## Phase 1: Setup and Tools

#### Install Necessary Libraries: You'll need the following Python libraries:

* requests: To fetch the HTML content of the webpage.
* beautifulsoup4: To parse the HTML and extract the data.
* gspread: To interact with Google Sheets.
* google-auth-oauthlib: To authenticate with Google Sheets.

pip install requests beautifulsoup4 gspread google-auth-oauthlib

#### Set up Google Sheets API:

* Go to the Google Cloud Console. https://console.cloud.google.com/
* Create a new project or select an existing one.
* Enable the Google Sheets API for your project.
* Create credentials (a service account is recommended for automation). AIzaSyC2U3_E1-MXacWpwmhkUCqePR2fRMY1Zsc
* Download the credentials JSON file (e.g., credentials.json). Keep this file secure.
* Share your Google Sheet with the client email address from your credentials file.

#### Enable the Google Sheets API for yourproject
* Go to the Google Cloud Console.
* If you don't have a project yet, create one. If you do, select the project where you want to enable the Google Sheets API.
* In the console, navigate to APIs & Services > Library.
* In the search bar, type "Google Sheets API" and press Enter.
* Click on the Google Sheets API in the search results.
* On the API's overview page, click the Enable button.

#### API Keys:

Purpose: Useful for public, non-sensitive data access. They identify your project when making requests.
##### How to create:
* Go to the Google Cloud Console.
* Select your project.
* Navigate to APIs & Services > Credentials.
* Click Create credentials and select API key.
* Your new API key will be displayed.

#### Service Accounts:

Purpose: Best for server-to-server communication or background processes that need to access Google Sheets without direct user interaction.
##### How to create:
* Go to IAM & Admin > Service Accounts in the Google Cloud Console.
* Click Create service account.
* Enter a Service account name, ID, and description.
* Click Create and continue.
* Grant the service account necessary roles (e.g., "Editor" on the specific Google Sheet or "Sheets API User" at the project level).
* Click Continue.
* Click Create key. Choose JSON as the Key type and click Create. This will download a private key file to your computer. Keep this file secure!
* Click Done.
* To allow this service account to access your Google Sheet, you need to share the sheet with the service account's email address (found in the Service account details).

## Phase 2: Scraping the Data

* Fetch the HTML: Use the requests library to get the page content.

In [1]:
import requests
from bs4 import BeautifulSoup

url = "https://en.wikipedia.org/wiki/List_of_highest-grossing_Nigerian_films"
response = requests.get(url)
response.raise_for_status()  # Check for errors
soup = BeautifulSoup(response.content, 'html.parser')

In [2]:
soup

<!DOCTYPE html>

<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-content-enabled vector-feature-custom-font-size-clientpref-1 vector-feature-appearance-pinned-clientpref-1 vector-feature-night-mode-enabled skin-theme-clientpref-day vector-sticky-header-enabled vector-toc-available" dir="ltr" lang="en">
<head>
<meta charset="utf-8"/>
<title>List of highest-grossing Nigerian films - Wikipedia</title>
<script>(function(){var className="client-js vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-

* Locate the Table: Inspect the Wikipedia page's HTML source (using your browser's developer tools) to find the table containing the film data.  It's likely within a <table> tag.  Look for unique attributes like class or id to help you target it.  For example, it might have a class like 'wikitable'.

In [3]:
table = soup.find('table', class_='wikitable sortable')

In [4]:
table

<table class="wikitable sortable">
<tbody><tr>
<th>Rank
</th>
<th>Title
</th>
<th>Year
</th>
<th>Domestic Gross
</th>
<th>Studio(s)
</th>
<th>Director(s)
</th></tr>
<tr>
<td>1
</td>
<td><i><a href="/wiki/Everybody_Loves_Jenifa" title="Everybody Loves Jenifa">Everybody Loves Jenifa</a></i>
</td>
<td>2024
</td>
<td>₦1,882,553,548
</td>
<td>Funke Ayotunde Akindele Network
</td>
<td><a href="/wiki/Funke_Akindele" title="Funke Akindele">Funke Akindele</a>, Tunde Olaoye
</td></tr>
<tr>
<td>2
</td>
<td><i><a href="/wiki/A_Tribe_Called_Judah" title="A Tribe Called Judah">A Tribe Called Judah</a></i>
</td>
<td>2023
</td>
<td>₦1,404,187,806
</td>
<td>Funke Ayotunde Akindele Network
</td>
<td><a href="/wiki/Funke_Akindele" title="Funke Akindele">Funke Akindele</a>, <a href="/wiki/Adeoluwa_Owu" title="Adeoluwa Owu">Adeoluwa Owu</a>
</td></tr>
<tr>
<td>3
</td>
<td><i><a href="/wiki/Battle_on_Buka_Street" title="Battle on Buka Street">Battle on Buka Street</a></i>
</td>
<td>2022
</td>
<td>₦668,423,0

* Extract Data from Rows: Iterate through the table rows (<tr> tags), skipping the header row.  Extract the data from each cell (<td> tags).

In [5]:
data = []
if table:
    rows = table.find_all('tr')[1:]  # Skip the header row
    for row in rows:
        cells = row.find_all('td')
        if len(cells) >= 6:  # Ensure enough columns exist
            rank = cells[0].text.strip()
            title = cells[1].text.strip()
            year = cells[2].text.strip()
            domestic_gross = cells[3].text.strip()
            studios = cells[4].text.strip()
            directors = cells[5].text.strip()
            data.append([rank, title, year, domestic_gross, studios, directors])

In [6]:
data

[['1',
  'Everybody Loves Jenifa',
  '2024',
  '₦1,882,553,548',
  'Funke Ayotunde Akindele Network',
  'Funke Akindele, Tunde Olaoye'],
 ['2',
  'A Tribe Called Judah',
  '2023',
  '₦1,404,187,806',
  'Funke Ayotunde Akindele Network',
  'Funke Akindele, Adeoluwa Owu'],
 ['3',
  'Battle on Buka Street',
  '2022',
  '₦668,423,056',
  'Funke Ayotunde Akindele Network / FilmOne Studios',
  'Funke Akindele, Tobi Makinde'],
 ['4',
  'Omo Ghetto: The Saga',
  '2020',
  '₦636,129,120',
  'SceneOne Productions',
  'Funke Akindele, JJC Skillz'],
 ['5',
  'Alakada: Bad and Boujee',
  '2024',
  '₦500,000,000',
  'Toyin Abraham Film Productions / FilmOne Studios',
  'Adebayo Tijani'],
 ['6',
  'The Wedding Party',
  '2016',
  '[1]₦452,288,605',
  'Ebonylife Films / FilmOne / Inkblot Production / Koga Studios',
  'Kemi Adetiba'],
 ['7',
  'The Wedding Party 2',
  '2017',
  '[2]₦433,197,377',
  'Ebonylife Films / FilmOne / Inkblot Production / Koga Studios',
  'Niyi Akinmolayan'],
 ['8',
  'Chief D

## Phase 3: Storing in Google Sheets

Authenticate and Connect: Use gspread and your credentials file to connect to Google Sheets.

In [7]:
import gspread
from google.oauth2.service_account import Credentials

# --- Google Sheets Setup ---
scope = [
    'https://www.googleapis.com/auth/spreadsheets',
    'https://www.googleapis.com/auth/drive.file',
    'https://www.googleapis.com/auth/drive'
]
creds = Credentials.from_service_account_file('nigeriamovie.json', scopes=scope)  # Replace with your path
gc = gspread.authorize(creds)

try:
    spreadsheet = gc.open('Nigerian_Movie_Data')  # Replace with your sheet name
    worksheet = spreadsheet.sheet1
except gspread.SpreadsheetNotFound:
    spreadsheet = gc.create('Nigerian_Movie_Data')
    worksheet = spreadsheet.sheet1
    worksheet.append_row(["Rank", "Title", "Year", "Domestic Gross", "Studio(s)", "Director(s)"])  # Add headers

In [8]:
creds

<google.oauth2.service_account.Credentials at 0x1c7cba07740>

In [9]:
gc

<gspread.client.Client at 0x1c7cbeef740>

* Append Data to Sheet: Use worksheet.append_rows() to add the scraped data

In [10]:
if data:
    worksheet.append_rows(data)
    print(f"Successfully added {len(data)} rows to Google Sheets.")
else:
    print("No data found to add to Google Sheets.")

Successfully added 105 rows to Google Sheets.
