## Introduction: Exploring the World of Cinema Through Web Scraping

Welcome to this exciting journey into the world of cinema, where we dive into the mesmerizing realm of film industry data. In this project, we will be webscraping the highest-grossing films year by year from 1915 till date from this [Wikipedia page](https://en.wikipedia.org/wiki/List_of_highest-grossing_films#High-grossing_films_by_year)


At the end of this project we should have a table similar to this:

|Year	   | Titles	  | Worldwide_gross| Budget
|----------|----------|----------|----------|
|0	1915   |The Birth of a Nation |	50000000.0|	110000|
|1	1916   |Intolerance	|1750000.0 |385907|
|2	1917   |Cleopatra	|500000.0  |300000|
|3	1918   |Mickey	|8000000.0	|250000|
|4	1919   |The Miracle Man	|3000000.0 |120000|



### Web Scraping

In [1]:
# Import necessary libraries for web scraping and data manipulation
import pandas as pd 
import requests  
from bs4 import BeautifulSoup 


In [2]:
import warnings
warnings.filterwarnings("ignore")

In [4]:
# Define the URL of the Wikipedia page containing the list of highest-grossing films and retrieve its content
url = "https://en.wikipedia.org/wiki/List_of_highest-grossing_films#High-grossing_films_by_year"

data = requests.get(url).text

# Create a BeautifulSoup object to parse the HTML content of the web page
soup = BeautifulSoup(data, "html.parser")


In [5]:
# Find the third table on the web page using BeautifulSoup (index 2, since Python uses 0-based indexing)
table = soup.find_all("table")[2]

In [6]:
# Create empty lists to store film titles, worldwide gross, budgets, and years
film_titles = []
worldwide_gross = []
budgets = []
years = []

In [7]:
# Loop through each row in the table (starting from the second row)
for row in table.find_all("tr")[1:]:
    cells = row.find_all("td") 
    
    # Check if there are at least three cells (title, gross, and budget)
    if len(cells) >= 3:
        # Extract and strip the text content of each cell
        title = cells[0].text.strip()     
        gross = cells[1].text.strip()  
        budget = cells[2].text.strip()
        
        # Append the extracted data to their respective lists
        film_titles.append(title)
        worldwide_gross.append(gross)
        budgets.append(budget)


In [8]:
for row in table.find_all("tr")[1:]:
    cells = row.find_all("td")
    for y in row.find_all("th"):
        year = y.text.strip()
    years.append(year)

In [9]:
# Create a dictionary 'merge_data' to store film data with corresponding labels
merge_data = {
    'Year': years,            
    'Titles': film_titles,     
    'Worldwide_gross': worldwide_gross,
    'Budget': budgets         
}

# Create a DataFrame 'df' using the 'merge_data' dictionary
gross_earnings = pd.DataFrame(merge_data)


### Data  Cleaning

Now that we've successfully scraped the data from the Wikipedia page, the next crucial step is data cleaning

#### Initial Data Inspection:
Before we dive into cleaning, it's essential to get an overview of our dataset. Let's start by looking at the first few rows to understand the structure and identify any obvious issues:

In [10]:
gross_earnings.head(5)

Unnamed: 0,Year,Titles,Worldwide_gross,Budget
0,1915,The Birth of a Nation,"$50,000,000–100,000,000$20,000,000+R ($5,200,0...","$110,000"
1,1916,Intolerance,"$1,750,000R IN","$385,907"
2,1917,Cleopatra,"$500,000*R","$300,000"
3,1918,Mickey,"$8,000,000","$250,000"
4,1919,The Miracle Man,"$3,000,000R","$120,000"


####  Handling Missing Values:

In [11]:
gross_earnings[['Year', 'Titles', 'Worldwide_gross', 'Budget']].isna().sum()

Year               0
Titles             0
Worldwide_gross    0
Budget             0
dtype: int64

The result from our _gross_earnings.head()_ above shows that we have some characters in the worldwide_gross column and Budget columns. Inorder to be able to carry out analysis on those columns there's need to clean and format those columns properly. 

In [12]:
# Remove dollar signs and commas from the 'Worldwide_gross' and 'Budget' columns in the DataFrame to make them numeric-friendly.
gross_earnings['Worldwide_gross'] = gross_earnings['Worldwide_gross'].str.replace('$', '').str.replace(',','')
gross_earnings['Budget'] = gross_earnings['Budget'].str.replace('$', '').str.replace(',','')


In [13]:
gross_earnings.head()

Unnamed: 0,Year,Titles,Worldwide_gross,Budget
0,1915,The Birth of a Nation,50000000–10000000020000000+R (5200000)R,110000
1,1916,Intolerance,1750000R IN,385907
2,1917,Cleopatra,500000*R,300000
3,1918,Mickey,8000000,250000
4,1919,The Miracle Man,3000000R,120000


In [14]:
# Extract numeric digits from the 'Worldwide_gross' and 'budget' column and store them back in the same column.
gross_earnings['Worldwide_gross'] = gross_earnings['Worldwide_gross'].str.extract(r'(\d+)')
gross_earnings['Budget'] = gross_earnings['Budget'].str.extract(r'(\d+)')

In [15]:
gross_earnings.head()

Unnamed: 0,Year,Titles,Worldwide_gross,Budget
0,1915,The Birth of a Nation,50000000,110000
1,1916,Intolerance,1750000,385907
2,1917,Cleopatra,500000,300000
3,1918,Mickey,8000000,250000
4,1919,The Miracle Man,3000000,120000


During webscraping we had issues with movies title 'Jurasic world' and 'Barbie'. Since it's just those two movies  We will manually correct the values they are in index '117' and '148' respectively.
But first lets view them so we can know the issue.

In [16]:
gross_earnings.iloc[117]

Year                             Jurassic Park †
Titles             $1,037,119,542 ($912,667,947)
Worldwide_gross                         63000000
Budget                                        74
Name: 117, dtype: object

In [17]:
gross_earnings.iloc[148]

Year                     Barbie †
Titles             $1,385,132,678
Worldwide_gross         128000000
Budget                         27
Name: 148, dtype: object

Using iloc we see that the values were shift one row to the left, we have Title shifted to year, Worldwide_gross to Titles and so on.

So we will input the manually still using the iloc.

In [18]:
gross_earnings.iloc[117] = ['1993', 'Jurassic Park', '1037119542', '63000000']
gross_earnings.iloc[148] = ['2023', 'Barbie', '1385132678', '128000000']

In [19]:
gross_earnings.iloc[117]

Year                        1993
Titles             Jurassic Park
Worldwide_gross       1037119542
Budget                  63000000
Name: 117, dtype: object

#### Data Type Conversion

In [20]:
# Convert the 'Year' column to datetime
gross_earnings['Year'] = pd.to_datetime(gross_earnings['Year']).dt.year
gross_earnings['Worldwide_gross'] = gross_earnings['Worldwide_gross'].astype(float)
gross_earnings['Budget'] =  gross_earnings['Budget'].astype(int)

In [21]:
gross_earnings.dtypes

Year                 int64
Titles              object
Worldwide_gross    float64
Budget               int32
dtype: object

In [22]:
gross_earnings.head(5)

Unnamed: 0,Year,Titles,Worldwide_gross,Budget
0,1915,The Birth of a Nation,50000000.0,110000
1,1916,Intolerance,1750000.0,385907
2,1917,Cleopatra,500000.0,300000
3,1918,Mickey,8000000.0,250000
4,1919,The Miracle Man,3000000.0,120000


In [23]:
# Save the DataFrame 'df' to a CSV file for further data cleaning and analysis
gross_earnings.to_csv('highest_gross_films_by_year.csv')