# Taylor Imhof
# Bellevue University | DSC 540
# Final Project Milestone 3
# 2/20/2022

## Milestone 3: Cleaning/Formatting Website Data

Perform at least 5 data transformation and/or cleansing steps to your website data:
 - Replace Headers
 - Format data into a more readable format
 - Identify outliers and bad data
 - Find duplicates
 - Fix casing or inconsistent values
 - Conduct Fuzzy Matching

## Import Required Libaries

In [1]:
# load libraries
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
import requests
from urllib.request import Request, urlopen

## Generate List Of Urls To Scrape Movie Data

The data on the website I was using (The Numbers) had thousands of records for movie financial data. However, each page would only display 100 movies. In order to scrape multiple pages, I came up with a simply for loop that would create a list of string urls that I could use in a web-scraping utility function.

In [2]:
# use a for loop to generate list of urls to scrape from
movie_urls = ["https://www.the-numbers.com/movie/budgets/all"]
for i in range(101, 3000, 100):
    url = "https://www.the-numbers.com/movie/budgets/all/" + str(i)
    movie_urls.append(url)
movie_urls

['https://www.the-numbers.com/movie/budgets/all',
 'https://www.the-numbers.com/movie/budgets/all/101',
 'https://www.the-numbers.com/movie/budgets/all/201',
 'https://www.the-numbers.com/movie/budgets/all/301',
 'https://www.the-numbers.com/movie/budgets/all/401',
 'https://www.the-numbers.com/movie/budgets/all/501',
 'https://www.the-numbers.com/movie/budgets/all/601',
 'https://www.the-numbers.com/movie/budgets/all/701',
 'https://www.the-numbers.com/movie/budgets/all/801',
 'https://www.the-numbers.com/movie/budgets/all/901',
 'https://www.the-numbers.com/movie/budgets/all/1001',
 'https://www.the-numbers.com/movie/budgets/all/1101',
 'https://www.the-numbers.com/movie/budgets/all/1201',
 'https://www.the-numbers.com/movie/budgets/all/1301',
 'https://www.the-numbers.com/movie/budgets/all/1401',
 'https://www.the-numbers.com/movie/budgets/all/1501',
 'https://www.the-numbers.com/movie/budgets/all/1601',
 'https://www.the-numbers.com/movie/budgets/all/1701',
 'https://www.the-number

## Utility Function For Webscraping

After manually scraping each webpage for its content, I felt it would be much easier to abstract this process away in its own function. Also, an error that I was running into was 403; I guess the website had some minor security for web-scraping and I needed to pass additional arguments to make it seem like my interactions on their website were from a actual browser.

I also learned that since I was using a function for each of the webscraping processes, I couldn't use a library alias (soup) and wound up having to call the actual library name `BeautifulSoup` to create the bs4 objects.

In [3]:
# utility function that attempts to access html at url and returns bs4 object
def get_soup_obj(url):
    """
    args: a string url
    
    attempts to request string html from specified url
    if success, then the html is opened and read via urllib.urlopen
    last, the html string is stored as a bs4 object
    
    returns: a bs4 soup object
    """
    try:
        req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
        content = urlopen(req).read().decode('utf-8')
        soup_obj = BeautifulSoup(content, 'html.parser')
        return(soup_obj)
    except Exception as err:
        print('error', type(err))

The easiest solution I found for combining all of the data from the website into a single, consolidated dataframe was to first create individual dataframes for each of the distinct urls. Then, after I had all of them converted into dataframes, I would simply concatenate them together.

Fortunately, each page only contained one html table, so scraping the content of each table was relatively painless. I found a really slick for loop implementation to get the content from all the rows tr and get them inserted into each dataframe properly.

In [4]:
# utility function to take soup object, store table data, and create data frame
def create_data_frame(soup_obj):
    """
    args: bs4 object
    
    first finds the soup table content
    then, an empty list is used to append the headers of each of the table columns
    then, a new dataframe is created using the extracted header names
    then, a for loop is used to extract all of the table rows data and insert them into the new dataframe
    
    returns: dataframe of movie finance info
    """
    table = soup_obj.find('table')
    
    # get table headers
    headers=[] # empty list to store table headers
    for i in table.find_all('th'):
        title = i.text
        headers.append(title)
        
    # create new df from headers
    df = pd.DataFrame(columns=headers)
    
    # fill df rows with tr from soup.table.tr
    for j in table.find_all('tr')[1:]:
        row_data = j.find_all('td')
        row = [i.text for i in row_data]
        length = len(df)
        df.loc[length] = row
        
    return df

In [5]:
# create empty list to store dataframes created via utilty fucntions
df_list = []

# iterate across each distinct url and attempt to create df
for url in movie_urls:
    soup = get_soup_obj(url) # create new soup object
    df = create_data_frame(soup) # create new dataframe on soup table data
    df_list.append(df) # add each newly created dataframe to list of dataframes

In [6]:
# check length of dataframe list to ensure 30 dfs were created
len(df_list)

30

In [7]:
# merge list of dataframes together via pd.concat()
merged = pd.concat(df_list)
merged.head()

Unnamed: 0,Unnamed: 1,ReleaseDate,Movie,ProductionBudget,DomesticGross,WorldwideGross
0,1,"Apr 23, 2019",Avengers: Endgame,"$400,000,000","$858,373,000","$2,797,800,564"
1,2,"May 20, 2011",Pirates of the Caribbean: On Stranger Tides,"$379,000,000","$241,071,802","$1,045,713,802"
2,3,"Apr 22, 2015",Avengers: Age of Ultron,"$365,000,000","$459,005,868","$1,395,316,979"
3,4,"Dec 16, 2015",Star Wars Ep. VII: The Force Awakens,"$306,000,000","$936,662,225","$2,064,615,817"
4,5,"Apr 25, 2018",Avengers: Infinity War,"$300,000,000","$678,815,482","$2,048,359,754"


In [8]:
# display info to see dimensions and dtypes
print(merged.info())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3000 entries, 0 to 99
Data columns (total 6 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0                     3000 non-null   object
 1   ReleaseDate       3000 non-null   object
 2   Movie             3000 non-null   object
 3   ProductionBudget  3000 non-null   object
 4   DomesticGross     3000 non-null   object
 5   WorldwideGross    3000 non-null   object
dtypes: object(6)
memory usage: 164.1+ KB
None


## Drop Redundant Columns (index)

After scraping the webpage(s) for their table content, when I created the dataframes it looks like an additional index column was produced. I simply dropped this column using the pd.drop() syntax, passing in the desired column to drop (df.columns[0])

In [9]:
# drop redundant index column
merged.drop(columns=merged.columns[0],
           axis=1,
           inplace=True)
merged.head() # check to ensure correct column was dropped successfully

Unnamed: 0,ReleaseDate,Movie,ProductionBudget,DomesticGross,WorldwideGross
0,"Apr 23, 2019",Avengers: Endgame,"$400,000,000","$858,373,000","$2,797,800,564"
1,"May 20, 2011",Pirates of the Caribbean: On Stranger Tides,"$379,000,000","$241,071,802","$1,045,713,802"
2,"Apr 22, 2015",Avengers: Age of Ultron,"$365,000,000","$459,005,868","$1,395,316,979"
3,"Dec 16, 2015",Star Wars Ep. VII: The Force Awakens,"$306,000,000","$936,662,225","$2,064,615,817"
4,"Apr 25, 2018",Avengers: Infinity War,"$300,000,000","$678,815,482","$2,048,359,754"


## Rename Columns/Headers

In order to keep with the naming conventions of my flat file headers, I decided to make all of the headers lowercase. Also, if the headers contained more than one word, I separated them with an underscore

In [10]:
rename_cols = merged.rename({'ReleaseDate' : 'release_date',
                            'Movie' : 'title',
                            'ProductionBudget' : 'production_budget',
                            'DomesticGross' : 'domestic_gross',
                            'WorldwideGross' : 'worldwide_gross'}, axis=1)
rename_cols.head()

Unnamed: 0,release_date,title,production_budget,domestic_gross,worldwide_gross
0,"Apr 23, 2019",Avengers: Endgame,"$400,000,000","$858,373,000","$2,797,800,564"
1,"May 20, 2011",Pirates of the Caribbean: On Stranger Tides,"$379,000,000","$241,071,802","$1,045,713,802"
2,"Apr 22, 2015",Avengers: Age of Ultron,"$365,000,000","$459,005,868","$1,395,316,979"
3,"Dec 16, 2015",Star Wars Ep. VII: The Force Awakens,"$306,000,000","$936,662,225","$2,064,615,817"
4,"Apr 25, 2018",Avengers: Infinity War,"$300,000,000","$678,815,482","$2,048,359,754"


## Convert Data Types (String to Float)

The main reason that I wanted to use this data was to analyze the relationship between a movie's production budget and how much money they were able to collect at the box office. Also, in tandem with my flat-file data, I wanted to see if these values had any relatioship with the movie's IMDb rating (or other rating system I could gather).

Because the values contained in all of the money columns contained dollar signs and commas, they were encoded as strings/objects by default. 

In [11]:
# check column dtypes to see if there is any conversion necessary
rename_cols.dtypes

release_date         object
title                object
production_budget    object
domestic_gross       object
worldwide_gross      object
dtype: object

In [12]:
# utility function to clean currency containing commas and dollar signs and converts to float type
def clean_currency(x):
    """
    args: string containing unclean currency value
    
    checks if arg is a string
    if it is a string, then first all dollar signs are replaced with empty string
    then, if all commas are replaced with an empty string
    
    returns: cleaned currency value as float
    """
    if isinstance(x, str):
        return(x.replace('$', '').replace(',', ''))    
    return(x)

In [13]:
# create copy to perserve og df
cleaned_currency = rename_cols.copy()

# use utility function for all currency values in each of the currency columns
cleaned_currency['production_budget'] = cleaned_currency['production_budget'].apply(clean_currency).astype(float)
cleaned_currency['domestic_gross'] = cleaned_currency['domestic_gross'].apply(clean_currency).astype(float)
cleaned_currency['worldwide_gross'] = cleaned_currency['worldwide_gross'].apply(clean_currency).astype(float)

# check dtypes to ensure column values were converted to floats successfully
cleaned_currency.dtypes

release_date          object
title                 object
production_budget    float64
domestic_gross       float64
worldwide_gross      float64
dtype: object

## Check For Missing Or Null Values

In [14]:
# check for missing vlaues
cleaned_currency.isnull().sum()

release_date         0
title                0
production_budget    0
domestic_gross       0
worldwide_gross      0
dtype: int64

There does not appear to be any missing values in this dataframe!

## Check For Duplicate Values

In [15]:
# check for duplicate values
dupes = cleaned_currency.duplicated()
dupes.value_counts()

False    3000
dtype: int64

There does not appear to be any duplicate values in this dataframe!

## View Final Cleaned Website Dataframe

In [16]:
# display final clean data frame head()
cleaned_website_df = cleaned_currency.copy()
print(cleaned_website_df.shape)
cleaned_website_df.head()

(3000, 5)


Unnamed: 0,release_date,title,production_budget,domestic_gross,worldwide_gross
0,"Apr 23, 2019",Avengers: Endgame,400000000.0,858373000.0,2797801000.0
1,"May 20, 2011",Pirates of the Caribbean: On Stranger Tides,379000000.0,241071802.0,1045714000.0
2,"Apr 22, 2015",Avengers: Age of Ultron,365000000.0,459005868.0,1395317000.0
3,"Dec 16, 2015",Star Wars Ep. VII: The Force Awakens,306000000.0,936662225.0,2064616000.0
4,"Apr 25, 2018",Avengers: Infinity War,300000000.0,678815482.0,2048360000.0


In [17]:
# create .csv from dataframe for use in another notebook/project step
cleaned_website_df.to_csv('cleaned_webiste_data.csv')

In [18]:
# create csv from clean df for use in other notebook
cleaned_website_df.to_csv('final_website_df.csv')