# Gathering Data `Lesson02`

##### Student Tags

* Author : AH Uyekita
* Title  :  _Data Wrangling_
* Date   : 21/12/2018
* Course : Data Science II - Foundations Nanodegree
    * COD    : ND111
    * **Instructor:** David Venturi
    * **Instructor:** Mat Leonard

***

# Summary

* <a href="#scraping">1. Web Scrapping</a>;
* <a href="#assessing">Assessing</a>, and;
* <a href="#cleaning">Cleaning</a>.

***

## 1. Web Scraping <a id='scraping'></a>

This is a fancy way to say gathering text/data from websites. For this course, I will use the [Beautiful Soup][beau_pack] package, the lxml parse package, and OS package.

[beau_pack]: https://www.crummy.com/software/BeautifulSoup/

#### Libraries

In [30]:
# Importing Libraries to perform this study
import pandas as pd
import os
import matplotlib as plt
%matplotlib inline

# This is the parser (without this kind of interpreter it is not possible to read this file).
import lxml

# Importing the Beautiful Soup Package
from bs4 import BeautifulSoup

Loading the `bestofrt.tsv`.

In [34]:
# Loading the Best of Rotten Tomatoes.
df_rt = pd.read_csv('01-Dataset/bestofrt.tsv', sep= '\t') # Bear in mind, I have used the \t because is tabular separated values.

In [35]:
# Printing the first 5 rows.
df_rt.head()

Unnamed: 0,ranking,critic_score,title,number_of_critic_ratings
0,1,99,The Wizard of Oz (1939),110
1,2,100,Citizen Kane (1941),75
2,3,100,The Third Man (1949),77
3,4,99,Get Out (2017),282
4,5,97,Mad Max: Fury Road (2015),370


### 1.1 Creating the Soup - Rotten Tomatoes

Creating the "soup" for the Rotten Tomatoes Website.

In [4]:
# Parsing the document - Using the BeautifulSoup constructor
with open('01-Dataset/rt-html/et_the_extraterrestrial.html') as file:
    soup = BeautifulSoup(file, 'lxml')

In [5]:
# Let's print out this file just to see how it is.
soup

<!DOCTYPE html>
<html lang="en" xmlns:fb="http://www.facebook.com/2008/fbml" xmlns:og="http://opengraphprotocol.org/schema/">
<head prefix="og: http://ogp.me/ns# flixstertomatoes: http://ogp.me/ns/apps/flixstertomatoes#">
<script src="//cdn.optimizely.com/js/594670329.js"></script>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<meta content="width=device-width,initial-scale=1" name="viewport"/>
<meta content="VPPXtECgUUeuATBacnqnCm4ydGO99reF-xgNklSbNbc" name="google-site-verification"/>
<meta content="034F16304017CA7DCF45D43850915323" name="msvalidate.01"/>
<link href="https://staticv2-4.rottentomatoes.com/static/images/iphone/apple-touch-icon.png" rel="apple-touch-icon"/>
<link href="https://staticv2-4.rottentomatoes.com/static/images/icons/favicon.ico" rel="shortcut icon" type="image/x-icon"/>
<link href="https://staticv2-4.rottentomatoes.com/static/styles/css/rt_main.css" rel="stylesheet"/>
<script id="jsonLdSchema" type="application/ld+json">{"@context":"http

### 1.2. Gathering the Data

This is a process to gather information from the soup. I would like these informations:

* Title;
* Audience Score, and;
* Number Audience Ratings.

#### 1.2.1. Title

In [6]:
# Finding the title is very simple I just need to apply the method find on the soup object.
soup.find('title')

<title>E.T. The Extra-Terrestrial (1982) - Rotten Tomatoes</title>

Keep in mind, the result of the `.find()` method is a `Tag element` of bs4.

In [7]:
# There is an other way to find the title.
soup.find_all('title', limit = 1)

[<title>E.T. The Extra-Terrestrial (1982) - Rotten Tomatoes</title>]

Although the results is "visually" the same of the `.find()`, the objects are different, the `.fin_all()` has output as `ResultSet element`. It means, some methods will not work as expected.

Now, I want to remove this " - Rotten Tomatoes", I will procede using the output of the `.find()`, because I can use the `.contents()` method to gather the content between the tags `<title>`.

In [8]:
# Removing the tags `<>`.
soup.find('title').contents[0]

'E.T. The Extra-Terrestrial\xa0(1982) - Rotten Tomatoes'

The above result is a string, which I can slice, so I will count the number os characteres from the right to the left to remove the " - Rotten Tomatoes".

In [9]:
# Length of the unnecessary string in the final of each title.
len(' - Rotten Tomatoes')

18

Let's combine the last steps to remove the tags and also remove the " - Rotten Tomatoes" string.

In [10]:
# Removing the tag and slicing.
soup.find('title').contents[0][0:-len(' - Rotten Tomatoes')]

'E.T. The Extra-Terrestrial\xa0(1982)'

Let's take a look in other information to our analysis.

In [11]:
# This is a little bit complicated. You need to find the div called audience-score meter.
# Later, inside of this div you need to find the span tag.
soup.find('div', class_ = 'audience-score meter').find('span')

<span class="superPageFontColor" style="vertical-align:top">72%</span>

The result above is the same of the `title`, I will use the `.contents()` to extract what I want.

#### 1.2.2. Audience Score

In [12]:
# This step I remove the tags `<>`.
soup.find('div', class_ = 'audience-score meter').find('span').contents[0]

'72%'

In [13]:
# Removing the % (slicing the string)
soup.find('div', class_ = 'audience-score meter').find('span').contents[0][:-1]

'72'

#### 1.2.3. Number of Audience Rating

In [14]:
# Finding the Number of Audience Rating. There are two `div`s, and I want the last one.
soup.find('div', class_ = 'audience-info hidden-xs superPageFontColor')

<div class="audience-info hidden-xs superPageFontColor">
<div>
<span class="subtle superPageFontColor">Average Rating:</span>
            3.5/5
                </div>
<div>
<span class="subtle superPageFontColor">User Ratings:</span>
        32,313,030</div>
</div>

In [15]:
# I will filter using the find_all. The output of the find_all is a list, so I can index what I want.
soup.find('div', class_ = 'audience-info hidden-xs superPageFontColor').find_all('div')

[<div>
 <span class="subtle superPageFontColor">Average Rating:</span>
             3.5/5
                 </div>, <div>
 <span class="subtle superPageFontColor">User Ratings:</span>
         32,313,030</div>]

The `.find_all()` output is a list, for this reason, I will use the index to subset what I want.

In [16]:
# Subsetting. I want the second element of the list.
soup.find('div', class_ = 'audience-info hidden-xs superPageFontColor').find_all('div')[1]

<div>
<span class="subtle superPageFontColor">User Ratings:</span>
        32,313,030</div>

I will convert the results of the `.find_all()` using the method `.contents` to extract the data.

In [17]:
# Using the contentes method.
soup.find('div', class_ = 'audience-info hidden-xs superPageFontColor').find_all('div')[1].contents

['\n',
 <span class="subtle superPageFontColor">User Ratings:</span>,
 '\n        32,313,030']

In [18]:
# A want to remove the \n, so I use the .strip() method.
soup.find('div', class_ = 'audience-info hidden-xs superPageFontColor').find_all('div')[1].contents[2].strip()

'32,313,030'

In [19]:
# Converting a string to a integer. The coma turn the field as a str, this is the reason I must remove the coma.
soup.find('div', class_ = 'audience-info hidden-xs superPageFontColor').find_all('div')[1].contents[2].strip().replace(',','')

'32313030'

This is a walkthrough to one movie, I want to do the same for the other 99 movies. I will use the `for()` loop to automatize this.

Bear in mind, the `os.listdir(folder)` generates a list of files inside of the folder path.

In [20]:
# Variable Initialization
df_list = []
folder = '01-Dataset/rt-html'

# Loop to repeat the same code to 100 files.
for movie_html in os.listdir(folder):
    print(movie_html) # just to see if works.

1000013-12_angry_men.html
1000121-39_steps.html
1000355-adventures_of_robin_hood.html
1000626-all_about_eve.html
1000642-all_quiet_on_the_western_front.html
1003707-casablanca.html
1007818-frankenstein.html
1011615-king_kong.html
1012007-laura.html
1012928-m.html
1013139-maltese_falcon.html
1013775-metropolis.html
1017289-rear_window.html
1017293-rebecca.html
1020333-streetcar_named_desire.html
1021749-touch_of_evil.html
1046060-high_noon.html
1048445-snow_white_and_the_seven_dwarfs.html
12_years_a_slave.html
400_blows.html
alien.html
apocalypse_now.html
argo_2012.html
army_of_shadows.html
arrival_2016.html
baby_driver.html
battleship_potemkin.html
beatles_a_hard_days_night.html
bicycle_thieves.html
boyhood.html
bride_of_frankenstein.html
brooklyn.html
citizen_kane.html
dr_strangelove.html
dunkirk_2017.html
et_the_extraterrestrial.html
finding_nemo.html
get_out.html
godfather.html
godfather_part_ii.html
gone_with_the_wind.html
grapes_of_wrath.html
gravity_2013.html
harry_potter_and_the

In [21]:
# Variable Initialization
df_list = []
folder = '01-Dataset/rt-html'

# Loop to repeat the same code to 100 files.
for movie_html in os.listdir(folder):
    # Opening the file.
    with open(os.path.join(folder, movie_html)) as file:
        # Creating the soup for each iteration
        soup = BeautifulSoup(file, 'lxml')
        
        # Gathering the title.
        title = soup.find('title').contents[0][:-len(' - Rotten Tomatoes')]
        
        # Gathering the Audience Score.
        audience_score = soup.find('div', class_ = 'audience-score meter').find('span').contents[0][:-1]
        
        # Gathering the Number of Audience Ratings.
        num_audience_ratings = soup.find('div', class_ = 'audience-info hidden-xs superPageFontColor').find_all('div')[1].contents[2].strip().replace(',','')

        # Append to list of dictionaries
        df_list.append({'title': title,
                        'audience_score': int(audience_score),
                        'number_of_audience_ratings': int(num_audience_ratings)})

df = pd.DataFrame(df_list, columns = ['title', 'audience_score', 'number_of_audience_ratings'])

In [22]:
# Printing the  results of the Scraping.
df.head()

Unnamed: 0,title,audience_score,number_of_audience_ratings
0,12 Angry Men (Twelve Angry Men) (1957),97,103672
1,The 39 Steps (1935),86,23647
2,The Adventures of Robin Hood (1938),89,33584
3,All About Eve (1950),94,44564
4,All Quiet on the Western Front (1930),89,17768
