# Python AI: Hack the web, get big data

## Finding data:

* Gather data
* Datasets in CSVs
* APIs (government APIs like https://data.gov.sg )
* SQl/NoSQL databases
* Web Pages (web scrapping)

## Web data Scrapping:
* Requests to obtain website
* Beautiful soup for HTML parsing 
* Pandas for data analysis 
* Matplotib for data plots

## The problem:

We want to analyze the distributions of IMDB and Metacritic movie ratings to see if we find anything interesting. To do this, we'll first scrape data for over 2000 movies.

### Identify the goal of our scraping right from the beginning

* ### takes a lot of time

## Find what webs to scrape

* ### Small number of requests
* ### Request: what happens whenever we access a web page
* ### More requests the longer we take

## 1st Approach:

### Compile a list of movie names
### Use it to access the web of each movie on:
#### IMDB
#### Metacritic


One way to get all the data we need is to compile a list of movie names, and use it to access the web page of each movie on both IMDB and Metacritic websites.

2000 ratings from both Metacritic and IMBD, we need 4000 requests

## Identifying the URL structure

 ### understand the logic of the URL as the pages we want to scrape change.

Let’s browse by year 2017

### URL has several parameters after the question mark:

    release_date — Shows only the movies released in a specific year.
    sort — Sorts the movies on the page. sort=num_votes,desc translates to sort by number of votes in a descending order.
    page — Specifies the page number.
    ref_ — Takes us to the the next or the previous page. The reference is the page we are currently on. adv_nxt and adv_prv are two possible values. They translate to advance to the next page, and advance to the previous page, respectively.


### navigate through those pages and observe the URL, you will notice that only the values of the parameters change.

In [0]:
from requests import get

url = 'http://www.imdb.com/search/title?release_date=2017&sort=num_votes,desc&page=1'

response = get(url)

print(response.text[:500])




<!DOCTYPE html>
<html
    xmlns:og="http://ogp.me/ns#"
    xmlns:fb="http://www.facebook.com/2008/fbml">
    <head>
         
        <meta charset="utf-8">
        <meta http-equiv="X-UA-Compatible" content="IE=edge">

    <meta name="apple-itunes-app" content="app-id=342792525, app-argument=imdb:///?src=mdot">



        <script type="text/javascript">var IMDbTimer={starttime: new Date().getTime(),pt:'java'};</script>

<script>
    if (typeof uet == 'function') {
      uet("bb", "LoadTitle"


# country where English is not the main language

In [0]:
headers = {"Accept-Language": "en-US, en;q=0.5"}

In [0]:
from requests import get

url = 'http://www.imdb.com/search/title?release_date=2017&sort=num_votes,desc&page=1'

response = get(url)

print(response.text[:500])




<!DOCTYPE html>
<html
    xmlns:og="http://ogp.me/ns#"
    xmlns:fb="http://www.facebook.com/2008/fbml">
    <head>
         
        <meta charset="utf-8">
        <meta http-equiv="X-UA-Compatible" content="IE=edge">

    <meta name="apple-itunes-app" content="app-id=342792525, app-argument=imdb:///?src=mdot">



        <script type="text/javascript">var IMDbTimer={starttime: new Date().getTime(),pt:'java'};</script>

<script>
    if (typeof uet == 'function') {
      uet("bb", "LoadTitle"


## Understanding the HTML structure of a single page

### first line of response.text, the server sent us an HTML

### Most websites have a similar structure

### browser’s Developer Tools to undestand the HMTL

### inspect movie name

### ll of the information for each movie, including the poster, is contained in a div tag.

# Using BeautifulSoup to parse the HTML content

### BeautifulSoup, the most common web scraping module for Python.

### Import the BeautifulSoup class creator from the package bs4.

###   Parse response.text by creating a BeautifulSoup object,

### The 'html.parser' argument indicates that we want to do the parsing using Python’s built-in HTML parser.



In [0]:
from bs4 import BeautifulSoup

html_soup = BeautifulSoup(response.text, 'html.parser')

type(html_soup)

bs4.BeautifulSoup

In [0]:
html_soup.find_all?   

SyntaxError: ignored

In [0]:
movie_containers = html_soup.find_all('div', attrs={'class': 'lister-item mode-advanced'})

print(len(movie_containers))

50


In [0]:
movie_containers[0]

<div class="lister-item mode-advanced">
<div class="lister-top-right">
<div class="ribbonize" data-caller="filmosearch" data-tconst="tt3315342"></div>
</div>
<div class="lister-item-image float-left">
<a href="/title/tt3315342/"> <img alt="Logan" class="loadlate" data-tconst="tt3315342" height="98" loadlate="https://m.media-amazon.com/images/M/MV5BYzc5MTU4N2EtYTkyMi00NjdhLTg3NWEtMTY4OTEyMzJhZTAzXkEyXkFqcGdeQXVyNjc1NTYyMjg@._V1_UX67_CR0,0,67,98_AL_.jpg" src="https://m.media-amazon.com/images/G/01/imdb/images/nopicture/large/film-184890147._CB470041630_.png" width="67"/>
</a> </div>
<div class="lister-item-content">
<h3 class="lister-item-header">
<span class="lister-item-index unbold text-primary">1.</span>
<a href="/title/tt3315342/">Logan</a>
<span class="lister-item-year text-muted unbold">(2017)</span>
</h3>
<p class="text-muted ">
<span class="certificate">R</span>
<span class="ghost">|</span>
<span class="runtime">137 min</span>
<span class="ghost">|</span>
<span class="genre">
Ac

### What distinguishes them from other div elements on that page?

### ResultSet object which is a list containing all the 50 divs

# The fields we want to collect
    The name of the movie.
    The year of release.
    The IMDB rating.
    The Metascore.
    The number of votes.


# The name of the movie

In [0]:
first_movie = movie_containers[0]

In [0]:
first_movie.div

<div class="lister-top-right">
<div class="ribbonize" data-caller="filmosearch" data-tconst="tt3315342"></div>
</div>

In [0]:
first_movie.a

<a href="/title/tt3315342/"> <img alt="Logan" class="loadlate" data-tconst="tt3315342" height="98" loadlate="https://m.media-amazon.com/images/M/MV5BYzc5MTU4N2EtYTkyMi00NjdhLTg3NWEtMTY4OTEyMzJhZTAzXkEyXkFqcGdeQXVyNjc1NTYyMjg@._V1_UX67_CR0,0,67,98_AL_.jpg" src="https://m.media-amazon.com/images/G/01/imdb/images/nopicture/large/film-184890147._CB470041630_.png" width="67"/>
</a>

In [0]:
first_movie.h3

<h3 class="lister-item-header">
<span class="lister-item-index unbold text-primary">1.</span>
<a href="/title/tt3315342/">Logan</a>
<span class="lister-item-year text-muted unbold">(2017)</span>
</h3>

In [0]:
first_movie.h3.a

<a href="/title/tt3315342/">Logan</a>

In [0]:
first_name = first_movie.h3.a.text
print(first_name)

'Logan'

# The year of the movie's release

In [0]:
first_year = first_movie.h3.find('span', class_='lister-item-year text-muted unbold')
print(first_year)

<span class="lister-item-year text-muted unbold">(2017)</span>


In [0]:
first_year = first_movie.h3.find('span', class_='lister-item-year text-muted unbold').text
print(first_year)

(2017)


# The IMDB rating

In [0]:
first_rating = first_movie.strong.text
first_rating = float(first_rating)
first_rating


8.1

# The Metascore

In [0]:
first_metascore = first_movie.find('span', class_='metascore favorable')
print(first_metascore)

<span class="metascore favorable">77        </span>


In [0]:
first_mscore = int(first_metascore.text)
print(first_mscore)

77


# The number of votes

In [0]:
first_votes = first_movie.find('span', attrs={'name': 'nv'})
first_votes

<span data-value="593689" name="nv">593,689</span>

In [0]:
first_votes['data-value']

'593689'

In [0]:
first_votes = int(first_votes['data-value'])
print(first_votes)

593689


# The script for a single page

### We need to add a condition to skip movies without a Metascore.

### The class attribute has two values: inline-block and ratings-metascore

In [0]:
names = []
years = []
imdb_ratings = []
metascores = []
votes = []

for containers in movie_containers:

  #The name
  name = containers.h3.a.text
  names.append(name)

  #The year
  year = containers.h3.find('span', class_='lister-item-year text-muted unbold').text
  years.append(year)

  #the imdb_rating
  imdb = float(first_movie.strong.text)
  imdb_ratings.append(imdb)

  #the metascores
  m_score = containers.find('span', class_='metascore favorable')
  metascores.append(m_score)

  #the votes
  vote = int(containers.find('span', attrs={'name': 'nv'})['data-value'])
  votes.append(vote)

# Pandas!

In [0]:
import pandas as pd

test_df = pd.DataFrame({'movie': names,
                        'year': years,
                        'imdb': imdb_ratings,
                        'metascore': metascores,
                        'votes':votes})

print(test_df.info())
test_df

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50 entries, 0 to 49
Data columns (total 5 columns):
movie        50 non-null object
year         50 non-null object
imdb         50 non-null float64
metascore    27 non-null object
votes        50 non-null int64
dtypes: float64(1), int64(1), object(3)
memory usage: 2.1+ KB
None


Unnamed: 0,movie,year,imdb,metascore,votes
0,Logan,(2017),8.1,[77 ],593689
1,Thor: Ragnarok,(2017),8.1,[74 ],526651
2,Guardians of the Galaxy Vol. 2,(2017),8.1,[67 ],518166
3,Wonder Woman,(2017),8.1,[76 ],512305
4,Star Wars: Episode VIII - The Last Jedi,(2017),8.1,[85 ],512076
5,Dunkirk,(2017),8.1,[94 ],493813
6,Spider-Man: Homecoming,(2017),8.1,[73 ],466473
7,Get Out,(I) (2017),8.1,[84 ],436831
8,It,(I) (2017),8.1,[69 ],415786
9,Blade Runner 2049,(2017),8.1,[81 ],407184


# The script for multiple pages

1. ### Making all the requests we want from within the loop.
2. ### Controlling the loop's rate to avoid bombarding the server with requests.
3. ### Monitoring the loop while it runs.


### 2000-2017 (18)
### 4 pages of each year in the interval
### 72 pages
### 50 movies

### about 3600, but due to no metacric just around 2000

## Changing the URL's parameters

In [0]:
pages = [str(i) for i in range(1,5)]
years_url = [str(i) for i in range(2014,2018)]


In [0]:
years_url

['2014', '2015', '2016', '2017']

## Controlling the crawl-rate

In [0]:
from time import sleep
from random import randint


## Monitoring the loop as it's still going


1. The frequency (speed) of requests, so we make sure our program is not overloading the server.
2. The number of requests, so we can halt the loop in case the number of expected requests is exceeded.
3. The status code of our requests, so we make sure the server is sending back the proper responses.


In [0]:
from time import sleep
from random import randint
from time import time

start_time = time()
requests = 0

for _ in range(5):

   # A REQUEST GOES HERE

   requests += 1
   #sleep for some random time
   sleep(randint(1,3))
   elapsed_time = time() - start_time

   print(requests, requests/elapsed_time)



1 0.3329445557310298
2 0.3994152238353424
3 0.499256506121956
4 0.4438264527318612
5 0.4539258406034486


### Warning

#### A successful request is indicated by a status code of 200. 
#### We'll use the warn() function from the warnings module to throw a warning if the status code is not 200.

# Putting all together

In [0]:
#list to store aLL DATA

names = []
years = []
imdb_ratings = []
metascores = []
votes = []

start_time = time()
requests = 0

#going through years
for year_url in years_url:

  #going through the pages
  for page in pages:

    response = get('http://www.imdb.com/search/title?release_date='+ year_url + 
                   '&sort=num_votes,desc&page=' + page)
    
    sleep(randint(8,15))

    requests += 1
    elapsed_time = time() - start_time

    print(requests, requests/elapsed_time)

    html_soup = BeautifulSoup(response.text, 'html.parser')

    movie_containers = html_soup.find_all('div', attrs={'class': 'lister-item mode-advanced'})

for containers in movie_containers:

  #The name
  name = containers.h3.a.text
  names.append(name)

  #The year
  year = containers.h3.find('span', class_='lister-item-year text-muted unbold').text
  years.append(year)

  #the imdb_rating
  imdb = float(first_movie.strong.text)
  imdb_ratings.append(imdb)

  #the metascores
  m_score = containers.find('span', class_='metascore favorable')
  metascores.append(m_score)

  #the votes
  vote = int(containers.find('span', attrs={'name': 'nv'})['data-value'])
  votes.append(vote)

1 0.08531604690529114
2 0.07036958096190157
3 0.06700797286492016
4 0.06542150240423558
5 0.06713734140440908
6 0.06745237303415169
7 0.06935677510298414
8 0.06990028592398626
9 0.0682201167016187
10 0.06744796322155788
11 0.06678171064749497
12 0.06705179255385942
13 0.06689820718560599
14 0.06868595306247251
15 0.06875929929168695
16 0.06941525950791014


# Examining the scraped data

# Cleaning the scraped data

### Let's start by reordering the columns:

### Year column to integers

# Plotting and analyzing the distributions

In [0]:
import matplotlib.pyplot as plt
%matplotlib inline

fig, axes = plt.subplots(nrows = 1, ncols = 3, figsize = (16,4))
ax1, ax2, ax3 = fig.axes

ax1.hist(movie_ratings['imdb'], bins = 10, range = (0,10)) # bin range = 1
ax1.set_title('IMDB rating')

ax2.hist(movie_ratings['metascore'], bins = 10, range = (0,100)) # bin range = 10
ax2.set_title('Metascore')

ax3.hist(movie_ratings['n_imdb'], bins = 10, range = (0,100), histtype = 'step')
ax3.hist(movie_ratings['metascore'], bins = 10, range = (0,100), histtype = 'step')
ax3.legend(loc = 'upper left')
ax3.set_title('The Two Normalized Distributions')

for ax in fig.axes:
    ax.spines['top'].set_visible(False)
    ax.spines['right'].set_visible(False)

plt.show()

### Metascore ratings resembles a normal distribution
### IMDB histogram, we can see that most ratings are between 6 and 8. 