# Introduction
<a name="introduction"></a>

This week I am going to do a fun project, I was going to webscrape bunch of movie websites like iMDB, rotten tomato and from there I will try to scrape a lot of info about these movies. Some I will use for visualization and most others will be used to create a machine learning model which would be able to predict a movie score. So, let's get started. 

# Objectives
- [Introduction](#introduction)
- [Import necessary modules](#modules)
- [Imdb_webscraping_part#1](#imdb1)
- [Imdb_webscraping_part#2](#imdb2)
- [Rotten tomato webscraping](#Rotten tomato webscraping)
- [Cleaning All datasets(This is the starting point of this project after collecting all the data)](#cleaning_data_sets)
- [Visualization](#visualization)
- [Feature Engineering](#feature_engineering)
- [Train-test split](#train_test_split)
- [Creating models](#creating_models)
- [Result metrics](#result_metrics)
- [Feature Importance](#feature_importance)
- [ROC/AUC curve](#roc_auc_curve)
- [Wrapping Up](#wrapping_up)
- [Next](#next)

#### Importing all the necessary modules
<a name="#modules"></a>

In [6]:

import pandas as pd
import requests
import numpy as np
import time
import re
import csv
from bs4 import BeautifulSoup
pd.set_option('display.max_columns', 500) ## to see all the columns,
pd.set_option('display.max_rows', 500)

#### Imdb webscraping part #1
<a name="imdb1"></a>

In [5]:
%%time
## Creating couple of empty list that will be used to create the dataframe. 
title_id = []
title = []
runtime =[]
genre = []
certificate = []
imdb_rating = []
gross = []
year = []
votes = []
director_actor=[]
metascore=[]
## Looping through each page of the IMDB website which consists of 50
##movies in each page, picking movies only with more than 1000 votes 
#and a rating higher than 9 and lower than 5

## a is doing 30 iterations to get 1500 movies
for a in range(30): 
    ## b is doing 2 iterations to switch r value
    for b in range(2): 
        
        # r with rating above 8
        r = requests.get("http://www.imdb.com/search/title?num_votes=1000,&title_type=feature,tv_movie,documentary,\
                          short&user_rating=8.0,&page="+str(a)+"&ref_=adv_nxt")
       
        ## altering the value of r, in order to scrape movies with rating lower than 5.
        if b == 1:
         
            # r with rating below 5
            r = requests.get("http://www.imdb.com/search/title?num_votes=1000,&title_type=feature,tv_movie,documentary,\
                              short&user_rating=,5.0&page="+str(a)+"&ref_=adv_nxt")
       
        ## use BeautifulSoup based on either r
        soup = BeautifulSoup(r.content, "lxml")
        
        for i in soup.findAll(class_='lister-item-content'):
            
            # Getting title_id
            title_id.append(re.findall(r'tt.+\d', str(i.find("a"))))
            
            # Getting title
            title.append(i.find('a').text.strip())
            
            # Gerring genre
            try:
                genre.append(i.find('span', class_ = "genre").text.strip())
            except:
                genre.append(None)
            
            # Getting runtime
            try:
                # runtime.append(re.findall(r'\d+', i.find('span', class_ = "runtime").text)[0])
                runtime.append(re.findall(r'\d+', i.find('span', class_ = "runtime").text))
            except:
                runtime.append(None)
            
            # Getting certificate    
            try:
                certificate.append(i.find("span", class_ ="certificate").text)
            except:
                certificate.append(None)
            
            # Getting imdb_rating
            imdb_rating.append(float(i.find("strong").text))
            
            # Getting year
            year.append(i.find("span", class_="lister-item-year text-muted unbold").text)
            
            # Getting votes
            votes.append((i.find("span", attrs={"name":"nv"}).text).replace(",",""))
            
            # Getting gross
            try:
                #gross.append(i.find("span", attrs={"name":"nv"}).find_next_sibling("span", attrs={"name":"nv"}).get_text())
                gross.append(re.findall(r'\d.+\d', i.find("span", attrs={"name":"nv"}).find_next_sibling("span", \
                attrs={"name":"nv"}).get_text()))
            except:
                gross.append(None)
            
            ## Getting director and actors
            try:
                director_actor.append(i.find("p", class_="text-muted").find_next_sibling("p", class_="").text.strip())
            except:
                director.append(None)
            
            ## Getting metascore
            try:
                metascore.append(int(i.find("span",  class_="metascore favorable").text.strip()))
            except:
                metascore.append(None)

CPU times: user 15.3 s, sys: 269 ms, total: 15.5 s
Wall time: 1min 23s


In [7]:
## checking the length of each features
print (len(title_id))
print (len(title))
print (len(runtime))
print (len(genre))
print (len(certificate))
print (len(imdb_rating))
print (len(gross))  ## there seems to be a problem with the gross, will look into that. 
print (len(year))
print (len(votes))
print (len(director_actor))
print (len(metascore))

3000
3000
3000
3000
3000
3000
3000
3000
3000
3000
3000


In [10]:
title

['Spider-Man: Into the Spider-Verse',
 'Bohemian Rhapsody',
 'The Favourite',
 'Green Book',
 'Roma',
 'A Star Is Born',
 'Avengers: Infinity War',
 'Dragon Ball Super: Broly',
 'How to Train Your Dragon: The Hidden World',
 'The Godfather',
 'The Shawshank Redemption',
 'The Dark Knight',
 'Pulp Fiction',
 'Gone Girl',
 'Three Billboards Outside Ebbing, Missouri',
 'The Wolf of Wall Street',
 'The Revenant',
 'Interstellar',
 'The Lord of the Rings: The Fellowship of the Ring',
 'Inception',
 'Guardians of the Galaxy',
 'Blade Runner 2049',
 'The Intouchables',
 'The Big Lebowski',
 'Uri: The Surgical Strike',
 'Andhadhun',
 'Room',
 'Mad Max: Fury Road',
 'The Dark Knight Rises',
 'Raiders of the Lost Ark',
 'Shoplifters',
 'Wonder',
 'Petta',
 'The Avengers',
 "Schindler's List",
 'The Lord of the Rings: The Return of the King',
 'The Departed',
 'Harry Potter and the Deathly Hallows: Part 2',
 'Inglourious Basterds',
 'The Green Mile',
 'Monty Python and the Holy Grail',
 'The Sile

In [8]:
## Fix title_id since its picking up some extra characters we do not want. 
a = []
for i in title_id:
    for j in i:
        a.append(j.split("/")[0])
title_id = a

In [None]:
data_list = [title_id, title, runtime, genre, certificate, imdb_rating, gross, year, votes, director_actor, metascore]

In [None]:
df = pd.DataFrame(data_list)

In [None]:
df = df.T

In [None]:
header =["title_id", "title", "runtime", "genre", "certificate", "imdb_rating", "gross", "year", "votes", "director_actor", "metascore"] 

In [None]:
df.columns = header

In [None]:
df.head()

In [None]:
## The code above ran in AWS and once the web scrapig was completed, the data was stored as a csv file in EC2. 
df.to_csv('imdb_1.csv', encoding="utf-8")

<a name="imdb2"></a>
#### Imdb webscarping part #2

In [None]:
%%time

import sys
sys.stdout = open('/dev/stdout', 'w') ## this is just for me to print out "w" in the termin and make sure that My code is working

user_review =[]
critic_review = []
writer = []
language = []
country = []
budget = []
gross_1 = []
opening_week=[]
oscar_win = []
oscar_nom = []
other_win = []
other_nom = []
count = 0
for i in title_id:
    
    sys.stdout.write(i + '\n')
    
    r = requests.get("http://www.imdb.com/title/"+str(i)+"/?ref_=nv_sr_1")
    soup = BeautifulSoup(r.content, "lxml")
    
    # Getting User_review_no
    try:
        user_review.append(re.findall(r'.+\d', soup.find('span', attrs={'itemprop':"reviewCount"}).text))
    except:
        user_review.append(None)
    # Getting Critic_no
    try:
        critic_review.append(re.findall(r'.+\d',soup.find('span', attrs={'itemprop':"reviewCount"}).find_next_sibling().text))
    except:
        critic_review.append(None)
        
    """try:
        won.append(soup.find("span", itemprop="awards").text.strip())
    except:
        won.append(None)"""
    # Getting Writer
    try:
        writer.append(soup.find('span', attrs={'itemprop':"creator"}).text)
    except:
        writer.append(None)
    """for i in soup.find_all('h4',class_ = "inline"):
        if i.text.strip() == "Writers:" or i.text.strip() =="Writer":
            writer.append(i.find_next().text.strip())"""
    
    # Getting Language
    a = 0
    try:
        for i in soup.find_all("h4", class_ = 'inline'):
            if i.text.strip() == "Language:":
                a = i.find_next().text
                #language.append(i.find_next().text)
    except:
        a = None
    language.append(a)
        
    # Getting Country
    a = 0
    try:
        for i in soup.find_all('h4',class_ = "inline"):
            if i.text.strip() == "Country:":
                a = i.find_next().text.strip()
                #country.append(i.find_next().text.strip())
    except:
        a = None
    country.append(a)
            
    # Getting Budget
    a = 0
    try:
        for i in soup.find_all("h4", class_ = "inline"):
            if i.text.strip() == "Budget:":
                a = (i.next_sibling.strip()).replace(",","")
               # budget.append((i.next_sibling.strip()).replace(",",""))
    except:
        a = None
    budget.append(a)
            
    # Getting Gross
    a = 0
    try:
        for i in soup.find_all("h4", class_ = "inline"):
            if i.text.strip() == "Gross:":    
                a = (i.next_sibling.strip()).replace(",","")
                #gross_1.append((i.next_sibling.strip()).replace(",",""))
    except:
        a = None
    gross_1.append(a)
        
    # Getting Opening Weekend
    a = 0
    try:
        for i in soup.find_all("h4", class_ = "inline"):
            if i.text.strip() == "Opening Weekend:":
                a = (i.next_sibling.strip()).replace(",","")
                #opening_week.append((i.next_sibling.strip()).replace(",",""))
    except:
        a = None
    opening_week.append(a)
    
    
    # Getting Oscar, Oscar_nomination, Other_awards, Other_nomminations
    # getting oscar
    while soup.find_all("span", attrs={"itemprop":"awards"}):
        for i in soup.find_all("span", attrs={"itemprop":"awards"}):
            if "Won" in i.text and ("Oscar." in i.text or "Oscars." in i.text):
                oscar_win.append(re.findall(r'\d+', i.text.strip("")))
                break
            else:
                oscar_win.append(None)
                break
        break
    else:
        oscar_win.append(None)
        
    # find nominations for oscar
    while soup.find_all("span", attrs={"itemprop":"awards"}):
        for i in soup.find_all("span", attrs={"itemprop":"awards"}):
            if "Nominated" in i.text and ("Oscar." in i.text or "Oscars." in i.text):
                oscar_nom.append(re.findall(r'\d+', i.text.strip("")))
                break
            else:
                oscar_nom.append(None)
                break
        break
    else:
        oscar_nom.append(None)
    
    # Getting other wins
    try:
        for i in soup.find_all("span", attrs={"itemprop":"awards"}):
            #print i.text
            if ("wins" in i.text or "win" in i.text) and ("nominations" in i.text or "nomination." in i.text):
                a = re.findall(r'\d+', i.text.strip(""))[0]
            elif ("wins" in i.text or "win" in i.text) and ("nominations" not in i.text or "nomination." not in i.text):
                a = re.findall(r'\d+', i.text.strip(""))[0]
            elif ("wins" not in i.text or "win" not in i.text) and ("nominations" in i.text or "nomination." in i.text):
                a = None
    except:
        a = None
    other_win.append(a)
        
        
    # Getting other nominations
    try:
        for i in soup.find_all("span", attrs={"itemprop":"awards"}):
            #print i.text
            if ("wins" in i.text or "win" in i.text) and ("nominations" in i.text or "nomination." in i.text):
                #other_nom.append(re.findall(r'\d+', i.text.strip(""))[1])
                a = re.findall(r'\d+', i.text.strip(""))[1]
            elif ("wins" in i.text or "win" in i.text) and ("nominations" not in i.text or "nomination." not in i.text):
                a = None
            elif ("wins" not in i.text or "win" not in i.text) and ("nominations" in i.text or "nomination." in i.text):
                a = re.findall(r'\d+', i.text.strip(""))[0]
    except:
        a = None
    other_nom.append(a)
    count += 1
    


In [None]:
# checking the length of each features

print (len(user_review))
print (len(critic_review))
print (len(writer))
print (len(language))
print (len(country))
print (len(budget))
print (len(gross_1))
print (len(opening_week))
print (len(oscar_win))
print (len(oscar_nom))
print (len(other_win))
print (len(other_nom))



In [None]:
## creating a second list with all the scraped features
data_list_2 = [user_review, critic_review, writer, language, country, budget, gross_1, opening_week, oscar_win, oscar_nom, other_win, other_nom]

In [None]:

df2 = pd.DataFrame(data_list_2)

In [None]:
df2 = df2.T

In [None]:

header2 = ["user_review", "critic_review", "writer", "language", "country", "budget", "gross_1", "opening_week", "oscar_win", "oscar_nom", "other_win", "other_nom"]

In [None]:
df2.columns = header2

In [None]:
## The code above ran in AWS and once the scrapig was completed, the scraped data stored as a csv file in EC2. 
df2.to_csv('imdb_2.csv', encoding="utf-8")

In [None]:
## Getting all the titles in order to run through rotten tomato. 
title = df1["title"]

In [None]:
## making necessary changes in order to fit the rotten tomato urls. 
title = title.apply(lambda x:x.replace(" ", "_"))
title = title.apply(lambda x:x.replace(".", ""))
title = title.apply(lambda x:x.replace(":", ""))
title = title.apply(lambda x:x.replace("-", ""))
title = title.apply(lambda x:x.replace(",", ""))
title = title.apply(lambda x:x.replace("'", ""))
title = title.apply(lambda x:x.replace("__", "_"))

## changes that name of the spoted movies to fit the url. 
title = title.apply(lambda x:x.replace("Logan", "logan_2017"))
title = title.apply(lambda x:x.replace("Lion", "lion_2016"))
title = title.apply(lambda x:x.replace("Bahubali_The_Beginning", "Baahubali_The_Beginning"))
title = title.apply(lambda x:x.replace("Star_Wars_Episode_V__The_Empire_Strikes_Back", "empire_strikes_back"))
title = title.apply(lambda x:x.replace("Kavkazskaya_plennitsa_ili_Novye_priklyucheniya_Shurika", "kavkazskaya_plennitsa_ili_novye_priklyucheniya_shurika_kidnapping_caucassian_style"))
title = title.apply(lambda x:x.replace("Star_Wars_Episode_IV_A_New_Hope", "star_wars"))
title = title.apply(lambda x:x.replace("Tom_Petty_and_the_Heartbreakers_Runnin_Down_a_Dream", "runnin_down_a_dream_tom_petty_and_the_heartbreakers"))
title = title.apply(lambda x:x.replace("The_Incredibly_Strange_Creatures_Who_Stopped_Living_and_Became_MixedUp_Zombies!!?", "the_incredibly_strange_creatures"))

#### Rotten_tomato webscraping
<a name="Rotten tomato webscraping"></a>

In [None]:
%%time

import sys
sys.stdout = open('/dev/stdout', 'w')

rt_score = []
rt_avg_rating = []
rt_audience_score = []
rt_user_rating = []
rt_avg_aud_rating = []
rt_fresh = []
rt_rotten = []

for i in title:
    
    sys.stdout.write(i + '\n')
    
    r = requests.get("https://www.rottentomatoes.com/m/"+str(i))
    soup = BeautifulSoup(r.content, "lxml")
    
    # Getting tomato meter score.
    a = 0
    for i in soup.find_all("h3", class_ ="scoreTitle superPageFontColor"):
        if i.text.strip() == "TOMATOMETER":
            try:
                a = i.find_next("span", class_="meter-value superPageFontColor").text
            except:
                a = None
    rt_score.append(a)
    
    # Getting average_rating from tomatometer 
    a = 0
    for i in soup.find_all("div", attrs= {"id":"all-critics-numbers"}):
        try:
            a = re.findall(r'./.\d', i.text.strip())[0]
        except:
            a = None
    rt_avg_rating.append(a)
    
    # Getting audience score meter score.
    a = 0
    for i in soup.find_all("h3", class_ ="scoreTitle superPageFontColor"):
        if i.text.strip() == "AUDIENCE SCORE":
            try:
                a = i.find_next("span" ,class_ ="superPageFontColor").text
            except:
                a = None
    rt_audience_score.append(a)
    
    # Getting user rating from Rotten Tomato
    a = 0
    for i in soup.find_all("span" ,class_ ="subtle superPageFontColor"):
        if i.text == "User Ratings:":
            try:
                a = i.next_sibling.strip()
            except:
                a = None
    rt_user_rating.append(a)
    
    # Getting Average rating from Rotten Tomato(Audience score)
    a = 0
    for i in soup.find_all("span" ,class_ ="subtle superPageFontColor"):
        if i.text == "Average Rating:":
            try:
                a = i.next_sibling.strip()
            except:
                a = None
    rt_avg_aud_rating.append(a)
    
    # finding fresh for all critic
    a = 0
    for i in soup.find_all("span" ,class_ ="subtle superPageFontColor audience-info"):
        if i.text.strip() == "Fresh:":
            a = int(i.find_next().text)
            break
    rt_fresh.append(a)
    
    # finding rotten for all critic
    a = 0
    for i in soup.find_all("span" ,class_ ="subtle superPageFontColor audience-info"):
        if i.text.strip() == "Rotten:":
            a = int(i.find_next().text)
            break
    rt_rotten.append(a)

In [None]:
## creating a third list with all the scraped features
data_list_3 = [rt_score, rt_avg_rating, rt_audience_score, rt_user_rating, rt_avg_aud_rating, rt_fresh, rt_rotten ]

In [None]:
for i in data_list_3:
    print len(i)

In [None]:

df3 = pd.DataFrame(data_list_3)

In [None]:
df3 = df3.T

In [None]:

header3 = ["rt_score", "rt_avg_rating", "rt_audience_score", "rt_user_rating", "rt_avg_aud_rating", "rt_fresh", "rt_rotten"]

In [None]:
df3.columns = header3

In [None]:
## The code above ran in AWS and once the scrapig was completed, the scraped data stored as a csv file in EC2. 
df3.to_csv('rt.csv', encoding="utf-8")

In [None]:
## re-ran the code for meta score. 
mt_score = []

for i in title_id:
    r = requests.get("http://www.imdb.com/title/"+str(i)+"/?ref_=nv_sr_2")
    soup = BeautifulSoup(r.content, "lxml")
    a = 0
    try:
        a = soup.find("a", attrs={"href":"criticreviews?ref_=tt_ov_rt"}).text.strip()
    except:
        a =  None
    print a
    mt_score.append(a)

In [None]:
header = ["metascores"]

In [None]:
df4.columns = header

In [None]:
df_mt.to_csv("meta.csv", encoding="utf-8")

# Cleaning all the datasets
<a name ="cleaning_data_sets"></a>

In [1]:
## This can be the socend part of this project, from here on, I dont have to worry web scraping and all 
## the scraped data was saved in a csv file and imported fresh with the following codes
import pandas as pd
import requests
import numpy as np
import re
import csv
from bs4 import BeautifulSoup
from sklearn.metrics import roc_curve, auc, precision_recall_curve

In [3]:
!ls


[31mProject-Netflix-Movie-Recommender.ipynb[m[m


In [2]:
## getting the scraped csv files. 
df1 = pd.read_csv('imdb_1.csv')
df2 = pd.read_csv('imdb_2.csv')
df3 = pd.read_csv('rt.csv')
df4 = pd.read_csv('meta.csv')

FileNotFoundError: File b'imdb_1.csv' does not exist

In [3]:
#Dropping the first columns of each datasets
df1.drop(df1.columns[0], axis=1, inplace=True)
df2.drop(df2.columns[0], axis=1, inplace=True)
df3.drop(df3.columns[0], axis=1, inplace=True)
df4.drop(df4.columns[0], axis=1, inplace=True)

In [8]:
## Assigning header for df4
header = ["metascores"]
df4.columns = header

In [6]:
df1.head()

Unnamed: 0,title_id,title,runtime,genre,certificate,imdb_rating,gross,year,votes,director_actor,metascore
0,tt4912910,Mission: Impossible - Fallout,['147'],"Action, Adventure, Thriller",PG-13,8.4,['83.86'],(2018),47815,Director:\nChristopher McQuarrie\n| \n Star...,86.0
1,tt4154756,Avengers: Infinity War,['149'],"Action, Adventure, Fantasy",PG-13,8.7,['677.76'],(2018),417825,"Directors:\nAnthony Russo, \nJoe Russo\n| \n ...",68.0
2,tt3606756,Incredibles 2,['118'],"Animation, Action, Adventure",PG,8.1,['577.02'],(2018),77932,Director:\nBrad Bird\n| \n Stars:\nCraig T....,80.0
3,tt5463162,Deadpool 2,['119'],"Action, Adventure, Comedy",R,8.0,['317.81'],(2018),189429,Director:\nDavid Leitch\n| \n Stars:\nRyan ...,66.0
4,tt5104604,Isle of Dogs,['101'],"Animation, Adventure, Comedy",PG-13,8.0,['31.97'],(2018),58787,Director:\nWes Anderson\n| \n Stars:\nBryan...,82.0


In [None]:
df1.columns

In [None]:
## Adding all the colummns together. 
df = pd.concat([df1, df2, df3, df4], axis = 1)

In [None]:
df.head()

In [None]:
df.isnull().sum()

In [None]:
df = df.replace('N/A',np.nan)

In [None]:
# Dropping duplicates 
df.drop_duplicates(inplace = True)

In [None]:
len(df)

In [None]:
## dropping gross column, since I already have another column for gross with less NaN values
df.drop("gross",inplace = True,axis = 1)

In [None]:
df.info()

In [None]:
df.head()

##### Cleaning runtime. 
<a name = "cleaning_runtime"></a>

In [None]:
df["runtime"] = df['runtime'].str.extract(r'(\d+)')

In [None]:
#df["runtime"] = df["runtime"].apply(lambda x: None if x is None else int(re.findall(r'\d+',i)[0]))

In [None]:
df[df.runtime.isnull()]

In [None]:
# fill the 4 missing values with the average. 
df.runtime.fillna(106, inplace=True)

In [None]:
df.runtime.isnull().sum()

##### Cleaning Award rows

In [None]:
df.oscar_win = df.oscar_win.str.extract(r'(\d+)')
df.oscar_nom = df.oscar_nom.str.extract(r'(\d+)')
df.oscar_win.fillna(0, inplace=True)
df.oscar_nom.fillna(0, inplace=True)
df.other_win.fillna(0, inplace=True)
df.other_nom.fillna(0, inplace=True)



In [None]:
other_win = []
for i in df.other_win:
    a = 0
    try:
        a = int(i)
    except:
        a = 0
    other_win.append(a)
    
df.other_win = other_win

In [None]:
other_nom = []
for i in df.other_nom:
    a = 0
    try:
        a = int(i)
    except:
        a = 0
    other_nom.append(a)
    
df.other_nom = other_nom

##### Cleaning Genre and Creating columns for each genre

In [None]:
df.genre.isnull().sum()

In [None]:
df[df.genre.isnull()]

In [None]:
## filled the 3 missing genre rows with "Drama"
df.genre.fillna("Drama", inplace=True)

In [None]:
# Create a unique list for genre.
unique_genre = []
for i in df.genre:
    a = i.split(",")
    for b in a:
        if b.strip() not in unique_genre:
            unique_genre.append(b.strip())

In [None]:
## create a cleaned list of lists where each list is a row from each row of the dataFrame. 
cleaned_genre = []
for i in df.genre:
    a = i.split(",")
    c =[]
    for b in a:
        c.append(b.strip())
    cleaned_genre.append(c)

In [None]:
## Creating new columns in the dataFrame and based on unique genre list, pluging in 1 or 0
for i in unique_genre:
    df[i] = [1 if i in x else 0 for x in cleaned_genre]

In [None]:
df.head(3)

In [None]:
# Dropping genre and opening_week columns 
df.drop("genre", axis=1, inplace=True)
df.drop("opening_week",axis=1, inplace=True)



In [None]:
df.head(3)

##### Meta Score

In [None]:
df.drop("metascore", axis=1, inplace=True) ## dropped the previous meta score code. 

In [None]:
df.metascores.value_counts(dropna=False).head()

In [None]:
#df.metascores.fillna(np.nanmean(df.metascores), inplace=True)

In [None]:
#df.metascores = df.metascores.apply(lambda x: int(x))

##### Munging Language, Country and year column. 

In [None]:
df.language.replace('0', "English", inplace=True)
df.country.replace('0', "USA", inplace=True)
df.year = df.year.apply(lambda x : int(re.findall(r"\D(\d{4})\D", x)[0]))

##### User_review and critic review

In [None]:
df.user_review.replace(",", "", inplace=True)

In [None]:
df.user_review = df.user_review.apply(lambda x: x.replace(",", ""))

In [None]:
df.user_review = df.user_review.str.extract(r'(\d+)')

In [None]:
df.user_review.isnull().sum()

In [None]:
#df.user_review.fillna(0, inplace=True)

In [None]:
df.critic_review = df.critic_review.str.extract(r'(\d+)')

In [None]:
df.critic_review.isnull().sum()

In [None]:
#df.critic_review.fillna(0, inplace=True)

##### budget and gross

In [None]:
df.budget = df.budget.str.extract(r'(\d+)')
df.budget = df.budget.apply(lambda x: int(x))
df.budget.replace("0", None, inplace=True)

In [None]:
df.gross_1 = df.gross_1.str.extract(r'(\d+)')
df.gross_1 = df.gross_1.apply(lambda x: int(x))
df.gross_1.replace("0",None, inplace=True)

##### Rotten Tomato scores

In [None]:
df.rt_score = df.rt_score.str.extract(r'(\d+)')

In [None]:
df.rt_score = [None if type(x) is float else int(x) for x in df.rt_score]

In [None]:
df.rt_avg_rating = [None if type(x) is float else x.split("/")[0] for x in df.rt_avg_rating]


In [None]:
df.rt_avg_rating = [int(x) if x else None for x in df.rt_avg_rating]

In [None]:
df.rt_audience_score = df.rt_audience_score.str.extract(r'(\d+)')

In [None]:
df.rt_avg_aud_rating = [None if type(x) is float else x.split("/")[0] for x in df.rt_avg_aud_rating]

In [None]:
df.rt_user_rating = [i.replace(',',"") for i in df.rt_user_rating]

##### Getting Writers.

In [None]:
writer = []
for x in df.writer:
    try:
        a = x.split("(")[0].strip()
        a = a.strip(",")
        a = a.replace(" ", "_")
    except:
        a = None
    writer.append(a)
df.writer = writer

In [None]:
unique_writer = []
for i in df.writer:
    if i not in unique_writer:
        unique_writer.append(i)
len(unique_writer)

In [None]:
## Dropping writer since there are too many unique values. 
df.drop("writer", axis=1, inplace=True)

##### Getting Directors.

In [None]:
## Creating a new column with only Directors name. 
director = []
for x in df.director_actor:
    try:
        if "Director:" in x:
            a = x
    except:
        a = None
    director.append(a)
df["director"] = director

In [None]:
## Getting rid of 
director_1 = []
for i in df.director:
    try:
        a = i.split("|")[0]
        a = a.split(":")[1]
        a = a.strip()
        a = a.replace(" ", "_")
    except:
        a = None
    director_1.append(a)
df["director"] = director_1

In [None]:
df.director.isnull().sum()

In [None]:
#df[df.director.isnull()].director = "Don Hertzfeldt" ## was trying to manually import one missing value.
## getting rid of that one missing value
df.director.dropna(inplace=True)

In [None]:
u_director = []
for i in df.director:
    if i not in u_director:
        u_director.append(i)
len(u_director)

In [None]:
## Dropping "director" columns since there are too many unique values. 
df.drop("director", axis=1, inplace=True)

##### Getting Actors. 

In [None]:
## Creating a new column with only Directors name. 
actor = []
for x in df.director_actor:
    try:
        if "Stars:" in x:
            a = x
    except:
        a = None
    actor.append(a)
df["actors"] = actor

In [None]:
## Getting rid of 
actor = []
for i in df.actors:
    try:
        a = i.split("|")[1]
        a = a.split(":")[1]
        a = a.strip("\n")
        a = a.strip(",")
        a = a.split(",")
        #a = a.replace(" ", "_")
    except:
        a = None
    actor.append(a)
df["actors"] = actor

In [None]:
df.actors[0]

In [None]:
b = []
for i in df.actors:
    try:
        a = []
        for j in i:
            c = j.strip()
            c = c.replace(" ","_")
            a.append(c)
    except:
        a = None
    b.append(a)
df.actors = b
        

In [None]:
df.actors[1]

In [None]:
# Dropping director_actor columns since its not needed anymore.
df.drop("director_actor", axis=1,inplace=True)

In [None]:
# Create a unique list for directors, writers.

In [None]:
## Dropping the actors for now but will look into it in the future. 
df.drop("actors",axis = 1, inplace=True)

##### Certificate

In [None]:
df.certificate.replace("Not Rated", "Unrated", inplace=True)
df.certificate.replace(np.nan, "Unrated", inplace=True)

In [None]:
df.certificate.value_counts(dropna=False)

## Visualization
<a name="visualization"></a>

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
% matplotlib inline

In [None]:
fig, ax = plt.subplots(figsize = (15,8))
sns.heatmap(df.corr());

In [None]:
fig, ax = plt.subplots(figsize = (15,8))
sns.heatmap(df.corr()**2);
plt.title("Correlation between features", size = 20);

In [None]:
fig, ax = plt.subplots(figsize = (15,8))
sns.distplot(df.imdb_rating,bins = 50);
plt.title("Distribution of imdb ratings among movies", size = 20);

The distribution of the imdb_rating clearly shows how I have chosen two sets of movies. This could give me a result with better r scores. However, it is important to keep in mind that in the real work the distribution do not act similarly. Often you get data sets 


In [None]:
fig, ax = plt.subplots(figsize = (15,8))
sns.distplot(df.year,bins = 50);
plt.title("Distribution of years among movies", size = 20);

Most of the movies are pretty recent.

In [None]:
df

## Feature Engineering
<a name ="feature_engineering"></a>

In [None]:
df = pd.get_dummies(df, columns=["language", "country"], drop_first=True)

In [None]:
df = pd.get_dummies(df, columns=["certificate"],dummy_na=True, drop_first=True)

In [None]:
df['label'] = df.imdb_rating.apply(lambda x : 1 if x > 7.5 else 0)

In [None]:
y = df.label
X = df.drop(["title_id", "title", "imdb_rating", "label" ],axis = 1)


### Train-test split
<a name="train_test_split"></a>

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=.25)

#### Imputing for NaN values

In [None]:
from fancyimpute import KNN

In [None]:
header = X_train.columns

In [None]:
X_train= KNN(k=5).complete(X_train)

In [None]:
X_train = pd.DataFrame(X_train, columns=header)

In [None]:
X_train.head()

In [None]:
X_test = KNN(k=5).complete(X_test)

In [None]:
X_test = pd.DataFrame(X_test, columns=header)

In [None]:
X_test.head()

In [None]:
# imputing the mean for any nulls
"""for i in numeric_columns:
    X_train[i] = [X_train.i.mean() if x == np.nan else x for x in X_train[i]]
    X_test[i] = [X_test.i.mean() if x == np.nan else x for x in X_test[i]]"""

#### Feature scaling

In [None]:
from sklearn.preprocessing import minmax_scale, StandardScaler

In [None]:
scale = ["runtime", "year", "votes", "user_review", "critic_review", "budget", "gross_1", "rt_score", "rt_avg_rating", "rt_audience_score", "rt_user_rating", "rt_avg_aud_rating", "rt_fresh", "rt_rotten", "metascores"]

In [None]:
scaler = StandardScaler()

In [None]:
X_train[scale] = scaler.fit_transform(X_train[scale])
#X_test[scale] = scaler.fit_transform(X_test[scale])

In [None]:
X_test[scale] = scaler.fit_transform(X_test[scale])

## Creating Models
<a name="creating_models"></a>


In [None]:
from sklearn.ensemble import RandomForestClassifier,BaggingClassifier
from sklearn.model_selection import GridSearchCV, StratifiedKFold, cross_val_score
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report,precision_recall_curve
from sklearn.tree import DecisionTreeClassifier

##### Finding the best Max_depth

In [None]:
# finding the best max depth for the DecisionTreeClassifier. 

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=11)

for i in [1,2,3,4,5,6,7,None]:
    print "Max depth:{}".format(i)
    clf = DecisionTreeClassifier(max_depth=i)
    print cross_val_score(clf, X_train, y_train, cv=cv, n_jobs =1).mean()
    print 

##### GridSearch on Random Forest

In [None]:
grid = {
    'n_estimators': [10, 20, 30, 50, 100],
    'max_features': [1,2,3,4,5,6,'auto'],
    'criterion': ['gini','entropy'],
    'class_weight': ["balanced","balanced_subsample",None]
}



dtc_ = DecisionTreeClassifier(max_depth=6)
rf = RandomForestClassifier(dtc_)
gs = GridSearchCV(rf, grid)

model_rf_gs = gs.fit(X_train, y_train)
gs.best_params_

In [None]:
print gs.best_params_
print gs.best_score_

##### Create Random Forest Model using the best param_ from GridSearch

In [None]:
rf = RandomForestClassifier(max_features = gs.best_params_["max_features"],\
                            n_estimators = gs.best_params_["n_estimators"],\
                            criterion = gs.best_params_["criterion"],\
                            class_weight = gs.best_params_["class_weight"])
model_rf = rf.fit(X_train, y_train)

In [None]:
## Predict y_train for X_train
y_pred = model_rf.predict(X_train)

In [None]:
## Predict y_test
y_pred_test = model_rf.predict(X_test)

### Result Metrics
<a name="result_metrics"></a>

In [None]:
# printing confision matrix
pd.DataFrame(confusion_matrix(y_test,y_pred_test),\
            columns=["Predicted high", "Predicted low"],\
            index=["is high","is low"] )

In [None]:
print classification_report(y_train, y_pred, labels=model_rf.classes_)

In [None]:
# Print classification report for y_test
print classification_report(y_test, y_pred_test, labels=model_rf.classes_)

In [None]:
# Printing accuracy Score
print accuracy_score(y_train, y_pred)
print accuracy_score(y_test, y_pred_test)

### Feature Importance
<a name="feature_importance"></a>

In [None]:
# Get features Gini scores
feature_importances = pd.DataFrame(model_rf.feature_importances_, 
                                   index = X_train.columns, 
                                   columns=['importance'])

feature_importances[feature_importances['importance']!=0].sort_values(by='importance', ascending=False).head(10)

### ROC/AUC CURVE
<a name="roc_auc_curve"></a>

In [None]:
Y_score = model_rf_gs.best_estimator_.predict_proba(X_test)[:,1]


# For class 1, find the area under the curve
FPR, TPR, _ = roc_curve(y_test, Y_score)
ROC_AUC = auc(FPR, TPR)
"""
PREC, REC, _ = precision_recall_curve(y_test, Y_score)
PR_AUC = auc(REC, PREC)"""

# Plot of a ROC curve for class 1 (has_cancer)
plt.figure(figsize=[11,9])
plt.plot(FPR, TPR, label='ROC curve (area = %0.2f)' % ROC_AUC, linewidth=4)
#plt.plot(REC, PREC, label='PR curve (area = %0.2f)' % PR_AUC, linewidth=4)
plt.plot([0, 1], [0, 1], 'k--', linewidth=4)
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive', fontsize=18)
plt.ylabel('True Positive', fontsize=18)
plt.title('Random Forest for imdb rating', fontsize=18)
plt.legend(loc="lower right")
plt.show()

In [None]:
Y_score = model_rf_gs.best_estimator_.predict_proba(X_test)[:,1]


# For class 1, find the area under the curve
PREC, REC, _ = precision_recall_curve(y_test, Y_score)
PR_AUC = auc(REC, PREC)

# Plot of a ROC curve for class 1 (has_cancer)
plt.figure(figsize=[11,9])
plt.plot(REC, PREC, label='PR curve (area = %0.2f)' % PR_AUC, linewidth=4)
plt.plot([0, 1], [0, 1], 'k--', linewidth=4)
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('Recall Rate', fontsize=18)
plt.ylabel('Precision Rate', fontsize=18)
plt.title('Random Forest for imdb rating', fontsize=18)
plt.legend(loc="lower right")
plt.show()

# Wrapping Up
<a name="wrapping_up"></a>

My model seems to perform too well. One of the reasons could be the way I collected data. I hypothesis is that I was biased while collecting data. The way I web-scraped and collected my data points, there was a huge difference between the two types of data which is unlikely in the real world data problems. This is something to think about for the upcoming projects. 

### Next..
<a name="next"></a>

I believe one of the reasons why my model was giving me really good results is because of the way I collected data from the beginning. While web scraping I basically collected two type of movies. First with imdb_rating over 8 and then imdb_rating below 5. Since there is a big gap in the rating, My model was able to predict too well. This is unlikely in the real world. Therefore, my next procedure would be to web scrape again but this time I should have more variety in the rating while scraping since it is my target variable. Also I would like to work with Actors, Director and Writers as a part of NLP for the next trial. 