#  Webscraping Module


1.import required libaries 
    * requests : To retrived the Html of the source page
    * bs4      : To converted the obtained html which in the form of text into soup object in which we can retrive specific                      data like headers, paragraphs, etc.
    * lxml     : Helps bs4 to understand the html.
    * csv      : To write or read into csv file.

In [1]:
import requests
import lxml
import bs4
import csv 
import time
from math import sqrt


2. Loading the featured page links into a link to hit them and creating headers to mimic actual browser.

In [2]:
urls =["https://www.metacritic.com/browse/movies/score/metascore/all/filtered?sort=desc&page=0",
       "https://www.metacritic.com/browse/movies/score/metascore/all/filtered?sort=desc&page=1",
       "https://www.metacritic.com/browse/movies/score/metascore/all/filtered?sort=desc&page=2",
       "https://www.metacritic.com/browse/movies/score/metascore/all/filtered?sort=desc&page=3",
       "https://www.metacritic.com/browse/movies/score/metascore/all/filtered?sort=desc&page=4"]
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36'}



3. Initialise the varaible that are used to store the data

In [3]:
movie_short_href={} # used to store the href values 

movie_details={}    # Used to store the final movie details as 
                    #   { 'movie_name':{'Director':['director_name'], 
                    #                   'Principal Cast' : [name_1, name_2 ...name_m], 
                    #                    'Cast':['name_1, name_2... name_n'] }}  note m>n.
            
failed_items=[]     # Used to store the movie if of those movie where we have a failed attempt.




4. Below process is to achieve the link for each movies from the main page

    result : now we have collection of end points of the movies.

In [4]:

index_num=0
for url in urls:
    result = requests.get(url,headers=headers)
    soup = bs4.BeautifulSoup(result.text, "lxml")
    div_tag = soup.find_all('div', class_=["browse_list_wrapper one browse-list-large",
                                           "browse_list_wrapper two browse-list-large",
                                           "browse_list_wrapper three browse-list-large",
                                           "browse_list_wrapper four browse-list-large"])
    for item in div_tag:
        for element in item.find_all('td', class_='clamp-image-wrap'):
            index_num=index_num+1
            source=element.find_all('a' , href= True)[0]
            movie_short_href[index_num]=source['href'].replace('/movie/','')
    
    

5. This is method is to scrape individual movie details from its respective website, if there is any kind of failure while scraping we store the the movie id  in a list (failed_items).

In [5]:
def retrive_data_from_website(i):
    movie_link='https://www.metacritic.com/movie/'
    url = movie_link+movie_short_href[i]+'/details'
    result = requests.get(url,headers=headers)
    if result.status_code != 200:
        failed_items.append(i)
    else:
        
        final_details={}
        soup = bs4.BeautifulSoup(result.text,'lxml')
        tags = soup.find_all('table', class_='credits')
        wanted_columns=['Director', 'Principal Cast', 'Cast']
        movie_name = soup.find_all('div', class_='product_page_title oswald upper')[0].find_all('h1')[0].getText()
        meta_score = soup.find_all('a', class_ = 'metascore_anchor')[0].getText().strip()
        final_details['meta-score']=meta_score
        for table in tags:
            column = table.find_all('th', class_="person")[0].getText()
            if column in wanted_columns:
                details=[]
                for name in table.find_all('a', href=True):
                    details.append(name.getText().strip())
                final_details[column]=details
        if movie_name not in movie_short_href.values():
            movie_short_href[i] = movie_name
        movie_details[movie_short_href[i]] = final_details
    if i in [100,200,300,400,500]:
        time.sleep(2)
    if i in failed_items:
        failed_items.remove(i)
    

6. Now we hit every movie link from our collection and check for any failed attempts,
    if any there is any failed attempt, that implies that the href we obtained has different value which the website has
    internally rerouted, so just for the failed items instead of hitting the details  page we go step bye step and find the
    actual href value 
    
    example:
    movie href value = 'citizen-kane' in page - 1
    we would be hitting = '/movies/citizen-kane/details' which is wrong endpoint.
    actual endpoint = '/movies/citizen-kane-1941/details' 
    
   for movies like these which are less than 10 in count,
    we first hit the movie page which is '/movies/citizen-kane'  which would internal redirect at the server side into
   'movie/citizen-kane-1941' we now take the url and update the href value in movie_short_href variable and continue the 
    process
    

In [6]:

for i in movie_short_href.keys():
    retrive_data_from_website(i)
if len(failed_items)>0:
    for i in failed_items:
        url = 'https://www.metacritic.com/movie/'+movie_short_href[i]
        dummy = requests.get(url,headers={'User-Agent': 'Mozilla/5.0'})
        movie_short_href[i]=dummy.url[dummy.url.rindex('/')+1::]
        retrive_data_from_website(i)


#  Store into CSV 


7. Now we use csv writed class to save our movie details into Csv file.

In [7]:
with open("dict.csv", "w", newline="") as csv_file:  
    writer = csv.writer(csv_file)
    writer.writerow(['Movie Name','Director','Cast','Meta Score'])
    for key, value in movie_details.items():
        writer.writerow([key,value['Director'],value.get('Principal Cast',[])+value.get('Cast',[]),value['meta-score']])


# Store in  SQL Lite Db
8. we store our movie details into sqlLite db under the name of MovieInfoDatabase db and into MovieInfoTable

In [8]:
import sqlite3
connection = sqlite3.connect('MovieInfoDatabase.db')
cursor= connection.cursor()
cursor.execute('CREATE TABLE MovieInfoTable (Movie_name Varchar, Director varchar, Cast Varchar, Meta_score varchar)')
for key, value in movie_details.items():
    cast=""
    for i in value.get('Principal Cast',[])+value.get('Cast',[]):
        cast =cast + i+','
    cast = cast[0:len(cast)-1:]
    sql_query="insert into MovieInfoTable values( \"{0}\",\"{1}\",\"{2}\",\"{3}\")".format(str(key), str(value['Director'][0]),str(cast),str(value['meta-score']))
    cursor.execute(sql_query)
connection.commit()

OperationalError: table MovieInfoTable already exists

 9. Check for Movie info By giving movie name 
 

In [14]:
movie_name = input('Which movie do you want to check? >')
if movie_name  in movie_details.keys():
    desire_info = input('What information about this movie do you want to check? (Choose Director or Cast)')
    print(movie_details[movie_name][desire_info])
else:
    print('The User provided movie name is not in our Data or there name doesn\'t match.Choose from below movie')
    print(movie_details.keys())
    


Which movie do you want to check? >Boyhood
What information about this movie do you want to check? (Choose Director or Cast)Cast
['Alyssa Petersen', 'Andrea Chen', 'Andrew Villarreal', 'Angela Rawna', 'Bonnie Cross', 'Brad Hawkins', 'Cassidy Johnson', 'David Blackwell', 'Deanna Brochin', 'Derek Chase Hickey', 'Elijah Smith', 'Ethan Hawke', 'Evie Thompson', 'Greg Baglia', 'Jamie Howard', 'Jonathan Bell', 'Jordan Howard', 'Libby Villari', 'Lorelei Linklater', 'Marco Perella', 'Mark Finn', 'Megan Devine', 'Mona Lee Fultz', 'Natalie Wilemon', 'Nick Krause', 'Patricia Arquette', 'Ryan Power', 'Sam Dillon', 'Savannah Welch', 'Shane Graham', 'Sharee Fowler', 'Steven Chester Prince', 'Sydney Orta', 'Tamara Jolaine', 'Tess Allen', 'Zoe Graham']


# Task 1:
1. Analyze how many times has each actor/actress appeared in these top 500 movies, analyze how many times has each director appeared in these top 500 movies, what can that tell you about their career?

We have accumulated all the details of cast and how many time that they have appeared in the in top 500 movies along with the movie rating provided by the metactritic website which is called as metascore. into the "cast_details" dictionary. 

Filter all the cast  and directors from each movie and append them to the list.
        cast_details = {'name' :{count:'v',rating:'v'}}
    director_details = {'name' :{count:'v',rating:'v'}}
                          
Count is the number of movies that they worked in.
Rating would be the average of the rating of the all the movies that they worked in.
    
                               
                        


In [15]:
cast_details={}

director_details={}

#Below two methods are to make dictionaries respect to cast and directors with their details.
def add_castnames_count(cast_names,score):
    for name in cast_names:
        cast_details[name]= cast_details.get(name,{'count':0, 'rating': 0})
        cast_details[name]['count'] = cast_details[name].get('count',0) + 1
        cast_details[name]['rating'] = cast_details[name]['rating']+int(score)
            
def add_directornames_count(director_name,score):
    for name in director_name:
        director_details[name]= director_details.get(name,{'count':0, 'rating': 0})
        director_details[name]['count'] = director_details[name].get('count',0) + 1
        director_details[name]['rating'] = director_details[name]['rating']+int(score)   
        
#We run the methods from the below code by passing one movie detail at a time.
for movie_name,movies in movie_details.items():
    add_castnames_count(list(set(movies.get('Principal Cast',[])+movies.get('Cast',[]))), movies['meta-score'])
    add_directornames_count(movies.get('Director',[]),movies['meta-score'])

# The code below is performed to find the average rating of the cast/director.
for name in cast_details.keys():
        cast_details[name]['rating'] = cast_details[name]['rating']/cast_details[name]['count']
for name in director_details.keys():
        director_details[name]['rating'] = director_details[name]['rating']/director_details[name]['count']
    
#To-Do

# # Task2 : Finding cosine similarities between  Directors

There are 410  distinct Director in the top 500 movies.

First we create a collection of  directors along with the cast they worked with and how many time that they have worked.

director_cast_details = {'director-1':{'cast-1' : 1, 'cast-2' : 3, ...}..}

 
Now for finding cosine similarity between two director we need find the all the cast which worked with director-1 and director-2 which we would call as union_cast, and then created a vector which consits of the number of time that they worked with the cast in the union_cast .
 
 suppose we take a pair of director as director-1, director-2 and lets look at the vector that we need to generate.
 
 Example:
 
 'director-1':{'cast-1' : 1, 
                'cast-2' : 2, }
 'director-2':{'cast-3' : 3, 
               'cast-2' : 4, }
               
  So the length of the vector is going to be the distinct cast of both directors.
  [cast-1, cast-2. cast-3]
  
  the director vectors are going to be 
  director-1= [1, 2, 0 ]
  
  director-2 = [0, 4, 3]
  
  now we give this input to our cosine function and get the similarity.

  
 
 

In [17]:
director_cast_details={}

# we create a dictionary with director name as key, and they cast who the director woked with and count of time he worked with themas value.
for v in movie_details.values():
    for director_name in v['Director']:
        cast_data=director_cast_details.get(director_name,{})
        cast = list(set(v.get('Principal Cast',[])+(v.get('Cast',[]))))
        for name in cast:
            cast_data[name] = cast_data.get(name,0)+1
        director_cast_details[director_name]=cast_data

# Inner product function is created to be used by the  cosine similarity function.
def inner_product(vector_1, vector_2):    
    product_sum = 0
    for i in range(len(vector_1)):
        product_sum = product_sum +(vector_1[i]*vector_2[i])
    return product_sum
        
# Cosine similarity function has been created which takes the dictionary value of director 1 and director 2.
def cosine_similarity(director_1, director_2):
    union_cast= set(list(director_1.keys())+list(director_2.keys()))
    vector_1=[director_1.get(i,0) for i in union_cast]       # This line provides us with the vector of the directors with respect to the all cast between two directors.
    vector_2=[director_2.get(i,0) for i in union_cast]
    numerator = inner_product(vector_1,vector_2)
    if numerator==0:
        return 0
    denominator = sqrt(inner_product(vector_1,vector_1)) * sqrt(inner_product(vector_2,vector_2))
    if denominator==0:
        return 0
    return round(numerator/denominator,5)


while(True):
    print("Please choose from the below Options \n\t1.List out of all directors\n\t2.Cosine similarity between  directors \n\t3.Exit")
    choice= int(input())
    if choice==2:
        print("Please enter the name's of the director's you want to find the similarity")
        director_1 = str(input("Enter the name of First director."))
        director_2 = str(input("Enter the name of Second director."))
        print("The cosine similarity between {0} and {1} is {2}\n".format(director_1,director_2,cosine_similarity
                                                                        (director_cast_details[director_1],director_cast_details[director_2])))
    elif choice==3:
        print('Exit completed')
        break
    elif choice==1:
        print("!!!!!!!!!!!!!!!!!!!!!!!!!Please Scroll down in the output cell to continue the loop or to enter your next choice!!!!!!!!!!!!!!!!!! ")
        print("Names of all director in top 500 movies")
        print(director_details.keys())
    

Please choose from the below Options 
	1.List out of all directors
	2.Cosine similarity between  directors 
	3.Exit
1
!!!!!!!!!!!!!!!!!!!!!Please Scroll down in the output cell to continue the loop or to enter your next choice!!!!!!!!!!!!!!!!!! 
Names of all director in top 500 movies
dict_keys(['Orson Welles', 'Francis Ford Coppola', 'Alfred Hitchcock', 'Michael Curtiz', 'Richard Linklater', 'Krzysztof Kieslowski', 'Gene Kelly', 'Stanley Donen', 'Charles Chaplin', 'Barry Jenkins', 'D.W. Griffith', 'Ben Sharpsteen', 'Bill Roberts', 'Hamilton Luske', 'Jack Kinney', 'Norman Ferguson', 'T. Hee', 'Wilfred Jackson', 'John Huston', 'Guillermo del Toro', 'Billy Wilder', 'Steve James', 'Akira Kurosawa', 'Joseph L. Mankiewicz', 'François Truffaut', 'Sam Peckinpah', 'Jim Sheridan', 'Carol Reed', 'Stanley Kubrick', 'Jasmila Zbanic', 'George Cukor', 'Sam Wood', 'Victor Fleming', 'Cristian Mungiu', 'Sergei M. Eisenstein', 'Elia Kazan', 'George Lucas', 'John Elliotte', 'Samuel Armstrong', 'Alfonso C

# Task 3 : Find similarity between 5 actors/actress
 
    1. I have observed and selected the John Ratzenberger,Wallace Shawn,Tom Hanks,Ray Collins and Joseph Cotten to find    similarites among them.
    2. Now we find the co_cast who worked with our cast and how many times did they work and store them in small_cast_details.
    3. Now we find cosine similarity between them and print them out.
    

In [43]:
#Loading the selected cast into a list.
cast_list=['John Ratzenberger','Wallace Shawn','Tom Hanks','Ray Collins','Joseph Cotten']
    
#Stor the selected cast details into small_cast_details
small_cast_details={}
for name in cast_list:
    cast_data=small_cast_details.get(name,{})
    for k,v in movie_details.items():
        cast = list(set(v.get('Principal Cast',[])+(v.get('Cast',[]))))
        if name in cast:
            cast.remove(name)
            for cast_name in cast:
                cast_data[cast_name] = cast_data.get(cast_name,0)+1
            small_cast_details[name]=cast_data
            
#Find Similarity between selected cast 
sub_list=list(cast_list)
for i in cast_list:
    sub_list.remove(i)
    for j in sub_list:
        print("The cosine similarity between {} and {} is {}\n".format(i,j,cosine_similarity(small_cast_details[i],small_cast_details[j])))



The cosine similarity between John Ratzenberger and Wallace Shawn is 0.55144

The cosine similarity between John Ratzenberger and Tom Hanks is 0.59779

The cosine similarity between John Ratzenberger and Ray Collins is 0

The cosine similarity between John Ratzenberger and Joseph Cotten is 0

The cosine similarity between Wallace Shawn and Tom Hanks is 0.68945

The cosine similarity between Wallace Shawn and Ray Collins is 0

The cosine similarity between Wallace Shawn and Joseph Cotten is 0

The cosine similarity between Tom Hanks and Ray Collins is 0

The cosine similarity between Tom Hanks and Joseph Cotten is 0

The cosine similarity between Ray Collins and Joseph Cotten is 0.55379



In [40]:
print(cast_list,sub_list)
    

['John Ratzenberger', 'Wallace Shawn', 'Tom Hanks', 'Ray Collins', 'Joseph Cotten'] []
