# __INSY 5336: PYTHON PROGRAMMING__
### FINAL PROJECT
#### NAME: NISARG PAWAR
#### STUDENT ID: 1001720812
------

#### **Objectives:**
 - Write a python program that fetches movie reviews for the top 50 movies from 3 major websites: IMDB, Metacritics, and Rotten Tomatoes.
 - Store the data in a comma separated file and in SQLite database
 - Compare the review style across the 3 websites by calculating the cosine similarity score between the keywords used in the reviews of these 3 websites.
 - Compare the top 50 movies that are considered top rated across the 3 websites by calculating the cosine similarity score between the genres of movies.

#### **Solution:**
 - Step 1: Web Scraping, Download movie title, genre and reviews from the 3 given websites.
     - IMDB: Top rated movies
     - Rotten Tomatoes: BEST MOVIES OF ALL TIME
     - Metacritics: Movies of All Time
 - Step 2: Store the downloaded data into a CSV file. Use this CSV file to load the same data in an sqlite database
 - Step 3: Pre-process and clean data and create vocabulary:
      - Remove special characters and extra spaces
      - Lowercase all text
      - create a wordlist and a vocabulary from the movie reviews
      - Remove the stop words

 - Step 4: Use word dictionary to find the top 50 used words on each site, merge this list and create a unique list of words to be the word_attributes.
 - Step 5: Use this word_attributes list to create movie review vectors for each movie. and calculate cosine similarity between each website (Task 1 complete!)
 - Step 6: Create a dictionary from the genre details of each movie for each website
 - Step 7: create a list of unique genres and use this list to create genre vectors for each movie
 - Step 8: Calculate cosine similarity between each movie using genre vectors. (Task 2 complete!)
---

In [1]:
import urllib.request, urllib.parse, urllib.error
from urllib.parse import urljoin
from bs4 import BeautifulSoup
import ssl
import requests
import re
import nltk
import csv
import sqlite3
import operator
import numpy as np
from nltk.corpus import stopwords
from nltk.tokenize import RegexpTokenizer
nltk.download('stopwords')

# Ignore SSL certificate errors
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36'}

[nltk_data] Downloading package stopwords to /home/nisarg/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


----
### Step 1: Web Scraping, Download movie title, genre and reviews from the 3 given websites:

#### **IMDB:**
 - page link: https://www.imdb.com/search/title/?groups=top_250&sort=user_rating
 - site: https://www.imdb.com/
 - We need two things: Genres and Reviews
 - Genres: they are available on the site of the list under the name of the movie and can be scraped by finding \<span> tag with class = 'genre'
 - Reviews: Initially th website shows only 25 reviews. To load the other 25 we must manipulate ajax to fetch the additional data. That is why the function get_reviews has two parts - first 25 and second 25. Both of these are then merged and a list of 50 reviews is returned.
 - All the above fetched data is stored in a dictionary called: IMDB_movie_list in the form: {mocie_name: {genre: [], link: '', reviews: []}}

In [2]:
imdb_url = 'https://www.imdb.com/search/title/?groups=top_250&sort=user_rating'
imdb_site = 'https://www.imdb.com'
imdb_page = requests.get(imdb_url, headers = headers)
imdb_soup = BeautifulSoup(imdb_page.content, 'lxml') 

In [3]:
movies_list = imdb_soup.find('div', class_ = 'lister-list').findAll('div',class_ = 'lister-item-content')

In [4]:
def get_reviews(link):
    
    reviews = []
    
    rev_page_1 = requests.get(link, headers = headers)
    rev_soup_1 = BeautifulSoup(rev_page_1.content, 'lxml')
    first_25 = rev_soup_1.find('div', class_ = 'lister-list').findAll('div', class_ = 'lister-item-content')
    for rev in first_25:
        try:
            text = rev.find('div',class_ = 'text show-more__control clickable').text
            reviews.append(text)
        except:
            text = rev.find('div',class_ = 'text show-more__control').text
            reviews.append(text)

    rev_page_2_url = urljoin(link,rev_soup_1.select(".load-more-data")[0]['data-ajaxurl'] + '&paginationKey='+rev_soup_1.select(".load-more-data")[0]['data-key'])
    rev_page_2 = requests.get(rev_page_2_url, headers = headers)
    rev_soup_2 = BeautifulSoup(rev_page_2.content, 'lxml')
    second_25 = rev_soup_2.find('div', class_ = 'lister-list').findAll('div', class_ = 'lister-item-content')
    for rev in second_25:
        try:
            text = rev.find('div',class_ = 'text show-more__control clickable').text
            reviews.append(text)
        except:
            text = rev.find('div',class_ = 'text show-more__control').text
            reviews.append(text)

    return reviews

In [5]:
IMDB_movie_list = dict()

for movie in movies_list:
    
    # movie title
    title_header = movie.find('h3', class_ = 'lister-item-header')
    title = title_header.find('a').text
    link = imdb_site + re.search('\/.*\/',title_header.find('a').get('href',None)).group(0)

    # movie genre
    genres = movie.find('span',class_ = 'genre').text.lower().strip()
    genres = re.sub('\s+','',genres).split(',')
    
    review_link = link + 'reviews?spoiler=hide'
    
    reviews = get_reviews(review_link)
    IMDB_movie_list[title] = {'link':link,'genres':genres,'reviews':reviews}

In [6]:
print(IMDB_movie_list.keys())

dict_keys(['The Shawshank Redemption', 'The Godfather', 'The Dark Knight', 'The Godfather: Part II', '12 Angry Men', 'The Lord of the Rings: The Return of the King', 'Pulp Fiction', "Schindler's List", 'Inception', 'Fight Club', 'The Lord of the Rings: The Fellowship of the Ring', 'Forrest Gump', 'The Good, the Bad and the Ugly', 'The Lord of the Rings: The Two Towers', 'The Matrix', 'Goodfellas', 'Star Wars: Episode V - The Empire Strikes Back', "One Flew Over the Cuckoo's Nest", 'Hamilton', 'Parasite', 'Interstellar', 'City of God', 'Spirited Away', 'Saving Private Ryan', 'The Green Mile', 'Life Is Beautiful', 'Se7en', 'The Silence of the Lambs', 'Star Wars: Episode IV - A New Hope', 'Harakiri', 'Seven Samurai', "It's a Wonderful Life", 'Joker', 'Vikram Vedha', 'Whiplash', 'The Intouchables', 'The Prestige', 'The Departed', 'The Pianist', 'Gladiator', 'American History X', 'The Usual Suspects', 'Léon: The Professional', 'The Lion King', 'Terminator 2: Judgment Day', 'Cinema Paradiso'

In [7]:
IMDB_movie_list['Star Wars: Episode IV - A New Hope']['genres']

['action', 'adventure', 'fantasy']

#### **Rotten Tomatoes:**
 - page link: https://www.rottentomatoes.com/top/bestofrt/
 - site: https://www.rottentomatoes.com/
 - The site only has a list of movies ordered in their ranking.
 - We must visit the page of each movie for further details.
 - We need two things: Genres and Reviews
 - Genres: they are available on the movie page in a 'div' with class = 'meta-value genre'
 - Reviews: reviews for each movie can be found by appending /reviews to the movie page, we allso add a 'type=user' criteria to the url to get the user reviews instead of critic reviews.
 - All the above fetched data is stored in a dictionary called: RT_movie_list in the form: {mocie_name: {genre: [], link: '', reviews: []}}
 - Unfortunately for this site i was unable to fetch 50 reviews since i was unable to simplate the 'next' event to load the rest of reviews. 

In [8]:
rt_url = 'https://www.rottentomatoes.com/top/bestofrt/'
rt_site = 'https://www.rottentomatoes.com'
rt_page = requests.get(rt_url, headers = headers)
rt_soup = BeautifulSoup(rt_page.content, 'lxml') 

In [9]:
rt_movies = rt_soup.find('table', class_ = 'table').findAll('tr')[1:51] # skip first one since its the header

In [10]:
def get_rt_reviews(link):
    reviews = []
    rev_page = requests.get(link, headers = headers)
    rev_soup = BeautifulSoup(rev_page.content, 'lxml')
    
    revs = rev_soup.find_all('li', class_ = 'audience-reviews__item')
    for rev in revs:
        reviews.append(rev.find('p').text)
    return reviews

In [11]:
RT_movie_list = dict()

for movie in rt_movies:
    
    # movie title
    title = movie.find('a').text.strip()
    link = rt_site + movie.find('a').get('href',None)
    
    # movie genre
    m_page = requests.get(link, headers = headers)
    m_soup = BeautifulSoup(m_page.content, 'lxml') 
    genres = m_soup.find('div', class_ = 'meta-value genre').text.lower().strip()
    genres = re.sub('\s+','',genres).split(',')
    
    review_link = link + '/reviews?type=user'
    reviews = get_rt_reviews(review_link)
    
    RT_movie_list[title] = {'link':link, 'genres': genres, 'reviews': reviews}

In [12]:
print(RT_movie_list.keys())

dict_keys(['Black Panther (2018)', 'Avengers: Endgame (2019)', 'Us (2019)', 'Toy Story 4 (2019)', 'Lady Bird (2017)', 'Citizen Kane (1941)', 'Mission: Impossible - Fallout (2018)', 'The Wizard of Oz (1939)', 'The Irishman (2019)', 'BlacKkKlansman (2018)', 'Get Out (2017)', 'Casablanca (1942)', 'Mad Max: Fury Road (2015)', 'Spider-Man: Into the Spider-Verse (2018)', 'Moonlight (2016)', 'Wonder Woman (2017)', 'A Star Is Born (2018)', 'Roma (2018)', 'Dunkirk (2017)', 'Inside Out (2015)', 'The Farewell (2019)', 'Modern Times (1936)', 'A Quiet Place (2018)', 'It Happened One Night (1934)', 'Eighth Grade (2018)', 'A Night at the Opera (1935)', 'Booksmart (2019)', 'The Third Man (1949)', 'Coco (2017)', 'The Shape of Water (2017)', 'Thor: Ragnarok (2017)', 'Selma (2014)', 'Spotlight (2015)', 'The Godfather (1972)', 'La Grande illusion (Grand Illusion) (1938)', 'Snow White and the Seven Dwarfs (1937)', 'Arrival (2016)', "Singin' in the Rain (1952)", 'Logan (2017)', 'The Cabinet of Dr. Caligari 

In [13]:
RT_movie_list['Spider-Man: Into the Spider-Verse (2018)']['reviews'][1]

'I\'ve watched this film around when it came out but decided to rewatch it again recently just because of the new Spider-Man: Miles Morales game. The premise of the movie is simply "Bitten by a radioactive spider in the subway, Brooklyn teenager Miles Morales suddenly develops mysterious powers that transform him into the one and only Spider-Man. When he meets Peter Parker, he soon realizes that many others share his special, high-flying talents. Miles must now use his newfound skills to battle the evil Kingpin, a hulking madman who can open portals to other universes and pull different versions of Spider-Man into our world". I absolutely adored it when it came out but do I still feel that way after rewatching it?\n\nFirstly, the animation style and 3D models are absolutely amazing! While I do love the Pixar style of animation, this comic book style of animation was such a breath of fresh air in the animated film department. I also appreciate a lot of the aesthetic choices such as havi

#### **Metacritic:**
 - Page: https://www.metacritic.com/browse/movies/score/metascore/all/filtered?sort=desc
 - site: https://www.metacritic.com/
 - The site only has a list of movies ordered in their ranking.
 - We must visit the page of each movie for further details.
 - We need two things: Genres and Reviews
 - Genres: they are available on the movie page in a 'div' with class = 'genre'
 - Reviews: reviews for each movie can be found in an anchor tag with clas = 'see_all boxed oswald' on the movie page, There are two such anchor tags, we must use the second one since its the tag that represents the user reviews.
 - All the above fetched data is stored in a dictionary called: META_movie_list in the form: {mocie_name: {genre: [], link: '', reviews: []}}

In [14]:
meta_url = 'https://www.metacritic.com/browse/movies/score/metascore/all/filtered?sort=desc'
meta_site = 'https://www.metacritic.com'
meta_page = requests.get(meta_url, headers = headers)
meta_soup = BeautifulSoup(meta_page.content, 'lxml') 

In [15]:
meta_movies = meta_soup.find('div', class_ = 'title_bump').find_all('td', class_='clamp-summary-wrap')[:50]

In [16]:
def get_meta_reviews(link):
    reviews = []
    rev_page = requests.get(link, headers = headers)
    rev_soup = BeautifulSoup(rev_page.content, 'lxml')
    try:
        revs = rev_soup.find('div', class_ = 'user_reviews').find_all('div', class_ = 'review_body')
        if len(revs) > 50:
            revs = revs[:50]
        for rev in revs:
            reviews.append(rev.text.strip())
    #         break
        return reviews
    except:
        return ['No Reviews for this movie!']

In [17]:
META_movie_list = dict()

for movie in meta_movies:
    
    # movie title
    title = movie.find('a', class_ = 'title').text.strip()
    link = meta_site + movie.find('a', class_ = 'title').get('href',None)
    
    # movie genre
    m_page = requests.get(link, headers = headers)
    m_soup = BeautifulSoup(m_page.content, 'lxml') 
    genres = m_soup.find('div', class_ = 'genres').text
    genres = re.sub('Genre\(s\):','',genres).lower() # to remove the genres title from the start of the string
    genres = re.sub('\s+','',genres).split(',')
    
    # reviews
    review_link = meta_site + m_soup.find_all('a', class_ = 'see_all boxed oswald')[1].get('href',None) # 0 is critic review and 1 is user reviews
    reviews = get_meta_reviews(review_link)
    
    META_movie_list[title] = {'link':link, 'genres': genres, 'reviews': reviews}

In [18]:
print(META_movie_list.keys())

dict_keys(['Citizen Kane', 'The Godfather', 'Rear Window', 'Casablanca', 'Boyhood', 'Three Colors: Red', 'Vertigo', 'Notorious', "Singin' in the Rain", 'City Lights', 'Moonlight', 'Intolerance', 'Pinocchio', 'Touch of Evil', 'The Treasure of the Sierra Madre', "Pan's Labyrinth", 'Some Like It Hot', 'North by Northwest', 'Nomadland', 'Hoop Dreams', 'Rashomon', 'All About Eve', 'Jules and Jim', 'The Wild Bunch', 'My Left Foot', 'The Third Man', 'Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb', 'Gone with the Wind', '4 Months, 3 Weeks and 2 Days', 'Psycho', 'Battleship Potemkin', 'A Streetcar Named Desire', 'American Graffiti', 'Dumbo', 'Roma', 'Ran', 'The Shop Around the Corner', '12 Angry Men', 'Manchester by the Sea', "Rosemary's Baby", 'The Maltese Falcon', '12 Years a Slave', 'Killer of Sheep', 'Rocks', 'Nashville', 'Ratatouille', 'Parasite', "Don't Look Now", 'The Grapes of Wrath', 'Children of Paradise (1945)'])


In [19]:
META_movie_list['The Godfather']['reviews'][4]

"One of the best films I've ever seen and it's masterpiece story telling from beginning to end and by the end you realize this is what real filmmaking is all about the films direction property built the film does have a rather slow pace butOne of the best films I've ever seen and it's masterpiece story telling from beginning to end and by the end you realize this is what real filmmaking is all about the films direction property built the film does have a rather slow pace but overall a excellent movie… Expand"

#### **All Data Downloaded! Thats it for Web Scraping!**

----
### Step 2: Store the downloaded data into a CSV file. Use this CSV file to load the same data in an sqlite database:

#### **Storing in csv:**

Creating two types of files: movie_genre.csv to store the movie genre details of each website <br>
And movie_nam.csv which has two columns id and review. We create this file for each movie.

Storage Structure:
 - nisarg_reviews/
    -  imdb/
        - movie_genre.csv
        - m_movie_name.csv x50
    -  rt/
        - movie_genre.csv
        - m_movie_name.csv x50
    -  meta/
        - movie_genre.csv
        - m_movie_name.csv x50
        
There are movies that start with numeric or special characters in our list. These characters are not allowed in file naming conventions. to remedy this we will be modifying the movie name on file to be of the format: m_movie_name.csv

We will be removing all special characters, replacing spaces with underscore and add an 'm' to the start of the name.

In [20]:
def save_to_csv(path,movie_list):
    
    genres = dict()
    gen_headers = ['movie name', 'genres']
    rev_headers = ['id','review']
    
    # storing movie reviews
    movies = list(movie_list)
    for movie in movies:
        genres[movie] = movie_list[movie]['genres']
        reviews = movie_list[movie]['reviews']
        movie = re.sub('[^\w\s]+','',movie)
        movie = re.sub('\s+','_',movie)
        file = open(path + '/m_' + movie + '.csv','w')
        writer = csv.writer(file)
        writer.writerow(rev_headers)
        
        data = [[i,reviews[i]] for i in range(len(reviews))]
        writer.writerows(data)
        file.close()
        
    # storing genres
    file = open(path + '/movie_genres.csv','w')
    g_l = list(genres)
    writer = csv.writer(file)
    writer.writerow(gen_headers)
    for g in g_l:
        writer.writerow([g,genres[g]])
    file.close()

In [21]:
save_to_csv('nisarg_reviews/imdb',IMDB_movie_list)
save_to_csv('nisarg_reviews/rt',RT_movie_list)
save_to_csv('nisarg_reviews/meta',META_movie_list)

#### **Save in SQLite DB:**
- following similar structure
- SQL/
    - imdb.sqlite
        - movie_genre
        - m_movie_name x50
    - rt.sqlite
        - movie_genre
        - m_movie_name x50
    - meta.sqlite
        - movie_genre
        - m_movie_name x50
        
following the same naming convention as that of CSV.

In [22]:
def save_in_db(cur, movie_list):
    
    genres = dict()
    
    # storing movie reviews
    movies = list(movie_list)
    for movie in movies:
        
        genres[movie] = movie_list[movie]['genres']
        reviews = movie_list[movie]['reviews']
        movie = re.sub('[^\w\s]+','',movie)
        movie = 'm_'+re.sub('\s+','_',movie)
        
        cur.execute("DROP table IF EXISTS "+movie+";")
        cur.execute("CREATE TABLE "+movie+" (id INTEGER, review TEXT);")
        
        data = [(i,reviews[i]) for i in range(len(reviews))]
        cur.executemany("INSERT INTO "+movie+" (id, review) VALUES (?, ?);", data)
    
    # storing genres
    data = [(movie, ",".join(genres[movie])) for movie in movies]
    cur.execute("DROP table IF EXISTS movie_genre;")
    cur.execute("CREATE TABLE movie_genre (movie_name TEXT, genres TEXT);")
    cur.executemany("INSERT INTO movie_genre (movie_name, genres) VALUES (?, ?);", data)

In [23]:
imdb_conn = sqlite3.connect("SQL/imdb.sqlite")
rt_conn = sqlite3.connect("SQL/rt.sqlite")
meta_conn = sqlite3.connect("SQL/meta.sqlite")

save_in_db(imdb_conn.cursor(),IMDB_movie_list)
save_in_db(rt_conn.cursor(),RT_movie_list)
save_in_db(meta_conn.cursor(),META_movie_list)

imdb_conn.commit()
imdb_conn.close()
rt_conn.commit()
rt_conn.close()
meta_conn.commit()
meta_conn.close()

----
### Step 3: Pre-process and clean data and create vocabulary:
      - Remove special characters and extra spaces
      - Lowercase all text
      - create a wordlist and a vocabulary from the movie reviews
      - Remove the stop words

In [24]:
def create_wordlist(movie_details):
    
    wordlist = []
    movies = list(movie_details)
    
    for movie in movies:
        reviews = movie_details[movie]['reviews']
        for review in reviews:
            rev = re.sub('[^\w\s]+','',review)
            words = rev.lower().strip().split()
            wordlist.extend(words)
    return wordlist

# vocabulary is a dictionary with word and its frequency of occurrence - {word:count}
def create_vocabulary(wordlist):
    words, freq = np.unique(wordlist, return_counts=True)
    vocabulary = dict(zip(words, freq))
    stop_words = stopwords.words('english')
    # remove stop words
    for word in words:
        if word in stop_words:
            del vocabulary[word]
    
    return vocabulary

In [25]:
imdb_wordlist = create_wordlist(IMDB_movie_list)
imdb_vocab = create_vocabulary(imdb_wordlist)

print('total no of words: ', len(imdb_wordlist))
print('unique no of words with no stop words: ', len(imdb_vocab))

total no of words:  582390
unique no of words with no stop words:  29294


In [26]:
rt_wordlist = create_wordlist(RT_movie_list)
rt_vocab = create_vocabulary(rt_wordlist)

print('total no of words: ', len(rt_wordlist))
print('unique no of words with no stop words: ', len(rt_vocab))

total no of words:  33483
unique no of words with no stop words:  5631


In [27]:
meta_wordlist = create_wordlist(META_movie_list)
meta_vocab = create_vocabulary(meta_wordlist)

print('total no of words: ', len(meta_wordlist))
print('unique no of words with no stop words: ', len(meta_vocab))

total no of words:  113290
unique no of words with no stop words:  11334


#### **Vocabularies created!!**

----
### Step 4: Use word dictionary to find the top 50 used words on each site, merge this list and create a unique list of words to be the word_attributes:

In [28]:
def sort_dict(vocab):
    sorted_dict = dict(sorted(vocab.items(), key=operator.itemgetter(1),reverse=True))
    return sorted_dict

In [29]:
imdb_sorted = sort_dict(imdb_vocab)
imdb_top_50_words = list(imdb_sorted)[:50]
print(imdb_top_50_words)

['film', 'movie', 'one', 'time', 'like', 'story', 'best', 'great', 'good', 'see', 'first', 'even', 'movies', 'films', 'characters', 'really', 'would', 'ever', 'well', 'also', 'many', 'much', 'people', 'life', 'way', 'made', 'seen', 'character', 'watch', 'never', 'think', 'dont', 'get', 'make', 'two', 'could', 'every', 'scenes', 'say', 'acting', 'love', 'action', 'still', 'us', 'scene', 'world', 'makes', 'know', 'back', 'better']


In [30]:
rt_sorted = sort_dict(rt_vocab)
rt_top_50_words = list(rt_sorted)[:50]
print(rt_top_50_words)

['movie', 'film', 'one', 'story', 'good', 'like', 'best', 'time', 'great', 'well', 'movies', 'de', 'que', 'really', 'even', 'watch', 'would', 'characters', 'films', 'love', 'also', 'think', 'get', 'la', 'made', 'amazing', 'see', 'still', 'acting', 'ever', 'much', 'en', 'plot', 'seen', 'say', 'classic', 'funny', 'way', 'dont', 'never', 'yet', 'el', 'end', 'life', 'perfect', 'una', 'many', 'scene', 'could', 'didnt']


In [31]:
meta_sorted = sort_dict(meta_vocab)
meta_top_50_words = list(meta_sorted)[:50]
print(meta_top_50_words)

['movie', 'film', 'one', 'expand', 'best', 'like', 'story', 'time', 'great', 'ever', 'good', 'really', 'films', 'movies', 'see', 'made', 'life', 'well', 'character', 'acting', 'years', 'even', 'watch', 'also', 'much', 'de', 'way', 'characters', 'people', 'dont', 'seen', 'would', 'many', 'think', 'every', 'get', 'masterpiece', 'la', 'first', 'never', 'plot', 'say', 'scenes', 'love', 'still', 'scene', 'end', 'greatest', 'make', 'nothing']


In [32]:
word_attribute_list = set(imdb_top_50_words + rt_top_50_words + meta_top_50_words)
print("word attribute list: ",word_attribute_list)

word attribute list:  {'still', 'ever', 'every', 'expand', 'action', 'plot', 'characters', 'seen', 'que', 'character', 'first', 'yet', 'know', 'much', 'scene', 'watch', 'never', 'better', 'classic', 'greatest', 'life', 'story', 'us', 'way', 'time', 'best', 'masterpiece', 'end', 'dont', 'one', 'think', 'like', 'movie', 'movies', 'really', 'two', 'de', 'amazing', 'el', 'years', 'film', 'would', 'perfect', 'makes', 'love', 'films', 'good', 'una', 'made', 'even', 'say', 'scenes', 'get', 'see', 'en', 'acting', 'many', 'could', 'world', 'nothing', 'also', 'people', 'funny', 'didnt', 'make', 'la', 'well', 'great', 'back'}


----
### Step 5: Use this word_attributes list to create movie review vectors for each movie. and calculate cosine similarity between each website:
#### **Word attribute vector for each movie:**

In [33]:
imdb_vector = [imdb_vocab.get(word,0) for word in word_attribute_list]
rt_vector = [rt_vocab.get(word,0) for word in word_attribute_list]
meta_vector = [meta_vocab.get(word,0) for word in word_attribute_list]

print("IMDB vector: ", imdb_vector)
print("---------------------------")
print("RT vector: ", rt_vector)
print("---------------------------")
print("META vector: ", meta_vector)

IMDB vector:  [561, 1016, 648, 8, 599, 487, 1043, 870, 0, 825, 1154, 356, 519, 966, 558, 816, 774, 512, 278, 316, 925, 1524, 560, 901, 1682, 1495, 369, 481, 728, 3031, 730, 1660, 4877, 1122, 1039, 708, 142, 414, 2, 498, 5013, 1038, 434, 521, 623, 1119, 1283, 0, 892, 1138, 627, 629, 714, 1209, 7, 626, 974, 673, 538, 331, 977, 927, 135, 337, 712, 43, 995, 1366, 512]
---------------------------
RT vector:  [50, 46, 31, 0, 27, 44, 67, 44, 78, 33, 31, 40, 33, 46, 37, 73, 40, 24, 41, 23, 38, 147, 29, 41, 103, 108, 17, 38, 40, 167, 54, 113, 335, 83, 76, 32, 82, 51, 39, 21, 281, 71, 38, 26, 55, 61, 118, 38, 52, 75, 43, 23, 53, 51, 44, 46, 37, 36, 22, 18, 54, 29, 41, 35, 34, 52, 84, 102, 22]
---------------------------
META vector:  [121, 302, 138, 625, 18, 124, 167, 156, 101, 190, 126, 48, 90, 175, 118, 178, 125, 94, 82, 115, 212, 342, 94, 169, 316, 404, 133, 117, 157, 626, 149, 349, 1279, 223, 273, 68, 174, 104, 31, 189, 1187, 155, 94, 104, 122, 223, 277, 40, 213, 186, 123, 123, 133, 223, 50,

In [34]:
def cosine_sim(p,q):
    
    p = np.asmatrix(p)
    q = np.asmatrix(q)
    dot = np.dot(p,q.T)
    p_det = np.linalg.norm(p)
    q_det = np.linalg.norm(q)
    cos = dot / (p_det * q_det)

    return cos.item()

#### **Calculater Cosine Similarity:**
**IMDB and Rotten Tomatoes:**

In [35]:
cosine_sim(imdb_vector,rt_vector)

0.9551444607366143

**IMDB and Metacritic:**

In [36]:
cosine_sim(imdb_vector,meta_vector)

0.9462254599007427

**Metacritic and Rotten Tomatoes:**

In [37]:
cosine_sim(meta_vector,rt_vector)

0.9338429843128762

#### **Task 1 Complete!**

----
### Step 6: Create a dictionary from the genre details of each movie for each website:
- this dictionary captures the genres and their frequency of occurrence on the list of movies.
- {genre:count}
- we can use the above create_vocab function for creating the dictionary

In [38]:
def genre_list(movie_details):
    genres = []
    for movie in movie_details:
        genres.extend(movie_details[movie]['genres'])
    return genres

In [39]:
imdb_genre_list = genre_list(IMDB_movie_list)
imdb_genres = create_vocabulary(imdb_genre_list)
print("Genres: ", imdb_genres)

Genres:  {'action': 14, 'adventure': 12, 'animation': 3, 'biography': 5, 'comedy': 4, 'crime': 15, 'drama': 39, 'family': 2, 'fantasy': 4, 'history': 2, 'horror': 1, 'music': 2, 'mystery': 5, 'romance': 2, 'sci-fi': 6, 'thriller': 6, 'war': 2, 'western': 2}


In [40]:
rt_genre_list = genre_list(RT_movie_list)
rt_genres = create_vocabulary(rt_genre_list)
print("Genres: ", rt_genres)

Genres:  {'action': 10, 'adventure': 14, 'animation': 5, 'comedy': 18, 'crime': 5, 'drama': 21, 'fantasy': 13, 'history': 2, 'horror': 4, 'kidsandfamily': 8, 'music': 2, 'musical': 2, 'mysteryandthriller': 7, 'romance': 4, 'scifi': 5, 'war': 2}


In [41]:
meta_genre_list = genre_list(META_movie_list)
meta_genres = create_vocabulary(meta_genre_list)
print("Genres: ", meta_genres)

Genres:  {'action': 3, 'adventure': 3, 'animation': 3, 'biography': 2, 'comedy': 9, 'crime': 4, 'documentary': 1, 'drama': 39, 'family': 3, 'fantasy': 3, 'film-noir': 4, 'history': 5, 'horror': 3, 'music': 1, 'musical': 3, 'mystery': 10, 'romance': 11, 'sport': 1, 'thriller': 13, 'war': 5, 'western': 2}


#### ****Since the number of genres is relatively low, instead of taking top 50 we take all genres and merge them to create a list of unique genres.****

In [42]:
genre_attr_list = set(list(imdb_genres) + list(rt_genres) + list(meta_genres))
print(genre_attr_list)

{'crime', 'war', 'action', 'romance', 'history', 'music', 'biography', 'animation', 'family', 'documentary', 'drama', 'horror', 'sci-fi', 'musical', 'mysteryandthriller', 'western', 'sport', 'fantasy', 'adventure', 'kidsandfamily', 'film-noir', 'mystery', 'scifi', 'comedy', 'thriller'}


#### **Based on the above list we create genre vectors:**

In [43]:
imdb_g_v = [imdb_genres.get(g,0) for g in genre_attr_list]
rt_g_v = [rt_genres.get(g,0) for g in genre_attr_list]
meta_g_v = [meta_genres.get(g,0) for g in genre_attr_list]

print("IMDB vector: ", imdb_g_v)
print("---------------------------")
print("RT vector: ", rt_g_v)
print("---------------------------")
print("META vector: ", meta_g_v)

IMDB vector:  [15, 2, 14, 2, 2, 2, 5, 3, 2, 0, 39, 1, 6, 0, 0, 2, 0, 4, 12, 0, 0, 5, 0, 4, 6]
---------------------------
RT vector:  [5, 2, 10, 4, 2, 2, 0, 5, 0, 0, 21, 4, 0, 2, 7, 0, 0, 13, 14, 8, 0, 0, 5, 18, 0]
---------------------------
META vector:  [4, 5, 3, 11, 5, 1, 2, 3, 3, 1, 39, 3, 0, 3, 0, 2, 1, 3, 3, 0, 4, 10, 0, 9, 13]


#### **Calculater Cosine Similarity:**
**IMDB and Rotten Tomatoes:**

In [44]:
cosine_sim(imdb_g_v,rt_g_v)

0.7476019681651895

**IMDB and Metacritic:**

In [45]:
cosine_sim(imdb_g_v,meta_g_v)

0.8646658729415193

**Metacritic and Rotten Tomatoes:**

In [46]:
cosine_sim(meta_g_v,rt_g_v)

0.6824329998331107

#### **Task 2 Complete!**