### Movies Recommendation System IMDB

For this project I am using the data from IMDb movies to create a movie recommendation system!

Link to IMDB movies dataset: https://www.kaggle.com/stefanoleone992/imdb-extensive-dataset

Link to IMDB site: https://www.imdb.com/

Let's import our tools!

In [1]:
import pandas as pd 
import numpy as np 
import csv
import pickle 

from sklearn.feature_extraction.text import HashingVectorizer, CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

df_movies= pd.read_csv("../input/imdb-extensive-dataset/IMDb movies.csv")

In [2]:
df_movies.head()

Unnamed: 0,imdb_title_id,title,original_title,year,date_published,genre,duration,country,language,director,...,actors,description,avg_vote,votes,budget,usa_gross_income,worlwide_gross_income,metascore,reviews_from_users,reviews_from_critics
0,tt0000574,The Story of the Kelly Gang,The Story of the Kelly Gang,1906,1906-12-26,"Biography, Crime, Drama",70,Australia,,Charles Tait,...,"Elizabeth Tait, John Tait, Norman Campbell, Be...",True story of notorious Australian outlaw Ned ...,6.1,537,$ 2250,,,,7.0,7.0
1,tt0001892,Den sorte drøm,Den sorte drøm,1911,1911-08-19,Drama,53,"Germany, Denmark",,Urban Gad,...,"Asta Nielsen, Valdemar Psilander, Gunnar Helse...",Two men of high rank are both wooing the beaut...,5.9,171,,,,,4.0,2.0
2,tt0002101,Cleopatra,Cleopatra,1912,1912-11-13,"Drama, History",100,USA,English,Charles L. Gaskill,...,"Helen Gardner, Pearl Sindelar, Miss Fielding, ...",The fabled queen of Egypt's affair with Roman ...,5.2,420,$ 45000,,,,24.0,3.0
3,tt0002130,L'Inferno,L'Inferno,1911,1911-03-06,"Adventure, Drama, Fantasy",68,Italy,Italian,"Francesco Bertolini, Adolfo Padovan",...,"Salvatore Papa, Arturo Pirovano, Giuseppe de L...",Loosely adapted from Dante's Divine Comedy and...,7.0,2019,,,,,28.0,14.0
4,tt0002199,"From the Manger to the Cross; or, Jesus of Naz...","From the Manger to the Cross; or, Jesus of Naz...",1912,1913,"Biography, Drama",60,USA,English,Sidney Olcott,...,"R. Henderson Bland, Percy Dyer, Gene Gauntier,...","An account of the life of Jesus Christ, based ...",5.7,438,,,,,12.0,5.0


In [3]:
df_movies.dtypes

imdb_title_id             object
title                     object
original_title            object
year                       int64
date_published            object
genre                     object
duration                   int64
country                   object
language                  object
director                  object
writer                    object
production_company        object
actors                    object
description               object
avg_vote                 float64
votes                      int64
budget                    object
usa_gross_income          object
worlwide_gross_income     object
metascore                float64
reviews_from_users       float64
reviews_from_critics     float64
dtype: object

### Drop columns

We'll drop some useless columns and we will work with movies from ` >1980 `

In [4]:
df_movies= df_movies.drop(["original_title", "duration", "metascore", "worlwide_gross_income", "date_published", "production_company", "budget", "usa_gross_income", "reviews_from_users", "reviews_from_critics"], axis=1)

In [5]:
df_movies= df_movies[df_movies.year > 1980]

We will drop rows with `Nan` columns

In [6]:
df_movies.dropna()

Unnamed: 0,imdb_title_id,title,year,genre,country,language,director,writer,actors,description,avg_vote,votes
4115,tt0035423,Kate & Leopold,2001,"Comedy, Fantasy, Romance",USA,"English, French",James Mangold,"Steven Rogers, James Mangold","Meg Ryan, Hugh Jackman, Liev Schreiber, Brecki...",An English Duke from 1876 is inadvertedly drag...,6.4,75298
4425,tt0036606,"Another Time, Another Place",1983,Drama,UK,"English, Italian",Michael Radford,"Jessie Kesson, John Francis Lane","Phyllis Logan, Giovanni Mauriello, Denise Coff...","Set in 1943 Scotland during World War II, Jani...",6.5,234
12699,tt0062181,Rece do góry,1981,Drama,Poland,Polish,Jerzy Skolimowski,"Andrzej Kostenko, Jerzy Skolimowski","Jerzy Skolimowski, Joanna Szczerbic, Tadeusz L...","Censored by the Polish authorities, this movie...",6.5,277
13606,tt0064820,The Plot Against Harry,1989,Comedy,USA,English,Michael Roemer,Michael Roemer,"Martin Priest, Ben Lang, Maxine Woods, Henry N...","A small-time Jewish racketeer, just out of pri...",6.9,273
13656,tt0064994,Skrivánci na niti,1990,"Comedy, Drama, Romance",Czechoslovakia,Czech,Jirí Menzel,"Bohumil Hrabal, Jirí Menzel","Rudolf Hrusínský, Vlastimil Brodský, Václav Ne...","Set in the late 1940s, the film concerns the t...",7.5,1506
...,...,...,...,...,...,...,...,...,...,...,...,...
81266,tt9899880,Columbus,2018,"Comedy, Drama",Iran,"Persian, English",Hatef Alimardani,Hatef Alimardani,"Farhad Aslani, Majid Salehi, Saeed Poursamimi,...",A rich family are deciding to immigrate to the...,4.0,130
81267,tt9900782,Kaithi,2019,"Action, Thriller",India,Tamil,Lokesh Kanagaraj,Lokesh Kanagaraj,"Karthi, Narain, Dheena, George Maryan, Ramana,...","A drug bust, an injured cop and a convicted cr...",8.9,3082
81268,tt9903716,Jessie,2019,"Horror, Thriller",India,Telugu,Aswani Kumar V.,Aswani Kumar V.,"Sritha Chandana, Pavani Gangireddy, Abhinav Go...","Set in an abandoned house, the film follows a ...",7.2,219
81269,tt9905412,Ottam,2019,Drama,India,Malayalam,Zam,Rajesh k Narayan,"Nandu Anand, Roshan Ullas, Manikandan R. Achar...","Set in Trivandrum, the story of Ottam unfolds ...",7.8,510


### Creating new csv

Let's create a new csv file with our new dataset

In [7]:
df_movies.to_csv("Mov.csv")

In [8]:
df_mov=pd.read_csv("Mov.csv")

df_mov.head()

Unnamed: 0.1,Unnamed: 0,imdb_title_id,title,year,genre,country,language,director,writer,actors,description,avg_vote,votes
0,4115,tt0035423,Kate & Leopold,2001,"Comedy, Fantasy, Romance",USA,"English, French",James Mangold,"Steven Rogers, James Mangold","Meg Ryan, Hugh Jackman, Liev Schreiber, Brecki...",An English Duke from 1876 is inadvertedly drag...,6.4,75298
1,4425,tt0036606,"Another Time, Another Place",1983,Drama,UK,"English, Italian",Michael Radford,"Jessie Kesson, John Francis Lane","Phyllis Logan, Giovanni Mauriello, Denise Coff...","Set in 1943 Scotland during World War II, Jani...",6.5,234
2,12699,tt0062181,Rece do góry,1981,Drama,Poland,Polish,Jerzy Skolimowski,"Andrzej Kostenko, Jerzy Skolimowski","Jerzy Skolimowski, Joanna Szczerbic, Tadeusz L...","Censored by the Polish authorities, this movie...",6.5,277
3,13606,tt0064820,The Plot Against Harry,1989,Comedy,USA,English,Michael Roemer,Michael Roemer,"Martin Priest, Ben Lang, Maxine Woods, Henry N...","A small-time Jewish racketeer, just out of pri...",6.9,273
4,13656,tt0064994,Skrivánci na niti,1990,"Comedy, Drama, Romance",Czechoslovakia,Czech,Jirí Menzel,"Bohumil Hrabal, Jirí Menzel","Rudolf Hrusínský, Vlastimil Brodský, Václav Ne...","Set in the late 1940s, the film concerns the t...",7.5,1506


### Find the Weight Rate

First we will find the `80%` of votes and the `mean` of `avg_vote`

In [9]:
quant= df_mov["votes"].quantile(0.8)

print("Number of votes: " +str(quant))

mean= df_mov["avg_vote"].mean()

print("Average vote: " +str(mean))

Number of votes: 3532.0
Average vote: 5.77913943264069


In [10]:
movies= df_movies.copy().loc[df_movies["votes"] >= quant]

movies.shape

(11967, 12)

Weight rate function

In [11]:
def Wrate(df, m=mean,  q=quant):
    v=df["votes"]
    R= df["avg_vote"]
    
    return (v/(v+q)* R) + (q/(q+v)* m)

In [12]:
movies["score"]= movies.apply(Wrate, axis=1)

Top `Weight Rate` score by `title`

In [13]:
movies= movies.sort_values("score", ascending= False)

movies["title"].head()

27628                         The Shawshank Redemption
46756                                  The Dark Knight
27558                                     Pulp Fiction
33198    The Lord of the Rings: The Return of the King
75818                                           Dag II
Name: title, dtype: object

We 'll clean the text data and make a new column named features with the existing columns of:

   * `writer`
        
   * `actors`
        
   * `director`
        
   * `genre`
        
   * `description`

In [14]:
def clean_data(x):
    if isinstance(x, list):
        return [str.lower(i.replace(" ", "")) for i in x]
    
    else: 
        
        if isinstance(x, str):
            return str.lower(x.replace(" ", ""))
        
        else: return ""

In [15]:
features=["actors", "writer", "director", "genre", "description"]

for f in features:
    df_movies[f]= df_movies[f].apply(clean_data)

Create new columns from the features

In [16]:
def create_col(x):
    return " ".join(x["writer"]) + " " + " ".join(x["actors"]) + " " + x["director"] + " " + " ".join(x["genre"])+ " "+ " ".join(x["description"])


df_movies["features"] = df_movies.apply(create_col, axis=1)

In [17]:
df_movies.dtypes

imdb_title_id     object
title             object
year               int64
genre             object
country           object
language          object
director          object
writer            object
actors            object
description       object
avg_vote         float64
votes              int64
features          object
dtype: object

### CountVectorizer

We 'll use `CountVectorizer` to find the similar movies

In [18]:
cv = CountVectorizer(stop_words="english")
cv_matrix = cv.fit_transform(df_movies["features"])

In [19]:
cosine_sim = cosine_similarity(cv_matrix, cv_matrix)

### Show recommendations

In [20]:
def get_recom(title, cosine_sim=cosine_sim):
    # Get the index of the movie that matches the title
    idx = indices[title]

    # Get the pairwise similarity scores of all movies with that movie
    sim_scores = list(enumerate(cosine_sim[idx]))

    # Sort the movies
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    sim_scores = sim_scores[1:20]

    # Get the movie indices
    movie_indices = [i[0] for i in sim_scores]

    return df_movies["title"].iloc[movie_indices]

In [21]:
df_movies = df_movies.reset_index()
indices = pd.Series(df_movies.index, index=df_movies["title"])

Test it!

In [22]:
get_recom("Inception", cosine_sim)

14906                        Memento
17676                       Insomnia
21449                  Batman Begins
25453                The Dark Knight
26111                   The Prestige
27608                   Interstellar
33679          The Dark Knight Rises
33914                      Inception
52247                        Dunkirk
0                     Kate & Leopold
1        Another Time, Another Place
2                       Rece do góry
3             The Plot Against Harry
4                  Skrivánci na niti
5                          7 xiao fu
6                        Firecracker
7               Proverka na dorogakh
8         The Other Side of the Wind
9                  Malibu Hot Summer
Name: title, dtype: object

### Save our model with pickle

In [23]:
pickle.dump=((cv, cv_matrix, cv.vocabulary_, cosine_sim), open("CvModel.pkl", "wb"))