### Movies Recommendation System IMDB

For this project I am using the data from IMDb movies to create a movie recommendation system!

Link to IMDb movies dataset: https://www.kaggle.com/stefanoleone992/imdb-extensive-dataset

Link to IMDb site: https://www.imdb.com/

Let's import our tools!

In [68]:
import pandas as pd 
import numpy as np 
import pickle 

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

df_movies= pd.read_csv("../input/imdb-extensive-dataset/IMDb movies.csv")

In [69]:
df_movies.head()

Unnamed: 0,imdb_title_id,title,original_title,year,date_published,genre,duration,country,language,director,...,actors,description,avg_vote,votes,budget,usa_gross_income,worlwide_gross_income,metascore,reviews_from_users,reviews_from_critics
0,tt0000574,The Story of the Kelly Gang,The Story of the Kelly Gang,1906,1906-12-26,"Biography, Crime, Drama",70,Australia,,Charles Tait,...,"Elizabeth Tait, John Tait, Norman Campbell, Be...",True story of notorious Australian outlaw Ned ...,6.1,537,$ 2250,,,,7.0,7.0
1,tt0001892,Den sorte drøm,Den sorte drøm,1911,1911-08-19,Drama,53,"Germany, Denmark",,Urban Gad,...,"Asta Nielsen, Valdemar Psilander, Gunnar Helse...",Two men of high rank are both wooing the beaut...,5.9,171,,,,,4.0,2.0
2,tt0002101,Cleopatra,Cleopatra,1912,1912-11-13,"Drama, History",100,USA,English,Charles L. Gaskill,...,"Helen Gardner, Pearl Sindelar, Miss Fielding, ...",The fabled queen of Egypt's affair with Roman ...,5.2,420,$ 45000,,,,24.0,3.0
3,tt0002130,L'Inferno,L'Inferno,1911,1911-03-06,"Adventure, Drama, Fantasy",68,Italy,Italian,"Francesco Bertolini, Adolfo Padovan",...,"Salvatore Papa, Arturo Pirovano, Giuseppe de L...",Loosely adapted from Dante's Divine Comedy and...,7.0,2019,,,,,28.0,14.0
4,tt0002199,"From the Manger to the Cross; or, Jesus of Naz...","From the Manger to the Cross; or, Jesus of Naz...",1912,1913,"Biography, Drama",60,USA,English,Sidney Olcott,...,"R. Henderson Bland, Percy Dyer, Gene Gauntier,...","An account of the life of Jesus Christ, based ...",5.7,438,,,,,12.0,5.0


In [70]:
df_movies.dtypes

imdb_title_id             object
title                     object
original_title            object
year                       int64
date_published            object
genre                     object
duration                   int64
country                   object
language                  object
director                  object
writer                    object
production_company        object
actors                    object
description               object
avg_vote                 float64
votes                      int64
budget                    object
usa_gross_income          object
worlwide_gross_income     object
metascore                float64
reviews_from_users       float64
reviews_from_critics     float64
dtype: object

### Drop columns

We'll drop some useless columns and we will work with movies from ` >1990 `

In [71]:
df_movies= df_movies.drop(["original_title", "duration", "metascore", "worlwide_gross_income", "date_published", "production_company", "budget", "usa_gross_income", "reviews_from_users", "reviews_from_critics"], axis=1)

In [72]:
df_movies= df_movies[df_movies.year > 1990]
df_movies= df_movies[df_movies.avg_vote > 6.5]

### Find the Weight Rate

First we will find the `80%` of votes and the `mean` of `avg_vote`

In [73]:
quant= df_movies["votes"].quantile(0.8)

print("Number of votes: " +str(quant))

mean= df_movies["avg_vote"].mean()

print("Average vote: " +str(mean))

Number of votes: 11820.400000000001
Average vote: 7.121994153946187


In [74]:
movies= df_movies.copy().loc[df_movies["votes"] >= quant]

movies.shape

(3079, 12)

Weight rate function

In [75]:
def Wrate(df, m=mean,  q=quant):
    v=df["votes"]
    R= df["avg_vote"]
    
    return (v/(v+q)* R) + (q/(q+v)* m)

In [54]:
movies["score"]= movies.apply(Wrate, axis=1)

Top `Weight Rate` score by `title`

In [55]:
movies= movies.sort_values("score", ascending= False)

movies["title"].head()

27628                         The Shawshank Redemption
46756                                  The Dark Knight
27558                                     Pulp Fiction
33198    The Lord of the Rings: The Return of the King
26817                                 Schindler's List
Name: title, dtype: object

We 'll clean the text data and make a new column named features with the existing columns of:

   * `writer`
        
   * `actors`
        
   * `director`
        
   * `genre`
        
   * `description`

In [56]:
def clean_data(x):
    if isinstance(x, list):
        return [str.lower(i.replace(" ", "")) for i in x]
    
    else: 
        
        if isinstance(x, str):
            return str.lower(x.replace(" ", ""))
        
        else: return ""

In [57]:
features=["actors", "director", "genre", "description"]

for f in features:
    df_movies[f]= df_movies[f].apply(clean_data)

Create new columns from the features

In [58]:
def create_col(x):
    return " ".join(x["actors"]) + " " + x["director"] + " " + " ".join(x["genre"])+ " "+ " ".join(x["description"])


df_movies["features"] = df_movies.apply(create_col, axis=1)

In [59]:
df_movies.dtypes

index              int64
imdb_title_id     object
title             object
year               int64
genre             object
country           object
language          object
director          object
writer            object
actors            object
description       object
avg_vote         float64
votes              int64
features          object
dtype: object

### TfidfVectorizer

We 'll use `TfidfVectorizer` to find the similar movies

In [60]:
tfidf = TfidfVectorizer(stop_words="english")
tfidf_matrix = tfidf.fit_transform(df_movies["features"])

In [61]:
cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)
cosine_sim.shape

(15395, 15395)

In [62]:
pickle.dump(cosine_sim, open("cos.pkl", "wb"))

### Show recommendations

In [63]:
def get_recom(title, cosine_sim=cosine_sim):
    # Get the index of the movie that matches the title
    idx = indices[title]

    # Get the pairwise similarity scores of all movies with that movie
    sim_scores = list(enumerate(cosine_sim[idx]))

    # Sort the movies
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    sim_scores = sim_scores[1:10]

    # Get the movie indices
    movie_indices = [i[0] for i in sim_scores]

    return df_movies["title"].iloc[movie_indices]

In [64]:
df_movies = df_movies.reset_index()
indices = pd.Series(df_movies.index, index=df_movies["title"])

In [47]:
df_movies.to_csv("Mov.csv")

Test it!

In [48]:
get_recom("JFK", cosine_sim)

112                             JFK
550                  Heaven & Earth
869            Natural Born Killers
1159                          Nixon
1773                         U Turn
2145               Any Given Sunday
12001                       Snowden
0        The Other Side of the Wind
1                         Abhimanyu
Name: title, dtype: object