# Movie Recommendation Systeme

A REFAIRE 

This project aims to develop a movie recommendation system using a dataset of Netflix movies.  
Source : https://www.kaggle.com/datasets/victorsoeiro/netflix-tv-shows-and-movies?select=titles.csv , Data collected from JustWatch.   
The data includes various features such as titles, genres, release years, and user ratings.   

The goal is to clean and preprocess the data, handle missing values, and build a recommendation engine that can suggest movies based on user preferences.   

1. Data Loading and Preparation : Loading the dataset, filtering for movies, and drop unnecessary columns.  
2. Data Transformation: Extracting genres and main production countries from the data.  
3. Handling Missing Values: Identifying and filling missing values using techniques such as imputation and prediction.  
4. Feature Engineering: Creating additional features like normalized scores and weighted ratings.  
5. Keyword Extraction: Extracting keywords from movie titles and descriptions using the YAKE algorithm.  
6. Visualization: Visualizing the distribution of movie ratings.  
7. Recommendation System: Implementing a recommendation system using content-based filtering and machine learning algorithms.  

Methods and Algorithms Used:
* Data Cleaning and Preprocessing  
* Feature Engineering
* Missing Value Imputation using Random Forest Regressor
* Keyword Extraction using YAKE (Yet Another Keyword Extractor)
* Data Visualization with Matplotlib and Seaborn
* Content-Based Filtering for Recommendations
* Nearest Neighbors Algorithm for Finding Similar Movies

In [15]:
# Import necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import ast

from gensim.models import Word2Vec
from sklearn.cluster import KMeans

from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.neighbors import KNeighborsRegressor
from sklearn.neighbors import NearestNeighbors
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.ensemble import RandomForestRegressor

import yake

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import string
from nltk.stem import WordNetLemmatizer

In [17]:
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\melan\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\melan\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\melan\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

## Section 1 : Data Loading and Preparation

In [19]:
# Loading Data
df = pd.read_csv('input/titles.csv')

In [23]:
df.head(3)

Unnamed: 0,id,title,type,description,release_year,age_certification,runtime,genres,production_countries,seasons,imdb_id,imdb_score,imdb_votes,tmdb_popularity,tmdb_score
0,ts300399,Five Came Back: The Reference Films,SHOW,This collection includes 12 World War II-era p...,1945,TV-MA,51,['documentation'],['US'],1.0,,,,0.6,
1,tm84618,Taxi Driver,MOVIE,A mentally unstable Vietnam War veteran works ...,1976,R,114,"['drama', 'crime']",['US'],,tt0075314,8.2,808582.0,40.965,8.179
2,tm154986,Deliverance,MOVIE,Intent on seeing the Cahulawassee River before...,1972,R,109,"['drama', 'action', 'thriller', 'european']",['US'],,tt0068473,7.7,107673.0,10.01,7.3


In [25]:
# Filtering Movies and Dropping Unnecessary Columns
df = df[df['type']=='MOVIE'].drop(['type','seasons','age_certification','imdb_id'], axis=1).reset_index(drop=True)

In [27]:
df.head(3)

Unnamed: 0,id,title,description,release_year,runtime,genres,production_countries,imdb_score,imdb_votes,tmdb_popularity,tmdb_score
0,tm84618,Taxi Driver,A mentally unstable Vietnam War veteran works ...,1976,114,"['drama', 'crime']",['US'],8.2,808582.0,40.965,8.179
1,tm154986,Deliverance,Intent on seeing the Cahulawassee River before...,1972,109,"['drama', 'action', 'thriller', 'european']",['US'],7.7,107673.0,10.01,7.3
2,tm127384,Monty Python and the Holy Grail,"King Arthur, accompanied by his squire, recrui...",1975,91,"['fantasy', 'action', 'comedy']",['GB'],8.2,534486.0,15.461,7.811


In [29]:
len(df)

3744

# Df final : celui qui accueillera les nouvelles data 
## L'utilisateur rentrera : 
* id  
* title  
* description  
* release_year  
* runtime  
* list of genre
* list of country production
* scores of differents plateforme 

## Mise en forme des data - Preparation df pour Pipeline

### Missing Value

In [108]:
# Missing value
class MissingValueHandler(BaseEstimator,TransformerMixin):
    def __init__(self):
        pass        
        
    def fit(self, X,y=None):
        return self
    
    def transform(self, X,y=None):
        df = X.copy()
        
        # Les lignes dont title ou description sont vide
        df = df.dropna(subset=['title', 'description','genres'])
        
        # Les lignes dont plus de 3 colonnes ont des valeurs nulles
        missing_value_line = df[df.isnull().sum(axis=1) >3]
        df = df.drop(missing_value_line.index)
        
        # Remplir les pays manquant pas unknown country
        df['production_countries'].fillna('unknown_country', inplace=True)
        
        # Supprimer les lignes où les deux scores sont manquants
        drop_line = df[(df['imdb_score'].isnull())&(df['tmdb_score'].isnull())]
        df = df.drop(drop_line.index)
        
        # Remplir les valeurs manquantes des scores en se basant sur l'autre score
        df['tmdb_score'] = df['tmdb_score'].fillna(df['imdb_score'])
        df['imdb_score'] = df['imdb_score'].fillna(df['tmdb_score'])
        

        
        # Predire les valeures manquante des scores  
        def predict_missing_values(df, target_column):
            df_model = df[["release_year", "runtime", "imdb_score", "tmdb_score", target_column]]

            train_data = df_model.dropna(subset=[target_column])
            test_data = df_model[df_model[target_column].isnull()]
            X_train = train_data.drop(target_column, axis=1)
            y_train = train_data[target_column]

            model = RandomForestRegressor(n_estimators=100, random_state=0)
            model.fit(X_train, y_train)

            X_test = test_data.drop(target_column, axis=1)
            predictions = model.predict(X_test)

            df.loc[df[target_column].isnull(), target_column] = predictions
            
        #Appliquer sur les colonnes cible
        target_columns = ['tmdb_popularity','imdb_votes']
        for column in target_columns: 
            predict_missing_values(df, column)
            
        return df            

### Score unique

In [110]:
# calculated weighted score
class ScoreCalculator(BaseEstimator, TransformerMixin):
    def __init__(self):
        pass
        
    def fit(self, X,y=None):
        return self
        
    def transform(self, X, y=None):
        df = X.copy()
        
        # Normalizing Scores
        df['tmdb_popularity_norm'] = (df['tmdb_popularity'] - df['tmdb_popularity'].min()) / (df['tmdb_popularity'].max() - df['tmdb_popularity'].min())
        df['imdb_votes_norm'] = (df['imdb_votes'] - df['imdb_votes'].min()) / (df['imdb_votes'].max() - df['imdb_votes'].min())

        # Calculating Weighted Scores
        df['weighted_imdb_score'] = df['imdb_score'] * df['imdb_votes_norm']
        df['weighted_tmdb_score'] = df['tmdb_score'] * df['tmdb_popularity_norm']

        df['my_rating'] = (df['weighted_imdb_score'] + df['weighted_tmdb_score']) / (df['imdb_votes_norm'] + df['tmdb_popularity_norm'])
        
        #drop colonnes : 
        df = df.drop(
            ['tmdb_popularity_norm','tmdb_popularity','tmdb_score','weighted_tmdb_score',
             'imdb_votes','imdb_votes_norm','imdb_score','weighted_imdb_score']
            , axis=1)
        
        return df

### Nettoyer la description

In [106]:
class TextCleaner(BaseEstimator, TransformerMixin):
    def __init__(self):
        self.stop_words = set(stopwords.words('english'))
        self.punctuation = set(string.punctuation)
        
    def fit(self, X,y=None):
        return self
        
    def transform(self, X, y=None):
        df = X.copy()
        
        words_description  = df['description'].apply(lambda x: x.lower())
        words_description = words_description.apply(word_tokenize)

        # filtrer les mots de la description qui ne sont pas dans stop_words (pourrais etre fait plus simplement!
        def filter_stop_words(words, stop_words,punctuation):
            filtered_words = []
            for word in words:
                clean_word = ''.join([char for char in word if char not in punctuation ])
               # Vérifie que clean_word n'est pas vide et n'est pas dans les stop words
                if clean_word and clean_word.lower() not in stop_words:
                    filtered_words.append(clean_word)
            return filtered_words

        #appliquer le filtre
        df['description_filtered'] = words_description.apply(lambda x : filter_stop_words(x, self.stop_words, self.punctuation))

        lemmatizer = WordNetLemmatizer()
        
        #lemmentizer les mots (tous la meme racine)
        def apply_lemmatization(words):           
            lemmatized_words = [lemmatizer.lemmatize(word) for word in words]
            return lemmatized_words 

        df['description_lem'] = df['description_filtered'].apply(lambda x : apply_lemmatization(x))

        #les joindre ensemble
        df['description_join'] = df['description_lem'].apply(lambda x : ' '.join(x))
        
        #drop colonnes : 
        df = df.drop(['description_lem','description_filtered'], axis=1)
        
        return df

## Verification du df

In [112]:
from sklearn.pipeline import Pipeline

In [114]:
pipeline = Pipeline([
    ('missing_values', MissingValueHandler()),
    ('score_calculation', ScoreCalculator()),
    ('text_cleaning', TextCleaner())
])

In [116]:
df_transformed = pipeline.fit_transform(df)

In [118]:
df_transformed.head(3)

Unnamed: 0,id,title,description,release_year,runtime,genres,production_countries,my_rating,description_join
0,tm84618,Taxi Driver,A mentally unstable Vietnam War veteran works ...,1976,114,"['drama', 'crime']",['US'],8.19898,mentally unstable vietnam war veteran work nig...
1,tm154986,Deliverance,Intent on seeing the Cahulawassee River before...,1972,109,"['drama', 'action', 'thriller', 'european']",['US'],7.665871,intent seeing cahulawassee river turned one hu...
2,tm127384,Monty Python and the Holy Grail,"King Arthur, accompanied by his squire, recrui...",1975,91,"['fantasy', 'action', 'comedy']",['GB'],8.189007,king arthur accompanied squire recruit knight ...


In [120]:
def percent_missing(df):
    per_nan = 100*df.isnull().sum() /len(df)
    per_nan = per_nan[per_nan >0].sort_values()    
    return per_nan

In [122]:
percent_missing(df_transformed)

Series([], dtype: float64)

## Preprocessing

# J'en suis là  de la v2!!!!!!

### Genre et contry : Binariser genre et country

In [None]:
# Extractiondata : multilabelBinarizer
class GenreCountryExtractor(BaseEstimator,TransformerMixin):
        def __init__(self):
            self.genre_binarizer = MultiLabelBinarizer()
            self.country_binarizer = MultiLabelBinarizer()            
        
        def fit(self, X, y=None):
            self.genre_binarizer.fit(['genre'])
            self.country_binarizer.fit(['country'])
            return self
        
        def transform(self,X,y=None):
            genre_encoded = self.genre_binarizer.transform(X['genre'])
            country_encoded = self.country_binarizer.transform(X['country'])
            
            genre_df = pd.DataFrame(genre_encoded, index=X.index, columns=self.genre_binarizer.classes_)
            country_df = pd.DataFrame(country_encoded, index=X.index, columns=self.country_binarizer.classes_)
            return pd.concat([genre_df, country_df], axis=1)

### Transfo description clean en vecteur

In [None]:
word2vec_model = Word2Vec(df['description_join'], vector_size=100, window=5, min_count=1, sg=1)

# Calculating mean vectors for each movies
def get_mean_vector(words):
    words = [word for word in words if word in word2vec_model.wv]
    if len(words) >= 1:
        return np.mean(word2vec_model.wv[words], axis=0)
    else:
        return np.zeros(word2vec_model.vector_size)
    
df['vector'] = df['description_join'].apply(get_mean_vector)

## Dummies

## Scaling

## Section 6 : Saving Cleaned Data

In [262]:
df = df[['id', 'title', 'description', 'release_year', 'runtime', 'first_genre', 'second_genre', 'country', 'description_join', 'my_rating']] 

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3677 entries, 0 to 3676
Data columns (total 10 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   id                3677 non-null   object 
 1   title             3677 non-null   object 
 2   description       3677 non-null   object 
 3   release_year      3677 non-null   int64  
 4   runtime           3677 non-null   int64  
 5   first_genre       3677 non-null   object 
 6   second_genre      3677 non-null   object 
 7   country           3677 non-null   object 
 8   description_join  3677 non-null   object 
 9   my_rating         3677 non-null   float64
dtypes: float64(1), int64(2), object(7)
memory usage: 287.4+ KB


In [266]:
df.to_csv("input/movie_recom_pandas.csv", index=False)

In [3]:
df = pd.read_csv("input/movie_recom_pandas.csv")

In [274]:
df.to_json('data/films_clean.json', orient='records')

## Section 7 : Movie Recommendation System

### Similarity-Based Recommendation

#### Preprocessing : Vectorization Keywords

In [15]:
word2vec_model = Word2Vec(df['description_join'], vector_size=100, window=5, min_count=1, sg=1)

In [17]:
# Calculating mean vectors for each movies
def get_mean_vector(words):
    words = [word for word in words if word in word2vec_model.wv]
    if len(words) >= 1:
        return np.mean(word2vec_model.wv[words], axis=0)
    else:
        return np.zeros(word2vec_model.vector_size)

In [19]:
df['vector'] = df['description_join'].apply(get_mean_vector)

In [21]:
df

Unnamed: 0,id,title,description,release_year,runtime,first_genre,second_genre,country,description_join,my_rating,vector
0,tm84618,Taxi Driver,a mentally unstable vietnam war veteran works ...,1976,114,drama,crime,US,mentally unstable vietnam war veteran work nig...,8.198980,"[-0.09833875, 0.18513083, 0.062483143, 0.06308..."
1,tm154986,Deliverance,intent on seeing the cahulawassee river before...,1972,109,drama,action,US,intent seeing cahulawassee river turned one hu...,7.665871,"[-0.104093514, 0.17885852, 0.057409924, 0.0549..."
2,tm127384,Monty Python and the Holy Grail,"king arthur, accompanied by his squire, recrui...",1975,91,fantasy,action,GB,king arthur accompanied squire recruit knight ...,8.189007,"[-0.10537343, 0.1804194, 0.063013054, 0.061206..."
3,tm120801,The Dirty Dozen,12 american military prisoners in world war ii...,1967,150,war,action,GB,12 american military prisoner world war ii ord...,7.677974,"[-0.1000509, 0.18254818, 0.0591866, 0.06253671..."
4,tm70993,Life of Brian,"brian cohen is an average young jewish man, bu...",1979,94,comedy,no_second-genre,GB,brian cohen average young jewish man series ri...,7.991343,"[-0.10681966, 0.17896973, 0.060149003, 0.06195..."
...,...,...,...,...,...,...,...,...,...,...,...
3672,tm1066324,Super Monsters: Once Upon a Rhyme,the super monsters rethink exemplary fantasies...,2021,25,animation,family,unknown_country,super monster rethink exemplary fantasy loved ...,6.299574,"[-0.09043901, 0.1746572, 0.061467055, 0.065398..."
3673,tm1097142,My Bride,the story follows a young man and woman who go...,2021,93,romance,comedy,EG,story follows young man woman go various situa...,5.265915,"[-0.1138273, 0.18390398, 0.058553066, 0.052136..."
3674,tm1014599,Fine Wine,a beautiful love story that can happen between...,2021,100,romance,drama,NG,beautiful love story happen two people regardl...,6.800000,"[-0.112132974, 0.18680863, 0.0631173, 0.074510..."
3675,tm898842,C/O Kaadhal,a heart warming film that explores the concept...,2021,134,drama,no_second-genre,unknown_country,heart warming film explores concept romance ep...,7.700000,"[-0.105311394, 0.18741384, 0.06420465, 0.07070..."


#### Preprocessing : Dummies and features

In [23]:
# Encoding genres and country
df = pd.get_dummies(df, columns=['first_genre', 'second_genre', 'country'])

In [25]:
# Adding other features
features = df[['release_year', 'runtime'] 
                     + list(df.columns[df.columns.str.startswith('first_genre_')]) 
                     + list(df.columns[df.columns.str.startswith('second_genre_')]) 
                     + list(df.columns[df.columns.str.startswith('country_')])]

In [27]:
features

Unnamed: 0,release_year,runtime,first_genre_action,first_genre_animation,first_genre_comedy,first_genre_crime,first_genre_documentation,first_genre_drama,first_genre_family,first_genre_fantasy,...,country_TZ,country_UA,country_US,country_UY,country_VE,country_VN,country_XX,country_ZA,country_ZW,country_unknown_country
0,1976,114,False,False,False,False,False,True,False,False,...,False,False,True,False,False,False,False,False,False,False
1,1972,109,False,False,False,False,False,True,False,False,...,False,False,True,False,False,False,False,False,False,False
2,1975,91,False,False,False,False,False,False,False,True,...,False,False,False,False,False,False,False,False,False,False
3,1967,150,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
4,1979,94,False,False,True,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3672,2021,25,False,True,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,True
3673,2021,93,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
3674,2021,100,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
3675,2021,134,False,False,False,False,False,True,False,False,...,False,False,False,False,False,False,False,False,False,True


#### Preprocessing : Scaling and combined all features

In [29]:
# normalized / scale 
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
features_scaled = scaler.fit_transform(features)

In [31]:
# Adding the keywords vectors
vectors = np.array(df['vector'].tolist())
combined_features = np.hstack((vectors, features_scaled))

#### Implementing KNN

In [33]:
knn = NearestNeighbors(n_neighbors=10, metric='cosine').fit(combined_features)

## Section 8 : Testing Similarity-Based Recommendation

In [35]:
def recommend_movies(movie_id, n_recommendations=5):
    movie_index = df.index[df['id'] == movie_id][0]
    distances, indices = knn.kneighbors([combined_features[movie_index]])
    recommended_movies = df.iloc[indices[0][1:n_recommendations+1]]
    return recommended_movies[['id', 'title']]

In [37]:
movie_id = 'tm1097142'  # ID du film que l'utilisateur a aimé
recommendations = recommend_movies(movie_id)

In [39]:
recommendations

Unnamed: 0,id,title
1857,tm311193,Seeking a Man
1126,tm235232,Misunderstanding
2908,tm853091,The American Game
3001,tm815438,Sab el-Burumbah
2916,tm823848,Mohamed Hussein


In [41]:
def recommend_movies(liked_movie_ids, n_recommendations=5):
    movie_indices = [df.index[df['id'] == movie_id][0] for movie_id in liked_movie_ids]
    liked_movie_vectors = [combined_features[movie_index] for movie_index in movie_indices]
    average_vector = np.mean(liked_movie_vectors, axis=0)
    
    distances, indices = knn.kneighbors([average_vector])
    
    # Exclude already liked movies from the recommendations
    recommended_indices = []
    for index in indices[0]:
        movie_id = df.iloc[index]['id']
        if movie_id not in liked_movie_ids:
            recommended_indices.append(index)
        if len(recommended_indices) == n_recommendations:
            break
    
    recommended_movies = df.iloc[recommended_indices]
    
    
    return recommended_movies[['id', 'title']]


In [43]:
liked_movie_ids = ['tm84618', 'tm15897', 'tm25947', 'tm70807']  # IDs of the movies the user liked


In [45]:
recommendations = recommend_movies(liked_movie_ids)

In [47]:
recommendations

Unnamed: 0,id,title
117,tm192431,The Devil's Own
38,tm177480,Police Academy
26,tm336403,Khoon Khoon
6,tm119281,Bonnie and Clyde
34,tm180542,Once Upon a Time in America
