# Movie Recommendation and Rating - Team ZF3

© Explore Data Science Academy 2022

---

###### Team Members

1. Ubasinachi Eleonu
2. Bongani Mkhize
3. Abubakar Abdulkadir
4. Michael Mamah
5. Joseph Okonkwo
6. 

---

## Project Overview

<img src="https://th.bing.com/th/id/R.f32f6c0a36b1166033122544cf0dd8a1?rik=QmYumf41lwVQgA&pid=ImgRaw&r=0" style='margin-top:30x; margin-bottom:30px'/>
It is almost impossible for a person to attempt to consume all the products and choices available. It is even most likely that a person will not have the time, patience or resources to even view the myraids of choices in terms of products and services available at his disposal. Hence, it becomes almost imperative for producers of goods and services to help narrow down the choices of products presented to their users in an attempt to reduce overwhelming them and help them reach thier relevant products and services without waste of time and as a result, helping them have a better user experience, while also exposing them to more products and services they might have never discovered otherwise. This help comes in the form of  <b> recommendation </b>

Simple as the above sounds, it is not as easy to implement because the traditional approach would have been to deploy product recommender agents (like customer service representatives) who will handle recommendation requests from customers. But these agents will be unable to learn about every of thier customers and what products and services they might want and find useful. So how does one recommend products and services to people he does not know?

The response is using Recommender Systems. Recommender systems are machine learning systems that help users discover products and services based on the relationship between the users and the products.Recommender systems are like salesmen who have learnt to recognize customers and the products they might like based on their history and preferences. Recommender systems are so common place now that every time you shop online, a  recommendation system is guiding you towards the most likely product you might purchase.

There are several use cases of the recommender system. But this project will focus on movie recommendation.

---

## 1.0 Project Objective

To build a recommendation system capable of recommending movies to users and predicting ratings a user might give a movie they have never seen bebfore. <br ><br>

## 2.0 Packages

### 2.1. Installing Packages

For this project, two major libraries were leveraged on - sklearn and surprise. Sklearn is the most mopular of the two.

In [None]:
!pip install scikit-surprise

- <a href="http://surpriselib.com/"> Surprise</a> is a Python scikit for building and analyzing recommender systems that deal with explicit rating data. It does not support implicit ratings or content-based information. Surprise was used in this project to make collaborative prediction. <br>

### 2.2 Importing Packages 

In [2]:
# data loading and preprocessing 
import numpy as np 
import pandas as pd 
import pickle as pkl
from collections import Counter
from surprise import Reader
from surprise import Dataset
import math

# Visualisation
import matplotlib.pyplot as plt
import seaborn as sns

# feature extration and similarity metrics
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.metrics.pairwise import linear_kernel, cosine_similarity

#modeling and validation
from surprise.model_selection import cross_validate
from surprise import NormalPredictor
from surprise import KNNBasic
from surprise import KNNWithMeans
from surprise import KNNWithZScore
from surprise import KNNBaseline
from surprise import SVD
from surprise import BaselineOnly
from surprise import SVDpp
from surprise import NMF
from surprise import SlopeOne
from surprise import CoClustering
from surprise.accuracy import rmse
from surprise import accuracy
from surprise.model_selection import train_test_split
from surprise.model_selection import GridSearchCV

<br />

## 3.0 Loading Datasets

    
The dataset used for this project is the MovieLens dataset maintained by the GroupLens research group in the Department of Computer Science and Engineering at the University of Minnesota. Additional movie content data was legally scraped from IMDB. The dataset can be found <a href="https://www.kaggle.com/competitions/edsa-movie-recommendation-2022/data"> here</a>. Pandas library will be used to access and Manipulate the datasets.

In [3]:
# read movie dataset
df_movies = pd.read_csv('data/movies.csv')

In [9]:
# read the ratings dataset
df_rating = pd.read_csv('data/train.csv')

In [5]:
# read the movie additional information
df_meta = pd.read_csv('data/imdb_data.csv')

In [13]:
data = pd.read_csv("C:/Users/USER/Desktop/recommedation/submission/submission.csv")

In [15]:
data['rating'] = data['rating'].apply(lambda x: round(x, 1))

In [16]:
data.to_csv("C:/Users/USER/Desktop/recommedation/submission/submission_main.csv", index=False)

<br><br>
## 4.0 Exploratory Data Analysis


Exploratory data analysis is an approach to analyzing data sets to summarize their main characteristics, often with visual methods. Primarily, EDA is for seeing what the data can tell us beyond the formal modeling or hypothesis testing task.This approach for data analysis uses many tools(mainly graphical to maximize insight into a data set, extract important variables, detect outliers and anomalies, amongst other details that is missed when looking at DataFrame. This step is very important especially when we model the data in order to apply Machine Learning techniques.

<br><br>
## 5.0 Content Based Recommendaton

This section of the project aims at making recommenadations and rating using the content-based aproach. This approach uses the similarity between items to make recommendations. It is based off the assumption that if a user likes a particular item, the user will like items similar to that items. Hence, if a user rates a particular movie very high, there is aa high chance the user will rank other similar movies high. 

### 5.1 Feature Engineering and Selection


This project considers building a recommender off the movie genre, the director and the plot keyword feature. 

#### 5.1.1 Selecting the Required Features

The movie genre is available in the movies dataset, the director and plot keywords features are in the imdb_data dataset. Hence, there is a need to merge both datasets and extract the required features. 

In [33]:
# Extract movieId, title_cast, director and plot_keywords from df_meta
df_meta = df_meta[['movieId', 'director', 'plot_keywords']]


# merge meta dataset to movies dataset to produce our train dataset
df_train = df_movies.merge(df_meta, on='movieId', how='left')
df_train.head()

Unnamed: 0,movieId,title,genres,director,plot_keywords
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,John Lasseter,toy|rivalry|cowboy|cgi animation
1,2,Jumanji (1995),Adventure|Children|Fantasy,Jonathan Hensleigh,board game|adventurer|fight|game
2,3,Grumpier Old Men (1995),Comedy|Romance,Mark Steven Johnson,boat|lake|neighbor|rivalry
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance,Terry McMillan,black american|husband wife relationship|betra...
4,5,Father of the Bride Part II (1995),Comedy,Albert Hackett,fatherhood|doberman|dog|mansion


<br>

#### 5.1.2 Cleaning the Selected features

The genres and plot keywords feature contains genres and keywords seperated by the '|' character. There is a need to replace the seperator character with a space. On the director feature, there is a need to remove the space between the director name and surname; this is so that the model will not percieve any similarity between Albert Johnson and Albert Robert. They are totally different persons. And lastly, merging the features together and changing them to all lowercasing.

In [35]:
# handle missing data
df_train.fillna(' ', inplace=True)

# replacing "|" and "(no genres listed)" with ' ' in genre
df_train['genres'] = df_train['genres'].apply(lambda x: x.replace("|" , ' ')
                                       .replace("(no genres listed)", ' '))

# replacing " " with ' ' in director
df_train['director'] = df_train['director'].apply(lambda x: ((x+'|'))
                                            .replace(" ", '')
                                            .replace("|", " "))

# replace "|" with ' ' in plot_keywords
df_train['plot_keywords'] = df_train['plot_keywords'].apply(lambda x: x.replace("|", " "))

# Merge the genres, plot_keywords and director names as our major predictors
df_train = df_train['genres'] + " " + df_train['director'] + " " + df_train['plot_keywords']

# change to lower case
df_train.apply(str.lower)

0        adventure animation children comedy fantasy jo...
1        adventure children fantasy jonathanhensleigh  ...
2        comedy romance markstevenjohnson  boat lake ne...
3        comedy drama romance terrymcmillan  black amer...
4        comedy alberthackett  fatherhood doberman dog ...
                               ...                        
62418                                            drama    
62419                                      documentary    
62420                                     comedy drama    
62421                                                     
62422                           action adventure drama    
Length: 62423, dtype: object

#### 5.1.3 Vectorization

To create a model, there is a need to have a set of feature(s) with numerical values since most models only accept numerical values for feature sets. For this project, our feature is a string of words. Hence there is a need to create vectors of digits from these words. The process is called Vectorization.

For this project we define a vectorizer with the following tuning
- analyser = 'word'
- ngram_range = (1, 1)
- max_df = 0.3
- min_df = 20
- stop_words = 'english'

In [37]:
vectorizer = TfidfVectorizer(analyzer='word', ngram_range=(1, 1), min_df=20, max_df=0.3, stop_words='english')
features = vectorizer.fit_transform(df_train)

### 5.2 Recommending

In this section, we will recommend m

In [28]:
# merging dataset to form our inital dataset

df_merged = df_movies.merge(df_meta, on='movieId', how='left')

In [29]:
df_merged.shape

(62423, 5)

In [30]:
df_merged.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 62423 entries, 0 to 62422
Data columns (total 5 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   movieId        62423 non-null  int64 
 1   title          62423 non-null  object
 2   genres         62423 non-null  object
 3   director       15347 non-null  object
 4   plot_keywords  14384 non-null  object
dtypes: int64(1), object(4)
memory usage: 2.9+ MB


In [31]:
# handle missing data
df_merged['title_cast'].fillna(' ', inplace=True)
df_merged['director'].fillna(' ', inplace=True)
df_merged['plot_keywords'].fillna(' ', inplace=True)

KeyError: 'title_cast'

In [None]:
df_merged.info()

In [None]:
df_merged

In [None]:
# cleaning the data in genres
df_cleaned = df_merged.copy()
df_cleaned['genres'] = df_merged['genres'].apply(lambda x: x.replace("|" , ' ').replace("(no genres listed)", ' '))

In [None]:
df_cleaned['title_cast'] = df_merged['title_cast'].apply(lambda x: x.replace(" " , '').replace("|", ' '))


In [18]:
df_cleaned['director'] = df_merged['director'].apply(lambda x: ((x+'|') * 3).replace(" ", '').replace("|", " "))

In [19]:
df_cleaned['director']

0                  JohnLasseter JohnLasseter JohnLasseter 
1        JonathanHensleigh JonathanHensleigh JonathanHe...
2        MarkStevenJohnson MarkStevenJohnson MarkSteven...
3               TerryMcMillan TerryMcMillan TerryMcMillan 
4               AlbertHackett AlbertHackett AlbertHackett 
                               ...                        
62418                                                     
62419                                                     
62420                                                     
62421                                                     
62422                                                     
Name: director, Length: 62423, dtype: object

In [20]:
df_cleaned['plot_keywords'] = df_merged['plot_keywords'].apply(lambda x: x.replace("|", " "))

In [21]:
df_cleaned['plot_keywords']

0                         toy rivalry cowboy cgi animation
1                         board game adventurer fight game
2                               boat lake neighbor rivalry
3        black american husband wife relationship betra...
4                          fatherhood doberman dog mansion
                               ...                        
62418                                                     
62419                                                     
62420                                                     
62421                                                     
62422                                                     
Name: plot_keywords, Length: 62423, dtype: object

In [22]:
df_cleaned.head()

Unnamed: 0,movieId,title,genres,title_cast,director,plot_keywords
0,1,Toy Story (1995),Adventure Animation Children Comedy Fantasy,TomHanks TimAllen DonRickles JimVarney Wallace...,JohnLasseter JohnLasseter JohnLasseter,toy rivalry cowboy cgi animation
1,2,Jumanji (1995),Adventure Children Fantasy,RobinWilliams JonathanHyde KirstenDunst Bradle...,JonathanHensleigh JonathanHensleigh JonathanHe...,board game adventurer fight game
2,3,Grumpier Old Men (1995),Comedy Romance,WalterMatthau JackLemmon SophiaLoren Ann-Margr...,MarkStevenJohnson MarkStevenJohnson MarkSteven...,boat lake neighbor rivalry
3,4,Waiting to Exhale (1995),Comedy Drama Romance,WhitneyHouston AngelaBassett LorettaDevine Lel...,TerryMcMillan TerryMcMillan TerryMcMillan,black american husband wife relationship betra...
4,5,Father of the Bride Part II (1995),Comedy,SteveMartin DianeKeaton MartinShort KimberlyWi...,AlbertHackett AlbertHackett AlbertHackett,fatherhood doberman dog mansion


In [23]:
df_data_string = df_cleaned['title'] + " " + df_cleaned['genres'] + " " + df_cleaned['title_cast'] + " " + df_cleaned['director'] + " " + df_cleaned['plot_keywords']

In [24]:
df_data_string.head()

0    Toy Story (1995) Adventure Animation Children ...
1    Jumanji (1995) Adventure Children Fantasy Robi...
2    Grumpier Old Men (1995) Comedy Romance WalterM...
3    Waiting to Exhale (1995) Comedy Drama Romance ...
4    Father of the Bride Part II (1995) Comedy Stev...
dtype: object

In [None]:
# vectorization

vectorizer = TfidfVectorizer(analyzer='word', ngram_range=(1, 2), min_df=1, max_df=0.5, stop_words='english')
features = vectorizer.fit_transform(df_data_string)

In [None]:
features.astype(np.float16)

In [None]:
# creating the similarity matrix
cosine_sim = cosine_similarity(features[0:1000], features)

In [None]:
cosine_sim.shape

In [45]:
df_cleaned.movieId

0             1
1             2
2             3
3             4
4             5
          ...  
62418    209157
62419    209159
62420    209163
62421    209169
62422    209171
Name: movieId, Length: 62423, dtype: int64