> # Regression Predict Student Solution

© Explore Data Science Academy

---
### Honour Code

We {**TEAM ES4**}, confirm - by submitting this document - that the solutions in this notebook are a result of my own work and that I abide by the [EDSA honour code](https://drive.google.com/file/d/1QDCjGZJ8-FmJE3bZdIQNwnJyQKPhHZBn/view?usp=sharing).

Non-compliance with the honour code constitutes a material breach of contract.


### Challenge Description

In today’s technology driven world, recommender systems are socially and economically critical for ensuring that individuals can make appropriate choices surrounding the content they engage with on a daily basis. One application where this is especially true surrounds movie content recommendations; where intelligent algorithms can help viewers find great titles from tens of thousands of options.

If you have ever wondered how Netflix, Amazon Prime, Showmax, Disney and the likes somehow know what to recommend to you, it's not just a guess drawn out of the hat. There is an algorithm behind it.

With this context, EA is challenging you to construct a recommendation algorithm based on content or collaborative filtering, capable of accurately predicting how a user will rate a movie they have not yet viewed based on their historical preferences.

## What value is achieved through building a functional recommender system?

Providing an accurate and robust solution to this challenge has immense economic potential, with users of the system being exposed to content they would like to view or purchase - generating revenue and platform affinity.



#                  EDSA Movie Recommendation Challenge

![image.png](attachment:09684f0d-a96b-4a7c-a2f0-7c92d4bf6fd7.png)

# Introduction 

In today’s technology driven world, recommender systems are socially and economically critical for ensuring that individuals can make appropriate choices surrounding the content they engage with on a daily basis. One application where this is especially true surrounds movie content recommendations; where intelligent algorithms can help viewers find great titles from tens of thousands of options.

…ever wondered how Netflix, Amazon Prime, Showmax, Disney and the likes somehow know what to recommend to you?

![image.png](attachment:9397b72d-0cc7-4d73-93e7-1e60a44dc6dc.png)

…it's not just a guess drawn out of the hat. There is an algorithm behind it.
This notebook follows the step-by-step process to construct a recommendation algorithm based on content or collaborative filtering, capable of accurately predicting how a user will rate a movie they have not yet viewed based on their historical preferences.

Providing an accurate and robust solution to this challenge has immense economic potential, with users of the system being exposed to content they would like to view or purchase - generating revenue and platform affinity.

# Table of contents:

* Import libraries and datasets
* Load the Data
* Data preprocessing
* Checking for missing values column wise
* Checking for duplicates records
* Create copy
* Evaluating Length of Unique Values
* Evaluating unique values for movies
* Joining Datasets
* Exploratory data analysis
* Collaborative and Content base filtering
* content Based Filtering
* Collaborative Based Filtering
* Item-Item based
* User-User
* Singular value decomposition
* Model Building
* Attempt prediction with altering parameters
* Submission
* Conclusion

# Import libraries and datasets

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# Packages for data processing
import numpy as np
import pandas as pd
import datetime
from sklearn import preprocessing
from sklearn.datasets import make_blobs
from sklearn.preprocessing import StandardScaler
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import re
from scipy.sparse import csr_matrix
import scipy as sp


# Packages for visualization
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.graph_objs as go
from plotly.offline import init_notebook_mode, plot, iplot
%matplotlib inline

# Packages for modeling
from surprise import Reader
from surprise import Dataset
from surprise import KNNWithMeans 
from surprise import KNNBasic
from surprise.model_selection import cross_validate
from surprise.model_selection import GridSearchCV
from surprise import SVD
from surprise import SVDpp
from surprise import NMF
from surprise import SlopeOne
from surprise import CoClustering
import heapq

# Packages for model evaluation
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
from time import time

# Package to suppress warnings
import warnings
warnings.filterwarnings("ignore")

# Packages for saving models
import pickle

/kaggle/input/edsa-movie-recommendation-predict/sample_submission.csv
/kaggle/input/edsa-movie-recommendation-predict/movies.csv
/kaggle/input/edsa-movie-recommendation-predict/imdb_data.csv
/kaggle/input/edsa-movie-recommendation-predict/genome_tags.csv
/kaggle/input/edsa-movie-recommendation-predict/genome_scores.csv
/kaggle/input/edsa-movie-recommendation-predict/train.csv
/kaggle/input/edsa-movie-recommendation-predict/test.csv
/kaggle/input/edsa-movie-recommendation-predict/tags.csv
/kaggle/input/edsa-movie-recommendation-predict/links.csv




### Loading experiments to Comet

In [None]:
pip install comet_ml

In [None]:
from comet_ml import Experiment
from comet_ml.integration.pytorch import log_model

experiment = Experiment(
  api_key = "ONBJWkst8Oy3PY0aTmkrmuPDV",
  project_name = "ea-movie-recommendation",
  workspace="ifeoluwa13"
)

# Load the Data

---
    
| ⚡ Description: Loading the data ⚡ |
| :--------------------------- |
| In this section you are required to load the data from the `df_train` file into a DataFrame. |

---

The train DataFrame below contains the main training data (over 10 million rows) while the test DataFrame (over 5 million rows) contains the data on which we have to predict the users' movie rating. Due to the size of the datasets they are all imported locally.

In [None]:
train = pd.read_csv('/kaggle/input/edsa-movie-recommendation-predict/train.csv')
test_df = pd.read_csv('/kaggle/input/edsa-movie-recommendation-predict/test.csv')
df_movies = pd.read_csv('/kaggle/input/edsa-movie-recommendation-predict/movies.csv')
df_imdb = pd.read_csv('/kaggle/input/edsa-movie-recommendation-predict/imdb_data.csv')
df_gtags = pd.read_csv('/kaggle/input/edsa-movie-recommendation-predict/genome_tags.csv')
df_scores = pd.read_csv('/kaggle/input/edsa-movie-recommendation-predict/genome_scores.csv')
df_tags = pd.read_csv('/kaggle/input/edsa-movie-recommendation-predict/tags.csv')
df_links = pd.read_csv('/kaggle/input/edsa-movie-recommendation-predict/links.csv')

**Here is a description of the data we just loaded:**

* test.csv - The test split of the dataset. Contains user and movie IDs with no rating data.
* train.csv - The training split of the dataset. Contains user and movie IDs with associated rating data.
* genome_scores.csv - a score mapping the strength between movies and tag-related properties. 
* genome_tags.csv - user assigned tags for genome-related scores
* imdb_data.csv - Additional movie metadata scraped from IMDB using the links.csv file.
* links.csv - File providing a mapping between a MovieLens ID and associated IMDB and TMDB IDs.
* tags.csv - User assigned for the movies within the dataset.

In [None]:
#take a look at the training data
train.head(10)

**Train Data:**

rating : Ratings are made on a 5-star scale, with half-star increments (0.5 stars - 5.0 stars).

timestamp: represent seconds since midnight Coordinated Universal Time (UTC) of January 1, 1970.

In [None]:
#take a look at the movies data
df_movies.head(10)

In [None]:
all_genres = df_movies['genres'].str.split('|').tolist()

# Flatten the list of lists into a single list
all_genres_flat = [genre for sublist in all_genres for genre in sublist]

# Get unique genres
unique_genres = list(set(all_genres_flat))

# Print the unique genres
print(unique_genres)

**Movies:**
 
* movieId : Identifier for movies used


* title : These were entered manually or imported from https://www.themoviedb.org/, and include the year of release in parentheses. Errors and inconsistencies may exist in these titles.


* genres: Genres are a pipe-separated list, and are selected from the following: 
* Action
* Adventure
* Animation
* Children's
* Comedy
* Crime
* Documentary
* Drama
* Fantasy
* Film-Noir
* Horror
* Musical
* Mystery
* Romance
* Sci-Fi
* Thriller
* War
* Western
* (no genres listed)

In [None]:
#Viewing imdb dataframe

df_imdb.head(1)

In [None]:
#Let's take a look at Genome tags
df_gtags.head(1)

In [None]:
#Let's take a look at scores
df_scores.head(10)

In [None]:
#Let's take a look at tags
df_tags.head(10)

In [None]:
#Let's take a look at links
df_links.head(10)

**Links:**

* movieId : Identifier for movies used by https://movielens.org
* imdbId : Identifier for movies used by http://www.imdb.com
* tmdbId : An identifier for movies used by https://www.themoviedb.org.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

# Split the genres by "|" and create a list of all genres
all_genres = df_movies['genres'].str.split('|').tolist()

# Flatten the list of lists into a single list
all_genres_flat = [genre for sublist in all_genres for genre in sublist]

# Create a Pandas Series from the flattened list
genres_series = pd.Series(all_genres_flat)

# Count the occurrences of each genre
genre_counts = genres_series.value_counts()

# Plot the bar chart
plt.figure(figsize=(10, 6))
genre_counts.plot(kind='bar')
plt.title('Most Common Genres')
plt.xlabel('Genres')
plt.ylabel('Count')
plt.xticks(rotation=45)
plt.show()

# Data Preprocessing
**Preparing raw data:**

We will first prepare this raw data to make it suitable for our machine learning model. This is a very crucial step while for creating a machine learning model.Data Preprocessing will be done before the EDA to briefly view the data structure and properties and perfom mergers of sets in preparation for more in depth analysis in the thereon-following EDA section.

# Checking for missing values column wise
**Handling Missing Data:**

In our dataset, there may be some missing values. We cannot train our model with a dataset that contains missing values. So we have to check if our dataset has missing values.

In [None]:
#check for missing values
train.isnull().sum()

# Checking for duplicates records
**Checking Duplicate Values:**

At times our dataset may contain some duplicated values which are not necessary therefore these values must be removed.

In [None]:
# check duplicates
dup_bool = train.duplicated(['userId', 'movieId', 'rating'])

# display duplicates
print("Number of duplicate records:", sum(dup_bool))

# Creating a copy
We will rename our train data and look at the top 5 records in the dataframe.

In [None]:
# Create a copy
df = train.copy()

In [None]:
# Create a copy of the train data
df_train = train.copy()

# Display top 10 records
df_train.head(10)

# Evaluating Length of Unique Values

In [None]:
# Find the length of the unique use
len(df_train['userId'].unique()), len(df_train['movieId'].unique())

In [None]:
# View movies
df_movies.head()

# Evaluating unique values for movies

In [None]:
# View unique values of movies
len(df_movies['movieId'].unique())

# Joining Datasets

In [None]:
# Merge the ratings and movies
df_merge1 = df_train.merge(df_movies, on='movieId')
# View the first 5 rows
df_merge1.head()

In [None]:
# Merging the dataset with that of the imbd
df_merge2 = df_train.merge(df_imdb, on="movieId")
# View first 5 rows
df_merge2.head()

In [None]:
# Merging the merge data earlier on with the df_imbd
df_merge3 = df_merge1.merge(df_imdb, on="movieId" )
# View first 5 rows
df_merge3.head()

In [None]:
# Check the null values of the data that has just been merged.
df_merge3.isnull().sum()

In [None]:
# View keywords
df_merge3['plot_keywords'].tail(100)

In [None]:
# Extract unique values from rating column
train['rating'].unique()

**Merging More Datasets**

The following code was going to provide us with a merge of a datasets that are available. However, we get an error indicating that this is quite huge and we will have to either minimize the dataset or try to find another alternative.

In [None]:
# Merging the dataset with that of the imbd
df_merge2 = df_train.merge(df_imdb, on="movieId")
df_merge2.head(1)

In [None]:
# Merging the merge data earlier on with the df_imbd
df_merge3 = df_merge1.merge(df_imdb, on="movieId" )
df_merge3.head()

# Exploratory data analysis(EDA)

![image.png](attachment:5157135e-5994-4937-b0e0-ae1a60104798.png)

**Ratings Distribution**

In [None]:
# Get summary statistics of rating
train['rating'].describe()

In [None]:
data = df_merge1['rating'].value_counts().sort_index(ascending=False)

# Create trace
trace = go.Bar(x=data.index,
               text=['{:.1f} %'.format(val) for val in (data.values / df_merge1.shape[0] * 100)],
               textposition='auto',
               textfont=dict(color='#000000'),
               y=data.values,
               marker=dict(color='#db0000'))
# Create layout
layout = dict(title='Distribution Of {} movie-ratings'.format(df_merge1.shape[0]),
              xaxis=dict(title='rating'),
              yaxis=dict(title='Count'))
# Create plot
fig = go.Figure(data=[trace], layout=layout)
iplot(fig)

**Observations:**

* We can observe that a high percentage of our movies were rated above average i.e above 3
* A low percentage were below 3

**Recommendations:**

* More movies are high quality perhaps people are watching movies that are recommended to them, either by their social groups or the recommender system itself.

In [None]:
df_merge3.rename(columns={'rating_x': 'rating', 'rating_y': 'numRatings'}, inplace=True)
#Create dataframe
num_ratings = pd.DataFrame(df_merge3.groupby('movieId').count()['rating']).reset_index()
#merge num ratings with df_merge3
df_merge3 = pd.merge(left=df_merge3, right=num_ratings, on='movieId')
# rename columns
df_merge3.rename(columns={'rating_x': 'rating', 'rating_y': 'numRatings'}, inplace=True)

In [None]:
# pre_process the budget column

# remove commas
df_merge3['budget'] = df_merge3['budget'].str.replace(',', '')

# remove currency signs like "$" and "GBP"
df_merge3['budget'] = df_merge3['budget'].str.extract('(\d+)', expand=False)

# convert the feature into a float
df_merge3['budget'] = df_merge3['budget'].astype(float)

# remove nan values and replacing with 0
df_merge3['budget'] = df_merge3['budget'].replace(np.nan, 0)

# convert the feature into an integer
df_merge3['budget'] = df_merge3['budget'].astype(int)

In [None]:
#extracting date from title column
df_merge3['release_year'] = df_merge3.title.str.extract('(\(\d\d\d\d\))', expand=False)

#adding date to realse year column
df_merge3['release_year'] = df_merge3.release_year.str.extract('(\d\d\d\d)', expand=False)

#view top 2 rows of the dataframe
df_merge3.head(2)

In [None]:
#drop duplicates on dataframe
data_1= df_merge3.drop_duplicates('movieId')

#view top 2 rows of the dataframe
data_1.head(2)

In [None]:
#create ratings dataframe
ratings_df = pd.DataFrame()

#extract average ratings
ratings_df['Mean_Rating'] = df_merge3.groupby('title')['rating'].mean().values

#extract average number of ratings
ratings_df['Num_Ratings'] = df_merge3.groupby('title')['rating'].count().values

#make a plot
fig, ax = plt.subplots(figsize=(14, 7))
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.set_title('Rating vs. Number of Ratings', fontsize=24, pad=20)
ax.set_xlabel('Rating', fontsize=16, labelpad=20)
ax.set_ylabel('Number of Ratings', fontsize=16, labelpad=20)

plt.scatter(ratings_df['Mean_Rating'], ratings_df['Num_Ratings'], alpha=0.5, color='purple')

The more a movie gets ratings it’s average ratings tends to increase. This also means that, If more and more people are watching a particular movie, it probably has a good budget and good marketing, and they are highly rated.

# **Visualising Genres**

The genres variable will surely be important while building the recommendation engines since it describes the content of the film (i.e. Animation, Horror, Sci-Fi). A basic assumption is that films in the same genre should have similar contents.Let's see exactly which genres are the most popular.

![image.png](attachment:8f17a784-2415-471d-bfee-96e173f44768.png)

In [None]:
movies = df_movies.copy()

In [None]:
# Make a census of the genre keywords
genre_labels = set()
for s in movies['genres'].str.split('|').values:
    genre_labels = genre_labels.union(set(s))

# Function that counts the number of times each of the genre keywords appear
def count_word(dataset, ref_col, census):
    """  
    
    This function counts the number of times each 
    of the genre keywords appear  
    
    Input : movies dataframe, column from dataframe,
    label column from dataframe
    datatype : dataframe        
    
    output : list    
    
    """
    keyword_count = dict()
    for s in census: 
        keyword_count[s] = 0
    for census_keywords in dataset[ref_col].str.split('|'):        
        if type(census_keywords) == float and pd.isnull(census_keywords): 
            continue        
        for s in [s for s in census_keywords if s in census]: 
            if pd.notnull(s): 
                keyword_count[s] += 1
    
    # convert the dictionary in a list to sort the keywords by frequency
    keyword_occurences = []
    for k,v in keyword_count.items():
        keyword_occurences.append([k,v])
    keyword_occurences.sort(key = lambda x:x[1], reverse = True)
    return keyword_occurences, keyword_count

# Calling this function gives access to a list of genre keywords which are sorted by decreasing frequency
keyword_occurences, dum = count_word(movies, 'genres', genre_labels)
keyword_occurences[:5]

The top 5 genres are: Drama, Comedy,Thriller, Romance and Action.

![image.png](attachment:ada07504-af87-43d1-a5f6-de5dfb713470.png)

We will show this on a wordcloud in order to make it more visually appealing

In [None]:
# Import new libraries
%matplotlib inline
import wordcloud
from wordcloud import WordCloud, STOPWORDS
# Define the dictionary used to produce the genre wordcloud
genres = dict()
trunc_occurences = keyword_occurences[0:18]
for s in trunc_occurences:
    genres[s[0]] = s[1]

# Create the wordcloud
genre_wordcloud = WordCloud(width=1000,height=400, background_color='white')
genre_wordcloud.generate_from_frequencies(genres)

# Plot the wordcloud
f, ax = plt.subplots(figsize=(16, 8))
plt.imshow(genre_wordcloud, interpolation="bilinear")
plt.axis('off')
plt.show()

In [None]:
df = df_imdb[['movieId','title_cast','director', 'plot_keywords']]
df = df.merge(df_movies[['movieId', 'genres', 'title']], on='movieId', how='inner')
df.head()

In [None]:
# Convert data types to strings for string handling
df['title_cast'] = df.title_cast.astype(str)
df['plot_keywords'] = df.plot_keywords.astype(str)
df['genres'] = df.genres.astype(str)
df['director'] = df.director.astype(str)

# Removing spaces between names
df['director'] = df['director'].apply(lambda x: "".join(x.lower() for x in x.split()))
df['title_cast'] = df['title_cast'].apply(lambda x: "".join(x.lower() for x in x.split()))

# Discarding the pipes between the actors' full names and getting only the first three names
df['title_cast'] = df['title_cast'].map(lambda x: x.split('|')[:3])

# Discarding the pipes between the plot keywords' and getting only the first five words
df['plot_keywords'] = df['plot_keywords'].map(lambda x: x.split('|')[:5])
df['plot_keywords'] = df['plot_keywords'].apply(lambda x: " ".join(x))

# Discarding the pipes between the genres 
df['genres'] = df['genres'].map(lambda x: x.lower().split('|'))
df['genres'] = df['genres'].apply(lambda x: " ".join(x))

df.head()

# Genre Popularity

In [None]:
def most_watched(input_df):
    
    """  
    
    This function gives out the number of ratings
    for each genre for each year since 1970  
    
    Input : dataframe
    datatype : dataframe        
    
    output : Figure (bar graph)
    
    """
    # Create a copy of the input dataframe and merge it to the movies dataframe
    df = input_df.copy()
    df = df.merge(df_movies,on='movieId',how='left')
    
    # Create an empty dataframe
    b = pd.DataFrame()
    
    # Extract the timestamps and genres 
    timestamps = [timestamp for timestamp in df.timestamp]
    all_genres = set(','.join([genres.replace('|',',') for genres in df.genres]).split(','))
    
    # Get the number of ratings for each genre for each year since 1970
    for index,genre in enumerate(all_genres):
        a = pd.Series([int((timestamps[i]/31536000)+1970) for i,x in enumerate(df.genres) if genre in x])
        a = a.value_counts()
        b = pd.concat([b,pd.DataFrame({genre:a})],axis=1)
        
    # Create trace
    trace = go.Bar(x = data.index,
    text = ['{:.1f} %'.format(val) for val in (data.values / df_merge1.shape[0] * 100)],
    textposition = 'auto',
    textfont = dict(color = '#000000'),
    y = data.values,
    marker = dict(color = '#db0000'))
    
    
    # Plot the trends for each genre on the same line graph 
    plt.figure(figsize=(40,20))
    plot = sns.barplot(data=b, ci=None)
      
    # Add plot labels
    plt.title('Trends in genre popularity',fontsize=20)
    plt.xlabel('Genres', fontsize=15)
    plt.ylabel('Number of ratings', fontsize=15)
    
    plt.show()
    
    return

most_watched(train)

**Observations:**

* It is reasonable to expect that movies with a high number of ratings have also garnered a high number of views.

* It is clear looking at the bar graph that Comedy, Drama and Action have the highest number of ratings and therefore views, it is therefore advisable to commission more of these genres of movie in order to increase viewership which in turn will increase revenue.

In [None]:
def most_watched(input_df): 
    
    """"
    This function gives out the number of ratings
    for each genre for each year since 1970  
    
    Input : dataframe
    datatype : dataframe        
    
    output : Figure (line graph)
    
    """
    
    # Create a copy of the input dataframe and merge it to the movies dataframe
    df = input_df.copy()
    df = df.merge(df_movies,on='movieId',how='left')
    
    # Create an empty dataframe
    b = pd.DataFrame()
    
    # Extract the timestamps and genres 
    timestamps = [timestamp for timestamp in df.timestamp]
    all_genres = set(','.join([genres.replace('|',',') for genres in df.genres]).split(','))
    
    # Get the number of ratings for each genre for each year since 1970
    for index,genre in enumerate(all_genres):
        a = pd.Series([int((timestamps[i]/31536000)+1970) for i,x in enumerate(df.genres) if genre in x])
        a = a.value_counts()
        b = pd.concat([b,pd.DataFrame({genre:a})],axis=1)
    
    # Plot the trends for each genre on the same line graph 
    plt.figure(figsize=(40,20))
    palette = sns.color_palette("mako_r", 20)
    sns.set()
    sns.set_palette("PuBuGn_d")
    plot = sns.lineplot(data=b, dashes=False, palette=palette)
      
    # Add plot labels
    plt.title('Trends in genre popularity',fontsize=20)
    plt.xlabel('Years', fontsize=15)
    plt.ylabel('Number of ratings', fontsize=15)
    
    plt.show()
    
    return

most_watched(train)

**Observations:**

* It is reasonable to expect that movies with a high number of ratings have also garnered a high number of views.
* Therefore the higher the number of ratings the greater the popularity.

**Observations:**

* It is reasonable to expect that movies with a high number of ratings have also garnered a high number of views.
* Therefore the higher the number of ratings the greater the popularity.

**Observations:**

* It is reasonable to expect that movies with a high number of ratings have also garnered a high number of views.
* Therefore the higher the number of ratings the greater the popularity.

In [None]:
# Create a wordcloud of the movie titles
movies['title'] = movies['title'].fillna("").astype('str')
title_corpus = ' '.join(movies['title'])
title_wordcloud = WordCloud(stopwords=STOPWORDS, background_color='black', height=2000, width=4000).generate(title_corpus)

# Plot the wordcloud
plt.figure(figsize=(16,8))
plt.imshow(title_wordcloud)
plt.axis('off')
plt.show()

**Observations:**

* We can observe that Man, Girl and Love are larger then the rest, which informs us that they are the most popular title words.
* II & III are relatively small which tells us that they are relatively less popular than other title words.

**Recommendations:**

* The deminutive nature of II, III and three tell us that there were not a lot of franchise films and this is a worry for netflix because we know that franchise films are created because they already have establish fanbases which if brought to netflix would add to the viewership and improve the revenue.
* Love, La, Girl and Man were the most occuring title words this matches up with the popularity line graph as romance and drama are amongst the top films, if any changes are obseved in the relative popularity of the genre types then this should be reflected in title popularity

# Popular Movies by Genre

In [None]:
genre_df = pd.DataFrame(df_merge3['genres'].str.split('|').tolist(), index=df_merge3['movieId']).stack()
genre_df = genre_df.reset_index([0, 'movieId'])
genre_df.columns = ['movieId', 'Genre']

In [None]:
def make_bar_chart(dataset, attribute, bar_color='#3498db', edge_color='#2980b9', title='Title', xlab='X', ylab='Y', sort_index=False):
    
    """"
    This function gives the count of the
    different genres
    
    Input : dataframe, dataframe column,
    colour of figure, title of figure,
    x and y labels
    datatype : dataframe        
    
    output : Figure (bar plot)
    
    """
    
    if sort_index == False:
        xs = dataset[attribute].value_counts().index
        ys = dataset[attribute].value_counts().values
    else:
        xs = dataset[attribute].value_counts().sort_index().index
        ys = dataset[attribute].value_counts().sort_index().values
    
    # Plotting the figure   
    fig, ax = plt.subplots(figsize=(14, 7))
    ax.spines['top'].set_visible(False)
    ax.spines['right'].set_visible(False)
    ax.set_title(title, fontsize=24, pad=20)
    ax.set_xlabel(xlab, fontsize=16, labelpad=20)
    ax.set_ylabel(ylab, fontsize=16, labelpad=20)
    # Plot the bar graph
    plt.bar(x=xs, height=ys, color=bar_color, edgecolor=edge_color, linewidth=2)
    plt.xticks(rotation=45)
    
# Show the bar chart with selected features    
make_bar_chart(genre_df, 'Genre', title='Most Popular Movie Genres', xlab='Genre', ylab='Counts')

**Observations:**

* Drama, Comedy, Action, Thriller and adventure are the top 5 genre in the dataset.

**Recommendations:**

* Netflix should endeavor to match the order of genre of movies available in terms of quantity to the popularity of the genre so as to maximise the views, this in turn will maximise the revenue in films.

# Movie Published per Year

In [None]:
# Create an empty list
years = []
# Finding the number of movies published in each year
for title in df_merge3['title']:
    year_subset = title[-5:-1]
    try: years.append(int(year_subset))
    except: years.append(9999)
# Create a new column in a dataframe.       
df_merge3['moviePubYear'] = years
print('The Number of Movies Published each year:',len(df_merge3[df_merge3['moviePubYear'] == 9999]))

In [None]:
def make_histogram(dataset, attribute, bins=25, bar_color='#3498db', edge_color='#2980b9', 
                   title='Title', xlab='X', ylab='Y', sort_index=False):
    """"
    This function gives a plot of the
    number of movies published per year
    
    Input : dataframe, dataframe column,
    bins, colour of figure, title of figure,
    x and y labels
    datatype : dataframe        
    
    output : Figure (bar plot)
    
    """
    if attribute == 'moviePubYear':
        dataset = dataset[dataset['moviePubYear'] != 9999]
        
    fig, ax = plt.subplots(figsize=(14, 7))
    ax.spines['top'].set_visible(False)
    ax.spines['right'].set_visible(False)
    ax.set_title(title, fontsize=24, pad=20)
    ax.set_xlabel(xlab, fontsize=16, labelpad=20)
    ax.set_ylabel(ylab, fontsize=16, labelpad=20)
    
    plt.hist(dataset[attribute], bins=bins, color=bar_color, ec=edge_color, linewidth=2)
    
    plt.xticks(rotation=45)
    
    
make_histogram(df_merge3, 'moviePubYear', title='Movies Published per Year', xlab='Year', ylab='Counts')

**Observations:**

* We observed a decrease in the movies published per year from 2000

**Recommendations:**

* It is not clear what accounts for the decrease in movies published but possible reasons for this change include the financial crisis in 2000 and in 2009.

In [None]:
df_merge3.head(2)

In [None]:
# Create a dataframe with the number of movies that the directors have made
director_m = pd.DataFrame(data_1.groupby('director').count()['title'].sort_values(ascending=False)).reset_index()
# View the first five directors
director_m.head()

In [None]:
# the least and most number of movies by directors in the dataset.

print(f'No of unique movies: \t{len(data_1)}\nLeast produced: \t{director_m.title.min()}\nMost produced: \t\t{director_m.title.max()}')

In [None]:
# View the director and the count of ratings
pd.DataFrame(data_1.groupby('director').sum()['numRatings'].sort_values(ascending=False)).reset_index()

# Number of ratings per director

In [None]:
#df_merge3.sort_values(by='numRatings', ascending=False).drop_duplicates('movieId')[:100]
director_n = pd.DataFrame(data_1.groupby('director').sum()['numRatings'].sort_values(ascending=False)).reset_index()

# visualize the number of movies per director
plt.figure(figsize = (14, 9.5))
sns.barplot(data = director_n.head(50), y = 'director', x = 'numRatings', color = 'purple')
plt.ylabel('Directors')
plt.xlabel('Number of ratings')
plt.title('Number of ratings per director\n')
#plt.xlim(0, 27)
plt.show()

# Number of movies per director

In [None]:
# visualize the number of movies per director
plt.figure(figsize = (14, 9.5))
sns.barplot(data = director_m.head(50), y = 'director', x = 'title', color = 'blue')
plt.ylabel('Directors')
plt.xlabel('Number of movies released')
plt.title('Number of Movies released per director\n')
plt.xlim(0, 27)
plt.show()

**Observations:**

* The bar chart reflects the number of movies per director. The chart is layed out such that the directors with the most movies produced are reflected first in descending order of the number of movies per director.
* Luc Besson is the most prolific director out of all directors in the dataset.
* We can see that Stephen King (23 movies released) and Edward Burns (9 movies released) are outperforming the directors; inclusive of Luc Besson who has the most movies directed (26).

In [None]:
# visualize the number of movies per director
plt.figure(figsize = (14, 9.5))
sns.barplot(data = director_m.head(50), y = 'director', x = 'title', color = 'blue')
plt.ylabel('Directors')
plt.xlabel('Number of movies released')
plt.title('Number of Movies released per director\n')
plt.xlim(0, 27)
plt.show()

In [None]:
# Create a list from the dataframe of the movie count of directors
top_100_produced = list(director_m.director.head(100))

top_produced = data_1[data_1['director'].isin(top_100_produced)]

In [None]:
# find the mean rating for the movies
avg_per_director = df_merge3[['rating','movieId']].groupby('movieId').mean().reset_index()

In [None]:
# find the mean rating for the movies
avg_per_director = df_merge3[['rating','movieId']].groupby('movieId').mean().reset_index()

In [None]:
print(f'Min_rating: \t{top_produced.rating.min()}\nMax_rating: \t{top_produced.rating.max()}\nMean_rating: \t{round(top_produced.rating.mean(),2)}')

In [None]:
#print the row related to this index.
def find_minmax(x):
    #use the function 'idmin' to find the index of lowest profit movie.
    min_index = data_1[x].idxmin()
    #use the function 'idmax' to find the index of Highest profit movie.
    high_index =data_1[x].idxmax()
    high = pd.DataFrame(data_1.loc[high_index,:])
    low = pd.DataFrame(data_1.loc[min_index,:])
    
    #print the movie with high and low profit
    print("Movie Which Has Highest "+ x + " : ",data_1['title'][high_index])
    print("Movie Which Has Lowest "+ x + "  : ",data_1['title'][min_index])
    return pd.concat([high,low],axis = 1)

#call the find_minmax function.
find_minmax('budget')

# Top Ten Budget Movies

In [None]:
#make a plot which contain top 10 highest budget movies.
#sort the 'budget' column in decending order and store it in the new dataframe.
info = pd.DataFrame(data_1['budget'].sort_values(ascending = False))
info['title'] = data_1['title']
data = list(map(str,(info['title'])))

#extract the top 10 budget movies data from the list and dataframe.
x = list(data[:10])
y = list(info['budget'][:10])

#plot the figure and setup the title and labels.
ax = sns.pointplot(x=y,y=x)
sns.set(rc={'figure.figsize':(10,5)})
ax.set_title("Top 10 High Budget Movies",fontsize = 15)
ax.set_xlabel("Budget",fontsize = 13)
sns.set_style("darkgrid")

**Observations:**

This is a line graph which shows the Top 10 high budget movies of which My Way is the Top budget Movie.

# Top Ten Longest Movies

In [None]:
#top 10 Movies With Longest runtime
#sort the 'runtime' column in decending order and store it in the new dataframe.
info = pd.DataFrame(data_1['runtime'].sort_values(ascending = False))
info['title'] = data_1['title']
data = list(map(str,(info['title'])))

#extract the top 10 longest duraton movies data from the list and dataframe.
x = list(data[:10])
y = list(info['runtime'][:10])

#make the point plot and setup the title and labels.
ax = sns.pointplot(x=y,y=x)
sns.set(rc={'figure.figsize':(10,5)})
ax.set_title("Top 10 Longest Movies",fontsize = 15)
ax.set_xlabel("Runtime",fontsize = 13)
sns.set_style("darkgrid")

**Observations:**

This is a line graph which shows the top 10 longest movies of which Taken is the longest Movie.

# Average Runtime Per Annum

In [None]:
from sklearn.linear_model import LinearRegression

# create a dataframe with runtime data
runtime_data = pd.DataFrame(data_1.dropna().groupby('release_year').mean()['runtime']).dropna()
runtime_data.index = runtime_data.index.astype('int')
runtime_data = runtime_data[runtime_data['runtime']>1].copy()

# train a linear regression model for the trend
lrm = LinearRegression()
runtime_data = runtime_data.reset_index()
lrm.fit(runtime_data.release_year.values.reshape(-1,1),runtime_data.runtime.values.reshape(-1,1))

# make predictions
runtime_data['regression'] = lrm.predict(runtime_data.release_year.values.reshape(-1,1))

# visualize the runtime per annum
runtime_data = runtime_data.set_index('release_year')
runtime_data.plot(figsize=(10,6))
plt.title("Average runtime per annum\n")
plt.ylabel("Average runtime")
plt.show()

**Observation:**

* We can see that average runtime of movies have consistantly increased from 1900 to the early 1990's from which time it has plateued until the last year of the dataset, this could be due to the decrease in the crisis of studio lot time which allows movies to be made longer due to the reduction in cast.

* We can observe that there has been an high variation from 1900 to early 1990 and less variation till the end of the dataset this could be due to the fact that movies produed in the 90's were very short.

**Recommendations:**

* As seen in the line graph movie runtime had a high variation and that could be attributed to the technological advancement which feedback could not be given to producers or directors but anytime prior to 2000 the runtime has stabilised which could be because everyone now has internet and since they have internet they can voice out their opinions.
* The later years of 2000 is slightly declining as seen on the graph therefore producesrs should produce shorter movies.

# Average Budget Per Annum

In [None]:
# create a dataframe with budget data
budget_data = pd.DataFrame(data_1.dropna().groupby('release_year').mean()['budget']).dropna()
budget_data.index = budget_data.index.astype('int')
runtime_data = budget_data[budget_data['budget']>1].copy()

# # make predictions
budget_data = budget_data.reset_index()
budget_data = budget_data.drop('budget',1).merge(pd.DataFrame(budget_data[budget_data['release_year']>1982]), on ='release_year')

# # visualize the budget per annum
plt.figure(figsize=(10,5))
sns.barplot(x='release_year',y='budget',data=budget_data, color='darkblue')
plt.title("Average budget per annum\n")
plt.ylabel("Average budget")
plt.xticks(rotation=45)
plt.show()

**Observations:**

* There is an observable trend of increase in average budget of budget of film which could be due to special effects and CGI.

**Recommendations**:

* This trend is particularly relavent for the streaming service as networks and production houses are producing their own competing sreaming services and therefore are removing their content from existing streaming servises which puts pressure on sreaming servises to create more in-house content.

# Genre With Highest Release

In [None]:
def count_genre(x):
    
    """"
    This function gives a function will split
    the string and return the count of each genre.
    
    Input : dataframe column
    datatype : dataframe (integer)        
    
    output : Figure (bar plot)
    
    """
    
    #concatenate all the rows of the genrs.
    data_plot = data_1[x].str.cat(sep = '|')
    data = pd.Series(data_plot.split('|'))
    #conts each of the genre and return.
    info = data.value_counts(ascending=False)
    return info

#call the function for counting the movies of each genre.
total_genre_movies = count_genre('genres')
#plot a 'barh' plot using plot function for 'genre vs number of movies'.
total_genre_movies.plot(kind= 'barh',figsize = (13,6),fontsize=12,colormap='tab20c')

#setup the title and the labels of the plot.
plt.title("Genre With Highest Release",fontsize=15)
plt.xlabel('Number Of Movies',fontsize=13)
plt.ylabel("Genres",fontsize= 13)
sns.set_style("whitegrid")

**Observations:**

* Drama, Comedy, Action, Thriller and adventure are the top 5 genre in the dataset.

**Recommendations:**

* Netflix should endeavor to match the order of genre of movies available in terms of quantity to the popularity of the genre so as to maximise the views, this in turn will maximise the revenue in films.

In [None]:
i = 0
# Create a list for the genre count
genre_count = []
#
for genre in total_genre_movies.index:
    genre_count.append([genre, total_genre_movies[i]])
    i = i+1
# Create a plot    
plt.rc('font', weight='bold')
f, ax = plt.subplots(figsize=(10, 10))
genre_count.sort(key = lambda x:x[1], reverse = True)
labels, sizes = zip(*genre_count)
labels_selected = [n if v > sum(sizes) * 0.01 else '' for n, v in genre_count]
ax.pie(sizes, labels=labels_selected,
       autopct = lambda x:'{:2.0f}%'.format(x) if x > 1 else '',
       shadow=False, startangle=0)
ax.axis('equal')
# Plotting the pie chart
plt.tight_layout()

The Pie Chart represents the same data as the Bar graph of the movies released with the added benefit of the relative distribution of releases.

# Correlation of Features

In [None]:
def plot_correlation_map( df ):
    
    """"
    This function gives a correlation map
    using all the features from merged data.
    
    Input : dataframe column
    datatype : dataframe (integer)        
    
    output : Figure (bar plot)
    
    """
    # Plotting using the features
    
    corr = df.corr()
    _ , ax = plt.subplots( figsize =( 12 , 10 ) )
    cmap = sns.diverging_palette( 240 , 10 , as_cmap = True )
    _ = sns.heatmap(corr,cmap = cmap,square=True, cbar_kws={ 'shrink' : .9 }, ax=ax, annot = True, annot_kws = { 'fontsize' : 12 })

In [None]:
# Select a number of features from the dataframe to make the correlation map
plot_correlation_map(data_1[['userId','movieId','rating', 'timestamp', 'budget','runtime', 'numRatings']])

We can observe that there aren't any signifigant positive correlations amongst the features , aside from timestamp and movieId
There is a very clear correlation between movieId and timestamp, this is possibly because movies have different lengths and do not end at exactly the same time.

# Content Base Filtering and Collaborative Filtering

**What Is Content-based Filtering?**

This filtering is based on the description or some data provided for that product. The system finds the similarity between recommended items based on their description or context. The user’s historical preference is taken into account to find products they may like in the future. For instance, if a user likes movies such as ‘Man in black’ then we can recommend the movies of ‘Will Smith’ or movies with the genre ‘Sci-fi’.

**Techniques used for our content based filtering:**

We used CountVectoriser that is used in Feature Extraction which entirely is responsible for convecting text into vectors. and the reason why we chose to use count vectoriser instead of tfidVectoriser is to avoid penalising keywords, directors and genres that occurred more frequently essentially because if we have a high count in the dataset this doesn't mean that the word is less important.

**Why we don't consider Content-based filtering:**

content-based recommendation systems have inherent limitations because of their lack of use of other user data. And because it is inherently retrospective it does not help the user to find discover their potential new favourite movies. For instance, let’s say that user X and user Y like action movies. User X also likes comedy movies, because you don’t have that knowledge, you keep offering action movies. Eventually, you’re eliminating other options that user Y potentially might like.

**What is cosine similarity?**

Cosine similarity is a technique for measuring the similarity between vectors. It calculates the cosine of the angle between the two vectors. If the angle between the two vectors is zero, the similarity is calculated as 1 because the cosine of zero is 1. So the two vectors are the same. The cosine of any angle varies from 0 to 1. Therefore, similarity rates will vary from 0 to 1. The formula is expressed as follows:

![image.png](attachment:d36462c2-c175-4f45-b329-889259317e1e.png)


**Advantages:**

* The user gets recommended the types of items they love.
* The user is satisfied by the type of recommendation.
* New items can be recommended; just data for that item is required.

**Disadvantages:**

* The user will never be recommended for different items.
* Business cannot be expanded as the user does not try a different type of product.
* If the user matrix or item matrix is changed the cosine similarity matrix needs to be calculated again.
* Limited content analysis: If the content doesn’t contain enough information to discriminate the items precisely, the recommendation itself risks being imprecise.
* Over-specialization: Content-based filtering provides a limited degree of novelty since it has to match up the features of a user’s profile with available items. In the case of item-based filtering, only item profiles are created and users are suggested items similar to what they rate or search for, instead of their history. A perfect content-based filtering system may suggest nothing unexpected or surprising.

In [None]:
# Create a copy of a dataframe
movies = df_movies.copy()

In [None]:
# Merge two dataframes
df_1 = df_imdb[['movieId','title_cast','director', 'plot_keywords']]
df_1 = df_1.merge(movies[['movieId', 'genres', 'title']], on='movieId', how='inner')
df_1.head()

In [None]:
# Convert data types to strings for string handling
df_1['title_cast'] = df_1.title_cast.astype(str)
df_1['plot_keywords'] = df_1.plot_keywords.astype(str)
df_1['genres'] = df_1.genres.astype(str)
df_1['director'] = df_1.director.astype(str)

# Removing spaces between names
df_1['director'] = df_1['director'].apply(lambda x: "".join(x.lower() for x in x.split()))
df_1['title_cast'] = df_1['title_cast'].apply(lambda x: "".join(x.lower() for x in x.split()))

# Discarding the pipes between the actors' full names and getting only the first three names
df_1['title_cast'] = df_1['title_cast'].map(lambda x: x.split('|'))

# Discarding the pipes between the plot keywords' and getting only the first five words
df_1['plot_keywords'] = df_1['plot_keywords'].map(lambda x: x.split('|'))
df_1['plot_keywords'] = df_1['plot_keywords'].apply(lambda x: " ".join(x))

# Discarding the pipes between the genres 
df_1['genres'] = df_1['genres'].map(lambda x: x.lower().split('|'))
df_1['genres'] = df_1['genres'].apply(lambda x: " ".join(x))

df_1.head()

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# ML Pre processing
from surprise.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Hyperparameter tuning
from surprise.model_selection import GridSearchCV

In [None]:
#we convert the tille_cast column from list to string
df_1['liststring'] = df_1['title_cast'].apply(lambda x: ','.join(map(str, x)))

#we remove the commas between the string in each row
df_1['liststring'] = df_1['liststring'].replace(',',' ', regex=True)

#we choose keywords, cast(liststring), diector and genres column to use as our features
df_features = df_1[['liststring','director','plot_keywords','genres']]

#we combine the features columns into  single string
df_1['combined_features'] = df_features['liststring'] +' '+ df_features['director'] +' '+ df_features['plot_keywords'] +' '+ df_features['genres']

#we now feed the combined features to a CountVectorizer() object for getting the cv matrix.
cv =CountVectorizer()
cv_matrix = cv.fit_transform(df_1['combined_features'])

#now we obtain the cosine similarity matrix from the cv matrix
sim_score = cosine_similarity(cv_matrix,cv_matrix)

df_1.set_index('title', inplace = True)
indices = pd.Series(df_1.index)

In [None]:
print(sim_score)

In [None]:
#Method to get recommenations
def recommendations(title,n,sim_score = sim_score):
    '''
    This method returns movies which are similar.
    
    Input:
        title: name of the movie to be compared
        n: number(quantity) of movies to be returned
        sim_core: similarity score
    Output:
        recommend movies
    '''
    
    recommended_movies = []
    
    # gettin the index of the movie that matches the title
    idx = indices[indices == title].index[0]

    # creating a Series with the similarity scores in descending order
    score_series = pd.Series(sim_score[idx]).sort_values(ascending = False)

    # getting the indexes of the 10 most similar movies
    top_n_indexes = list(score_series.iloc[1:n+1].index)
    
    # populating the list with the titles of the best n matching movies
    for i in top_n_indexes:
        recommended_movies.append(list(df_1.index)[i])
        
    return recommended_movies

![image.png](attachment:8c0c15cf-732f-40ae-8c6f-43833e553a7e.png)

Here’s a glimpse of what happens when you call the above function.!!!

In [None]:
recommendations('Innocence (2014)',10)

**Observations:**

* We can observe that the top 10 movies above are similar movies to the movie Innocent released in 2014.

**Recommendations:**

* Since a user will not be impressed with a list of recommendations, a possible improvement for this content based recommendation system would be to only show the most recent movies.

* Content-based filtering is not practical, or rather, not very dependable when the number of items increases along with a need for clear and differentiated descriptions.

* To overcome all the issues as above, we can implement collaborative filtering techniques, which have proven to be better and more scalable.

* The content based method is also extremely memory intensive therefore we will focus on collaborative filtering

# Collaborative Filtering

**What Is Collaborative Filtering?**

Collaborative filtering is a technique that can filter out items that a user might like based on reactions by similar users.

It works by searching a large group of people and finding a smaller set of users with tastes similar to a particular user. It looks at the items they like and combines them to create a ranked list of suggestions.

to be more precise it is based on similarity in preference, taste and choices of two users. A good example that we can give you could be if user A likes movies 1,2 and 3 and user B likes movies 2,3 and 4 then this implies that they have similar interests and user A should like movie 4 and B should like movie 1.

**Why Do We Consider Collaborating Filtering Over Content Based Filtering?**

Collaborative filtering recommender engine is a much better algorithm then content-based filtering since it can do feature learning on its own, in other words, it can learn which features to use.

**Advantages of Collaborative filtering:**

* Taken that we find collaborative filtering better than content-based, we will give a few advantages to support the argument.
* Takes other user ratings into consideration
* Doesn't need to study or extract information from a recommended item.
* It adapts to the user's interest which might change over time.

**About Collaborative Filtering Datasets:**

To take note that for us to implement this algorithm or any recommendation algorithms we need a specific dataset that is structured in a specific format. This data should entail a set of items and users who have reacted to some of the items.

While working with such data, you’ll mostly see it in the form of a matrix consisting of the reactions given by a set of users to some items from a set of items. Each row would contain the ratings given by a user, and each column would contain the ratings received by an item. A matrix with five users and five items could look like this:

![image.png](attachment:b9a5704c-5a13-4307-a8d7-cfe1b8455f75.png)

**Rating Matrix:**


The matrix shows five users who have rated some of the items on a scale of 1 to 5. For example, the first user has given a rating 4 to the third item. n most cases, the cells in the matrix are empty, as users only rate a few items. It’s highly unlikely for every user to rate or react to every item available. A matrix with mostly empty cells is called sparse, and the opposite to that (a mostly filled matrix) is called dense.

**How do you measure the accuracy of the ratings you calculate?**

Essentially there are many approaches but we will explain the main approach that we will need for this project which is the Root Mean Square Error (RMSE), in which you predict ratings for a test dataset of user-item pairs whose rating values are already known. The difference between the known value and the predicted value would be the error. Square all the error values for the test set, find the average (or mean), and then take the square root of that average to get the RMSE.

![image.png](attachment:fb470fec-1495-40b1-9670-5dc10bd7856c.png)

Another metric to measure the accuracy is Mean Absolute Error (MAE), in which you find the magnitude of error by finding its absolute value and then taking the average of all error values.

However, we will be focusing on the RMSE for our predictions.

Before diving deep into the code we would like to clarify the type of collaborative filtering we are going to implement.

Recommender Systems are divided into three branches of which collaborative filtering is entailed, the figure below will make a clear breakdown to the reader.

![image.png](attachment:2de7ab20-4a17-47f5-9ea8-ab6d0e8643d4.png)

You will notice that Collaborative filtering consist of two filtering techniques,

* Model-based Technique
* Memory-based filtering

Below is a short description of these techniques.

**Model-based Technique**
* Model-based collaborative filtering algorithms provide item recommendations by first developing a model of user ratings. With these systems you will build a model from user ratings and then make recommendations based on that model, this offers a speed and scalability that not available when you are forced to refer back to the entire dataset to make a prediction.

**Memory based filtering**
* Memory-based rely heavily on simple similarity measures (cosine similarity, Pearson correlation and more) to match similar people or items together. These consist of two methods namely Item-based and user based collaborative filtering.

# User-Based and Item Based

![image.png](attachment:fed7e0d3-ccd6-460c-ae49-63c9ef95a60d.png)


**User-user collaborative based filtering(UB-CF)**

User-User Collaborative Filtering:

Here we find look alike users based on similarity and recommend movies which first user’s look-alike has chosen in past. This algorithm is very effective but takes a lot of time and resources. It requires to compute every user pair information which takes time. Therefore, for big base platforms, this algorithm is hard to implement without a very strong parallelizable system.

A specific application of this is the user-based Nearest Neighbor algorithm. This algorithm needs two tasks: 1.Find the K-nearest neighbors (KNN) to the user a, using a similarity function w to measure the distance between each pair of users:

![image.png](attachment:d4731617-460f-429e-8539-f17130ce1854.png)

**Advantages:**

* Easy to implement.
* Context independent.
* Compared to other techniques, such as content-based, it is more accurate.

**Disadvantages**

* Sparsity: The percentage of people who rate items is really low.
* Scalability: The more K neighbors we consider (under a certain threshold), the better my classification should be. Nevertheless, the more users there are in the system, the greater the cost of finding the nearest K neighbors will be.
* Cold-start: New users will have no to little information about them to be compared with other users.

In [None]:
# Creating a small test dataframe to evaluate our models
tests = train.copy()
tests.drop(['timestamp'], axis=1, inplace=True)
tests = tests.head(20000)

# Creating the training data
reader = Reader(rating_scale=(0.5, 5))
test_data = Dataset.load_from_df(tests[['userId','movieId','rating']], reader)

# Compute similarities between users using cosine distance
sim_options = {"name": "cosine",
               "user_based": True}  

# Evaluate the model 
user = KNNWithMeans(sim_options=sim_options)
cv = cross_validate(user, test_data, cv=5, measures=['RMSE'], verbose=True)

Using UBCF gives us a RMSE score of 1.1 (based on a 2% sample of the train data)

**Item-item collaborative based filtering**

Item-Item Collaborative Filtering:

It is quite similar to the previous algorithm, but instead of finding user’s look-alike, we try finding movie’s look-alike. Once we have a movie’s look-alike matrix, we can easily recommend alike movies to a user who has rated any movie from the dataset. This algorithm is far less resource consuming than user-user collaborative filtering. Hence, for a new user, the algorithm takes far lesser time than user-user collaborate as we don’t need all similarity scores between users.

In [None]:
# Compute similarities between items using cosine distance
sim_options = {"name": "cosine",
               "user_based": False}  

# Fit the KNNwithmeans algorithm to the training set
item_based = KNNWithMeans(sim_options=sim_options)

# Evaluate the model 
cv = cross_validate(item_based, test_data, cv=5, measures=['RMSE'], verbose=True)

Using IBCF gives us a RMSE score of 1.08 (based on a 2% sample of the train data) which is only a slight improvement on the UBCF method

**Singular value decomposition (SVD)**

SVD is decomposition of a matrix R which is the utility matrix with m equal to the number of users and m number exposed items (movies) into the product of three matrices:

U is a left singular orthogonal matrix, representing the relationship between users and latent factors Σ is a diagonal matrix (with positive real values) describing the strength of each latent factor

V(transpose) is a right singular orthogonal matrix, indicating the similarity between items and latent factors.


![image.png](attachment:0b5e4823-fc09-436b-a4cd-d76c93d5480f.png)


Decompose rating matrix R in unique prosuct of 3 matrices, with an aim to reveal latent factors in R by minimizing RMSE

* r is rank of R
* U and V are column orthonomal
* V^T has orthonomal rows
* Sum of is diagonal matrix with singular values

The aim of SVD is to make r smaller by setting smallest singular values to 0.

# Model Building

In [None]:
# Loading as Surprise dataframe
df_train = train.copy()
reader = Reader()
# Data selected for model training
data = Dataset.load_from_df(df_train[['userId', 'movieId', 'rating']], reader)

In [None]:
# Data split 99/1
trainset, testset = train_test_split(data, test_size=0.01)

In [None]:
# Check the info of the dataset
df_train.info()

**Base Algorithm**

In [None]:
# Base algorithm
algo_1 = SVD()

In [None]:
# Fitting our trainset
algo_1.fit(trainset)

In [None]:
# Using the 15% testset to make predictions
predictions = algo.test(testset) 
predictions

test = pd.DataFrame(predictions)

In [None]:
# View the test data
test.head()

In [None]:
# We are trying to predict ratings for every userId / movieId pair, we implement the below list comprehension to achieve this.
ratings_predictions=[algo.predict(row.userId, row.movieId) for _,row in test_df.iterrows()]
ratings_predictions

In [None]:
# Converting our prediction into a familiar format-Dataframe
df_pred=pd.DataFrame(ratings_predictions)
df_pred

In [None]:
# Renaming our predictions to original names
df_pred=df_pred.rename(columns={'uid':'userId', 'iid':'movieId','est':'rating'})
df_pred.drop(['r_ui','details'],axis=1,inplace=True)

In [None]:
# Snippet of our ratings
df_pred.head()

In [None]:
# Concatenating userId/movieId into a single Id column.(code has to be run twice to get desired outcome)
df_pred['Id']=df_pred.apply(lambda x:'%s_%s' % (x['userId'],x['movieId']),axis=1)
df_pred['Id']=df_pred.apply(lambda x:'%s_%s' % (x['userId'],x['movieId']),axis=1)

In [None]:
# View the first five rows of the dataframe
df_pred.head()

In [None]:
# Drop the columns: 'userId' and 'movieId'
df_pred.drop(['userId', 'movieId'], inplace=True, axis= 1)

In [None]:
# View the predicted dataset
df_pred = df_pred[['Id', 'rating']]

In [None]:
# View the first 5 rows
df_pred.tail()

In [None]:
# View the shape of the dataset to be submitted 
df_pred.shape

In [None]:
df_pred.to_csv('/kaggle/working/submission.csv', index=False)

In [None]:
# The submitted base model
df_pred.to_csv("SVD_model_base.csv", index=False)

**SVD prediction with altered parameters**

**Parameters:**

* n_factors – The number of factors. Default is 100.
* n_epochs – The number of iteration of the SGD procedure. Default is 20.
* init_mean – The mean of the normal distribution for factor vectors initialization. Default is 0.
* init_std_dev – The standard deviation of the normal distribution for factor vectors initialization. Default is 0.1.
* lr_all – The learning rate for all parameters. Default is 0.005.
* reg_all – The regularization term for all parameters. Default is 0.02.

In [None]:
# Copy of the train dataset
df_train = train.copy()
reader = Reader(rating_scale=(0, 5))
# Data for training the SVD model
sup_data= Dataset.load_from_df(df_train[['userId', 'movieId', 'rating']], reader)

In [None]:
# The full dataset for model training
sup_train = sup_data.build_full_trainset()
# The parameters obtained from randomised search CV
algo_2 = SVD(n_factors = 200, n_epochs = 30, lr_all = 0.005, reg_all = 0.02, init_std_dev=0.02)
# Fit the model
algo_2.fit(sup_train)


In [None]:
# We are trying to predict ratings for every userId / movieId pair, we implement the below list comprehension to achieve this.
ratings_predictions=[algo.predict(row.userId, row.movieId) for _,row in test_df.iterrows()]
# View the predictions
ratings_predictions

In [None]:
# Converting our prediction into a familiar format-Dataframe
df_pred=pd.DataFrame(ratings_predictions)
# View the predictions from a dataframe
df_pred

In [None]:
# Renaming our predictions to original names
df_pred=df_pred.rename(columns={'uid':'userId', 'iid':'movieId','est':'rating'})
# Drop the columns not required for the submission
df_pred.drop(['r_ui','details'],axis=1,inplace=True)

In [None]:
# Snippet of our ratings
df_pred.head()

In [None]:
# Concatenating userId/movieId into a single Id column.(code has to be run twice to get desired outcome)
df_pred['Id']=df_pred.apply(lambda x:'%s_%s' % (x['userId'],x['movieId']),axis=1)
df_pred['Id']=df_pred.apply(lambda x:'%s_%s' % (x['userId'],x['movieId']),axis=1)

In [None]:
# View the top 5 rows for the prediction
df_pred.head()

In [None]:
# Drop the features that will not be required for the submission
df_pred.drop(['userId', 'movieId'], inplace=True, axis= 1)

In [None]:
# Datframe that will be ready for submission
df_pred = df_pred[['Id', 'rating']]

In [None]:
# View the first 5 rows 
df_pred.head()

In [None]:
# Shape of the prediction dataset
df_pred.shape

In [None]:
# Submission final csv. file
df_pred.to_csv('/kaggle/working/submission2.csv', index=False)


In [None]:
# Submission final csv. file
df_pred.to_csv("SVD_altered_params2.csv", index=False)

In [None]:
# Logging the experiment for SVD Model

run_experiment(algo_1, "SVD initial", trainset)

In [None]:
# Logging the experiment for Logistic Regression Model

run_experiment(algo_2, "SVD tuned parameters", sup_train)

In [None]:
experiment.end()

   ![image.png](attachment:099c47cf-8503-4434-bb0f-85fb77d0c925.png)

In this notebook, the movie dataset was used to create a recommender system. The dataset draws on movie ratings and movie specific data dating back over 50 years. The EDA revealed to us that there was an increase in movie production from 1990 to 2000 which subsequently slowed down in the last few years.

We observed that a high percentage of our movies were rated above 3 with the top 3 occurring ratings being 4, 3 and 5 in that order and that alone comprised 50% of total ratings. There are 19 unique movie genres in the dataset with Drama, comedy and thriller being the 3 most popular genres.


In order to produce new recommendations we attempted collaborative based filtering methods because they draw only on past interactions between users and items. These methods do not require item meta-data like their content-based counterparts. This has an added advantage of adapting users' interest which might change over time.

We found that sparsity and scalability were a challenge when we attempted both user-based and item-based memory methods. We settled on the singular value decomposition(SVD), a collaborative filtering method that deals with the sparsity that we had with the user-user and item-item memory based methods, the advantage of being computationally more efficient than content based method .


**Possible improvements:**

Collaborative filtering methods have an issue with the cold start problem, which the content based filtering method doesn’t. This problem can be addressed by implementing a hybrid recommender system that uses a combination of both content and collaborative filtering based methods.



   ![image.png](attachment:6edd70b1-8c26-4aff-8879-4e5befdcdbf8.png)