<a href="https://colab.research.google.com/github/Lebo1024/FNBioscope_Recommender_System/blob/main/Recommender_system_notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Introduction**

In today’s technology driven world, recommender systems are socially and economically critical for ensuring that individuals can make appropriate choices surrounding the content they engage with on a daily basis. One application where this is especially true surrounds movie content recommendations, where intelligent algorithms can help viewers find great titles from tens of thousands of options.

This notebook follows the step-by-step process to construct a recommendation algorithm based on content or collaborative filtering, capable of accurately predicting how a user will rate a movie they have not yet viewed based on their historical preferences.

Providing an accurate and robust solution to this challenge has immense economic potential, with users of the system being exposed to content they would like to view or purchase - generating revenue and platform affinity.

In [None]:
pip install surprise

In [None]:
# data analysis libraries
import pandas as pd
import numpy as np

# Kaggle requirements
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
        

# visualisation libraries
from matplotlib import pyplot as plt
import seaborn as sns
from numpy.random import RandomState


#word cloud
%matplotlib inline
import wordcloud

from wordcloud import WordCloud, STOPWORDS
%matplotlib inline
sns.set()

# visualisation libraries
from matplotlib import pyplot as plt
import seaborn as sns
from plotly.offline import init_notebook_mode, plot, iplot
import plotly.graph_objs as go
init_notebook_mode(connected=True)
import plotly.express as px
import plotly.figure_factory as ff
import plotly.graph_objects as go
import plotly.graph_objs as go
import plotly.offline as pyo


# ML Models
from surprise import Reader
from surprise import Dataset
from surprise.model_selection import cross_validate
from surprise import NormalPredictor
from surprise import KNNBasic
from surprise import KNNWithMeans
from surprise import KNNWithZScore
from surprise import KNNBaseline
from surprise import SVD
from surprise import BaselineOnly
from surprise import SVDpp
from surprise import NMF
from surprise import SlopeOne
from surprise import CoClustering
from surprise.accuracy import rmse
from surprise import accuracy
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# ML Pre processing
from surprise.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Hyperparameter tuning
from surprise.model_selection import GridSearchCV

# High performance hyperparameter tuning
#from tune_sklearn import TuneSearchCV
#import warnings
#warnings.filterwarnings("ignore")

# **Data Imports**

The Expected data sets are as follows:

 

*   genome_scores.csv - a score mapping the strength between movies and tag-related properties.


*   genome_tags.csv - user assigned tags for genome-related scores
*   imdb_data.csv - Additional movie metadata scraped from IMDB using the links.csv file


*   links.csv - File providing a mapping between a MovieLens ID and associated IMDB and TMDB IDs.
*sample_submission.csv - Sample of the submission format for the hackathon.
tags.csv - User assigned for the movies within the dataset.
*test.csv - The test split of the dataset. Contains user and movie IDs with no rating data.
*train.csv - The training split of the dataset. Contains user and movie IDs with associated rating data.



.





In [None]:
test_df = pd.read_csv('/content/drive/MyDrive/edsa-recommender-system-predict/test.csv')
movies = pd.read_csv('/content/drive/MyDrive/edsa-recommender-system-predict/movies.csv')
train = pd.read_csv("/content/drive/MyDrive/edsa-recommender-system-predict/train.csv")
imdb = pd.read_csv('/content/drive/MyDrive/edsa-recommender-system-predict/imdb_data.csv')
gtags = pd.read_csv("/content/drive/MyDrive/edsa-recommender-system-predict/genome_tags.csv")
gscores = pd.read_csv("/content/drive/MyDrive/edsa-recommender-system-predict/genome_scores.csv")
tags = pd.read_csv("/content/drive/MyDrive/edsa-recommender-system-predict/tags.csv")
links = pd.read_csv("/content/drive/MyDrive/edsa-recommender-system-predict/links.csv")

# **Basic Data Analysis**

In [None]:
# Display top 5 rows of dataframe
train.head()

In [None]:
#Viewing movies data
movies.head()

In [None]:
#Viewing imdb dataframe

imdb.head()

In [None]:
#Viewing Genrome tags
gtags.head()


In [None]:
#Viewing scores
gscores.head()

In [None]:
#viewing tags
tags.head()

In [None]:
#view links
links.head()

# **Data Preprocessing**

Preparing raw data:

We will first prepare this raw data to make it suitable for our machine learning model. This is a very crucial step while for creating a machine learning model.

### **Checking for missing values column wise**

**Handling Missing Data:**

In our dataset, there may be some missing values. We cannot train our model with a dataset that contains missing values. So we have to check if our dataset has missing values.

In [None]:
#check for missing values
train.isnull().sum()

## **Checking for duplicates records**

In [None]:
# check duplicates
dup_bool = train.duplicated(['userId', 'movieId', 'rating'])

# display duplicates
print("Number of duplicate records:", sum(dup_bool))

## **Creating a copy**

We will rename our train data as df and look at the top 5 records in the dataframe.

In [None]:
# Create a copy
df = train.copy()

In [None]:
# Create a copy of the train data
df_train = train.copy()

# Display top 5 records
df_train.head()

## **Evaluating Length of Unique Values**

In [None]:
# Find the length of the unique use
len(df_train['userId'].unique()), len(df_train['movieId'].unique())


In [None]:
# View movies
movies.head()

In [None]:
# View unique values of movies
len(movies['movieId'].unique())

## **Joining Datasets**

In [None]:
# Merge the ratings and movies
df_merge1 = train.merge(movies, on='movieId')
# View the first 5 rows
df_merge1.head()

In [None]:
# Merging the dataset with that of the imbd
df_merge2 = train.merge(imdb, on="movieId")
# View first 5 rows
df_merge2.head()

In [None]:
# Merging the merge data earlier on with the df_imbd
df_merge3 = df_merge1.merge(imdb, on="movieId" )
# View first 5 rows
df_merge3.head()

In [None]:
# Check the null values of the data that has just been merged.
df_merge3.isnull().sum()

# **Exploratory Data Analysis**

## **Missing Data and Data Types**

In order to facilitate the identification of missing data and data types, a function, print_dtypes_missing, is defined below

In [None]:
def print_dtypes_null(df):
    
    """
    This function takes a dataframe as input and prints out the
    datatypes and null values datatypes of the dataframe
    """
    
    # print data types
    print('Data type')
    print(df.info(),'\n======================')
    
    
    # get number of null values
    total = df.isnull().sum().sort_values(ascending=False)
    
    # get percentage null values
    percent = (df.isnull().sum()/df.isnull().count()).sort_values(ascending=False)*100
    
    # create dataframe
    print('Missing Values')
    print(pd.concat([total, percent], axis=1, keys=['Total Number Missing', 'Percent Missing']),'\n======================')
    
    # print original dataframe for ease of reading
    print('Dataset')
    print(df.head())

In [None]:
print_dtypes_null(train)

In [None]:
print_dtypes_null(test_df)

In [None]:
print_dtypes_null(gscores)

In [None]:
print_dtypes_null(gtags)

In [None]:
print_dtypes_null(movies)

In [None]:
print_dtypes_null(imdb)

imdb_data_df consists of numerical data, float64 and has no 5 columns with missing data ranging from 36% to for director to 71% budget

In [None]:
print_dtypes_null(links)

links_df consists of numerical data, int64 and float64 and has 1 column, tmdbId with 17% missing data

In [None]:
print_dtypes_null(tags)

tags_df consists of numerical data, int64 , and non-numeric data, object ,and has less than 1% missing values for tag column

**Conclusion:**

1.) From the assessment we see that our dataset consists of a combination of numeric and non-numeric data types.

2.) The imdb_data_df dataset is has 36% - 71% missing data across all the columns. This datatset will therefore not be considered going forward in this excercise. In a different context however, the links_df dataset would be used to source the missing data from a supplementary dataset. The links_df dataset will also not be considered going forward.

In [None]:
# remove data that will not be considered
del imdb
del links

# **Exploratory Data Analysis**

In [None]:
def make_histogram(df, col):


    # Plot the histogram with default number of bins; label your axes
    _ = plt.hist(df[col])
    _ = plt.xlabel(col)
    _ = plt.xticks(rotation=90)
    _ = plt.ylabel('Frequency')
    
    plt.savefig(f'Histogram of {col}.png')

    # Show the plot
    plt.show()


def show_wordcloud(data, col):
    
    # define text from data
    text = ' '.join(data[col].values.astype(str))
    
    # generate wordclound
    wordcloud = WordCloud(max_words=50,
                          background_color='black',
                          scale=3,
                          random_state=4).generate(str(text))
    
    # plot wordcloud
    fig = plt.figure(1, figsize=(15, 15))
    plt.axis('off')
        
    plt.savefig(f'Word cloud of {col}.png')
    plt.imshow(wordcloud)
    plt.show()


def ecdf(data):
    
    """Compute ECDF for a one-dimensional array of measurements."""
    
    # Number of data points: n
    n = len(data)

    # x-data for the ECDF: x
    x = np.sort(data)

    # y-data for the ECDF: y
    y = np.arange(1, n+1) / n

    return x, y


def plot_ecdf(df, col):
    
    """plot ECDF for a column, col, in a dataframe, df."""
    
    # Compute ECDF 
    x, y = ecdf(df[col])
    
    # Generate plot
    _ = plt.plot(x, y, marker='.', linestyle = 'none')
    
    # Label axes
    plt.ylabel('ECDF')
    plt.xlabel(f'{col}')
    
    
    plt.savefig(f'ecdf of {col}.png')
    
    # display
    plt.show()
    
    
def plot_category_distribution(data, category, value, plot_type=sns.violinplot):
    
    """
    To create a distribution plot. The standard plot type is violing plot.
    """
    
    # Create bee swarm plot with Seaborn's default settings
    _ = plot_type(x=category, y=value, data=data)

    # Label Title and axes
    _ = plt.title(f'distribution of {category} vs {value}')
    _ = plt.xlabel(category)
    _ = plt.xticks(rotation=90)
    _ = plt.ylabel(value)
    
    
    # save the plot
    plt.savefig(f'distribution of {category} vs {value}.png')

    # Show the plot
    plt.show()

In [None]:
train.describe().T

In [None]:
train.nunique()

In [None]:
# remove timestamp
train.drop('timestamp', axis=1, inplace=True)
train.head()

There are 10 000 038 records in the train_df dataset. However there are 162 541 usersIDs with 48 213 movies that interacted with them. There are 10 unique ratings that were made and 8 795 101 different times.

It was assumed that people view different movies at different times for reasons that have little or nothing to do with movies they like. For this reason, The timestamp data will not be assessed going forward in this exercise

The rating data was be explored below

In [None]:
# Determining number of rows for each rating value
rows_rating = train["rating"].value_counts()
rows_rating_df = pd.DataFrame({"rating": rows_rating.index, "Rows": rows_rating.values})

# Determining percentage of rows for each rating value
percentage_rating = round(train["rating"].value_counts(normalize=True) * 100, 2)
percentage_rating_df = pd.DataFrame(
    {"rating": percentage_rating.index, "Percentage": percentage_rating.values}
)

# Joining row and percentage information
ratings_distribution_df = pd.merge(
    rows_rating_df, percentage_rating_df, on="rating", how="outer"
)
ratings_distribution_df.set_index("rating", inplace=True)
ratings_distribution_df.sort_index(axis=0)

In the dataframe above we can see that 4.0 is the most commonly score, with 26.53% of the movies in the dataframe assigned that score.

We visualize the data below:

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
from plotly.offline import init_notebook_mode, iplot, plot
from plotly.subplots import make_subplots
import plotly.express as px
import plotly.figure_factory as ff
import plotly.graph_objects as go
import plotly.graph_objs as go
import plotly.offline as pyo

In [None]:
pyo.init_notebook_mode()
init_notebook_mode(connected=True)
data = train["rating"].value_counts().sort_index(ascending=False)

# Plot data
trace = go.Bar(
    x=data.index,
    text=["{:.1f} %".format(val) for val in (data.values / train.shape[0] * 100)],
    textposition="auto",
    textfont=dict(color="#000000"),
    y=data.values,
)

# Create layout
layout = dict(
    title="Distribution Of {} ratings".format(train.shape[0]),
    xaxis=dict(title="Rating"),
    yaxis=dict(title="Count"),
)

# Create plot
fig = go.Figure(data=[trace], layout=layout)
pyo.iplot(fig)

Interestingly, we see half scores (0.5, 1.5, 2.5, 3.5 and 4.5) are less commonly used than integer score values. We don't know if this is because users prefer to rate movies with integer values or if it's because half scores were introduced after the original scoring system was already in use, leading to a decreased volume in a dataset with ratings from 1995. We quickly attempt to understand this further by investigating which years recorded half-score ratings