# PHASE 4 FINAL PROJECT

## 1. BUSINESS UNDERSTANDING

### 1.1 INTRODUCTION

Creating recommendation systems is a famous problem in our society, such is the importance of such a problem that Netflix, a streaming service, once offered $1 Million as a prize to anyone who could beat their own recommendation systems by 10%. This prize was won in 2009 by a team of researchers called `Belker's Pragmatic Chaos` in 2009 after 3 years of competition. Netflix has also said that they would lose over a billion dollars each year in revenue if it wasn't for their recommendation system.

Knowing this, a startup company in Kenya called Phoenix Incorporated has decided to start its own streaming service to compete with the likes of Showmax, Netflix and Prime Video which are all available in Kenya. The idea of Phoenix as a startup company is to offer the same variety as a company like Netflix but at a lower price so as to accessible by the average Kenyan in this economy. Phoenix Incorporated understands that in order for a streaming service to be successful it has to have a good recommendation system so as to keep its clients happy and to attract new clients hence more revenue. In order to this, they have decided to hire a famous Data science firm called Regex Analytics to assist them in this task.

After meeting with one of the representatives from Phoenix Incorporated the CEO of Regex Analytics has decided to give this task to the head Data Scientist in the firm. The head has then decided to delegate this task to one of the experienced data scientists in his team. This experienced data scientist is to build a recommendation system that recommends 5 movies that users may like based on previous choices and what others liked from the streaming platform so as to address the cold start problem. It is to do The experienced data scientist is then supposed to present this model to the head data scientist who will then show it to the CEO. He is also to create a presentation on the project mentioned above as a summary of his findings to a non-technical audience which will also include some members from Phoenix Incorporated.

### 1.2 OBJECTIVES

- To build a recommendation system capable of suggesting 5 movies to users based on their past choices and popular content in the streaming service currently.

- To address the `cold start problem` to provide valuable recommendations to new users with limited interaction history.

- To optimise recommendation algorithms to maximise user satisfaction and platform revenue.

- To implement recommendation system to enhance user engagement and retention.

### 1.3 PROBLEM STATEMENT

To build a recommendation system that offers 5 recommendations to a user based on previous content and also what other users with similar interests have watched.

### 1.4 MEASURE OF SUCCESS

The goal is to build a recommendation system that recommends 5 movies to a user based on previous things watched and what other users with similar interests have watched. The measure of success will therefore be a working recommendation system that is able to offer recommendations to new and old users with at least 70% accuracy. This is lower than major platforms like Netflix's accuracy of 80% but is still a good starting point. The model can them be improved later based on suggestion from users as Netflix even as a company has been around for more than 10 years, hence it has had time to improve its recommendation systems to almost perfect.

First loading libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline

Next we want to see how our data looks like and to combine the dataframes into one using a common key to make further analysis easier through working with one dataframe instead of 4.

In [2]:
# Creating dataframes based on each csv file
ratings = pd.read_csv('ratings.csv')
movies = pd.read_csv('movies.csv')
links = pd.read_csv('links.csv')
tags = pd.read_csv('tags.csv')

# Merging all the dataframes based on their common column movieId
combined_df = movies.merge(ratings, on='movieId').merge(links, on='movieId').merge(tags, on='movieId')


# Results
combined_df.head()

Unnamed: 0,movieId,title,genres,userId_x,rating,timestamp_x,imdbId,tmdbId,userId_y,tag,timestamp_y
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1,4.0,964982703,114709,862.0,336,pixar,1139045764
1,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1,4.0,964982703,114709,862.0,474,pixar,1137206825
2,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1,4.0,964982703,114709,862.0,567,fun,1525286013
3,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,5,4.0,847434962,114709,862.0,336,pixar,1139045764
4,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,5,4.0,847434962,114709,862.0,474,pixar,1137206825


**We have succesfully loaded the dataframes and joined them using the common column `movieId` for the join, we did not specify the how parameter hence as a default parameter the join was set to `inner` meaning only instances where there is a value for movieId on both dataframes will be chosen.**

We now want to see the general info about our data such as the type of data in each column and whether or not there are missing values in any of the columns.

In [3]:
combined_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 233213 entries, 0 to 233212
Data columns (total 11 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   movieId      233213 non-null  int64  
 1   title        233213 non-null  object 
 2   genres       233213 non-null  object 
 3   userId_x     233213 non-null  int64  
 4   rating       233213 non-null  float64
 5   timestamp_x  233213 non-null  int64  
 6   imdbId       233213 non-null  int64  
 7   tmdbId       233213 non-null  float64
 8   userId_y     233213 non-null  int64  
 9   tag          233213 non-null  object 
 10  timestamp_y  233213 non-null  int64  
dtypes: float64(2), int64(6), object(3)
memory usage: 19.6+ MB


**There are no missing values in any of the columns present and the `dtypes` seem to be varied as there are a combination of integers, floats and objects meaning strings in this combined dataframe.**

Checking for duplicates

In [4]:
combined_df.duplicated().sum()

0

In [5]:
tester_df = combined_df.copy()

tester_df['title'] = tester_df['title'].apply(lambda x: x.split())

tester_df['title'][0]

['Toy', 'Story', '(1995)']

**No duplicates present.**

We now want to create a new column year to know which year each movie came out this might be useful during Exploratory Data Analysis.

We will do this using a function called `year_splitter` as shown below.

In [6]:
# Creating a copy to test the function to be created
test_df = combined_df.copy()

def year_splitter(df):
    """
    Splits the movie into individual words, isolates the years into their own column,
    combines the words again, removes the years value, then the list created when splitting into individual words.

    :param df: DataFrame containing a 'title' column with movie titles.
    :return: DataFrame with 'title' column modified.
    """
    # Splitting the titles into individual words
    df['title'] = df['title'].apply(lambda x: x.split())

    # Extracting the years
    df['year'] = [x[-1].strip('()') for x in df['title']]

    # Joining the separate strings into one, removing the years
    df['title'] = [[' '.join(inner_list[:-1])] for inner_list in df['title']]

    # Removing the list from the 'title' column
    strings = []
    for value in df['title']:
        for string in value:
            strings.append(string)
    
    # Assigining results to the title column
    df['title'] = strings

    # Returning the modified DataFrame
    return df




Testing the function to see if it worked

In [7]:
new_df = year_splitter(test_df)
new_df


Unnamed: 0,movieId,title,genres,userId_x,rating,timestamp_x,imdbId,tmdbId,userId_y,tag,timestamp_y,year
0,1,Toy Story,Adventure|Animation|Children|Comedy|Fantasy,1,4.0,964982703,114709,862.0,336,pixar,1139045764,1995
1,1,Toy Story,Adventure|Animation|Children|Comedy|Fantasy,1,4.0,964982703,114709,862.0,474,pixar,1137206825,1995
2,1,Toy Story,Adventure|Animation|Children|Comedy|Fantasy,1,4.0,964982703,114709,862.0,567,fun,1525286013,1995
3,1,Toy Story,Adventure|Animation|Children|Comedy|Fantasy,5,4.0,847434962,114709,862.0,336,pixar,1139045764,1995
4,1,Toy Story,Adventure|Animation|Children|Comedy|Fantasy,5,4.0,847434962,114709,862.0,474,pixar,1137206825,1995
...,...,...,...,...,...,...,...,...,...,...,...,...
233208,187595,Solo: A Star Wars Story,Action|Adventure|Children|Sci-Fi,586,5.0,1529899556,3778644,348350.0,62,star wars,1528934552,2018
233209,193565,Gintama: The Movie,Action|Animation|Comedy|Sci-Fi,184,3.5,1537098554,1636780,71172.0,184,anime,1537098582,2010
233210,193565,Gintama: The Movie,Action|Animation|Comedy|Sci-Fi,184,3.5,1537098554,1636780,71172.0,184,comedy,1537098587,2010
233211,193565,Gintama: The Movie,Action|Animation|Comedy|Sci-Fi,184,3.5,1537098554,1636780,71172.0,184,gintama,1537098603,2010


**The function worked and now we have a new `year` column with the years removed from the `title` column.**

From now on we will be using the `new_df` dataframe.

In [11]:
new_df['year'][new_df['year'] == 'Paterson']

232424    Paterson
232425    Paterson
232426    Paterson
Name: year, dtype: object

In [14]:
new_df['tag'].unique()

array(['pixar', 'fun', 'fantasy', ..., 'star wars', 'gintama', 'remaster'],
      dtype=object)