# PHASE 4 FINAL PROJECT

## 1. BUSINESS UNDERSTANDING

### 1.1 INTRODUCTION

Creating recommendation systems is a famous problem in our society, such is the importance of such a problem that Netflix, a streaming service, once offered $1 Million as a prize to anyone who could beat their own recommendation systems by 10%. This prize was won in 2009 by a team of researchers called `Belker's Pragmatic Chaos` in 2009 after 3 years of competition. Netflix has also said that they would lose over a billion dollars each year in revenue if it wasn't for their recommendation system.

Knowing this, a startup company in Kenya called Phoenix Incorporated has decided to start its own streaming service to compete with the likes of Showmax, Netflix and Prime Video which are all available in Kenya. The idea of Phoenix as a startup company is to offer the same variety as a company like Netflix but at a lower price so as to accessible by the average Kenyan in this economy. Phoenix Incorporated understands that in order for a streaming service to be successful it has to have a good recommendation system so as to keep its clients happy and to attract new clients hence more revenue. In order to this, they have decided to hire a famous Data science firm called Regex Analytics to assist them in this task.

After meeting with one of the representatives from Phoenix Incorporated the CEO of Regex Analytics has decided to give this task to the head Data Scientist in the firm. The head has then decided to delegate this task to one of the experienced data scientists in his team. This experienced data scientist is to build a recommendation system that recommends 5 movies that users may like based on previous choices and what others liked from the streaming platform so as to address the cold start problem. The experienced data scientist is then supposed to present this model to the head data scientist who will then show it to the CEO. He is also to create a presentation on the project mentioned above as a summary of his findings to a non-technical audience which will also include some members from Phoenix Incorporated.

### 1.2 OBJECTIVES

- To build a recommendation system capable of suggesting 5 movies to users based on their past choices and popular content in the streaming service currently.

- To address the `cold start problem` to provide valuable recommendations to new users with limited interaction history.

- To optimise recommendation algorithms to maximise user satisfaction and platform revenue.

- To implement recommendation system to enhance user engagement and retention.

### 1.3 PROBLEM STATEMENT

To build a recommendation system that offers 5 recommendations to a user based on previous content and also what other users with similar interests have watched.

### 1.4 MEASURE OF SUCCESS

The goal is to build a recommendation system that recommends 5 movies to a user based on previous things watched and what other users with similar interests have watched. The measure of success will therefore be a working recommendation system that is able to offer recommendations to new and old users with at least 70% accuracy. This is lower than major platforms like Netflix's accuracy of 80% but is still a good starting point. The model can them be improved later based on suggestion from users as Netflix even as a company has been around for more than 10 years, hence it has had time to improve its recommendation systems to almost perfect.

## 2. DATA UNDERSTANDING

In order to build a recommendation system the data sourced was from the `MovieLens` dataset. This is a common dataset used in making recommendation systems. This dataset consists of various csv files which each shall be explained below.

First loading libraries to see the columns for each csv file in form of a dataframe and also the libraries we might use later on.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline

### 2.1 Ratings

These are the ratings given to various movies and the data contained can be seen below.

In [18]:
ratings = pd.read_csv('ratings.csv')
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


**As can be seen above the columns for ratings are `userId`, `movieId`, `rating` and `timestamp`.**

 These are defined as follows:
 - **UserId** : The unique identification of the user who gave the rating for a specific movie to help know ratings given by a user or other users.
 
 - **movieId** : The unique identification of the movie of which the user gave a rating to help know the ratings given to movies to be able to recommend correctly if a user is new to the system based on ratings of users with similar interests.

 - **rating** : The rating given by a user to a specific movie starting from 0 to 5 which is the highest score. This is to help in content and collaborative filtering.

 - **timestamp** :  This represents the date and time the user rated a movie in terms of seconds.


### 2.2 Movies

These are the IDs of the movies to be rated, the titles of the movies in question and the category in which each film belongs.

In [19]:
movies = pd.read_csv('movies.csv')
movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


**The columns of the movie dataframe are `movieId`, `title` and `genres`.**

These are defined as follows:
- **movieId** : Same as above unique identification of movie to be rated.

- **title** : Title of the movie to be rated used to tie the movieId to a string so that the system can know the name of a specific movieId

- **genres** : The category or the type of the film rated.

### 2.3 Links

This table contains ID data for various databases and also for the MovieLens site. This is most likely a junction table used to connect various dataframes through common columns.

In [20]:
links = pd.read_csv('links.csv')
links

Unnamed: 0,movieId,imdbId,tmdbId
0,1,114709,862.0
1,2,113497,8844.0
2,3,113228,15602.0
3,4,114885,31357.0
4,5,113041,11862.0
...,...,...,...
9737,193581,5476944,432131.0
9738,193583,5914996,445030.0
9739,193585,6397426,479308.0
9740,193587,8391976,483455.0


**The columns are `movieId`, `imdbId` and `tmdbId`.**

They are defined as follows:

- **movieId** : Unique identifier for every movie according to the MovieLens site.

- **imdbId** : Unique identifier for every movie according to the International movie database(IMDb)

- **tmdbId** : Unique identifier for every movie according to The Movie DataBase(TMDb)

### 2.4 Tags

Contains the IDs of the movie and the users. It also contains the tags given by the user and the timestamp the user gave that tag.

In [21]:
tags = pd.read_csv('tags.csv')
tags

Unnamed: 0,userId,movieId,tag,timestamp
0,2,60756,funny,1445714994
1,2,60756,Highly quotable,1445714996
2,2,60756,will ferrell,1445714992
3,2,89774,Boxing story,1445715207
4,2,89774,MMA,1445715200
...,...,...,...,...
3678,606,7382,for katie,1171234019
3679,606,7936,austere,1173392334
3680,610,3265,gun fu,1493843984
3681,610,3265,heroic bloodshed,1493843978


**These are the columns present `userId`, `movieId`, `tag` and `timestamp`.**

They are defined as follows:

- **userId** : Unique identifier of user who gave a tag of a movie.

- **movieId** : Unique identifier of movie given a tag.

- **tag** : Short description of a movie given by a user to express opinions. They are used by recommendation systems to understand user preferences and make movie recommendations based on the tags assigned by users. It will therefore be useful in building a recommendation system.

- **timestamp** : The time in seconds a tag was made by a user.


**These are the columns present in all the dataframes and even though we may drop some the remaining ones will be useful in building a recommendation system in their own way. Now we move on to the Data Preparation phase of the project.**

## 3. DATA PREPARATION

We want to make our dataframe ready for analysis by removing outliers, wrong datatypes, duplicates and missing values from the dataset. We may also do some feature engineering and feature selection in the process to help us understand our data more.

The first step to make our work easier is to combine all the dataframes into one using the common key `movieId`. This is to lighten our workload.

In [22]:
# Merging all the dataframes based on their common column movieId
combined_df = movies.merge(ratings, on='movieId').merge(links, on='movieId').merge(tags, on='movieId')

# Results
combined_df.head()

Unnamed: 0,movieId,title,genres,userId_x,rating,timestamp_x,imdbId,tmdbId,userId_y,tag,timestamp_y
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1,4.0,964982703,114709,862.0,336,pixar,1139045764
1,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1,4.0,964982703,114709,862.0,474,pixar,1137206825
2,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1,4.0,964982703,114709,862.0,567,fun,1525286013
3,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,5,4.0,847434962,114709,862.0,336,pixar,1139045764
4,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,5,4.0,847434962,114709,862.0,474,pixar,1137206825


**We have succesfully loaded the dataframes and joined them using the common column `movieId` for the join, we did not specify the how parameter hence as a default parameter the join was set to `inner` meaning only instances where there is a value for movieId on both dataframes will be chosen.**

We now want to see the general info about our data such as the type of data in each column and whether or not there are missing values in any of the columns.

In [3]:
combined_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 233213 entries, 0 to 233212
Data columns (total 11 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   movieId      233213 non-null  int64  
 1   title        233213 non-null  object 
 2   genres       233213 non-null  object 
 3   userId_x     233213 non-null  int64  
 4   rating       233213 non-null  float64
 5   timestamp_x  233213 non-null  int64  
 6   imdbId       233213 non-null  int64  
 7   tmdbId       233213 non-null  float64
 8   userId_y     233213 non-null  int64  
 9   tag          233213 non-null  object 
 10  timestamp_y  233213 non-null  int64  
dtypes: float64(2), int64(6), object(3)
memory usage: 19.6+ MB


**There are no missing values in any of the columns present and the `dtypes` seem to be varied as there are a combination of integers, floats and objects meaning strings in this combined dataframe.**

Checking for duplicates

In [4]:
combined_df.duplicated().sum()

0

**There are no duplicates in this dataset.**

Looking at the combined dataset above it is interesting to see that at the end of every string in the `title` column is the year when the movie came out what if we could seperate each year from the title column to be its own column `year`. This we will do using a function known as `year_splitter` as shown below.

In [6]:
# Creating a copy to test the function to be created
test_df = combined_df.copy()

def year_splitter(df):
    """
    Splits the movie into individual words, isolates the years into their own column,
    combines the words again, removes the years value, then the list created when splitting into individual words.

    :param df: DataFrame containing a 'title' column with movie titles.
    :return: DataFrame with 'title' column modified.
    """
    # Splitting the titles into individual words
    df['title'] = df['title'].apply(lambda x: x.split())

    # Extracting the years
    df['year'] = [x[-1].strip('()') for x in df['title']]

    # Joining the separate strings into one, removing the years
    df['title'] = [[' '.join(inner_list[:-1])] for inner_list in df['title']]

    # Removing the list from the 'title' column
    strings = []
    for value in df['title']:
        for string in value:
            strings.append(string)
    
    # Assigining results to the title column
    df['title'] = strings

    # Returning the modified DataFrame
    return df




Testing the function to see if it worked

In [7]:
# Creating a new dataframe based on the changes to the test df.
new_df = year_splitter(test_df)
new_df


Unnamed: 0,movieId,title,genres,userId_x,rating,timestamp_x,imdbId,tmdbId,userId_y,tag,timestamp_y,year
0,1,Toy Story,Adventure|Animation|Children|Comedy|Fantasy,1,4.0,964982703,114709,862.0,336,pixar,1139045764,1995
1,1,Toy Story,Adventure|Animation|Children|Comedy|Fantasy,1,4.0,964982703,114709,862.0,474,pixar,1137206825,1995
2,1,Toy Story,Adventure|Animation|Children|Comedy|Fantasy,1,4.0,964982703,114709,862.0,567,fun,1525286013,1995
3,1,Toy Story,Adventure|Animation|Children|Comedy|Fantasy,5,4.0,847434962,114709,862.0,336,pixar,1139045764,1995
4,1,Toy Story,Adventure|Animation|Children|Comedy|Fantasy,5,4.0,847434962,114709,862.0,474,pixar,1137206825,1995
...,...,...,...,...,...,...,...,...,...,...,...,...
233208,187595,Solo: A Star Wars Story,Action|Adventure|Children|Sci-Fi,586,5.0,1529899556,3778644,348350.0,62,star wars,1528934552,2018
233209,193565,Gintama: The Movie,Action|Animation|Comedy|Sci-Fi,184,3.5,1537098554,1636780,71172.0,184,anime,1537098582,2010
233210,193565,Gintama: The Movie,Action|Animation|Comedy|Sci-Fi,184,3.5,1537098554,1636780,71172.0,184,comedy,1537098587,2010
233211,193565,Gintama: The Movie,Action|Animation|Comedy|Sci-Fi,184,3.5,1537098554,1636780,71172.0,184,gintama,1537098603,2010


**The function worked and now we have a new `year` column with the years removed from the `title` column.**

From now on we will be using the `new_df` dataframe.

We wanted to convert the new `years` column into integers but there are weird values such as `Patterson` as shown below. We therefore need to drop them.

In [11]:
new_df['year'][new_df['year'] == 'Paterson']

232424    Paterson
232425    Paterson
232426    Paterson
Name: year, dtype: object

In [16]:
new_df = new_df.drop(new_df[new_df['year'] == 'Paterson'].index)

Seeing if it worked.

In [23]:
new_df['year'][new_df['year'] == 'Paterson']

Series([], Name: year, dtype: object)

**Since it worked we now convert the years column to integers.**

In [24]:
new_df['year'] = new_df['year'].apply(lambda x: int(x))

Seeing if it worked by taking a sample

In [26]:
type(new_df['year'][6000])

numpy.int64

**It worked.**