## Final Project Submission

Please fill out:
* Group 17
* Student names:
    - [Fredrick Kyeki]()
    - [Stacy Kiriiri]()
    - [Wilfred Kivinda]()
* Student pace: **part time**
* Scheduled project review date/time: $1st - 10th December, 2023$
* Instructor name: ...
* Blog post URL: ...

BUSINESS UNDERSTANDING:
=======

Objective:
-------------
The primary objective of the recommender system project is to enhance user satisfaction and engagement on the MovieLens platform by delivering personalized and relevant movie recommendations. The recommender system aims to provide users with tailored suggestions based on their historical movie ratings and tagging activities, ultimately improving their overall experience.

Scope:
-------------
The project will focus on implementing a collaborative filtering-based recommender system, leveraging the ml-latest-small dataset from MovieLens. The recommendations will be centered around user preferences, ensuring that users discover movies aligned with their tastes and interests. The scope includes both explicit ratings and user-generated tags as valuable indicators of user preferences.

Success Criteria:
-------------
The success of the recommender system will be evaluated based on several key performance indicators (KPIs):
##### User Engagement:
Increase in the number of user interactions with the platform, including ratings, tags, and time spent on the website.
##### Recommendation Accuracy:
Improvement in the precision and relevance of movie recommendations, reducing instances of irrelevant or disliked suggestions.
#####  User Satisfaction:
Positive feedback from users, measured through surveys, reviews, and user ratings.
##### Platform Adoption:
Growth in the number of registered users and active users leveraging the recommendation features.

DATA UNDERSTANDING:
=======

## Data Source:

The dataset (ml-latest-small) consists of 100,836 ratings and 3,683 tag applications across 9,742 movies. The data were 
generated by 610 users between March 29, 1996, and September 24, 2018.

The data used in this project will be pulled from 4 different separate file:

##### 1. Movies Data (movies.csv):

Contains movie information, including titles and genres.

Columns: 

 * movieId: Unique identifier for each movie.
 * title: The title of the movie, which also includes the year of release in parentheses.
 * genres: A pipe-separated list of genres to categorize the movie (e.g., Action|Adventure|Comedy).


##### 2. Links Data (links.csv):

Provides identifiers for linking to external movie-related sources (IMDb, TMDb).

Columns:

* movieId: Unique identifier for each movie, consistent with other data files.

* imdbId: Identifier for movies used by IMDb (Internet Movie Database).

* tmdbId: Identifier for movies used by TMDb (The Movie Database).

##### 3. Ratings Data (ratings.csv):


Each entry represents a user's rating for a specific movie.

Contains user ratings on a 5-star scale for movies.

Columns: 

* userId: ID representing the unique identifier for each user.
* movieId: Unique identifier for each movie.
* rating: User's rating for the movie on a 5-star scale with half-star increments (0.5 to 5.0).
* timestamp: The timestamp when the rating was recorded, represented in seconds since midnight Coordinated Universal Time (UTC) of January 1, 1970.

##### 4. Tags Data (tags.csv):

Contains user-generated metadata (tags) about movies.

Columns: 
* userId: ID representing the unique identifier for each user.
* movieId: Unique identifier for each movie.
* tag: User-generated metadata describing a movie, typically a single word or short phrase.
* timestamp: The timestamp when the tag was applied, represented in seconds since midnight Coordinated Universal Time (UTC) of January 1, 1970.


In [1]:
## **IMPORTING NECESSARY LIBRARIES**
import pandas as pd 
import numpy as np

In [2]:
# Datasets
movies = "data/ml-latest-small/movies.csv"
links = "data/ml-latest-small/links.csv"
ratings = "data/ml-latest-small/ratings.csv"
tags = "data/ml-latest-small/tags.csv"

data = {"movies":None, "links": None, "ratings": None, "tags": None}

for key in data.keys():
    data[key] = pd.read_csv(f"data/ml-latest-small/{key}.csv")

#### EXPLORING DATAFRAMES

Datasets Lengths

In [3]:
print("Length of each data-set:")
for k, v in data.items():
    print(k, ":",len(v))

Length of each data-set:
movies : 9742
links : 9742
ratings : 100836
tags : 3683


Top five of each:

In [4]:
data["movies"].head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [5]:
data["links"].head()

Unnamed: 0,movieId,imdbId,tmdbId
0,1,114709,862.0
1,2,113497,8844.0
2,3,113228,15602.0
3,4,114885,31357.0
4,5,113041,11862.0


In [6]:
data["ratings"].head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


In [7]:
data["tags"].head()

Unnamed: 0,userId,movieId,tag,timestamp
0,2,60756,funny,1445714994
1,2,60756,Highly quotable,1445714996
2,2,60756,will ferrell,1445714992
3,2,89774,Boxing story,1445715207
4,2,89774,MMA,1445715200


In [8]:
print("Summary of each data-set:\n")
for k, v in data.items():
    print(k, "\n")
    print(v.info())
    print("="*100, "\n")

Summary of each data-set:

movies 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9742 entries, 0 to 9741
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   movieId  9742 non-null   int64 
 1   title    9742 non-null   object
 2   genres   9742 non-null   object
dtypes: int64(1), object(2)
memory usage: 228.5+ KB
None

links 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9742 entries, 0 to 9741
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   movieId  9742 non-null   int64  
 1   imdbId   9742 non-null   int64  
 2   tmdbId   9734 non-null   float64
dtypes: float64(1), int64(2)
memory usage: 228.5 KB
None

ratings 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100836 entries, 0 to 100835
Data columns (total 4 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   userId     100836 non-null  int64  
 1   movie

In [9]:
print("Colums of each data-set:\n")
all_columns = []
for k, v in data.items():
    print(k, "\n")
    all_columns += list(v.columns)
    print(list(v.columns))
    print("="*100, "\n")

print("all unique columns", set(all_columns))

Colums of each data-set:

movies 

['movieId', 'title', 'genres']

links 

['movieId', 'imdbId', 'tmdbId']

ratings 

['userId', 'movieId', 'rating', 'timestamp']

tags 

['userId', 'movieId', 'tag', 'timestamp']

all unique columns {'userId', 'timestamp', 'genres', 'title', 'tag', 'movieId', 'rating', 'imdbId', 'tmdbId'}


### MERGING DATAFRAMES

In [10]:
# merging movies df and link df using an inner join
merged_movies_links = pd.merge(data["movies"], data["links"], on="movieId", how="inner")
merged_movies_links.shape

(9742, 5)

In [49]:
merged_movies_links.head()

Unnamed: 0,movieId,title,genres,imdbId,tmdbId
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,114709,862.0
1,2,Jumanji (1995),Adventure|Children|Fantasy,113497,8844.0
2,3,Grumpier Old Men (1995),Comedy|Romance,113228,15602.0
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance,114885,31357.0
4,5,Father of the Bride Part II (1995),Comedy,113041,11862.0


In [11]:
# merging merged_movies_links with rating & tags df with a left outer join
merged_data_ratings = pd.merge(merged_movies_links, data["ratings"], on="movieId", how="left")

df = pd.merge(merged_data_ratings, data['tags'], on="movieId", how="left")
print("Final Merged Data:")
df.head()

Final Merged Data:


Unnamed: 0,movieId,title,genres,imdbId,tmdbId,userId_x,rating,timestamp_x,userId_y,tag,timestamp_y
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,114709,862.0,1.0,4.0,964982703.0,336.0,pixar,1139046000.0
1,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,114709,862.0,1.0,4.0,964982703.0,474.0,pixar,1137207000.0
2,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,114709,862.0,1.0,4.0,964982703.0,567.0,fun,1525286000.0
3,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,114709,862.0,5.0,4.0,847434962.0,336.0,pixar,1139046000.0
4,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,114709,862.0,5.0,4.0,847434962.0,474.0,pixar,1137207000.0


In [12]:
print(f"The Final dataframe has {df.shape[0]} rows and {df.shape[1]} columns")

The Final dataframe has 285783 rows and 11 columns


### DATA CLEANING

#### MISSING VALUES

In [13]:
# Checking for missing values in each column
missing_values = df.isna().sum()

for column, count in missing_values.items():
    if count > 0:
        print(f"The {column} column has {count} missing values")

The tmdbId column has 13 missing values
The userId_x column has 21 missing values
The rating column has 21 missing values
The timestamp_x column has 21 missing values
The userId_y column has 52549 missing values
The tag column has 52549 missing values
The timestamp_y column has 52549 missing values


In [14]:
# Calculating percentage of missing values in each column
missing_percentage = df.isna().mean() * 100

missing_percentage = missing_percentage[missing_percentage > 0]

# A DataFrame with columns and percentage of missing values
missing_table = pd.DataFrame({
    'Columns': missing_percentage.index,
    '% of Missing Values': missing_percentage.values
})
print("Percentage of Missing Values")
missing_table

Percentage of Missing Values


Unnamed: 0,Columns,% of Missing Values
0,tmdbId,0.004549
1,userId_x,0.007348
2,rating,0.007348
3,timestamp_x,0.007348
4,userId_y,18.387728
5,tag,18.387728
6,timestamp_y,18.387728


In [15]:
# dropped all the missing values in tmdbId,userId_x,timestamp_x,userId_y,tag,timestamp_y columns

df.dropna(subset = ['tmdbId','userId_x','timestamp_x','userId_y','tag','timestamp_y','rating'], inplace=True)

In [16]:
missing_values1 = df.isna().sum()

for column, count in missing_values1.items():
        print(f"The {column} column has {count} missing values")

The movieId column has 0 missing values
The title column has 0 missing values
The genres column has 0 missing values
The imdbId column has 0 missing values
The tmdbId column has 0 missing values
The userId_x column has 0 missing values
The rating column has 0 missing values
The timestamp_x column has 0 missing values
The userId_y column has 0 missing values
The tag column has 0 missing values
The timestamp_y column has 0 missing values


#### DUPLICATES

In [17]:
duplicated_rows = df.duplicated().sum()
print(f'The DataFrame has {duplicated_rows} duplicated rows.')

The DataFrame has 0 duplicated rows.


#### CORRECTING DATA TYPES

In [18]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 233213 entries, 0 to 285773
Data columns (total 11 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   movieId      233213 non-null  int64  
 1   title        233213 non-null  object 
 2   genres       233213 non-null  object 
 3   imdbId       233213 non-null  int64  
 4   tmdbId       233213 non-null  float64
 5   userId_x     233213 non-null  float64
 6   rating       233213 non-null  float64
 7   timestamp_x  233213 non-null  float64
 8   userId_y     233213 non-null  float64
 9   tag          233213 non-null  object 
 10  timestamp_y  233213 non-null  float64
dtypes: float64(6), int64(2), object(3)
memory usage: 21.4+ MB


In [21]:
# Converting user IDs and tmdbId to object data type
df[['userId_x', 'userId_y', 'tmdbId']] = df[['userId_x', 'userId_y', 'tmdbId']].astype('object')
# df.info()

#### HANDLING OUTLIERS

In [27]:
# OUTLIERS IN RATING COLUMN
df.value_counts(['rating'])

rating
4.0       64781
5.0       63845
4.5       31502
3.0       28550
3.5       22895
2.0        7955
2.5        6488
1.0        3721
0.5        1908
1.5        1568
dtype: int64

In [48]:
import plotly.express as px

# a box plot to visualize outliers in rating column 
fig = px.box(df, y='rating', title='Box Plot of Rating Outliers')

fig.update_layout(
    title=dict(text='Box Plot of Rating with Outliers', x=0.5, y=0.95), 
)

fig.show()

max_rating = df['rating'].max()
min_rating = df['rating'].min()
print(f"The maximum rating is {max_rating}")
print(f"The minimum rating is {min_rating}")

The maximum rating is 5.0
The minimum rating is 0.5


#### REMOVING UNNECESSARY COLUMNS

In [50]:
# dropping timestamp_x & timestamp_y columns
df = df.drop(['timestamp_x', 'timestamp_y'], axis=1)

df.head()

Unnamed: 0,movieId,title,genres,imdbId,tmdbId,userId_x,rating,userId_y,tag
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,114709,862,1,4.0,336,pixar
1,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,114709,862,1,4.0,474,pixar
2,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,114709,862,1,4.0,567,fun
3,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,114709,862,5,4.0,336,pixar
4,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,114709,862,5,4.0,474,pixar


### EXPLORATORY DATA ANALYSIS