# © Explore Data Science Academy

Honour Code
We {NM_2_Avengers}, confirm - by submitting this document - that the solutions in this notebook are a result of my own work and that I abide by the EDSA honour code.

Non-compliance with the honour code constitutes a material breach of contract.


# Table of Contents

<a href=#one>1. Introduction</a>

<a href=#two>2. Problem Statement</a>

<a href=#three>3. Importing Packages</a>

<a href=#four>4. Loading Data</a>

<a href=#five>5. Data Preprocessing</a>

<a href=#six>6. Exploratory Data Analysis (EDA)</a>

<a href=#seven>7. Feature Engineering</a>

<a href=#eight>8. Modeling</a>

<a href=#nine>9. Model Performance 

<a id = "one"></a>
## 1. Introduction
<a href=#cont>Back to Table of Contents</a>

If you have ever used a streaming website like Netflix, Showmax, or Youtube, a fter watching a movie, the platform begins suggesting more films and TV series of a similar kind. This is an illustration of a recommendation system. Recommender systems  recognize a user's viewing habits and offer pertinent advice. Recommender systems are economically and socially essential in today's technologically advanced world to enable people to make the best decisions possible regarding the information they consume on a daily basis. This is particularly true in the context of movie recommendations, where clever algorithms may guide viewers toward excellent films among tens of thousands of possibilities.

The task is to develop a collaborative filtering or content-based recommendation algorithm that can correctly forecast how a user would evaluate a film they haven't yet seen based on their past preferences.

Offering a precise and reliable solution to this problem has enormous economic potential since users will receive personalized suggestions, creating platform affinity for streaming services that make it easiest for their audience to watch.

# TEAM

1. THATO RABODIBA
2. KOKETSO MAHLANGU
3. ZITHULELE MANYATHI
4. NONTOKOZO NDLOVU
5. THABATHA NOMPOKO
6. MINENHLE MAPHUMOLO

<a id = "two"></a>
## 2. Problem Statement
<a href=#cont>Back to Table of Contents</a>

The goal is to create a collaborative filtering system or content-based recommendation algorithm that can accurately predict a user's evaluation of a movie they haven't seen based on their past preferences.

In [5]:
# import comet_ml at the top of your file
from comet_ml import Experiment


# Setting the API key (saved as environment variable)
experiment = Experiment(
  api_key="h2d2lfNX7NTFfv4141F24E74C",
  project_name="general",
  workspace="proudmamatoboys",
)

[1;38;5;39mCOMET INFO:[0m ---------------------------------------------------------------------------------------
[1;38;5;39mCOMET INFO:[0m Comet.ml Experiment Summary
[1;38;5;39mCOMET INFO:[0m ---------------------------------------------------------------------------------------
[1;38;5;39mCOMET INFO:[0m   Data:
[1;38;5;39mCOMET INFO:[0m     display_summary_level : 1
[1;38;5;39mCOMET INFO:[0m     url                   : https://www.comet.com/proudmamatoboys/general/782fb1c9f0ef4a93983e860c224d6881
[1;38;5;39mCOMET INFO:[0m   Uploads:
[1;38;5;39mCOMET INFO:[0m     conda-environment-definition : 1
[1;38;5;39mCOMET INFO:[0m     conda-info                   : 1
[1;38;5;39mCOMET INFO:[0m     conda-specification          : 1
[1;38;5;39mCOMET INFO:[0m     environment details          : 1
[1;38;5;39mCOMET INFO:[0m     filename                     : 1
[1;38;5;39mCOMET INFO:[0m     git metadata                 : 1
[1;38;5;39mCOMET INFO:[0m     installed packages    

<a id="three"></a>
## 3. Importing Packages
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Importing Packages ⚡ |
| :--------------------------- |
| In this section you are required to import, and briefly discuss, the libraries that will be used throughout your analysis and modelling. |

In [None]:


# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

# Install packages here
# Packages for data processing
import numpy as np
import pandas as pd
import datetime
from sklearn import preprocessing
from sklearn.datasets import make_blobs
from sklearn.preprocessing import StandardScaler
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import re
from scipy.sparse import csr_matrix
import scipy as sp


# Packages for visualization
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline


# Packages for model evaluation
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
from time import time



# Input data files are available in the read-only "../input/" directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))





<a id="four"></a>
## 4. Loading the Data
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Loading the data ⚡ |
| :--------------------------- |
| In this section you are required to load the data from the `df_train` file into a DataFrame. |

---

In [None]:
movies_df = pd.read_csv('movies.csv')
gs_df = pd.read_csv('genome_scores.csv')
gt_df = pd.read_csv('genome_tags.csv')
imdb_df = pd.read_csv('imdb_data.csv')
links_df = pd.read_csv('links.csv')
tags_df = pd.read_csv('tags.csv')
train_df = pd.read_csv('train.csv') 
test_df = pd.read_csv('test.csv')
sample_df = pd.read_csv('sample_submission.csv')

### Brief Description of the Datasets
- genome_scores.csv - a score mapping the strength between movies and tag-related properties. Read more here
- genome_tags.csv - user assigned tags for genome-related scores
- imdb_data.csv - Additional movie metadata scraped from IMDB using the links.csv file.
- links.csv - File providing a mapping between a MovieLens ID and associated IMDB and TMDB IDs.
- sample_submission.csv - Sample of the submission format for the hackathon.
- tags.csv - User assigned for the movies within the dataset.
- test.csv - The test split of the dataset. Contains user and movie IDs with no rating data.
- train.csv - The training split of the dataset. Contains user and movie IDs with associated rating data.

In [None]:
train_df.head()

In [None]:
movies_df.head()

In [None]:
train_df.shape

In [None]:
movies_df.head(1)

# Data Preprocessing

In [None]:
movies_df.info() # Get concise summary of the movie dataset

In [None]:
movies_df.head() # Get the first 5 observations 

In [None]:
gt_df.info() # Get concise summary of the Genome_tag Dataframe 

In [None]:
gs_df.info() # Get concise summary of the Genome_score Dataframe 

In [None]:
imdb_df.info() # Get concise of imdb database

In [None]:
imdb_df.head() # Shows the first 5 observations

In [None]:
train_df.info() # Get the summary of the dataset's metadata

In [None]:
train_df.isnull().sum() # check if there are any null values

In [None]:
train_df.head() # show the first 5 observations 

# Exploratory Data Analysis¶

In [None]:
# Display summary statistics of numerical features in train data
print(train_df.describe())

# Visualize the distribution of ratings for each user
user_rating_counts = train_df.groupby('userId')['rating'].count()
plt.hist(user_rating_counts, bins=50, edgecolor='black')
plt.title('Distribution of Ratings Count per User')
plt.xlabel('Number of Ratings')
plt.ylabel('Frequency')
plt.show()

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

plt.figure(figsize=(8, 6))
plt.hist(train_df['rating'], bins=20, color='skyblue', edgecolor='black')
plt.title('Distribution of Movie Ratings')
plt.xlabel('Rating')
plt.ylabel('Frequency')
plt.grid(True)
plt.show()



In [None]:
# Visualize distribution of movie genres using a bar plot
plt.figure(figsize=(10, 6))
genre_counts = train_df['rating'].value_counts().head(10)  
genre_counts.plot(kind='bar', color='skyblue')
plt.title('Top 10 Movie Genres')
plt.xlabel('Genres')
plt.ylabel('Count')
plt.xticks(rotation=45, ha='right')  
plt.tight_layout()
plt.show()

# Visualize distribution of user-assigned tags using a pie chart
plt.figure(figsize=(8, 8))
tag_counts = train_df['rating'].value_counts().head(10) 
tag_counts.plot(kind='pie', autopct='%1.1f%%', colors=['lightgreen', 'lightblue', 'lightcoral', 'lightskyblue', 'lightpink', 'lightyellow', 'lightgrey', 'lightseagreen', 'lightcyan', 'lightsteelblue'])
plt.title('Top 10 User-Assigned Tags')
plt.ylabel('') 
plt.tight_layout()
plt.show()

In [None]:
print('There are ',train_df["userId"].nunique(), 'users and',train_df.movieId.nunique(),'movies in the database/dataframe')

In [None]:
# Create a Dataframe consisting of the users average rating the give per user and number of times they have rated movies
train1 = pd.DataFrame(train_df.groupby('userId')['rating'].agg('mean').reset_index())
train1_2 = pd.DataFrame(train_df.groupby('userId')['rating'].count().reset_index())
# Merge the above to dataframe 
train1 = train1.merge(train1_2,on='userId',how = 'left')
# Rename the columns respectively
train1.rename(columns={'rating_x':'avg_rating','rating_y':'number_of_movies'},inplace = True)

# Sort the data in descending of the number of movies the user has rated
train1 = train1.sort_values('number_of_movies', ascending = False)

# Show 5 observations 
train1.head()

In [None]:
# Initialize the plot with set figure size
fig, ax = plt.subplots(1,2,figsize=(20, 10)) 

# Create a densityplot to visualise the density of the ratings per users
sns.kdeplot(ax=ax[0], x='avg_rating', data=train1) 
ax[0].set_title("The average rating of users",fontsize = 20)

# Set the tick labels to appear in non-scientific form
plt.ticklabel_format(style='plain', axis='y', useOffset=False) 

# Create a KDE plot of the rating the users 
sns.kdeplot(ax=ax[1], x='number_of_movies', data=train1[1:]) 
ax[1].set_title("A Density plot showing the average number of movies watched by the users",fontsize = 20)
    
# Show density plot
plt.show()

In [None]:
# Check for correlation between average rating and the number of the times have the user has watch a movie

# Create a scatter plot to visualise 
sns.regplot(data = train1, y = "number_of_movies", x = "avg_rating",line_kws={"color": "red"})

# Show the scatterplot
plt.show()

In [None]:
train_df.rating.value_counts()

In [None]:
# Plotting the graph
fig, ax = plt.subplots(figsize=(20, 10)) 

sns.countplot(ax=ax, x='rating', data=train_df) 
ax.set_title("The number of movies for each ratings")
plt.ticklabel_format(style='plain', axis='y', useOffset=False) 

# Make the counts appear on the different
for p in ax.patches:
    ax.annotate('{:.1f}'.format(p.get_height()), (p.get_x()+0.1, p.get_height()+50))
    
# Show the countplot
plt.show()

In [None]:
train_movies = train_df.merge(movies_df,on = 'movieId') # Merges the movie and train datasets
train_movies.drop(columns=['timestamp'],inplace=True) # Dropping the timestamp column
train_movies.head(10) # Shows the first 10 observations

In [None]:
# Create a scatter plot to explore the relationship between movie ratings and user ratings counts
plt.figure(figsize=(8, 6))
plt.scatter(train_df['rating'],train_df['rating'], color='skyblue', alpha=0.6)
plt.title('Relationship between Movie Ratings and User Ratings Counts')
plt.xlabel('Movie Ratings')
plt.ylabel('User Ratings Counts')
plt.grid(True)
plt.show()


In [None]:
# Display links data
print(links_df.head())

# Check for missing values in links data
print(links_df.isnull().sum())