<img src="https://explore-datascience.net/images/images_admissions2/main-logo.jpg">

<img src="https://github.com/Explore-AI/Pictures/blob/master/sql_tmdb.jpg?raw=true" width=90%/>

# Streamlit-based Movie Recommender System

## Team 14 : 

## Table of contents
1. [Introduction](#intro)
2. [Data Collection](#data)
3. [Data Preprocessing](#cleaning)
4. [Exploratory Data Analysis](#EDA)
5. [Feature Engineering And Selection](#features)
6. [Model Building And Evaluation](#model)
7. [Model Hyperparameter Tuning](#tuning)
8. [Conclusion](#conclusion)
9. [References](#references)
 

<a id="intro"></a>
# 1. **Introduction**

<a id="data"></a>
# 2. **Data Collection**

## **Import Libraries**

In [None]:
!pip install comet_ml
!pip install surprise

In [1]:
# import comet_ml at the top of your file
from comet_ml import Experiment

# Create an experiment with your api key
experiment = Experiment(
    api_key="cDBGt9YOCyyinNTUvxRUB3hxd",
    project_name="streamlit-based-movie-recommender-system",
    workspace="kwanda2426",
)

COMET INFO: Experiment is live on comet.ml https://www.comet.ml/kwanda2426/streamlit-based-movie-recommender-system/a57a1063ab9a43e8b0ccc7a4633d6899



We use comet to run different experiments while saving the .

In [1]:

# Data manipulation
import pandas as pd
import numpy as np

# datetime
import datetime

# Libraries for data preparation and model building
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics.pairwise import linear_kernel
from surprise import Reader
from surprise.model_selection import train_test_split
from surprise import Reader
from surprise import Dataset
from surprise import SVD
from surprise.accuracy import rmse
import plotly.express as px
import plotly.graph_objects as go
import matplotlib.pyplot as plt
import seaborn as sns
# saving model
import pickle

#ignoring warnings
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline

#making sure that we can see all rows and cols
pd.set_option('display.max_columns', None)

pd.set_option('display.max_rows', None)

### **Loading Data**

The basic process of loading data from a CSV file into a Pandas DataFrame (with all going well) is achieved using the “read_csv” function in Pandas.

In [2]:
# imdb
imdb_df = pd.read_csv('../input/edsa-movie-recommendation-wilderness/imdb_data.csv')

# movies
movies_df = pd.read_csv('../input/edsa-movie-recommendation-wilderness/movies.csv')

# train 
train_df = pd.read_csv('../input/edsa-movie-recommendation-wilderness/train.csv')

# test
test_df = pd.read_csv('../input/edsa-movie-recommendation-wilderness/test.csv')

Unnamed: 0,userId,movieId,tag,timestamp
0,3,260,classic,1439472355
1,3,260,sci-fi,1439472256
2,4,1732,dark comedy,1573943598
3,4,1732,great dialogue,1573943604
4,4,7569,so bad it's good,1573943455


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1093360 entries, 0 to 1093359
Data columns (total 4 columns):
 #   Column     Non-Null Count    Dtype 
---  ------     --------------    ----- 
 0   userId     1093360 non-null  int64 
 1   movieId    1093360 non-null  int64 
 2   tag        1093344 non-null  object
 3   timestamp  1093360 non-null  int64 
dtypes: int64(3), object(1)
memory usage: 33.4+ MB


Unnamed: 0,movieId,tagId,relevance
0,1,1,0.02875
1,1,2,0.02375
2,1,3,0.0625
3,1,4,0.07575
4,1,5,0.14075


Unnamed: 0,tagId,tag
0,1,007
1,2,007 (series)
2,3,18th century
3,4,1920s
4,5,1930s


Unnamed: 0,movieId,imdbId,tmdbId
0,1,114709,862.0
1,2,113497,8844.0
2,3,113228,15602.0
3,4,114885,31357.0
4,5,113041,11862.0


Unnamed: 0,movieId,title_cast,director,runtime,budget,plot_keywords
0,1,Tom Hanks|Tim Allen|Don Rickles|Jim Varney|Wal...,John Lasseter,81.0,"$30,000,000",toy|rivalry|cowboy|cgi animation
1,2,Robin Williams|Jonathan Hyde|Kirsten Dunst|Bra...,Jonathan Hensleigh,104.0,"$65,000,000",board game|adventurer|fight|game
2,3,Walter Matthau|Jack Lemmon|Sophia Loren|Ann-Ma...,Mark Steven Johnson,101.0,"$25,000,000",boat|lake|neighbor|rivalry
3,4,Whitney Houston|Angela Bassett|Loretta Devine|...,Terry McMillan,124.0,"$16,000,000",black american|husband wife relationship|betra...
4,5,Steve Martin|Diane Keaton|Martin Short|Kimberl...,Albert Hackett,106.0,"$30,000,000",fatherhood|doberman|dog|mansion


Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [19]:
train.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,5163,57669,4.0,1518349992
1,106343,5,4.5,1206238739
2,146790,5459,5.0,1076215539
3,106362,32296,2.0,1423042565
4,9041,366,3.0,833375837


In [20]:
test.head()

Unnamed: 0,userId,movieId
0,1,2011
1,1,4144
2,1,5767
3,1,6711
4,1,7318


### Data Overview


This gives an overview of the dataset that is more interesting than the others, i.e tags, movies, train and test datasets.

**movies dataset**

In [22]:
# Checking how our movies dataset looks like
print("Rows    : ", movies_df.shape[0])

print("Columns : ", movies_df.shape[1])

print("\nMissing values: ", movies_df.isnull().sum())

print("\nInformation about the data: ")
print("  \n", movies_df.info())
 
print("\nAbout the data: \n")

# Check how many unique items are in each column of the dateframe
for col_name in movies_df.columns:
    unique_out = len(movies_df[col_name].unique())
    print(f"Feature '{col_name}' has {unique_out} unique categories") 

Rows    :  62423
Columns :  3

Missing values:  0

Information about the data: 
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 62423 entries, 0 to 62422
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   movieId  62423 non-null  int64 
 1   title    62423 non-null  object
 2   genres   62423 non-null  object
dtypes: int64(1), object(2)
memory usage: 1.4+ MB
  
 None

About the data: 

Feature 'movieId' has 62423 unique categories
Feature 'title' has 62325 unique categories
Feature 'genres' has 1639 unique categories


**tags dataset**

In [23]:
# Checking how our tags dataset looks like
print("Rows    : ", tags_df.shape[0])

print("Columns : ", tags_df.shape[1])

print("\nMissing values: ", tags_df.isnull().sum())

print("\nInformation about the data: ")
print("  \n", tags_df.info())
 
print("\nAbout the data: \n")

# Check how many unique items are in each column of the dateframe
for col_name in tags_df.columns:
    unique_out = len(tags_df[col_name].unique())
    print(f"Feature '{col_name}' has {unique_out} unique categories") 

Rows    :  1093360
Columns :  4

Missing values:  16

Information about the data: 
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1093360 entries, 0 to 1093359
Data columns (total 4 columns):
 #   Column     Non-Null Count    Dtype 
---  ------     --------------    ----- 
 0   userId     1093360 non-null  int64 
 1   movieId    1093360 non-null  int64 
 2   tag        1093344 non-null  object
 3   timestamp  1093360 non-null  int64 
dtypes: int64(3), object(1)
memory usage: 33.4+ MB
  
 None

About the data: 

Feature 'userId' has 14592 unique categories
Feature 'movieId' has 45251 unique categories
Feature 'tag' has 73051 unique categories
Feature 'timestamp' has 907730 unique categories


**train dataset**

In [24]:
# Checking how our tags dataset looks like
print("Rows    : ", train_df.shape[0])

print("Columns : ", train_df.shape[1])

print("\nMissing values: ", train_df.isnull().sum())

print("\nInformation about the data: ")
print("  \n", train_df.info())
 
print("\nAbout the data: \n")

# Check how many unique items are in each column of the dateframe
for col_name in train.columns:
    unique_out = len(train_df[col_name].unique())
    print(f"Feature '{col_name}' has {unique_out} unique categories") 

Rows    :  10000038
Columns :  4

Missing values:  0

Information about the data: 
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000038 entries, 0 to 10000037
Data columns (total 4 columns):
 #   Column     Dtype  
---  ------     -----  
 0   userId     int64  
 1   movieId    int64  
 2   rating     float64
 3   timestamp  int64  
dtypes: float64(1), int64(3)
memory usage: 305.2 MB
  
 None

About the data: 

Feature 'userId' has 162541 unique categories
Feature 'movieId' has 48213 unique categories
Feature 'rating' has 10 unique categories
Feature 'timestamp' has 8795101 unique categories


**test dataset**

In [25]:
# Checking how our tags dataset looks like
print("Rows    : ", test_df.shape[0])

print("Columns : ", test_df.shape[1])

print("\nMissing values: ", test_df.isnull().sum().values.sum())

print("\nInformation about the data: ")
print("  \n", test_df.info())
 
print("\nAbout the data: \n")

# Check how many unique items are in each column of the dateframe
for col_name in test.columns:
    unique_out = len(test_df[col_name].unique())
    print(f"Feature '{col_name}' has {unique_out} unique categories") 

Rows    :  5000019
Columns :  2

Missing values:  0

Information about the data: 
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000019 entries, 0 to 5000018
Data columns (total 2 columns):
 #   Column   Dtype
---  ------   -----
 0   userId   int64
 1   movieId  int64
dtypes: int64(2)
memory usage: 76.3 MB
  
 None

About the data: 

Feature 'userId' has 162350 unique categories
Feature 'movieId' has 39643 unique categories


**imdb_data dataset**

In [26]:
# Checking how our tags dataset looks like
print("Rows    : ", imdb_df.shape[0])

print("Columns : ", imdb_df.shape[1])

print("\nMissing values: ", imdb_df.isnull().sum().values.sum())

print("\nInformation about the data: ")
print("  \n", imdb_df.info())
 
print("\nAbout the data: \n")

# Check how many unique items are in each column of the dateframe
for col_name in imdb_df.columns:
    unique_out = len(imdb_df[col_name].unique())
    print(f"Feature '{col_name}' has {unique_out} unique categories") 

Rows    :  27278
Columns :  6

Missing values:  62481

Information about the data: 
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27278 entries, 0 to 27277
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   movieId        27278 non-null  int64  
 1   title_cast     17210 non-null  object 
 2   director       17404 non-null  object 
 3   runtime        15189 non-null  float64
 4   budget         7906 non-null   object 
 5   plot_keywords  16200 non-null  object 
dtypes: float64(1), int64(1), object(4)
memory usage: 1.2+ MB
  
 None

About the data: 

Feature 'movieId' has 27278 unique categories
Feature 'title_cast' has 17144 unique categories
Feature 'director' has 11787 unique categories
Feature 'runtime' has 275 unique categories
Feature 'budget' has 1363 unique categories
Feature 'plot_keywords' has 16009 unique categories


<a id="cleaning"></a>
## 3. **Data Preprocessing**

Data preprocessing is a technique that involves taking in raw data and transforming it into a understandable format and useful. The technique includes data cleaning, intergration, transformation, reduction and discretization. The data preprocessing plan will include the following processes:

- **Data cleaning**

- **Table merging process**

- **Dealing with missing values**


### Data cleaning

Data cleansing is also important because it improves your data quality and in doing so, increases overall productivity. When you clean your data, all outdated or incorrect information is gone – leaving you with the highest quality information. We aim to determine inaccurate, incomplete, or unreasonable data and then improve quality by correcting detected errors and omissions.

In [3]:
# create copies of the dataframes

imdb_df = imdb_df.copy()
movies_df = movies.copy()
train_df = train_df.copy()
test_df = test_df.copy()

#### Removing noise

In [None]:
# genres
movies_df['genres'] = movies_df['genres'].str.split('|')

# title_cast
imdb_df['title_cast'] = imdb_df['title_cast'].str.split('|') 

# plot keywords
imdb_df['plot_keywords'] = imdb_df['plot_keywords'].str.split('|')

# Dropping the budget column
imdb_df.drop('budget', axis = 1, inplace = True)

#### Merging tables

In [None]:
# Merging datasets
data = pd.merge(movies_df, imdb_df, how = 'left', on = 'movieId')

### Dealing with missing values

In [None]:
# Percentage of missing values
(data.isnull().sum()/len(data))*100

We can see that **title_cast** is missing about **36.9%**, the **director** column is missing **36.2%**, **runtime** is missing **44.3%**, **budget** is missing **71.0%**, **plot_keywords** is missing **40.6%**.
The **budget** column since is missing a lot of data, and **we can't make a reliable analysis on it**, hence we drop the column.

In [4]:
# fill nan in text data

# title cast
data['title_cast'].fillna('no cast', inplace = True)

# plot key_words
data['plot_keywords'].fillna('no keywords', inplace = True)

# director
data['director'].fillna('no director', inplace = True)

# runtime
data['runtime'].fillna(round(data['runtime'].median(),1), inplace = True)

<a id="EDA"></a>
## 4. **Exploratory Data Analysis**

<a id="features"></a>
## 5. **Feature engineering And Selection**

<a id="model"></a>
## 6. **Model Building And Evaluation**

### **Content-based Filtering**

### **Collaborative Filtering**

<a id="evaluation"></a>
## 7. **Model Parameter Tuning**

In [14]:
params = {'n_neighbors' : 3,
          'model_name' : 'KNN'}


In [15]:
# log our parameters and results

experiment.log_parameters(params)

#experiment.log_parameters(metrics)

In [None]:
# ending the experiment

experiment.end()

<a id="conclusion"></a>
## 8. **Conclusion**

<a id="references"></a>
## 9. **References**