# **Comet Experiment**

In [1]:
# Install the comet_ml if the code below is missing the package
# !pip install comet_ml

In [2]:
"""
# import comet_ml at the top of your file
from comet_ml import Experiment

# Create an experiment with your api key:
experiment = Experiment(
    api_key="ZBRB8H2ncCGZsUoS6CqVIAr0y",
    project_name="general",
    workspace="knetshiongolwe",
)

"""

COMET INFO: Experiment is live on comet.ml https://www.comet.ml/knetshiongolwe/general/a6513bd1eea24a968832a738525b06dd



<a id="section-one"></a>

# **Import libraries and datasets**

We will be working with the famous Surprise(Simple Python RecommendatIon System Engine.) Library, Surprise is a Python scikit for building and analyzing recommender systems that deal with explicit rating data. 

Below are all libraries that are used through out this notebook.

In [15]:
# data analysis libraries
import pandas as pd
import numpy as np

# visualisation libraries
from matplotlib import pyplot as plt
import seaborn as sns
from numpy.random import RandomState


# Notebook styling
%matplotlib inline
sns.set()


# ML Models
from surprise import Reader
from surprise import Dataset
from surprise.model_selection import cross_validate
from surprise import CoClustering, SVDpp
from surprise.accuracy import rmse
from surprise import accuracy

# ML Pre processing
from surprise.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Hyperparameter tuning
from surprise.model_selection import GridSearchCV

# High performance hyperparameter tuning
#from tune_sklearn import TuneSearchCV
#import warnings
#warnings.filterwarnings("ignore")

# **Loading data**
<a id="section-two"></a>

We will load all the dataframes that we desire to work with 

In [16]:
#Pandas libraries used in the notebook.
train = pd.read_csv('./data/train.csv')
test_df = pd.read_csv('./data/test.csv')
df_movies = pd.read_csv('./data/movies.csv')
# df_samp = pd.read_csv('/kaggle/input/edsa-recommender-system-predict/sample_submission.csv')
# df_imdb = pd.read_csv('/kaggle/input/edsa-recommender-system-predict/imdb_data.csv')
# df_gtags = pd.read_csv("/kaggle/input/edsa-recommender-system-predict/genome_tags.csv")
# df_scores = pd.read_csv("/kaggle/input/edsa-recommender-system-predict/genome_scores.csv")
# df_tags = pd.read_csv("/kaggle/input/edsa-recommender-system-predict/tags.csv")
# df_links = pd.read_csv("/kaggle/input/edsa-recommender-system-predict/links.csv")


<a id="section-three"></a>
# **Evaluating Data**

Here is the data that we are was given to us.
Supplied Files
*
* train.csv - The training split of the dataset. Contains user and movie IDs with associated rating data.

*Description of the data that is given to us *

In [17]:
#viewing training data
train.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,31,2.5,1260759144
1,1,1029,3.0,1260759179
2,1,1061,3.0,1260759182
3,1,1129,2.0,1260759185
4,1,1172,4.0,1260759205


Train:

* UserId
* movieId : Identifier for movies used
* rating : Ratings are made on a 5-star scale, with half-star increments (0.5 stars - 5.0 stars).
* timestamp: represent seconds since midnight Coordinated Universal Time (UTC) of January 1, 1970.

In [18]:
#Viewing movies data
df_movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


Movies:

* movieId : Identifier for movies used

* title : These were entered manually or imported from https://www.themoviedb.org/, and include the year of release in parentheses. Errors and inconsistencies may exist in these titles.

* genres: Genres are a pipe-separated list, and are selected from the following:

    * Action
    * Adventure
    * Animation
    * Children's
    * Comedy
    * Crime
    * Documentary
    * Drama
    * Fantasy
    * Film-Noir
    * Horror
    * Musical
    * Mystery
    * Romance
    * Sci-Fi
    * Thriller
    * War
    * Western
    * (no genres listed)

<a id="section-four"></a>
# **Data Preprocessing**
**Preparing raw data:**

We will first prepare this raw data to make it suitable for our machine learning model. This is a very crucial step while for creating a machine learning model.

<a id="subsection-one"></a>
# **Checking for missing values column wise**

**Handling Missing Data:**

In our dataset, there may be some missing values. We cannot train our model with a dataset that contains missing values. So we have to check if our dataset has missing values.


In [19]:
#check for missing values
train.isnull().sum()

userId       0
movieId      0
rating       0
timestamp    0
dtype: int64

<a id="subsection-two"></a>
# **Checking for duplicates records**
**Checking Duplicate Values:**

At times our dataset may entail some duplicated values which are not necessary therefore this values must be removed, befor removing these duplicates we are able to first check if we do have them. We will implement this by the code below.

In [20]:
#check duplicates
dup_bool = train.duplicated(['userId','movieId','rating'])

#display duplicates
print("Number of duplicate records:",sum(dup_bool))

Number of duplicate records: 0


<a id="subsection-three"></a>
# **Creating a copy**

We will rename our train data as df and look at the top 5 records in the dataframe.

In [21]:
df = train.copy()

In [22]:
#create a copy of the train data
df_train = train.copy()

#display top 5 records
df_train.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,31,2.5,1260759144
1,1,1029,3.0,1260759179
2,1,1061,3.0,1260759182
3,1,1129,2.0,1260759185
4,1,1172,4.0,1260759205


## **Evaluating Length of Unique Values**

In [23]:
# Find the length of the unique use
len(df_train['userId'].unique()), len(df_train['movieId'].unique())

(671, 9066)

In [24]:
#view movies
df_movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


## **Joining Datasets**

In [25]:
# Merge
df_merge1 = df_train.merge(df_movies, on = 'movieId')
df_merge1.head()

Unnamed: 0,userId,movieId,rating,timestamp,title,genres
0,1,31,2.5,1260759144,Dangerous Minds (1995),Drama
1,7,31,3.0,851868750,Dangerous Minds (1995),Drama
2,31,31,4.0,1273541953,Dangerous Minds (1995),Drama
3,32,31,4.0,834828440,Dangerous Minds (1995),Drama
4,36,31,3.0,847057202,Dangerous Minds (1995),Drama


# **Collaborative Filtering**

**What Is Collaborative Filtering?**

Collaborative filtering is a technique that can filter out items that a user might like on the basis of reactions by similar users.

It works by searching a large group of people and finding a smaller set of users with tastes similar to a particular user. It looks at the items they like and combines them to create a ranked list of suggestions.

to be more precise it is based on similarity in preference , taste and choices of two users. A good example that we can give you could be if user A likes movies 1,2 and 3 and user B likes movies 2,3 and 4 then this implies that they have similar interests and user A should like movie 4 and B should like movie 1.


**Why Do We Consider Collaborating Filtering Over Content Based Filtering?**

Collaborative filtering recommender engine is a much better algorithim then content content based filtering since it is able to do feature laerning on its own, in other words it can laern which features to use

**Advantages of Collaborative filtering:**

Taken that we find collaborative filtering better than content based, We will give a few adavntages to support the argument.

* Takes other user ratings into consideration 
* Doesnt need to study or extract information from recommended item.
* It adapts to the user' interest which might change over time.

**About Collaborative Filtering Datasets:**

To take note that in order for us to implement this algorithm or any recommendation algorithms  we need a specific dataset that is stuctured in a specific format. This data should entail a set of items and users who have reacted to some of the items.

While working with such data, you’ll mostly see it in the form of a matrix consisting of the reactions given by a set of users to some items from a set of items. Each row would contain the ratings given by a user, and each column would contain the ratings received by an item. A matrix with five users and five items could look like this:


**Rating Matrix:**


![](https://files.realpython.com/media/rating-matrix.04153775e4c1.jpg)

The matrix shows five users who have rated some of the items on a scale of 1 to 5. For example, the first user has given a rating 4 to the third item. n most cases, the cells in the matrix are empty, as users only rate a few items. It’s highly unlikely for every user to rate or react to every item available. A matrix with mostly empty cells is called sparse, and the opposite to that (a mostly filled matrix) is called dense.

**How do you measure the accuracy of the ratings you calculate?**

Esentially there are many approaches but we will explain the main approach that we will need for this project which is the Root Mean Square Error (RMSE), in which you predict ratings for a test dataset of user-item pairs whose rating values are already known. The difference between the known value and the predicted value would be the error. Square all the error values for the test set, find the average (or mean), and then take the square root of that average to get the RMSE.

![](https://www.analyticsvidhya.com/wp-content/uploads/2016/02/rmse.png)

Another metric to measure the accuracy is Mean Absolute Error (MAE), in which you find the magnitude of error by finding its absolute value and then taking the average of all error values.

However we will be focusing on the RMSE for our predictions.

Before diving deep into the code we would like to clarify the Type of collaborative filtering we are going to implement. 

Recommender Sysem is divided ito Three brances of which collaborative filtering is entailed, the figure below will make a clear breakdown to the reader.

![](https://www.seoclerk.com/pics/want61009-1nSWOn1525162745.jpg)

You will notice that Collaborative filtering consist of two filtering techniques, 

* **Model-based Technique**
* **Memory-based filtering**

We will give a short description of these techniques. 

* **Model-based Technique**
Model based collaborative filtering algorithms provide item recommendations by first developing a model of user ratings. With these systems you will build a model from user ratings and then make recommendations based on that model, this offers a speed and scalability that not available when
youre forced to refer back to the entire dataset to make a prediction.

* **Memory based filtering**
Memory based rely heavely on simple similarity measures(cosine similarity, pearson correlation and more) to match similar people or items together.
thses consist of two methods namely **Item based** and **user based** collaborative filtering.

The figure below defines the two filtering methods.

![](https://cdn-images-1.medium.com/max/1600/1*7uW5hLXztSu_FOmZOWpB6g.png)

 ### **Loading as Surprise Dataframe**

We will be using the dataset module which loads the pandas dataframe that is available for this experiment, The reader function is used to parse a file containing ratings data. The default format in which it accepts data is 
that each rating is stored in a separate line in the order user, movie and rating

In [26]:
reader = Reader(rating_scale=(0.5, 5))

data = Dataset.load_from_df(df_train[['userId', 'movieId', 'rating']], reader)

### **Splitting into Train and Test Set**

Another way to implement the sampling of the trainset and testset without the use of a cross validate procedure is by using a train test split given the sizes , with the acuracy metrics of choice.

We use a random trainset and testset with the testset that is only 15% of the ratings.

In [27]:
# Data split 85/15
trainset, testset = train_test_split(data, test_size=0.20)

### **Training  Model**

Using the base algoritm of `Co Clustering` we will fit method which will train the algorithm on the trainset and and the test() method which will return the predictions made from the testset furthermore storing all our predictions on a dataframe called test.

In [28]:
# Base algorithm
svdpp = SVDpp()

In [29]:
# Fitting our trainset
svdpp.fit(trainset)

# Using the 15% testset to make predictions
predictions = svdpp.test(testset) 
predictions

test = pd.DataFrame(predictions)


Let us have a closer look into the predictions on the dataframe test.

In [30]:
# View the head
test.head()

Unnamed: 0,uid,iid,r_ui,est,details
0,15,94677,4.0,3.052915,{'was_impossible': False}
1,433,4027,4.5,3.695145,{'was_impossible': False}
2,355,1261,3.0,3.872683,{'was_impossible': False}
3,73,100159,4.0,3.387418,{'was_impossible': False}
4,430,45499,4.0,4.001106,{'was_impossible': False}


### **Evaluate Model**

Utilising the test dataframe that we have created we are going to predict some of the ratings for each userId and movieId pair, this ratings predictions will be collected and stored as a list consiting of these pairs, ideally this list will help in predicting unknown values in the original matrix(test_df dataframe) (this is also known as matrix completion)

Let us look at the list called ratings predictions.

In [31]:
# We are trying to predict ratings for every userId / movieId pair, we implement the below list comprehension to achieve this.
ratings_predictions=[svdpp.predict(row.userId, row.movieId) for _,row in test_df.iterrows()]
ratings_predictions

[Prediction(uid=102413, iid=74, r_ui=None, est=3.316838779644708, details={'was_impossible': False}),
 Prediction(uid=122968, iid=52328, r_ui=None, est=3.4843440215365105, details={'was_impossible': False}),
 Prediction(uid=89372, iid=1188, r_ui=None, est=3.2625163101168173, details={'was_impossible': False}),
 Prediction(uid=1812, iid=6874, r_ui=None, est=3.808868745332308, details={'was_impossible': False}),
 Prediction(uid=7355, iid=112175, r_ui=None, est=3.6972281881594298, details={'was_impossible': False}),
 Prediction(uid=159376, iid=2087, r_ui=None, est=3.6308708890014376, details={'was_impossible': False}),
 Prediction(uid=47553, iid=1172, r_ui=None, est=4.267248272478779, details={'was_impossible': False}),
 Prediction(uid=70393, iid=7361, r_ui=None, est=3.999079383358388, details={'was_impossible': False}),
 Prediction(uid=42418, iid=594, r_ui=None, est=3.7210582274506288, details={'was_impossible': False}),
 Prediction(uid=101637, iid=923, r_ui=None, est=3.952813324226784, 

We will store the list of predictions in a dataframe which will essentially in help in creating the familiar format of the dataframe

In [32]:
# Converting our prediction into a familiar format-Dataframe
df_pred=pd.DataFrame(ratings_predictions)
df_pred

Unnamed: 0,uid,iid,r_ui,est,details
0,102413,74,,3.316839,{'was_impossible': False}
1,122968,52328,,3.484344,{'was_impossible': False}
2,89372,1188,,3.262516,{'was_impossible': False}
3,1812,6874,,3.808869,{'was_impossible': False}
4,7355,112175,,3.697228,{'was_impossible': False}
...,...,...,...,...,...
249996,30316,637,,3.024559,{'was_impossible': False}
249997,82743,1270,,3.897435,{'was_impossible': False}
249998,140187,1282,,3.674567,{'was_impossible': False}
249999,5399,6373,,3.402437,{'was_impossible': False}


In [33]:
# Renaming our predictions to original names
df_pred=df_pred.rename(columns={'uid':'userId', 'iid':'movieId','est':'rating'})
df_pred.drop(['r_ui','details'],axis=1,inplace=True)

In [34]:
# Snippet of our ratings
df_pred.head()

Unnamed: 0,userId,movieId,rating
0,102413,74,3.316839
1,122968,52328,3.484344
2,89372,1188,3.262516
3,1812,6874,3.808869
4,7355,112175,3.697228


In [35]:
# Concatenating userId/movieId into a single Id column.(code has to be run twice to get desired outcome)
df_pred['Id']=df_pred.apply(lambda x:'%s_%s' % (x['userId'],x['movieId']),axis=1)
df_pred['Id']=df_pred.apply(lambda x:'%s_%s' % (x['userId'],x['movieId']),axis=1)

In [36]:
# drop the two features from the dataset userId and movieId
df_pred.drop(['userId', 'movieId'], inplace=True, axis= 1)

## **Preparing Submission**

The submission of this competition has to be in csv file entailing a id and rating column

In [37]:
# df_pred = df_pred[['Id', 'rating']]
# df_pred.shape

In [38]:
# df_pred.to_csv("svdpp_model_base.csv", index=False)

# IMPORTANT NOTE

In this notebook: This model is implemented on sample dataset, however, with the same procedure this model was implemented on a notebook on kaggle but because of memory issues this notebook was then moved here to GitHub. Another issues arises when implementing the notebook because of the huge dataset. This issue can be resolved in the three following steps. 
1. Clone this repository.
2. Go to kaggle and download the dataset on: https://www.kaggle.com/c/edsa-recommender-system-predict/data
3. Replace the data on the folder titled 'data' with three dataset loaded in this notebook.
4. If you run this whole notebook you it will return a csv file title `coClustering_model_base.csv`. 
5. Uploading that file on kaggle will give a score of

In [27]:
# experiment.end()

COMET INFO: ---------------------------
COMET INFO: Comet.ml Experiment Summary
COMET INFO: ---------------------------
COMET INFO:   Data:
COMET INFO:     display_summary_level : 1
COMET INFO:     url                   : https://www.comet.ml/knetshiongolwe/general/a6513bd1eea24a968832a738525b06dd
COMET INFO:   Uploads:
COMET INFO:     code                     : 1 (4 KB)
COMET INFO:     environment details      : 1
COMET INFO:     filename                 : 1
COMET INFO:     git metadata             : 1
COMET INFO:     git-patch (uncompressed) : 1 (16 MB)
COMET INFO:     installed packages       : 1
COMET INFO:     notebook                 : 1
COMET INFO: ---------------------------
COMET INFO: Uploading stats to Comet before program termination (may take several seconds)
COMET INFO: Waiting for completion of the file uploads (may take several seconds)
COMET INFO: Still uploading
