# Recommender Systems

**EDSA 2020: Predict 7 - Unsupervised Learning**


<img src="https://raw.githubusercontent.com/Tiroamodimo/tikitaka_unsupervised_project_Repo/master/Notebook_cover_photo.jpg?token=AGIEGQ2SA2OQXDS6RIZSHJK7DUBXC" width = "100%" align = "left" />

### Contributors - TS4_JHB
* Lebogang Lamola (Team Captain)
* Jagannath Chetty
* Akhona Stefane
* Abel Marumo
* Letlhogile Mothoagae


### Contents
* [Introduction](#intro)
* [Library Imports](#libraries)
* [Data Imports](#data)
* [Exploratory Data Analysis](#eda)
* [Feature Engineering](#feat_eng)
* [Recommender Systems](#rec_sys)
    * [Content-Based Recommender System](#cb_rec)
    * [Collaborative Filtering Recommender System](#cf_rec)
* [Conclusion](#conclusion)

<a id="intro"></a>
# Introduction

<a id="libraries"></a>

# Library Imports

Let's install libraries that don't normally come with cloud-based kernels

In [None]:
!pip install comet

In order to deploy our model experiments to the team's [comet repository](https://www.comet.ml/tiroamodimo/jhb-ts4-unsupervised/view/new), we'll instantiate an `Experiment` intance before importing other libraries

In [None]:
# import comet_ml
from comet_ml import Experiment
# Add the following code anywhere in your machine learning file
experiment = Experiment(api_key="quY9CXKJTLd4wCLNuIQqCuVGa",
                     project_name="jhb-ts4-unsupervised",
                     workspace="tiroamodimo")

Now we can import the other modules we'll be using in the Notebook

In [None]:
# Data Analysis libraries
import pandas as pd
import numpy as np

# Text Data Analysis
from textblob import TextBlob
from wordcloud import WordCloud

# visualisation libraries
import matplotlib.pyplot as plt
import seaborn as sns
import cufflinks as cf

# Styling
%matplotlib inline
sns.set(style='whitegrid', palette='muted',
        rc={'figure.figsize': (15,10)})
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
init_notebook_mode(connected=True)

# Machine Learning
import surprise
from sklearn.preprocessing import StandardScaler
from sklearn import preprocessing
from sklearn.decomposition import PCA

# sundry imports
import os
from timeit import default_timer
start = default_timer()

<a id="data"></a>

# Data Imports

The Expected data sets are as follows:

* `genome_scores.csv` - a score mapping the strength between movies and tag-related properties. Read more [here](http://files.grouplens.org/papers/tag_genome.pdf)
* `genome_tags.csv` - user assigned tags for genome-related scores
* `imdb_data.csv` - Additional movie metadata scraped from IMDB using the links.csv file.
* `links.csv` - File providing a mapping between a MovieLens ID and associated IMDB and TMDB IDs.
* `sample_submission.csv` - Sample of the submission format for the hackathon.
* `tags.csv` - User assigned for the movies within the dataset.
* `test.csv` - The test split of the dataset. Contains user and movie IDs with no rating data.
* `train.csv` - The training split of the dataset. Contains user and movie IDs with associated rating data.


In [None]:
# List all data files
basepath = '../input/edsa-recommender-system-predict/'
for entry in os.listdir(basepath):
    if os.path.isfile(os.path.join(basepath, entry)):
        print(entry)

In [None]:
# import Training, Testing and Submission Data
train_df = pd.read_csv(basepath + 'train.csv')
test_df = pd.read_csv(basepath + 'test.csv')
sample_submission_df = pd.read_csv(basepath + 'sample_submission.csv')

# User - Movie relationship
genome_scores_df = pd.read_csv(basepath + 'genome_scores.csv')
genome_tags_df = pd.read_csv(basepath + 'genome_tags.csv')

# Other Data to be explored
movies_df = pd.read_csv(basepath + 'movies.csv')
imdb_data_df = pd.read_csv(basepath + 'imdb_data.csv')
links_df = pd.read_csv(basepath + 'links.csv')
tags_df = pd.read_csv(basepath + 'tags.csv')

All ratings are contained in the file `train.csv.` Each line of this file after the header row represents one rating of one movie by one user, and has the following format:
```
userId,movieId,rating,timestamp
```

* The lines within this file are ordered first by userId, then, within user, by movieId.
* Ratings are made on a 5-star scale, with half-star increments (0.5 stars - 5.0 stars).
* Timestamps represent seconds since midnight Coordinated Universal Time (UTC) of January 1, 1970.

In [14]:
print(train_df.shape)
train_df.head()

(10000038, 4)


Unnamed: 0,userId,movieId,rating,timestamp
0,5163,57669,4.0,1518349992
1,106343,5,4.5,1206238739
2,146790,5459,5.0,1076215539
3,106362,32296,2.0,1423042565
4,9041,366,3.0,833375837


In [6]:
print(test_df.shape)
test_df.head()

(5000019, 2)


Unnamed: 0,userId,movieId
0,1,2011
1,1,4144
2,1,5767
3,1,6711
4,1,7318


In [7]:
print(movies_df.shape)
sample_submission_df.head()

(62423, 3)


Unnamed: 0,Id,rating
0,1_2011,1.0
1,1_4144,1.0
2,1_5767,1.0
3,1_6711,1.0
4,1_7318,1.0


Movie information is contained in the file `movies.csv`. Each line of this file after the header row represents one movie, and has the following format:
```
movieId,title,genres
```
Movie titles are entered manually or imported from https://www.themoviedb.org/, and include the year of release in parentheses. Errors and inconsistencies may exist in these titles.
Genres are a pipe-separated list, and are selected from the following:

* Action
* Adventure
* Animation
* Children's
* Comedy
* Crime
* Documentary
* Drama
* Fantasy
* Film-Noir
* Horror
* Musical
* Mystery
* Romance
* Sci-Fi
* Thriller
* War
* Western
* (no genres listed)

In [8]:
print(movies_df.shape)
movies_df.head()

(62423, 3)


Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [9]:
print(imdb_data_df.shape)
imdb_data_df.head()

(27278, 6)


Unnamed: 0,movieId,title_cast,director,runtime,budget,plot_keywords
0,1,Tom Hanks|Tim Allen|Don Rickles|Jim Varney|Wal...,John Lasseter,81.0,"$30,000,000",toy|rivalry|cowboy|cgi animation
1,2,Robin Williams|Jonathan Hyde|Kirsten Dunst|Bra...,Jonathan Hensleigh,104.0,"$65,000,000",board game|adventurer|fight|game
2,3,Walter Matthau|Jack Lemmon|Sophia Loren|Ann-Ma...,Mark Steven Johnson,101.0,"$25,000,000",boat|lake|neighbor|rivalry
3,4,Whitney Houston|Angela Bassett|Loretta Devine|...,Terry McMillan,124.0,"$16,000,000",black american|husband wife relationship|betra...
4,5,Steve Martin|Diane Keaton|Martin Short|Kimberl...,Albert Hackett,106.0,"$30,000,000",fatherhood|doberman|dog|mansion


As described in [this article](http://files.grouplens.org/papers/tag_genome.pdf), the tag genome encodes how strongly movies exhibit particular properties represented by tags (atmospheric, thought-provoking, realistic, etc.). The tag genome was computed using a machine learning algorithm on user-contributed content including tags, ratings, and textual reviews.

The genome is split into two files. The file `genome-scores.csv` contains movie-tag relevance data in the following format:
```
movieId,tagId,relevance
```

In [10]:
print(genome_tags_df.shape)
genome_scores_df.head()

(1128, 2)


Unnamed: 0,movieId,tagId,relevance
0,1,1,0.02875
1,1,2,0.02375
2,1,3,0.0625
3,1,4,0.07575
4,1,5,0.14075


The second file, `genome-tags.csv`, provides the tag descriptions for the tag IDs in the genome file, in the following format:
```
tagId,tag
```

In [11]:
print(genome_tags_df.shape)
genome_tags_df.head()

(1128, 2)


Unnamed: 0,tagId,tag
0,1,007
1,2,007 (series)
2,3,18th century
3,4,1920s
4,5,1930s


Identifiers that can be used to link to other sources of movie data are contained in the file `links.csv`. Each line of this file after the header row represents one movie, and has the following format:
```
movieId,imdbId,tmdbId
```
movieId is an identifier for movies used by https://movielens.org. E.g., the movie Toy Story has the link https://movielens.org/movies/1.

imdbId is an identifier for movies used by http://www.imdb.com. E.g., the movie Toy Story has the link http://www.imdb.com/title/tt0114709/.

tmdbId is an identifier for movies used by https://www.themoviedb.org. E.g., the movie Toy Story has the link https://www.themoviedb.org/movie/862.

Use of the resources listed above is subject to the terms of each provider.

In [12]:
print(links_df.shape)
links_df.head()


(62423, 3)


Unnamed: 0,movieId,imdbId,tmdbId
0,1,114709,862.0
1,2,113497,8844.0
2,3,113228,15602.0
3,4,114885,31357.0
4,5,113041,11862.0


All tags are contained in the file `tags.csv`. Each line of this file after the header row represents one tag applied to one movie by one user, and has the following format:
```
userId,movieId,tag,timestamp
```

The lines within this file are ordered first by userId, then, within user, by movieId.

Tags are user-generated metadata about movies. Each tag is typically a single word or short phrase. The meaning, value, and purpose of a particular tag is determined by each user.

Timestamps represent seconds since midnight Coordinated Universal Time (UTC) of January 1, 1970

In [13]:
print(tags_df.shape)
tags_df.head()

(1093360, 4)


Unnamed: 0,userId,movieId,tag,timestamp
0,3,260,classic,1439472355
1,3,260,sci-fi,1439472256
2,4,1732,dark comedy,1573943598
3,4,1732,great dialogue,1573943604
4,4,7569,so bad it's good,1573943455


<a id="eda"></a>

# Exploratory Data Analysis

<a id="feat_eng"></a>

# Feature Engineering

<a id="rec_sys"></a>

# Recommender Systems

Now that the data is prepared, the machine learning experiments will follow. The functions below are defined to enable experiments to be deployed to the team's [comet repository](https://www.comet.ml/tiroamodimo/jhb-ts4-unsupervised/view/new)

In [None]:
def remove_unchanged_params(params_dict, used_params_list=None):
    
    """
    This function takes a dictionary of parameters and a list of used parameters
    as inputs and returns a dictionary of parameters that are in the list of
    used parameters.
    """

    # check if a list of parameters was specified
    if used_params_list == None:

      # if not return the original dictionary
      return {'params_used': 'default'}

    # initialise a new dictionary of parameters
    new_params_dict = {'params_used': 'custom'}

    # for each parameter in params_dict
    for param in params_dict.keys():

      # if a parameter is in used_params_list
      if param in used_params_list:

          # add that parameter's entry to the new dictionary of parameters
          new_params_dict[param] = params_dict[param]

    return new_params_dict

In [None]:
def add_model_type(params_dict,model_type='Not Specified'):

    """
    This function takes a dictionary of parameters as inputs
    and returns a dictionary of parameters that includes the specified
    model_type.
    """

    # add model_type parameter to the dictionary
    params_dict['model_type'] = model_type

    return params_dict

In [None]:
def make_comet_model_params(model_params, model_name, used_params_list=None):

    """
    This function takes a dictionary of model parameters, a string of the model type and list of
    parameters as inputs and returns a dictionary of parameters of used in the
    model for comet experiment logging.
    """

    # get parameters that were were used
    new_params_dict = remove_unchanged_params(model_params,
                                            used_params_list)

    # add model_type to dictionary of parameters
    new_params_dict = add_model_type(new_params_dict,model_name)

    return new_params_dict

In [None]:
def deploy_comet(experiment, metrics, parameters=None):
    """
    This function takes a comet experiment object, a dictionary of model
    parameters and a dictionary of model test results as inputs and uploads
    the experiment to comet.
    """

    # Log our parameters
    if parameters != None:
        print('logging parameters...')
        experiment.log_parameters(parameters)

    # log model performace
    print('logging metric...')
    experiment.log_metrics(metrics)

    print('ending experiment...')
    # end experiment
    experiment.end()

    # display experiment
    experiment.display()

<a id="cb_rec"></a>

## Content-Based Recommender System

<a id="cf_rec"></a>

## Collaborative Filtering Recommender System

<a id="conclusion"></a>

# Conclusion