# Recommender Systems

**EDSA 2020: Predict 7 - Unsupervised Learning**


<img src="https://raw.githubusercontent.com/Tiroamodimo/tikitaka_unsupervised_project_Repo/master/Notebook_cover_photo.jpg?token=AGIEGQ2SA2OQXDS6RIZSHJK7DUBXC" width = "100%" align = "left" />

### Contributors - TS4_JHB
* Lebogang Lamola (Team Captain)
* Jagannath Chetty
* Akhona Stafane
* Abel Marumo
* Letlhogile Mothoagae


### Contents
* [Introduction](#intro)
* [Library Imports](#libraries)
* [Data Imports](#data)
* [Exploratory Data Analysis](#eda)
* [Data Cleaning](#data_clean)
* [Feature Engineering](#feat_eng)
* [Recommender Systems](#rec_sys)
    * [Content-Based Recommender System](#cb_rec)
    * [Collaborative Filtering Recommender System](#cf_rec)
* [Conclusion](#conclusion)

<a id="intro"></a>
# Introduction

<a id="libraries"></a>

# Library Imports

Let's install libraries that don't normally come with cloud-based kernels

In [None]:
!pip install comet_ml

In order to deploy our model experiments to the team's [comet repository](https://www.comet.ml/tiroamodimo/jhb-ts4-unsupervised/view/new), we'll instantiate an `Experiment` intance before importing other libraries

In [None]:
# # import comet_ml
# from comet_ml import Experiment
# # Add the following code anywhere in your machine learning file
# experiment = Experiment(api_key="quY9CXKJTLd4wCLNuIQqCuVGa",
#                      project_name="jhb-ts4-unsupervised",
#                      workspace="tiroamodimo")

Now we can import the other modules we'll be using in the Notebook

In [None]:
# Data Analysis libraries
import pandas as pd
import numpy as np

# Text Data Analysis
from textblob import TextBlob
from wordcloud import WordCloud

# visualisation libraries
import matplotlib.pyplot as plt
import seaborn as sns
import cufflinks as cf

# Styling
%matplotlib inline
sns.set(style='whitegrid', palette='muted',
        rc={'figure.figsize': (15,10)})
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
init_notebook_mode(connected=True)

# Machine Learning
import surprise
from sklearn.preprocessing import StandardScaler
from sklearn import preprocessing
from sklearn.decomposition import PCA

# sundry imports
import os
from timeit import default_timer
start = default_timer()

<a id="data"></a>

# Data Imports

The Expected data sets are as follows:

* `genome_scores.csv` - a score mapping the strength between movies and tag-related properties. Read more [here](http://files.grouplens.org/papers/tag_genome.pdf)
* `genome_tags.csv` - user assigned tags for genome-related scores
* `imdb_data.csv` - Additional movie metadata scraped from IMDB using the links.csv file.
* `links.csv` - File providing a mapping between a MovieLens ID and associated IMDB and TMDB IDs.
* `sample_submission.csv` - Sample of the submission format for the hackathon.
* `tags.csv` - User assigned for the movies within the dataset.
* `test.csv` - The test split of the dataset. Contains user and movie IDs with no rating data.
* `train.csv` - The training split of the dataset. Contains user and movie IDs with associated rating data.


In [None]:
# List all data files
basepath = '../input/edsa-recommender-system-predict/'
for entry in os.listdir(basepath):
    if os.path.isfile(os.path.join(basepath, entry)):
        print(entry)

In [None]:
# import Training, Testing and Submission Data
train_df = pd.read_csv(basepath + 'train.csv')
test_df = pd.read_csv(basepath + 'test.csv')
sample_submission_df = pd.read_csv(basepath + 'sample_submission.csv')

# User - Movie relationship
genome_scores_df = pd.read_csv(basepath + 'genome_scores.csv')
genome_tags_df = pd.read_csv(basepath + 'genome_tags.csv')

# Other Data to be explored
movies_df = pd.read_csv(basepath + 'movies.csv')
imdb_data_df = pd.read_csv(basepath + 'imdb_data.csv')
links_df = pd.read_csv(basepath + 'links.csv')
tags_df = pd.read_csv(basepath + 'tags.csv')

All ratings are contained in the file `train.csv.` Each line of this file after the header row represents one rating of one movie by one user, and has the following format:
```
userId,movieId,rating,timestamp
```

* The lines within this file are ordered first by userId, then, within user, by movieId.
* Ratings are made on a 5-star scale, with half-star increments (0.5 stars - 5.0 stars).
* Timestamps represent seconds since midnight Coordinated Universal Time (UTC) of January 1, 1970.

In [None]:
print(train_df.shape)
train_df.head()

In [None]:
print(test_df.shape)
test_df.head()

In [None]:
print(movies_df.shape)
sample_submission_df.head()

Movie information is contained in the file `movies.csv`. Each line of this file after the header row represents one movie, and has the following format:
```
movieId,title,genres
```
Movie titles are entered manually or imported from https://www.themoviedb.org/, and include the year of release in parentheses. Errors and inconsistencies may exist in these titles.
Genres are a pipe-separated list, and are selected from the following:

* Action
* Adventure
* Animation
* Children's
* Comedy
* Crime
* Documentary
* Drama
* Fantasy
* Film-Noir
* Horror
* Musical
* Mystery
* Romance
* Sci-Fi
* Thriller
* War
* Western
* (no genres listed)

In [None]:
print(movies_df.shape)
movies_df.head()

In [None]:
print(imdb_data_df.shape)
imdb_data_df.head()

As described in [this article](http://files.grouplens.org/papers/tag_genome.pdf), the tag genome encodes how strongly movies exhibit particular properties represented by tags (atmospheric, thought-provoking, realistic, etc.). The tag genome was computed using a machine learning algorithm on user-contributed content including tags, ratings, and textual reviews.

The genome is split into two files. The file `genome-scores.csv` contains movie-tag relevance data in the following format:
```
movieId,tagId,relevance
```

In [None]:
print(genome_tags_df.shape)
genome_scores_df.head()

The second file, `genome-tags.csv`, provides the tag descriptions for the tag IDs in the genome file, in the following format:
```
tagId,tag
```

In [None]:
print(genome_tags_df.shape)
genome_tags_df.head()

Identifiers that can be used to link to other sources of movie data are contained in the file `links.csv`. Each line of this file after the header row represents one movie, and has the following format:
```
movieId,imdbId,tmdbId
```
movieId is an identifier for movies used by https://movielens.org. E.g., the movie Toy Story has the link https://movielens.org/movies/1.

imdbId is an identifier for movies used by http://www.imdb.com. E.g., the movie Toy Story has the link http://www.imdb.com/title/tt0114709/.

tmdbId is an identifier for movies used by https://www.themoviedb.org. E.g., the movie Toy Story has the link https://www.themoviedb.org/movie/862.

Use of the resources listed above is subject to the terms of each provider.

In [None]:
print(links_df.shape)
links_df.head()


All tags are contained in the file `tags.csv`. Each line of this file after the header row represents one tag applied to one movie by one user, and has the following format:
```
userId,movieId,tag,timestamp
```

The lines within this file are ordered first by userId, then, within user, by movieId.

Tags are user-generated metadata about movies. Each tag is typically a single word or short phrase. The meaning, value, and purpose of a particular tag is determined by each user.

Timestamps represent seconds since midnight Coordinated Universal Time (UTC) of January 1, 1970

In [None]:
print(tags_df.shape)
tags_df.head()

<a id="eda"></a>

# Exploratory Data Analysis

## Missing Data and Data Types

In order to facilitate the identification of missing data and data types, a function, `print_dtypes_missing`, is defined below

In [None]:
def print_dtypes_null(df):
    
    """
    This function takes a dataframe as input and prints out the
    datatypes and null values datatypes of the dataframe
    """
    
    # print data types
    print('Data type')
    print(df.info(),'\n======================')
    
    
    # get number of null values
    total = df.isnull().sum().sort_values(ascending=False)
    
    # get percentage null values
    percent = (df.isnull().sum()/df.isnull().count()).sort_values(ascending=False)*100
    
    # create dataframe
    print('Missing Values')
    print(pd.concat([total, percent], axis=1, keys=['Total Number Missing', 'Percent Missing']),'\n======================')
    
    # print original dataframe for ease of reading
    print('Dataset')
    print(df.head())

Data types and missing values were assessed below

In [None]:
print_dtypes_null(train_df)

`train_df` consists of numerical data, _int64_ and _float64_ and has no missing values in any of the columns

In [None]:
print_dtypes_null(test_df)

`test_df` consists of numerical data, _int64_, and has no missing values in any of the columns

In [None]:
print_dtypes_null(genome_scores_df)

`genome_scores_df` consists of numerical data, _int64_ and _float64_ and has no missing values in any of the columns

In [None]:
print_dtypes_null(genome_tags_df)

`genome_tags_df` consists of numerical data, _int64_ and _float64_ and has no missing values in any of the columns

In [None]:
print_dtypes_null(movies_df)

`movies_df` consists of numerical data, _int64_ , and and non-numeric data _object_ and has no missing values in any of the columns

In [None]:
print_dtypes_null(imdb_data_df)

`imdb_data_df` consists of numerical data, _float64_ and has no 5 columns with missing data ranging from 36% to for `director` to 71% `budget`

In [None]:
print_dtypes_null(links_df)

`links_df` consists of numerical data, _int64_ and _float64_ and has 1 column, `tmdbId` with 17% missing data 

In [None]:
print_dtypes_null(tags_df)

`tags_df` consists of numerical data, _int64_ , and non-numeric data, _object_ ,and has less than 1% missing values for `tag` column

**Outcomes From Assessment of Datatypes and Null Values**

1. From the assessment we see that our dataset consists of a combination of _numeric_ and _non-numeric_ data types
    * in order to implement machine learning, the non-numeric datatypes need to be converted to numeric datatypes.
2. The `imdb_data_df` dataset is has 36% - 71% missing data across all the columns. This datatset will therefore not be considered going forward in this excercise. In a different context however, the `links_df` dataset would be used to source the missing data from a supplementary dataset. The `links_df` dataset will also not be considered going forward.

In [None]:
# remove data that will not be considered
del imdb_data_df
del links_df

## Assessing The Data

In order to facilitate the assessment of our data the functions below are defined.

In [None]:
def make_histogram(df, col):


    # Plot the histogram with default number of bins; label your axes
    _ = plt.hist(df[col])
    _ = plt.xlabel(col)
    _ = plt.ylabel('Frequency')
    
    plt.savefig(f'Histogram of {col}.png')

    # Show the plot
    plt.show()


def show_wordcloud(data, col):
    
    # define text from data
    text = ' '.join(data[col].values.astype(str))
    
    # generate wordclound
    wordcloud = WordCloud(max_words=50,
                          background_color='black',
                          scale=3,
                          random_state=4).generate(str(text))
    
    # plot wordcloud
    fig = plt.figure(1, figsize=(15, 15))
    plt.axis('off')
        
    plt.savefig(f'Word cloud of {col}.png')
    plt.imshow(wordcloud)
    plt.show()


def ecdf(data):
    
    """Compute ECDF for a one-dimensional array of measurements."""
    
    # Number of data points: n
    n = len(data)

    # x-data for the ECDF: x
    x = np.sort(data)

    # y-data for the ECDF: y
    y = np.arange(1, n+1) / n

    return x, y


def plot_ecdf(df, col):
    
    """plot ECDF for a column, col, in a dataframe, df."""
    
    # Compute ECDF 
    x, y = ecdf(df[col])
    
    # Generate plot
    _ = plt.plot(x, y, marker='.', linestyle = 'none')
    
    # Label axes
    _ = plt.ylabel('ECDF')
    _ = plt.xlabel(f'{col}')
    
    plt.savefig(f'ecdf of {col}.png')
    
    # display
    plt.show

Assessing the `train_df` data

In [None]:
train_df.describe().T

In [None]:
train_df.nunique()

There are 10 000 038 records in the train_df dataset. However there are 162 541 usersIDs with 48 213 movies that interacted with them. There are 10 unique ratings that were made and 8 795 101 different times.

It was assumed that people view different movies at different times for reasons that have little or nothing to do with _movies_ they like. For this reason, The `timestamp` data will not be assessed going forward in this exercise

The `rating` data was be explored below

In [None]:
train_df['rating'].value_counts()

In [None]:
make_histogram(train_df, 'rating')

* Most users rated movies with a 4 followed by 3 an and 5 respectively.
* On average users rated movies with 3.5
* 0.5 was the least frequent rating observed
* in general ratings below 3 were less frequent.


Assessing the `movies_df` data

In [None]:
movies_df.describe()

In [None]:
movies_df.nunique()

* There are 62 423 records in the `movies_df` data and 62 423 unique movie Ids. There are 62 325 unique movie titles. This suggests that 98 Movie titles are duplicated. There are 1639 different combinations of genres for the various movies

In [None]:
show_wordcloud(movies_df, 'genres')

* 'no genres', 'genres listed', 'Comedy Drama' and 'Drama Romance' are the most common genre type, closely followed by 'Thriller Comedy', 'Thriller Drama' and 'Romance Comedy'
* The `genres` data includes a combination of different genres, this can normalised to 1NF to make the data easier to analyse
* The `titles` column includes the year that the movies was released, which can be extracted.

Assessing `genome_tags_df` data

In [None]:
genome_tags_df.describe().T

In [None]:
genome_tags_df.nunique()

There are 1128 unique values under `tag`. Let's have a closer look at the most common words

In [None]:
show_wordcloud(genome_tags_df, 'tag')

* The most common words under the `tag` data are 'war' and 'good', followed closely by 'movie' and 'comedy'

Assessing the `genome_scores_df` data

In [None]:
genome_scores_df.describe().T

In [None]:
genome_scores_df.nunique()

* 13 816 movies have have tags with an associated `relevance` score.

The `relevance` was investigated further below

In [None]:
make_histogram(genome_scores_df, 'relevance')

In [None]:
plot_ecdf(genome_scores_df, 'relevance')

* More than 80% of the tags have a relevance of less than 0.2 for each movie
* The `genome_tags_df` and `genome_scores_df` provide interesting meta-data about movies. However because only 13 816 out of 62 423 (22%)
* The abovemened datasets were not be considered futher in this exercise for the reason stated above

In [None]:
del genome_tags_df
del genome_scores_df

Assessing the `tags_df` data

In [None]:
tags_df.describe().T

In [None]:
tags_df.nunique()

* The `timestamp data` will not be assessed in this dataset for this exerceised for the same reason it was not assessed in the `train_df` data

the `tag` data was explored further below

In [None]:
show_wordcloud(tags_df, 'tag')

* there are 45 251 movies out of 62 423 (72%) with associated tags
* This data may be useful in describing 72% of the movies data, therefore it will be kept.

<a id="data_clean"></a>
# Data Cleaning

The previous assessment assisted in 

<a id="feat_eng"></a>

# Feature Engineering

In [None]:
# get the year
movies_df['year'] = movies_df.title.str.extract("\((\d{4})\)", expand=True)

# remove the year from the title
movies_df['title'] = movies_df['title'].apply(lambda x: ' '.join(re.findall(r'[^ (\d)]+',x)))

# get the number of genres
movies_df['genre_count'] = movies_df['genres'].apply(lambda x: x.count('|') + 1)

# get the polarity of the title
movies_df['title_polarity'] = movies_df['title'].apply(lambda x: TextBlob(x).sentiment.polarity)

# get the subjectivity of the title
movies_df['title_subjectivity'] = movies_df['title'].apply(lambda x: TextBlob(x).sentiment.subjectivity)


movies_df.head()

<a id="rec_sys"></a>

# Recommender Systems

Now that the data is prepared, the machine learning experiments will follow. The functions below are defined to enable experiments to be deployed to the team's [comet repository](https://www.comet.ml/tiroamodimo/jhb-ts4-unsupervised/view/new)

In [None]:
def remove_unchanged_params(params_dict, used_params_list=None):
    
    """
    This function takes a dictionary of parameters and a list of used parameters
    as inputs and returns a dictionary of parameters that are in the list of
    used parameters.
    """

    # check if a list of parameters was specified
    if used_params_list == None:

      # if not return the original dictionary
      return {'params_used': 'default'}

    # initialise a new dictionary of parameters
    new_params_dict = {'params_used': 'custom'}

    # for each parameter in params_dict
    for param in params_dict.keys():

      # if a parameter is in used_params_list
      if param in used_params_list:

          # add that parameter's entry to the new dictionary of parameters
          new_params_dict[param] = params_dict[param]

    return new_params_dict

In [None]:
def add_model_type(params_dict,model_type='Not Specified'):

    """
    This function takes a dictionary of parameters as inputs
    and returns a dictionary of parameters that includes the specified
    model_type.
    """

    # add model_type parameter to the dictionary
    params_dict['model_type'] = model_type

    return params_dict

In [None]:
def make_comet_model_params(model_params, model_name, used_params_list=None):

    """
    This function takes a dictionary of model parameters, a string of the model type and list of
    parameters as inputs and returns a dictionary of parameters of used in the
    model for comet experiment logging.
    """

    # get parameters that were were used
    new_params_dict = remove_unchanged_params(model_params,
                                            used_params_list)

    # add model_type to dictionary of parameters
    new_params_dict = add_model_type(new_params_dict,model_name)

    return new_params_dict

In [None]:
def deploy_comet(experiment, metrics, parameters=None):
    """
    This function takes a comet experiment object, a dictionary of model
    parameters and a dictionary of model test results as inputs and uploads
    the experiment to comet.
    """

    # Log our parameters
    if parameters != None:
        print('logging parameters...')
        experiment.log_parameters(parameters)

    # log model performace
    print('logging metric...')
    experiment.log_metrics(metrics)

    print('ending experiment...')
    # end experiment
    experiment.end()

    # display experiment
    experiment.display()

<a id="cb_rec"></a>

## Content-Based Recommender System

<a id="cf_rec"></a>

## Collaborative Filtering Recommender System

<a id="conclusion"></a>

# Conclusion