# Kaggle Competion

## Description

In today’s technology driven world, recommender systems are socially and economically critical for ensuring that individuals can make appropriate choices surrounding the content they engage with on a daily basis. One application where this is especially true surrounds movie content recommendations; where intelligent algorithms can help viewers find great titles from tens of thousands of options.

With this context, EDSA is challenging you to construct a recommendation algorithm based on content or collaborative filtering, capable of accurately predicting how a user will rate a movie they have not yet viewed based on their historical preferences.

Providing an accurate and robust solution to this challenge has immense economic potential, with users of the system being exposed to content they would like to view or purchase - generating revenue and platform affinity.

## Evaluation



The evaluation metric for this competition is [Root Mean Square Error](https://surprise.readthedocs.io/en/stable/accuracy.html). Root Mean Square Error (RMSE) is commonly used in regression analysis and forecasting, and measures the standard deviation of the residuals arising between predicted and actual observed values for a modelling process. For our task of generating user movie ratings via recommendation algorithms, the the formula is given by:

$$RMSE = \sqrt{
\frac{1}{|\hat{R}|}} \sum_{ \hat{r_{ui} \in \hat{R}}} 
\left(r_ui - \hat{r}_{ui}\right)^2$$

# Data Assessment

## 1.1 Load Libraries

In [1]:
# !pip install scikit-surprise

In [1]:
# Data Analysis llibraries
import pandas as pd
import numpy as np

# visualisation libraries
import matplotlib.pyplot as plt
import seaborn as sns

# other
import os

# Figures inline and set visualization style
%matplotlib inline
sns.set()

# Machine Learning
import surprise

## 1.2 Import Datasets

In [2]:
# List all files in a directory using os.listdir
basepath = '../unsupervised_data/unsupervised_movie_data/'
for entry in os.listdir(basepath):
    if os.path.isfile(os.path.join(basepath, entry)):
        print(entry)

genome-scores.csv
genome-tags.csv
imdb_data.csv
links.csv
movies.csv
sample_submission.csv
test.csv
train.csv


Expected Data sets

* `genome_scores.csv` - a score mapping the strength between movies and tag-related properties. Read more [here](http://files.grouplens.org/papers/tag_genome.pdf)
* `genome_tags.csv` - user assigned tags for genome-related scores
* `imdb_data.csv` - Additional movie metadata scraped from IMDB using the links.csv file.
* `links.csv` - File providing a mapping between a MovieLens ID and associated IMDB and TMDB IDs.
* `sample_submission.csv` - Sample of the submission format for the hackathon.
* `tags.csv` - User assigned for the movies within the dataset.
* `test.csv` - The test split of the dataset. Contains user and movie IDs with no rating data.
* `train.csv` - The training split of the dataset. Contains user and movie IDs with associated rating data.


In [3]:
# import Training, Testing and Submission Data
train_df = pd.read_csv(basepath + 'train.csv', nrows=50000)
test_df = pd.read_csv(basepath + 'test.csv')
sample_submission_df = pd.read_csv(basepath + 'sample_submission.csv')

# # User - Movie relationship
# genome_scores_df = pd.read_csv(basepath + 'genome_scores.csv')
# genome_tags_df = pd.read_csv(basepath + 'genome_tags.csv')

# # Other Data to be explored
# movies_df = pd.read_csv(basepath + 'movies.csv')
# imdb_data_df = pd.read_csv(basepath + 'imdb_data.csv')

# links_df = pd.read_csv(basepath + 'links.csv')
# tags_df = pd.read_csv(basepath + 'tags.csv')

### Machine Learning Data

All ratings are contained in the file `train.csv.` Each line of this file after the header row represents one rating of one movie by one user, and has the following format:
```
userId,movieId,rating,timestamp
```

* The lines within this file are ordered first by userId, then, within user, by movieId.
* Ratings are made on a 5-star scale, with half-star increments (0.5 stars - 5.0 stars).
* Timestamps represent seconds since midnight Coordinated Universal Time (UTC) of January 1, 1970.

In [5]:
print(train_df.shape)
train_df.head()

(50000, 4)


Unnamed: 0,userId,movieId,rating,timestamp
0,5163,57669,4.0,1518349992
1,106343,5,4.5,1206238739
2,146790,5459,5.0,1076215539
3,106362,32296,2.0,1423042565
4,9041,366,3.0,833375837


In [6]:
print(test_df.shape)
test_df.head()

(5000019, 2)


Unnamed: 0,userId,movieId
0,1,2011
1,1,4144
2,1,5767
3,1,6711
4,1,7318


In [7]:
print(sample_submission_df.shape)
sample_submission_df.head()

(5000019, 2)


Unnamed: 0,Id,rating
0,1_2011,1.0
1,1_4144,1.0
2,1_5767,1.0
3,1_6711,1.0
4,1_7318,1.0


### Other Data

Movie information is contained in the file `movies.csv`. Each line of this file after the header row represents one movie, and has the following format:
```
movieId,title,genres
```
Movie titles are entered manually or imported from https://www.themoviedb.org/, and include the year of release in parentheses. Errors and inconsistencies may exist in these titles.
Genres are a pipe-separated list, and are selected from the following:

* Action
* Adventure
* Animation
* Children's
* Comedy
* Crime
* Documentary
* Drama
* Fantasy
* Film-Noir
* Horror
* Musical
* Mystery
* Romance
* Sci-Fi
* Thriller
* War
* Western
* (no genres listed)

In [8]:
# movies_df.head()

In [9]:
# imdb_data_df.head()

As described in [this article](http://files.grouplens.org/papers/tag_genome.pdf), the tag genome encodes how strongly movies exhibit particular properties represented by tags (atmospheric, thought-provoking, realistic, etc.). The tag genome was computed using a machine learning algorithm on user-contributed content including tags, ratings, and textual reviews.

The genome is split into two files. The file `genome-scores.csv` contains movie-tag relevance data in the following format:
```
movieId,tagId,relevance
```

In [10]:
# genome_scores_df.head()

The second file, `genome-tags.csv`, provides the tag descriptions for the tag IDs in the genome file, in the following format:
```
tagId,tag
```

In [11]:
# genome_tags_df.head()

Identifiers that can be used to link to other sources of movie data are contained in the file `links.csv`. Each line of this file after the header row represents one movie, and has the following format:
```
movieId,imdbId,tmdbId
```
movieId is an identifier for movies used by https://movielens.org. E.g., the movie Toy Story has the link https://movielens.org/movies/1.

imdbId is an identifier for movies used by http://www.imdb.com. E.g., the movie Toy Story has the link http://www.imdb.com/title/tt0114709/.

tmdbId is an identifier for movies used by https://www.themoviedb.org. E.g., the movie Toy Story has the link https://www.themoviedb.org/movie/862.

Use of the resources listed above is subject to the terms of each provider.

In [12]:
# links_df.head()

All tags are contained in the file `tags.csv`. Each line of this file after the header row represents one tag applied to one movie by one user, and has the following format:
```
userId,movieId,tag,timestamp
```

The lines within this file are ordered first by userId, then, within user, by movieId.

Tags are user-generated metadata about movies. Each tag is typically a single word or short phrase. The meaning, value, and purpose of a particular tag is determined by each user.

Timestamps represent seconds since midnight Coordinated Universal Time (UTC) of January 1, 1970

In [13]:
# tags_df.head()

## Machine Learning

In [9]:
# determine range of ratings
lower_rating = train_df['rating'].min()
upper_rating = train_df['rating'].max()
print(f'review range: {lower_rating} to {upper_rating}')

review range: 0.5 to 5.0


In [10]:
# instantiate reader object
reader = surprise.Reader(rating_scale = (lower_rating, upper_rating))

# create data object
data = surprise.Dataset.load_from_df(train_df.drop('timestamp', axis=1), reader)

In [11]:
# instantiate the SVD model
alg = surprise.SVDpp()

# train the svd model
alg.fit(data.build_full_trainset())

<surprise.prediction_algorithms.matrix_factorization.SVDpp at 0x7f687b6e5650>

Create a function that generates a submission csv

In [40]:
def make_kaggle_csv(model_name, clf, features):
    
    """
    This funtion this model take a file name as string, model object and vector of features
    as inputs and returns a csv to submit to Kaggle or Zindi
    """
    
    # define columns in sample submission data
    cols = sample_submission_df.columns
    
    # Make prediction of submission data
    print('making prediction...')
    prediction = [alg.predict(uid = my_uid, iid=my_iid).est for my_uid, my_iid in features]
    
    # concatenate user id and movie id
    uid_iid_concat = test_df['userId'].astype(str) + '_' + test_df['movieId'].astype(str).str.replace('.0','')
    
    # remove duplcates
    # uid_iid_concat.drop_duplicates(keep='first', inplace=True)
    
    # make dataframe of predictions of submission data
    print('creating dataframe... ')
    submission = pd.DataFrame({cols[0]: uid_iid_concat,
                               cols[1]: prediction})
    
    # save file
    filename = 'submission_' + model_name + '.csv'
    submission.to_csv(filename,index=False)
    print(f'Saved file {filename}')

In [41]:
# make submission csv
make_kaggle_csv('simple_recommender_SVD++', alg, test_df.values)

making prediction...
creating dataframe... 
Saved file submission_simple_recommender_SVD++.csv


ValueError: array length 5000019 does not match index length 4826489

173580 records removed when duplicates are removed

In [38]:
# concatenate user id and movie id
uid_iid_concat = test_df['userId'].astype(str) + '_' + test_df['movieId'].astype(str).str.replace('.0','')

uid_iid_concat

0                 1_11
1               1_4144
2               1_5767
3               1_6711
4               1_7318
              ...     
5000014      162541_79
5000015    162541_4467
5000016      162541_49
5000017    162541_5689
5000018    162541_7153
Length: 5000019, dtype: object

In [39]:
type(uid_iid_concat)

pandas.core.series.Series

### Check user ids

In [28]:
sample_submission_df['Id'].apply(lambda x: int(x[:x.find('_')]))

0               1
1               1
2               1
3               1
4               1
            ...  
5000014    162541
5000015    162541
5000016    162541
5000017    162541
5000018    162541
Name: Id, Length: 5000019, dtype: int64

In [30]:
sum(test_df['userId'] == sample_submission_df['Id'].apply(lambda x: int(x[:x.find('_')])))

5000019

In [27]:
test_df['userId']

0               1
1               1
2               1
3               1
4               1
            ...  
5000014    162541
5000015    162541
5000016    162541
5000017    162541
5000018    162541
Name: userId, Length: 5000019, dtype: int64

### Check movie ids

In [33]:
sample_submission_df['Id'].apply(lambda x: int(x[x.find('_')+1:]))

0          2011
1          4144
2          5767
3          6711
4          7318
           ... 
5000014    4079
5000015    4467
5000016    4980
5000017    5689
5000018    7153
Name: Id, Length: 5000019, dtype: int64

In [36]:
test_df['movieId']

0          2011
1          4144
2          5767
3          6711
4          7318
           ... 
5000014    4079
5000015    4467
5000016    4980
5000017    5689
5000018    7153
Name: movieId, Length: 5000019, dtype: int64

In [35]:
# check if 2nd part of Id in sample_submission_df is the same as the movieId
sum(test_df['movieId'] == sample_submission_df['Id'].apply(lambda x: int(x[x.find('_')+1:])))

5000019