# Practical Machine Learning and Deep Learning - Assignment 2 - Movie Recommender System

## Task description

A recommender system is a type of information filtering system that suggests items or content to users based on their interests, preferences, or past behavior. These systems are commonly used in various domains, such as e-commerce, entertainment, social media, and online content platforms.

Your assignment is to create a recommender system of movies for users:
* Your system should suggest some movies to the user based on user's gemographic information(age, gender, occupation, zip code) and favorite movies (list of movie ids).
* Solve this task using a machine learning model. You may consider only one model: it will be enough.
* Create a benchmark that would evaluate the quality of recommendations of your model. Look for commonly used metrics to evaluate a recommender system and use at least one metric.
* Make a single report decribing data exploration, solution implementation, training process, and evaluation on the benchmark.
* Explicitly state the benchmark scores of your systems.

Submission should be a link to GitHub repository. It should be open repository, so that the instructors could assess it easily.

## Data Description

In this assignment you will use [MovieLens 100K dataset](https://grouplens.org/datasets/movielens/100k/) consisting user ratings to movies.

**General information about the dataset:**
* It consists of 100,000 ratings from 943 users on 1682 movies
* Ratings are ranged from 1 to 5
* Each user has rated at least 20 movies
* It contains simple demographic info for the users (age, gender, occupation, zip code)

**Detailed description of data files:**

| **File** | **Description** |
| -------- | --------------- |
| u.data | Full dataset of 100000 ratings by 943 users on 1682 items. Users and items are numbered consecutively from 1. The data is randomly ordered. This is a tab separated list of user id, item id, rating, and timestamp. The time stamps are unix seconds. |
| u.info | The number of users, items, and ratings in the u data set |
| u.item | Information about the items (movies). This is a tab separated list of movie id, movie title, release date, video release date, IMDB URL, and genres. The last 19 fields are genres and contain binary values. Movies can be of several genres at once. The movie ids are the ones used in u.data |
| u.genre | List of genres. |
| u.user | Demographic information about the users. This is a tab separated list of user id, age, gender, occupation, zip code. The user ids are the ones used in in u.data file. |
| u.occupation | List of occupations. |
| u1.base, u1.test, u2.base, u2.test, u3.base, u3.test, u4.base, u3.test, u5.base, u5.test | The data sets u1.base and u1.test through u5.base and u5.test are 80%/20% splits of the u data into training and test data. Each of u1, ..., u5 have disjoint test sets; this if for 5 fold cross validation (where you repeat your experiment with each training and test set and average the results). These data sets can be generated from u.data by mku.sh. |
| ua.base, ua.test, ub.base, ub.test | The data sets ua.base, ua.test, ub.base, and ub.test split the u data into a training set and a test set with exactly 10 ratings per user in the test set. The sets ua.test and ub.test are disjoint. These data sets can be generated from u.data by mku.sh. |
| allbut.pl | The script that generates training and test sets where all but n of a users ratings are in the training data |
| mku.sh | A shell script to generate all the u data sets from u.data. |

## Evaluation criterias

The repository should have the following structure:

```
movie-recommender-system
├── README.md               # The top-level README
│
├── data
│   ├── external            # Data from third party sources
│   ├── interim             # Intermediate data that has been transformed.
│   └── raw                 # The original, immutable data
│
├── models                  # Trained and serialized models, final checkpoints
│
├── notebooks               #  Jupyter notebooks. Naming convention is a number (for ordering),
│                               and a short delimited description, e.g.
│                               "1.0-initial-data-exporation.ipynb"            
│
├── references              # Data dictionaries, manuals, and all other explanatory materials.
│
├── reports
│   ├── figures             # Generated graphics and figures to be used in reporting
│   └── final_report.pdf    # Report containing data exploration, solution exploration, training process, and evaluation
│
└── benchmark
    ├── data                # dataset used for evaluation
    └── evaluate.py         # script that performs evaluation of the given model
```


In the top `README.md` file put your name, email and group number.

In the `reports` directory create a report about your work. In the report, describe in details the implementation of your system. Mention its advantages and disadvantages.

### Expected Report Structure

```
# Introduction
...
# Data analysis
...
# Model Implementation
...
# Model Advantages and Disadvantages
...
# Training Process
...
# Evaluation
...
# Results
...
```

In the `notebooks` directory put at least two notebooks. **First notebook** should contain your initial data exploration and basic ideas behind data preprocessing. **Second notebook** should contain information about final solution training and visualization.

## Grading criterias

Full assignment without any problems is said to be the `100%` solution.

| Criteria | Weight (%) | Comment |
| ---- | ----- | ----- |
| Structure and code quality | 30 | Code quality, structure, comments, clean repo, commit history, reproducibility (manual seeding) |
| Visualization, notebooks quality | 10 | Jupyter notebooks, visualizations |
| Solution building | 40 |  Implementation description, references, final report structure |
| Final score, evaluation  | 20 | Evaluation function, final score, quality of results |

If **PMLDL Course Team** will have any questions about your assignment or your work fails to show your results you will be called solution defence procedure.




# Solution

## Data downloading

In [1]:
complete_dataset_url = 'http://files.grouplens.org/datasets/movielens/ml-latest.zip'
small_dataset_url = 'http://files.grouplens.org/datasets/movielens/ml-latest-small.zip'

In [2]:
!pip install wget

Collecting wget
  Downloading wget-3.2.zip (10 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: wget
  Building wheel for wget (setup.py) ... [?25l[?25hdone
  Created wheel for wget: filename=wget-3.2-py3-none-any.whl size=9655 sha256=15afc752c6b229803d6454e91fd85c8ee768043486945efc992b9b9b71daafe5
  Stored in directory: /root/.cache/pip/wheels/8b/f1/7f/5c94f0a7a505ca1c81cd1d9208ae2064675d97582078e6c769
Successfully built wget
Installing collected packages: wget
Successfully installed wget-3.2


In [3]:
import os
import wget

complete_f = wget.download(complete_dataset_url)

small_f = wget.download(small_dataset_url)

In [4]:
!unzip /content/ml-latest-small.zip

Archive:  /content/ml-latest-small.zip
   creating: ml-latest-small/
  inflating: ml-latest-small/links.csv  
  inflating: ml-latest-small/tags.csv  
  inflating: ml-latest-small/ratings.csv  
  inflating: ml-latest-small/README.txt  
  inflating: ml-latest-small/movies.csv  


In [5]:
!unzip /content/ml-latest.zip

Archive:  /content/ml-latest.zip
   creating: ml-latest/
  inflating: ml-latest/tags.csv      
  inflating: ml-latest/links.csv     
  inflating: ml-latest/README.txt    
  inflating: ml-latest/ratings.csv   
  inflating: ml-latest/genome-tags.csv  
  inflating: ml-latest/genome-scores.csv  
  inflating: ml-latest/movies.csv    


In [6]:
!pip install pyspark

Collecting pyspark
  Downloading pyspark-3.5.0.tar.gz (316.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m316.9/316.9 MB[0m [31m3.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.5.0-py2.py3-none-any.whl size=317425345 sha256=eb92fe757f69adaa6d11d4b6033f2dce648a266c5d482ab53c450b36f63ecc18
  Stored in directory: /root/.cache/pip/wheels/41/4e/10/c2cf2467f71c678cfc8a6b9ac9241e5e44a01940da8fbb17fc
Successfully built pyspark
Installing collected packages: pyspark
Successfully installed pyspark-3.5.0


## Inicializiting Spark

In [7]:
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.recommendation import ALS
from pyspark.sql import Row
from pyspark.sql import SparkSession

In [8]:
from pyspark import SparkContext
sc = SparkContext()

In [9]:
# I will use the smaller version of the dataset for learning the optimal parapeters
small_ratings_file = "/content/ml-latest-small/ratings.csv"

small_ratings_raw_data = sc.textFile(small_ratings_file)
small_ratings_raw_data_header = small_ratings_raw_data.take(1)[0]

In [10]:
small_ratings_data = small_ratings_raw_data.filter(lambda line: line!=small_ratings_raw_data_header)\
    .map(lambda line: line.split(",")).map(lambda tokens: (tokens[0],tokens[1],tokens[2])).cache()

In [11]:
small_ratings_data.take(3)

[('1', '1', '4.0'), ('1', '3', '4.0'), ('1', '6', '4.0')]

In [39]:
# preparing the data, we use pyspark rdd
datasets_path = "/content/"

small_movies_file = os.path.join(datasets_path, 'ml-latest-small', 'movies.csv')

small_movies_raw_data = sc.textFile(small_movies_file)
small_movies_raw_data_header = small_movies_raw_data.take(1)[0]

small_movies_data = small_movies_raw_data.filter(lambda line: line!=small_movies_raw_data_header)\
    .map(lambda line: line.split(",")).map(lambda tokens: (tokens[0],tokens[1])).cache()

small_movies_data.take(3)

[('1', 'Toy Story (1995)'),
 ('2', 'Jumanji (1995)'),
 ('3', 'Grumpier Old Men (1995)')]

##Selecting ALS parameters using the small dataset

In [13]:
# Splitting the datasets
training_RDD, validation_RDD, test_RDD = small_ratings_data.randomSplit([6, 2, 2], seed=0)
validation_for_predict_RDD = validation_RDD.map(lambda x: (x[0], x[1]))
test_for_predict_RDD = test_RDD.map(lambda x: (x[0], x[1]))

In [14]:
from pyspark.mllib.recommendation import ALS
import math

seed = 5
iterations = 10
regularization_parameter = 0.1
ranks = [4, 8, 12]
errors = [0, 0, 0]
err = 0
tolerance = 0.02

min_error = float('inf')
best_rank = -1
best_iteration = -1
for rank in ranks:
    model = ALS.train(training_RDD, rank, seed=seed, iterations=iterations,
                      lambda_=regularization_parameter)
    predictions = model.predictAll(validation_for_predict_RDD).map(lambda r: ((r[0], r[1]), r[2]))
    rates_and_preds = validation_RDD.map(lambda r: ((int(r[0]), int(r[1])), float(r[2]))).join(predictions)
    error = math.sqrt(rates_and_preds.map(lambda r: (r[1][0] - r[1][1])**2).mean())
    errors[err] = error
    err += 1
    print ('For rank %s the RMSE is %s' % (rank, error))
    if error < min_error:
        min_error = error
        best_rank = rank

print ('The best model was trained with rank %s' % best_rank)

For rank 4 the RMSE is 0.9121002117526161
For rank 8 the RMSE is 0.9184327220254898
For rank 12 the RMSE is 0.9160151516811809
The best model was trained with rank 4


In [17]:
# evaluating the predictions
model = ALS.train(training_RDD, best_rank, seed=seed, iterations=iterations,
                      lambda_=regularization_parameter)
predictions = model.predictAll(test_for_predict_RDD).map(lambda r: ((r[0], r[1]), r[2]))
rates_and_preds = test_RDD.map(lambda r: ((int(r[0]), int(r[1])), float(r[2]))).join(predictions)
error = math.sqrt(rates_and_preds.map(lambda r: (r[1][0] - r[1][1])**2).mean())

print('For testing data the RMSE is %s' % (error))

For testing data the RMSE is 0.912132604158584


##Using the complete dataset to build the final model

In [18]:
# Load the complete dataset file
complete_ratings_file = os.path.join(datasets_path, 'ml-latest', 'ratings.csv')
complete_ratings_raw_data = sc.textFile(complete_ratings_file)
complete_ratings_raw_data_header = complete_ratings_raw_data.take(1)[0]

# Parse
complete_ratings_data = complete_ratings_raw_data.filter(lambda line: line!=complete_ratings_raw_data_header)\
    .map(lambda line: line.split(",")).map(lambda tokens: (int(tokens[0]),int(tokens[1]),float(tokens[2]))).cache()

print ("There are %s recommendations in the complete dataset" % (complete_ratings_data.count()))

There are 33832162 recommendations in the complete dataset


In [19]:
# training the final dataset
training_RDD, test_RDD = complete_ratings_data.randomSplit([7, 3], seed=0)

complete_model = ALS.train(training_RDD, best_rank, seed=seed,
                           iterations=iterations, lambda_=regularization_parameter)

In [20]:
# testing the final model
test_for_predict_RDD = test_RDD.map(lambda x: (x[0], x[1]))

predictions = complete_model.predictAll(test_for_predict_RDD).map(lambda r: ((r[0], r[1]), r[2]))
rates_and_preds = test_RDD.map(lambda r: ((int(r[0]), int(r[1])), float(r[2]))).join(predictions)
error = math.sqrt(rates_and_preds.map(lambda r: (r[1][0] - r[1][1])**2).mean())

print ('For testing data the RMSE is %s' % (error))

For testing data the RMSE is 0.8261327073910048


## Getting recommendation by myself

Let's make a prediction for a random user with such a history:

userId,movieId,rating,timestamp

1,19,4.0,964982703

1,34,4.0,964981247

1,6,4.0,964982224

1,47,5.0,964983815

1,51,5.0,964982931

1,75,3.0,964982400

1,10,5.0,964980868

1,10,4.0,964982176

1,151,5.0,964984041

1,117,5.0,964984100

1,13,5.0,964983650

1,2,5.0,964981208

1,23,3.0,964980985


In [30]:
complete_movies_file = os.path.join(datasets_path, 'ml-latest', 'movies.csv')
complete_movies_raw_data = sc.textFile(complete_movies_file)
complete_movies_raw_data_header = complete_movies_raw_data.take(1)[0]

# Parse
complete_movies_data = complete_movies_raw_data.filter(lambda line: line!=complete_movies_raw_data_header)\
    .map(lambda line: line.split(",")).map(lambda tokens: (int(tokens[0]),tokens[1],tokens[2])).cache()

complete_movies_titles = complete_movies_data.map(lambda x: (int(x[0]),x[1]))

print ("There are %s movies in the complete dataset" % (complete_movies_titles.count()))

There are 86537 movies in the complete dataset


In [31]:
def get_counts_and_averages(ID_and_ratings_tuple):
    nratings = len(ID_and_ratings_tuple[1])
    return ID_and_ratings_tuple[0], (nratings, float(sum(x for x in ID_and_ratings_tuple[1]))/nratings)

movie_ID_with_ratings_RDD = (complete_ratings_data.map(lambda x: (x[1], x[2])).groupByKey())
movie_ID_with_avg_ratings_RDD = movie_ID_with_ratings_RDD.map(get_counts_and_averages)
movie_rating_counts_RDD = movie_ID_with_avg_ratings_RDD.map(lambda x: (x[0], x[1][0]))

In [25]:
complete_ratings_file_eval = "/content/eval/ratings.csv"
raw_eval = sc.textFile(complete_ratings_file_eval)

In [26]:
data_eval = raw_eval.filter(lambda line: line!=complete_ratings_raw_data_header)\
    .map(lambda line: line.split(",")).map(lambda tokens: (int(tokens[0]),int(tokens[1]),float(tokens[2]))).cache()

In [28]:
users_to_recommend = data_eval.map(lambda x: (x[0], x[1]))
new_user_recommendations_RDD = complete_model.predictAll(users_to_recommend)

In [32]:
new_user_recommendations_rating_RDD = new_user_recommendations_RDD.map(lambda x: (x.product, x.rating))
new_user_recommendations_rating_title_and_count_RDD = new_user_recommendations_rating_RDD.join(complete_movies_titles).join(movie_rating_counts_RDD)
new_user_recommendations_rating_title_and_count_RDD.take(3)

[(97328, ((3.2456643250798933, 'Liberal Arts (2012)'), 414)),
 (4928,
  ((4.1395190947861344,
    'That Obscure Object of Desire (Cet obscur objet du désir) (1977)'),
   905)),
 (4928,
  ((3.9225236849812015,
    'That Obscure Object of Desire (Cet obscur objet du désir) (1977)'),
   905))]

In [33]:
new_user_recommendations_rating_title_and_count_RDD = new_user_recommendations_rating_title_and_count_RDD.map(lambda r: (r[1][0][1], r[1][0][0], r[1][1]))

Getting the recommendations in human-friendly shape

In [34]:
top_movies = new_user_recommendations_rating_title_and_count_RDD.filter(lambda r: r[2]>=25).takeOrdered(25, key=lambda x: -x[1])

print ('TOP recommended movies (with more than 25 reviews):\n%s' %
        '\n'.join(map(str, top_movies)))

TOP recommended movies (with more than 25 reviews):
('Apocalypse Now (1979)', 6.415263624734343, 34020)
('Reservoir Dogs (1992)', 6.205831000329978, 45318)
('"Shining', 6.175700739561374, 40297)
('Requiem for a Dream (2000)', 6.041638315947007, 30402)
('"Good', 6.040596117829245, 23823)
('Goodfellas (1990)', 6.035250005435119, 44592)
('Eternal Sunshine of the Spotless Mind (2004)', 5.963101701998606, 46292)
('Alien (1979)', 5.95763215351198, 46572)
('Donnie Darko (2001)', 5.924413319259936, 36667)
('Harry Potter and the Prisoner of Azkaban (2004)', 5.9078172758939225, 32517)
("Monty Python's Life of Brian (1979)", 5.906836046588124, 28801)
('"Shawshank Redemption', 5.903036156870099, 122296)
('Kill Bill: Vol. 1 (2003)', 5.850071139395098, 46973)
('Memento (2000)', 5.84109154513231, 55649)
('Twelve Monkeys (a.k.a. 12 Monkeys) (1995)', 5.803686360726292, 59730)
('"Shawshank Redemption', 5.797215967440351, 122296)
('Kill Bill: Vol. 2 (2004)', 5.736034165605361, 38553)
('"Usual Suspects', 

# Saving the model

In [36]:
from pyspark.mllib.recommendation import MatrixFactorizationModel

model_path = "/content/model"

# Save and load model
model.save(sc, model_path)
same_model = MatrixFactorizationModel.load(sc, model_path)

In [38]:
!zip -r /content/model.zip /content/model

  adding: content/model/ (stored 0%)
  adding: content/model/data/ (stored 0%)
  adding: content/model/data/user/ (stored 0%)
  adding: content/model/data/user/part-00000-3309dc45-0ebd-40d6-a714-bbdf11c690a5-c000.snappy.parquet (deflated 23%)
  adding: content/model/data/user/part-00001-3309dc45-0ebd-40d6-a714-bbdf11c690a5-c000.snappy.parquet (deflated 23%)
  adding: content/model/data/user/.part-00001-3309dc45-0ebd-40d6-a714-bbdf11c690a5-c000.snappy.parquet.crc (stored 0%)
  adding: content/model/data/user/._SUCCESS.crc (stored 0%)
  adding: content/model/data/user/_SUCCESS (stored 0%)
  adding: content/model/data/user/.part-00000-3309dc45-0ebd-40d6-a714-bbdf11c690a5-c000.snappy.parquet.crc (stored 0%)
  adding: content/model/data/product/ (stored 0%)
  adding: content/model/data/product/.part-00000-fad96151-59d4-4b4c-9e77-eb94a2b0e959-c000.snappy.parquet.crc (stored 0%)
  adding: content/model/data/product/._SUCCESS.crc (stored 0%)
  adding: content/model/data/product/part-00000-fad9