<a href="https://colab.research.google.com/github/monicafar147/unsupervised-predict-streamlit-template/blob/eda/main.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# EDSA Movie Recommendation Challenge

This notebook was created to enter the private kaggle competition of EDSA on creating a movie recommendation predictor.

In today’s technology driven world, recommender systems are socially and economically critical for ensuring that individuals can make appropriate choices surrounding the content they engage with on a daily basis.

One application where this is especially true surrounds movie content recommendations; where intelligent algorithms can help viewers find great titles from tens of thousands of options.

This notebook will construct a recommendation algorithm based on content or collaborative filtering in order to accurately predict how a user will rate a movie they have not yet viewed based on their historical preferences.

Providing an accurate and robust solution to this challenge has immense economic potential, with users of the system being exposed to content they would like to view or purchase generating revenue and platform affinity.

# Imports

## Launching Comet Experiment

In [None]:
#pip install comet_ml

In [4]:
# import comet_ml in the top of your file
from comet_ml import Experiment
    
# Add the following code anywhere in your machine learning file
experiment = Experiment(api_key="cFwBSBV5CDhaL3B3LHdwF7O3M",
                        project_name="recommender-systems", workspace="jonnybegreat")

COMET INFO: Experiment is live on comet.ml https://www.comet.ml/jonnybegreat/recommender-systems/193f9d063d4c462291106e21f0dc241d



## Importing modules

In [None]:
#pip install surprise

In [1]:
# Import essential packages
import pandas as pd
import numpy as np

# Import surprise packages
from surprise import BaselineOnly
from surprise import Dataset
from surprise import Reader
from surprise.model_selection import cross_validate

# Import models
from surprise import NormalPredictor
from surprise import KNNBasic
from surprise import SVD
from surprise.model_selection import GridSearchCV

# Import utilities
import os

## Importing the raw data

In [2]:
imdb = pd.read_csv('imdb_data.csv')
train = pd.read_csv('train.csv')
movies = pd.read_csv('movies.csv')
g_tags = pd.read_csv('genome_tags.csv')
g_scores = pd.read_csv('genome_scores.csv')
tags = pd.read_csv('tags.csv')
print('Data Successfully imported!')

Data Successfully imported!


In [4]:
# Importing train and test datasets
ratings_df = pd.read_csv('train.csv').drop(columns=['timestamp'],axis=0)
test_df = pd.read_csv('test.csv')
ratings_df.head()

Unnamed: 0,userId,movieId,rating
0,5163,57669.0,4.0
1,106343,5.0,4.5
2,146790,5459.0,5.0
3,106362,32296.0,2.0
4,9041,366.0,3.0


Due to the size of the data only a subset of the data was selected.

In [5]:
# Subset of test data
test_df_subset = test_df[:1000]

#EDA

##What information is available in the data?
The dataset consists of 9 951 829 ratings (5 star based) obtained from users using the online MovieLens movie recommendation service. The MovieLens dataset has long been used by industry and academic researchers to improve the performance of explicitly-based recommender systems.

- genome_scores.csv - a score mapping the strength between movies and tag-related properties
- genome_tags.csv - user assigned tags for genome-related scores
imdb_data.csv - Additional movie metadata scraped from IMDB using the links.csv file.
- links.csv - File providing a mapping between a MovieLens ID and associated IMDB and TMDB IDs.
- tags.csv - User assigned for the movies within the dataset.
- test.csv - The test split of the dataset. Contains user and movie IDs with no rating data.
- train.csv - The training split of the dataset. Contains user and movie IDs with associated rating data.

In [3]:
g_scores.head(1)

Unnamed: 0,movieId,tagId,relevance
0,1,1,0.02875


In [17]:
g_tags.head(1)

Unnamed: 0,tagId,tag
0,1,7


In [19]:
tags.head(1)

Unnamed: 0,userId,movieId,tag,timestamp
0,3,260,classic,1439472355


In [21]:
movies.head(1)

Unnamed: 0,35,Comedy,5,35.1
0,35,Comedy,13,35


In [23]:
imdb.head(1)

Unnamed: 0,movieId,title_cast,director,runtime,budget,plot_keywords
0,1,Tom Hanks|Tim Allen|Don Rickles|Jim Varney|Wal...,John Lasseter,81.0,"$30,000,000",toy|rivalry|cowboy|cgi animation


## Research Question 1 : Which year has the highest release of movies?

## Research Question 2 : Which Movie Has The Highest Or Lowest Budget?

## Research Question 3 : Movie with Longest And Shortest Runtime?

## Research Question 4 : Movie with Highest And Lowest Ratings?

## Research Question 5: Average Runtime Of Movies From Year To Year?

## Research Question 6: Which Genre Has The Highest Release Of Movies?

## Reasearch Question 7: Most Frequent Actor?

## Research Question 8 : Top 20 Director Who Directs Maximum Movies?

## Research Question 9 : Most Popular Keyword?

#Modelling

A reader is needed in order to input `rating_scale` parameter

In [None]:
reader = Reader(rating_scale=(1, 5))

 - Load data into format that Surprise understands using Dataset class from surprise
 - The columns must correspond to user id, item id and ratings (in that order).

In [None]:
data = Dataset.load_from_df(ratings_df[['userId', 'movieId', 'rating']], reader)

We can now use this dataset as we please, e.g. calling cross_validate

In [None]:
# We are using the full train dataset to train the model here - should use test and train split in order to evaluate
trainset = data.build_full_trainset()

## SVD model

In [None]:
# Build an algorithm, and train it.
algo = SVD()
algo.fit(trainset)

In [None]:
# Pickle model

model_save_path = "SVD_full_dataset.pkl"
with open(model_save_path,'wb') as file:
    pickle.dump(algo,file)

# Recommender predictions

Example on how to predict using the SVD algorithm

In [None]:
# raw user id (as in the ratings file). They are **strings**!
uid = str(1)
# raw item id (as in the ratings file). They are **strings**!
iid = str(2011)

Get a prediction for specific users and items.

In [None]:
pred = algo.predict(uid, iid, r_ui=4, verbose=True)

Get a prediction for the test data

In [None]:
# Predict on our test data
predcol = []
for _, row in test_df.iterrows():
    x = (algo.predict(row.userId, row.movieId))
    pred = x[3]
    predcol.append(pred)

In [None]:
# check if the prediction length is the same as input length
len(predcol) == len(test_df)

In [None]:
# Convert columns to string in order to prep for concatenating
test_df['userId'] = test_df['userId'].astype(str)
test_df['movieId'] = test_df['movieId'].astype(str)

In [None]:
# Create submission column ID
test_df['Id'] =test_df['userId']+'_'+test_df['movieId']

In [None]:
kaggle_df = pd.DataFrame(
    {'Id': test_df['Id'],
     'rating': predcol
    })

In [None]:
kaggle_df.shape

In [None]:
kaggle_df.head()

In [None]:
# Export to csv
kaggle_df.to_csv("recommend_2.csv",index=False)

In [None]:
# Comet logging

params = {"random_state": 'none',
          "model_type": "SVD",
          "scaler": "none",
          "param_grid": str(param_grid)
          }
metrics = {"RMSE": 'none',
           "top n": 'none',
           "MAE": 'none'
           }

# Log parameters and results
experiment.log_parameters(params)
experiment.log_metrics(metrics)

# End Experiment# Create dictionaries for comet

params = {"random_state": 'none',
          "model_type": "SVD",
          "scaler": "none",
          "param_grid": str(param_grid)
          }
metrics = {"RMSE": 'none',
           "top n": 'none',
           "MAE": 'none'
           }

# Log parameters and results
experiment.log_parameters(params)
experiment.log_metrics(metrics)

# End Experiment