# Recommendation system for movies

**Benedikt Roth**

**data-pt-ber-08-20**

## Overview
Main project: Building recommendation systems based on different techniques<br>
Hypothese/Question to answer: Is machine Learning the best apporach to build a recommendation system on?<br>
Tesing: Building different recommendation systems using different techniques and compare them to machine learning approach<br>

Sub project: Clustering people based on average genre ratings<br>
Hypothesis: There is a genre which mostly drives different clusters for movies?<br>
Tesing: Building an unsupervies machine leanring alg to identify most important feature for clustering people based on movie genres.<br>

Main structure:<br>
Process:<br>
1.Data Acquisition<br>
2.Data cleaning<br>
3.Data exploration and analysis<br>
4. Sub project:<br>
>Feature selection<br>
>Train Unsupervised Learning model<br>
>Model evaluation using Elbow method and Silhouette score<br>
    
5. Building recommendation Engines:
>Content Based Filtering<br>
>Item-Item based Filtering<br>
>User-Item based Filtering<br>
>Model based Filtering<br>
>Model based Filtering using ML approach:<br>
>Train/Test Split<br>
>Train Model<br>
>Tuning Model<br>
>Evaluate Model<br>

6.Conclusion

## Data Preparation
### Overview:
Data from MovieLen: MovieLen is a non-commercial, personalized movie recommendation website
https://grouplens.org/datasets/movielens/

size:(105339, 7)
datatypes:int, float, timestamp

### Data Ingestion

In [3]:
# Importing packages
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel
from scipy.sparse.linalg import svds
from surprise import Dataset
from surprise import Reader
from surprise import SVD
from surprise.model_selection import train_test_split
from surprise.model_selection import cross_validate
from surprise.model_selection import GridSearchCV
import random

In [6]:
### Loading datasets
movies_df = pd.read_csv('./Dataset_original/movies.csv')
ratings_df = pd.read_csv('./Dataset_original/ratings.csv')

In [14]:
# Displaying movies dataset
movies_df.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [15]:
# Checking shape of dataset
movies_df.shape

(10329, 3)

In [16]:
# Checking datatypes
print(movies_df.dtypes)

movieId     int64
title      object
genres     object
dtype: object


In [17]:
# Describing dataset
movies_df.describe()

Unnamed: 0,movieId
count,10329.0
mean,31924.282893
std,37734.741149
min,1.0
25%,3240.0
50%,7088.0
75%,59900.0
max,149532.0


In [18]:
# Displaying dataset
ratings_df.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,16,4.0,1217897793
1,1,24,1.5,1217895807
2,1,32,4.0,1217896246
3,1,47,4.0,1217896556
4,1,50,4.0,1217896523


In [19]:
# Checking shape of dataset
ratings_df.shape

(105339, 4)

In [20]:
# Checking datatypes
print(ratings_df.dtypes)

userId         int64
movieId        int64
rating       float64
timestamp      int64
dtype: object


In [21]:
# Describing dataset
ratings_df.describe()

Unnamed: 0,userId,movieId,rating,timestamp
count,105339.0,105339.0,105339.0,105339.0
mean,364.924539,13381.312477,3.51685,1130424000.0
std,197.486905,26170.456869,1.044872,180266000.0
min,1.0,1.0,0.5,828565000.0
25%,192.0,1073.0,3.0,971100800.0
50%,383.0,2497.0,3.5,1115154000.0
75%,557.0,5991.0,4.0,1275496000.0
max,668.0,149532.0,5.0,1452405000.0


In [22]:
# Converting datetype
ratings_df["timestamp"]= pd.to_datetime(ratings_df.timestamp)
ratings_df.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,16,4.0,1970-01-01 00:00:01.217897793
1,1,24,1.5,1970-01-01 00:00:01.217895807
2,1,32,4.0,1970-01-01 00:00:01.217896246
3,1,47,4.0,1970-01-01 00:00:01.217896556
4,1,50,4.0,1970-01-01 00:00:01.217896523


In [23]:
# Merging datasets on movie_id
all_movies  = pd.merge(left=movies_df, right=ratings_df, how='left', on = 'movieId')
print('Total dataset: {}'.format(all_movies.shape[0]))
all_movies.head()

Total dataset: 105343


Unnamed: 0,movieId,title,genres,userId,rating,timestamp
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,2.0,5.0,1970-01-01 00:00:00.859046895
1,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,5.0,4.0,1970-01-01 00:00:01.303501039
2,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,8.0,5.0,1970-01-01 00:00:00.858610933
3,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,11.0,4.0,1970-01-01 00:00:00.850815810
4,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,14.0,4.0,1970-01-01 00:00:00.851766286


In [24]:
# Checking shape of dataset
all_movies.shape

(105343, 6)

In [25]:
# Checking datatypes
print(all_movies.dtypes)

movieId               int64
title                object
genres               object
userId              float64
rating              float64
timestamp    datetime64[ns]
dtype: object


In [26]:
# Describing dataset
all_movies.describe()

Unnamed: 0,movieId,userId,rating
count,105343.0,105339.0,105339.0
mean,13382.696373,364.924539,3.51685
std,26172.698128,197.486905,1.044872
min,1.0,1.0,0.5
25%,1073.0,192.0,3.0
50%,2497.0,383.0,3.5
75%,5991.0,557.0,4.0
max,149532.0,668.0,5.0


### Data Wrangling and Cleaning

In [28]:
# Checking NULL values per row
amount_of_null_values_per_row = all_movies.isnull().sum(axis=1)
pd.Series(amount_of_null_values_per_row).value_counts()

0    105339
3         4
dtype: int64

In [29]:
# Checking NULL values per variable
missing = all_movies.isnull().sum().sort_values(ascending=False)
missing

timestamp    4
rating       4
userId       4
genres       0
title        0
movieId      0
dtype: int64

In [30]:
# Checking NULL values per column in relation to all values of a column
def missing_values_table(df):
    mis_val = all_movies.isnull().sum()
    mis_val_percent = 100 * all_movies.isnull().sum()/len(all_movies)
    mis_val_table = pd.concat([mis_val, mis_val_percent], axis=1)
    mis_val_table_ren_columns = mis_val_table.rename(
    columns = {0 : "Missing Values", 1 : "% of Total Values"})
    return mis_val_table_ren_columns
missing_values_table(all_movies)

Unnamed: 0,Missing Values,% of Total Values
movieId,0,0.0
title,0,0.0
genres,0,0.0
userId,4,0.003797
rating,4,0.003797
timestamp,4,0.003797


In [31]:
# Dropping NULL values
all_movies.dropna(inplace=True) 

### Data Storage

In [32]:
# Saving the cleaned data as csv
all_movies.to_csv(r'Dataset_cleaned\Dataset_cleaned.csv')

In [None]:
## Data Analysis sub project
### Overview
### Data Exploration and Visualization
### Model Training and Evaluation
## Conclusion
* Summarize your data analysis result.
* State your conclusion of your hypothesis testing.
* Interpret your findings in terms of the human-understandable question you try to answer.
* What are the next steps?

In [None]:
## Data Analysis main project
### Overview
### Data Exploration and Visualization
### Model Training and Evaluation
## Conclusion
* Summarize your data analysis result.
* State your conclusion of your hypothesis testing.
* Interpret your findings in terms of the human-understandable question you try to answer.
* What are the next steps?