<a id='top'></a>
<div class="list-group" id="list-tab" role="tablist">
<p style="background-color:#682F2F;font-family:newtimeroman;color:#FFF9ED;font-size:150%;text-align:center;border-radius:10px 10px;">TABLE OF CONTENTS</p>

* [1. IMPORTING LIBRARIES](#1)
* [2. LOADING DATA](#2)    
* [3. DATA PIPELINE](#3) (Maleable section)
* [4. EXPLORATORY DATA ANALYSIS](#4)     
* [5. MODELING](#5)
* [6. EVALUATION](#6)
* [7. DISCUSSION](#7)  
* [8. DEPLOYMENT AND ENSEMBLES](#8)
* [9. END](#9)

<a id="1"></a>
# <p style="background-color:#682F2F;font-family:newtimeroman;color:#FFF9ED;font-size:150%;text-align:center;border-radius:10px 10px;">IMPORTING LIBRARIES</p>

In [None]:
# Parameters
path_root = "/home/magody/programming/python/data_science/"
path_output = f"{path_root}output/"
path_data = f"{path_root}data/movies/"

In [None]:
# For Basic Operations
import numpy as np
import pandas as pd
from collections import defaultdict

In [None]:
# !pip install scikit-surprise
# Lets Import the Data Import into the Surprise Reader
from surprise import Dataset, Reader
from surprise import KNNWithMeans
from surprise.model_selection import  cross_validate

<a id="2"></a>
# <p style="background-color:#682F2F;font-family:newtimeroman;color:#FFF9ED;font-size:150%;text-align:center;border-radius:10px 10px;">LOADING DATA</p>

## Description

## Load

In [None]:
movies:pd.DataFrame = pd.read_csv(f"{path_data}movies.csv")
# lets also check the ratings dataset
ratings:pd.DataFrame  = pd.read_csv(f"{path_data}ratings.csv")
# Removing the Timestamp column from the Data, as the Surprise Library Accepts only three Columns
ratings = ratings.drop(['timestamp'], axis = 1)

file_path = f"{path_data}ratings_modified.csv"

# Now, we have to Create a CSV File for the new rating data, as the Surprise Library can only accept CSV Files as Input
# We will have to Specify the Header as None, as again the Surprise Library cannot take in Column Names
# We will have to set the Index also as False, Becaus ethe Surprise Library cannot Handle Indexes also.
ratings.to_csv(file_path, 
               header = None,
               index = False)
x = pd.read_csv(file_path)
x.head()


In [None]:
# Lets First Specify the File Path and Reader Parameters Required for Loading the Data

reader = Reader(line_format='user item rating', sep=',', rating_scale = (1,5))

# Lets Load the Dataset into the Surprise Reader, We cannot read this Dataset, as this is a Surprise Object. 
data = Dataset.load_from_file(file_path, reader=reader)

# Lets Build the Training Dataset
train = data.build_full_trainset()

# lets get the Number of Users and Items
print('Number of users in the Database :', train.n_users)
print('Number of items in the Database :', train.n_items)

## Eager explotarion

In [None]:
print(movies.shape, ratings.shape)
movies.head()

In [None]:
ratings.head()

<a id="3"></a>
# <p style="background-color:#682F2F;font-family:newtimeroman;color:#FFF9ED;font-size:150%;text-align:center;border-radius:10px 10px;">DATA PIPELINE</p>
- Special section: is used before and after by the following sections. Isn't in the common sequential flow.
- Maleable section.
- Here we define a pipeline for cleaning, preprocessing, dimensionality reduction, feature enginering, etc. That can be modified at any time for other following steps.
- Commonly, we use the insights got in EDA for write this part.

<a id="4"></a>
# <p style="background-color:#682F2F;font-family:newtimeroman;color:#FFF9ED;font-size:150%;text-align:center;border-radius:10px 10px;">EXPLORATORY DATA ANALYSIS</p>

## Exploration and understanding

## Visualization of data prepared for consumption

## Pivoting

## Correlation

## Dimensionality reduction

### PCA
- Principal component analysis (PCA) is a technique for reducing the dimensionality of such datasets, increasing interpretability but at the same time minimizing information loss.

## Insights



<a id="5"></a>
# <p style="background-color:#682F2F;font-family:newtimeroman;color:#FFF9ED;font-size:150%;text-align:center;border-radius:10px 10px;">MODELING</p>

## User based

In [None]:
# User Based collaborative Filtering.
my_sim_option = {'name':'pearson', 'user_based':True}

# KNN model as backend 
algo = KNNWithMeans(k = 15, min_k = 5, 
    sim_options = my_sim_option, verbose = True
    )

# Lets Training the Model on our Dataset
algo.fit(train)

<a id="6"></a>
# <p style="background-color:#682F2F;font-family:newtimeroman;color:#FFF9ED;font-size:150%;text-align:center;border-radius:10px 10px;">EVALUATION</p>

In [None]:
# Cross validation 
results = cross_validate(algo = algo, 
                         data = data, 
                         measures=['RMSE'], 
                         cv = 5, 
                         return_train_measures=True)
    
print(results['test_rmse'].mean())

<a id="7"></a>
# <p style="background-color:#682F2F;font-family:newtimeroman;color:#FFF9ED;font-size:150%;text-align:center;border-radius:10px 10px;">DISCUSSION</p>

## Patterns study

## Profiling

## Conclusions


<a id="8"></a>
# <p style="background-color:#682F2F;font-family:newtimeroman;color:#FFF9ED;font-size:150%;text-align:center;border-radius:10px 10px;">DEPLOYMENT AND ENSEMBLES</p>

In [None]:
# lets Create a Dictionary to Map the Movie Id and Movie Names
movie_id_to_title_map = {}

for m_id , title in zip(movies['movieId'].values , movies['title'].values):
    movie_id_to_title_map[str(m_id)] = title

In [None]:
# Realtime prediction

# how much the user id - 1 would rate item id 31 ?
val = algo.predict(uid = '1', iid = '31')
print(val)
print(movie_id_to_title_map[val[1]] , val[3])

In [None]:
def get_top_n(predictions, n=10):
    top_n = defaultdict(list)
    for uid, iid, true_r, est, _ in predictions:
        top_n[uid].append((iid, est))

    for uid, user_ratings in top_n.items():
        user_ratings.sort(key=lambda x: x[1], reverse=True)
        top_n[uid] = user_ratings[:n]

    return top_n

# Lets Create a Function to Fetch all the Movies Watched by the Users 
def PreviousMoviedUserWatched(user_df , user_id , item_map):
    user_df = user_df[user_df.iloc[: , 0] == user_id]
    for movie , rating in zip(user_df.iloc[:,1].values , user_df.iloc[:,2].values):
        print(item_map[str(movie)] , rating)

# Lets Create a Function to Predict Movies to the Users based on the Movies Watched Previously
def UserPredictions(user_id , top_n , item_map):
    print("Predictions for User Id : " , user_id)
    user_ratings = top_n[user_id]
    for item_id , rating in user_ratings :
        print(item_map[item_id] , " : " , rating)

In [None]:
# Build an Iterable Testset, Direct predictions on Train would throw errors 

testdata = train.build_anti_testset() # all the data not related to the user
predictions = algo.test(testdata)
top_n = get_top_n(predictions, n = 10)

In [None]:
PreviousMoviedUserWatched(ratings , 1 , movie_id_to_title_map)

In [None]:
UserPredictions('1' , top_n , movie_id_to_title_map)

<a id="9"></a>
# <p style="background-color:#682F2F;font-family:newtimeroman;color:#FFF9ED;font-size:150%;text-align:center;border-radius:10px 10px;">END</p>

[Return to table of contents](#top)