# Movie recommendations system
## Recommendation system based on movie similarities in genre/s and rating, using unsupervised knn, sklearn, pandas and numpy
#### Author: Julia Bunescu
---

**Table of contents**<a id='toc0_'></a>    
- [Section 0: General Functions and Libraries](#toc1_1_1_)    
    - [Section 1](#toc1_1_2_)    
    - [Section 2](#toc1_1_3_)    
    - [Section 3](#toc1_1_4_)    
    - [Section 4](#toc1_1_5_)    
- [Application](#toc1_2_)    
  - [Section 0: Functions and Libraries](#toc1_2_1_)    
  - [Section 1](#toc1_2_2_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

## <a id='toc1_1_1_'></a>[Section 0: General Functions and Libraries](#toc0_)

Please run these functions before running any of the sections below.

In [1]:
# checking and installing missing packages
%pip install sklearn
%pip install numpy
%pip install pandas

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


In [3]:
# importing libraries
from sklearn.neighbors import NearestNeighbors
import pandas as pd
import numpy as np
from myfunctions import read_from_global_csv, add_to_global_csv, get_csv_data

In [9]:
# remove columns with only 0 values
def remove_zero(dataframe, columns = []):
    #make a copy to modify on
    new_data = dataframe.copy()

    for col in columns:
        if ~new_data[col].any():
            new_data.drop(columns = col, inplace=True)

    return new_data

### <a id='toc1_1_2_'></a>[Section 1](#toc0_)

Obtain the data from the imdb dataset (https://www.imdb.com/interfaces/). Chosen datasets: *title.basics.tsv.gz* and *title.ratings.tsv.gz*. 
Columns used:
- *tconst* used as identification and merge column, later replaced by auto increment pandas index
- *primaryTitle* as movie name
- *genres* to take and modify into multiple columns, one for each genre
- *averageRating* for clustering based on ratings

Filter by keeping picture type: movies and removing movies with no genre. 

Resulted data frame: *movie_data*

**The resulted csv file can be found here: https://drive.google.com/file/d/1BasDf5kGMpARja5sO2tkgxqW_yalquPk/view?usp=sharing . For using the app from the Application section, please download this file and place it into your data folder.**

In [4]:
# make a pandas dataframe from movies and their rating from IMDB
movie_general_data = pd.read_table('../data/general_data.tsv', delimiter='\t', usecols=['tconst','titleType', 'primaryTitle','genres'], dtype={'tconst':'string','titleType':'string', 'primaryTitle':'string', 'genres':'string'})
movie_rating_data = pd.read_table('../data/rating_data.tsv', delimiter='\t', usecols=['tconst','averageRating'], dtype={'tconst':'string','averageRating':'float'})
movie_data = pd.merge(movie_general_data, movie_rating_data, on='tconst')

# get only movie type pictures
options = ['movie']
movie_data = movie_data[movie_data['titleType'].isin(options)]

#clean data 
#remove now redundant titleTtype
movie_data.pop('titleType')

# remove null genres entries
null_values = ['\\N']
movie_data = movie_data[movie_data.genres.isin(null_values) == False]

# redefine the index column based on the new data, remove the old index column
movie_data.index = [x for x in range(1, len(movie_data.values)+1)]
movie_data.index.name = 'id'

movie_data.pop('tconst')

#preview pandas dataframe 
display(movie_data.head(5))

Unnamed: 0_level_0,primaryTitle,genres,averageRating
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,Miss Jerry,Romance,5.3
2,The Corbett-Fitzsimmons Fight,"Documentary,News,Sport",5.3
3,The Story of the Kelly Gang,"Action,Adventure,Biography",6.0
4,The Prodigal Son,Drama,4.4
5,Robbery Under Arms,Drama,4.3


### <a id='toc1_1_3_'></a>[Section 2](#toc0_)

In order to use KNN, dataframne needs to be modified. Each genre will became its own column and gain a value of either a 0 or a 1. 
To save all the movie genres, a list will be saved as an entry to the global csv file.

In [5]:
# get all uniques genres in a list
genres = movie_data.genres.unique().tolist()

split_genres_list = [item.split(',') for item in genres]

flat_genres_list = [item for l in split_genres_list for item in l]

unique_geners_set = set(flat_genres_list)
unique_geners_list = list(unique_geners_set)


add_to_global_csv("unique_geners_list", unique_geners_list)


# add new columns to the dataset for each genre
for genre in unique_geners_list:
    movie_data[genre] = np.where(movie_data['genres'].str.contains(genre), 1, 0)

# remove genres column
movie_data.pop('genres')

# write the new data into a csv file in the data folder
movie_data.to_csv('../data/modified_movie_data.csv') 

# preview panda dataframe
display(movie_data.head(5))

Unnamed: 0_level_0,primaryTitle,averageRating,Horror,War,Western,Game-Show,Fantasy,Adventure,Short,History,...,Crime,Biography,Mystery,Musical,Music,Reality-TV,Family,Film-Noir,Romance,Action
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,Miss Jerry,5.3,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
2,The Corbett-Fitzsimmons Fight,5.3,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,The Story of the Kelly Gang,6.0,0,0,0,0,0,1,0,0,...,0,1,0,0,0,0,0,0,0,1
4,The Prodigal Son,4.4,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,Robbery Under Arms,4.3,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### <a id='toc1_1_4_'></a>[Section 3](#toc0_)

Split dataset into trainig and testing data, fit the model.

In [283]:
# get relevant columns
columns = ['averageRating'] + unique_geners_list

# get testing data as sample
test_data = movie_data.sample()[columns]
train_data = movie_data[columns]

# construct a NearestNeighbors class from the dataframe
k_neighbours = NearestNeighbors(n_neighbors=5)

# fitting the model 
k_neighbours.fit(train_data)

### <a id='toc1_1_5_'></a>[Section 4](#toc0_)

Make a new DataFrame with the results of the test data as a request and the resulted recommendations.

In [288]:
# get the neighbours ids
neighbours_ids = k_neighbours.kneighbors(test_data)[1][0]

# get the data of the test movie and add it to the results DataFrame
test_data_row  = movie_data.loc[test_data.index]

relevant_test_results = remove_zero(test_data_row, unique_geners_list)
relevant_test_results['Type'] = 'current'

# get the recommendations
rec_data_row = movie_data.iloc[neighbours_ids]

relevant_rec_results = remove_zero(rec_data_row, unique_geners_list)
relevant_rec_results['Type'] = 'recommendation'

# make a final results DataFrame
results = pd.concat([relevant_test_results,relevant_rec_results])

display(results)
    

Unnamed: 0_level_0,primaryTitle,averageRating,Documentary,Type
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
99157,Great White Death,5.6,1,current
49848,Dirigenterna,5.6,1,recommendation
63884,The Volcano Man,5.6,1,recommendation
47895,Die Kümmeltürkin geht,5.6,1,recommendation
28511,España insólita,5.6,1,recommendation
44392,Too Early/Too Late,5.6,1,recommendation


## <a id='toc1_2_'></a>[Application](#toc0_)
**A small application where the user can choose a movie and receive recommendations.**

### <a id='toc1_2_1_'></a>[Section 0: Functions and Libraries](#toc0_)

Please run the following code sections if you want to run the sections below.

In [5]:
## importing libraries
from sklearn.neighbors import NearestNeighbors
import pandas as pd
import numpy as np
from myfunctions import read_from_global_csv, get_csv_data

In [39]:
# train and fit the model based on a dataset
def prepare_model( movie_dataset):
    # get the geners list global variable 
    unique_geners_list = read_from_global_csv('unique_geners_list', list)

    columns = ['averageRating'] + unique_geners_list

    # get training data by using only the numeric columns
    train_data = movie_dataset[columns]

    # construct a NearestNeighbors class from the dataframe
    k_neighbours = NearestNeighbors(n_neighbors = 6)

    # fitting the model 
    k_neighbours.fit(train_data)
    return (k_neighbours, columns)

In [41]:
# get the recommendations for a specific entry by id, from a dataset
def get_recom(id, dataset, k_neighbours, columns):
    
     # get requested data using the id from the input
     req_data = dataset.loc[[id]][columns]

     # get the neighbours ids
     neighbours_ids = k_neighbours.kneighbors(req_data)[1][0]

     #remove self from neighbours
     neighbours_ids = np.delete(neighbours_ids, np.where(neighbours_ids == (id-1)))

     # get the data of the requested movie and add it to the results DataFrame
     req_data_row  = dataset.loc[[id]]

     genre_list = read_from_global_csv('unique_geners_list', list)

     req_data_row['Type'] = 'requested'

     # get the recommendations
     rec_data_row = dataset.loc[neighbours_ids + 1]

     rec_data_row['Type'] = 'recommendation'

     # make a final results DataFrame
     results = pd.concat([req_data_row, rec_data_row])

     # remove the columns that only contain 0 values
     results = remove_zero(results, genre_list)
     
     return results

### <a id='toc1_2_2_'></a>[Section 1](#toc0_)

Sums up the code presented above. The user is prompted to enter an id which will serve as test data for the trained model and such receive 5 recommendations.

**If you haven't already, please run both Section 0 of this file before running anything below.**


In [45]:
# flags
ready = False # for validity of the input
retry = False # for entering another input
load = False # for preparing the model

movie_dataset = get_csv_data('../data/modified_movie_data.csv' )

max_index = len(movie_dataset)-1

# prompting the user for an id
print('Please choose a number between 1 and {}, or write 0 to exit.'.format(max_index))
choice = input('Welcome to the movie recommendation app! \nPlease choose a number between 1 and {}, or write 0 to exit.'.format(max_index))
print('You said: {}'.format(choice))

# looping based on the user input
while  choice not in [0, '0']:
    # check the validity of the user data
    while not ready:
        if retry:

            # repromting the user to enter another id
            print('Please choose a number between 1 and {}, or write 0 to exit.'.format(max_index))
            choice = input('Please choose a number between 1 and {}, or write 0 to exit.'.format(max_index))
            print('You said: {}'.format(choice))

            retry = False
        try:
            choice = int(choice)

        # checking for data type error
        except ValueError:
            print('Wrong data type. Try again:')
            retry = True
        else:
            if choice == 0:
                break

            # checking if input id is between limits
            elif choice < 0 or choice > len(movie_dataset)-1:
                print('Wrong number. Try again:')
                retry = True
            else:
                ready = True 
                retry = False 

    # continuing the process with valid user data: preparing the knn model
    if not load:
        (k_neighbours, columns) = prepare_model(movie_dataset)
        load = True

    # generating the movie recommendations based on the user input and the model
    if ready:
        recom = get_recom(choice, movie_dataset, k_neighbours, columns)

        print('This is your data:')
        display(recom)

        ready = False
        retry = True

# if 0 is the input, exiting   
print('Thanks, bye!')

    

Please choose a number between 1 and 279123, or write 0 to exit.
You said: 1
This is your data:


Unnamed: 0_level_0,primaryTitle,averageRating,Romance,Type
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,Miss Jerry,5.3,1,requested
48217,Salon krasoty,5.3,1,recommendation
21084,Ännchen von Tharau,5.3,1,recommendation
2715,The Third Degree,5.3,1,recommendation
23602,El amor empieza en sábado,5.3,1,recommendation
3547,Hearts in Exile,5.3,1,recommendation


Please choose a number between 1 and 279123, or write 0 to exit.
You said: 0
Thanks, bye!
