#What question are we trying to answer?

## Given a movies data set, what are the 5 most similar movies to a movie query?

#Data Source and Contents

We will use a small sub-set of the data extracted from the UCI’s IMDB data set.

Data File Name: movies_recommendation_data.csv
File Location: https://github.com/ArinB/CA05-kNN/raw/master/movies_recommendation_data.csv

The data contains **thirty movies**, including data for each movie across **seven genres** and their IMDB
ratings **bold text**. 

The labels column values are all zeroes because we aren’t using this data set for classification or
regression. We can ignore this column. The implementation assumes that all columns contain numerical data.


Additionally, there are relationships among the movies that will not be accounted for (e.g. actors,
directors, and themes) when using the KNN algorithm simply because the data that captures those
relationships are missing from the data set. 

Consequently, when we run the KNN algorithm on our data, *similarity will be based solely on the included genres and the IMDB ratings of the movies*.

#Building Recommender System

We are building our own movie recommendation website which uses our Recommendation Engine at
the back-end. 

We are going to build this back-end Recommendation Engine. 

##How things look like: 
Imagine a user is navigating our recommendation website, and he/she encounters a movie named “The Post”. The user
is not sure if he/she wants to watch it, but its genres intrigue the user; he/she is curious about other similar movies. The user scrolls down to the “More Like This” section to see what recommendations our recommendation website will make, and the back-end algorithmic gears begin to turn.


##Logics:
Our website sends a request to its back-end for the 5 movies that are most similar to The Post. 
The backend has a recommendation data set exactly like ours. It begins by creating the row representation (better
known as a feature vector) for The Post, then it runs a program similar to the one below to search for
the 5 movies that are most similar to The Post, and finally sends the results back to the user at your
website.


###*Following is the genre information about the movie “The Post”*


IMDB Rating = 7.2, Biography = Yes, Drama = Yes, Thriller = No, Comedy = No, Crime = No,
Mystery = No, History = Yes


In [1]:
#import packages 
import pandas as pd
import numpy as np
#classifier implementing the k-nearest neighbors vote
from sklearn.neighbors import  KNeighborsClassifier 
#unsupervised learner for implementing neighbor searches
from sklearn.neighbors import NearestNeighbors 

In [2]:
#read file into data frame
df = pd.read_csv( 'https://github.com/ArinB/CA05-kNN/raw/master/movies_recommendation_data.csv')

In [3]:
#display the top 5 rows of df
df.head()

Unnamed: 0,Movie ID,Movie Name,IMDB Rating,Biography,Drama,Thriller,Comedy,Crime,Mystery,History,Label
0,58,The Imitation Game,8.0,1,1,1,0,0,0,0,0
1,8,Ex Machina,7.7,0,1,0,0,0,1,0,0
2,46,A Beautiful Mind,8.2,1,1,0,0,0,0,0,0
3,62,Good Will Hunting,8.3,0,1,0,0,0,0,0,0
4,97,Forrest Gump,8.8,0,1,0,0,0,0,0,0


In [4]:
#get the descriptive data of df
df.describe()

Unnamed: 0,Movie ID,IMDB Rating,Biography,Drama,Thriller,Comedy,Crime,Mystery,History,Label
count,30.0,30.0,30.0,30.0,30.0,30.0,30.0,30.0,30.0,30.0
mean,48.133333,7.696667,0.233333,0.6,0.1,0.1,0.133333,0.1,0.1,0.0
std,29.288969,0.666169,0.430183,0.498273,0.305129,0.305129,0.345746,0.305129,0.305129,0.0
min,1.0,5.9,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,27.75,7.3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,48.5,7.75,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,64.25,8.175,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
max,98.0,8.8,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0


In [5]:
#get more info about cols
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30 entries, 0 to 29
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Movie ID     30 non-null     int64  
 1   Movie Name   30 non-null     object 
 2   IMDB Rating  30 non-null     float64
 3   Biography    30 non-null     int64  
 4   Drama        30 non-null     int64  
 5   Thriller     30 non-null     int64  
 6   Comedy       30 non-null     int64  
 7   Crime        30 non-null     int64  
 8   Mystery      30 non-null     int64  
 9   History      30 non-null     int64  
 10  Label        30 non-null     int64  
dtypes: float64(1), int64(9), object(1)
memory usage: 2.7+ KB


Since this is a pretty clean data frame, we can start to build the recommendation engine.

In [6]:

#define 'X' variable and 'y' variable 
    #We will ignore 'Label' col from previously stated
X, y = df.drop(['Movie Name','Label','Movie ID'],axis=1), df['Movie Name']

#display our defined X and y
print('X variable: \n', X[:5])
print('y variable: \n', y[:5])

X variable: 
    IMDB Rating  Biography  Drama  Thriller  Comedy  Crime  Mystery  History
0          8.0          1      1         1       0      0        0        0
1          7.7          0      1         0       0      0        1        0
2          8.2          1      1         0       0      0        0        0
3          8.3          0      1         0       0      0        0        0
4          8.8          0      1         0       0      0        0        0
y variable: 
 0    The Imitation Game
1            Ex Machina
2      A Beautiful Mind
3     Good Will Hunting
4          Forrest Gump
Name: Movie Name, dtype: object


In [7]:

#fitting on sparse input will override the setting of this parameter, using brute force
knn = NearestNeighbors(n_neighbors=5, algorithm='brute') 
knn.fit(X,y)

NearestNeighbors(algorithm='brute', leaf_size=30, metric='minkowski',
                 metric_params=None, n_jobs=None, n_neighbors=5, p=2,
                 radius=1.0)

In [8]:
#IMDB Rating = 7.2, Biography = Yes, Drama = Yes, Thriller = No, Comedy = No, 
    #Crime = No, Mystery = No, History = Yes
#print(knn.kneighbors([[7.2,1,1,0,0,0,0,1]]))
knn.kneighbors([[7.2,1,1,0,0,0,0,1]], 5, return_distance=False)

array([[28, 27, 29, 16,  9]])

In [9]:
indices=knn.kneighbors([[7.2,1,1,0,0,0,0,1]], 5, return_distance=False)
#raw_idx = sorted(list(zip(indices,squeeze().tolist(),distances.squeeze().tolist())),
#                 key = lambda x: x[1])[:0:-1]

print('More Like This: ')

#return a copy of the array collapsed into one dimension
indices.flatten() #array([28, 27, 29, 16,  9])
for i in range(0,5):
    print('{0}:{1}'.format(i+1, df['Movie Name'].iloc[indices.flatten()[i]]))


#def recommends(i):
#    title = knn.kneighbors_graph([i]).indices
#    print('Top 5 Recommendations', title)


More Like This: 
1:12 Years a Slave
2:Hacksaw Ridge
3:Queen of Katwe
4:The Wind Rises
5:The Karate Kid


In [10]:
#Another way (maybe)

#create KNN classifiers 
knn5 = KNeighborsClassifier(n_neighbors=5)
knn5.fit(X,y)

#input the same criteria and get predicated output
knn5.predict([[7.2,1,1,0,0,0,0,1]])
knn5.predict_proba([[7.2,1,1,0,0,0,0,1]])

array([[0.2, 0. , 0. , 0.2, 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. ,
        0.2, 0. , 0. , 0.2, 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. ,
        0.2, 0. , 0. , 0. ]])

In [11]:
knn5.kneighbors([[7.2,1,1,0,0,0,0,1]],return_distance=False)

array([[28, 27, 29, 16,  2]])

In [12]:
#display the recommendation list again 
 
#return the indices of the training data
indices5=knn5.kneighbors([[7.2,1,1,0,0,0,0,1]],return_distance=False)

print('More Like This: ')

#return a copy of the array collapsed into one dimension
indices5.flatten() #array([28, 27, 29, 16,  9])
for i in range(0,5):
    print('{0}:{1}'.format(i+1, df['Movie Name'].iloc[indices5.flatten()[i]]))

More Like This: 
1:12 Years a Slave
2:Hacksaw Ridge
3:Queen of Katwe
4:The Wind Rises
5:A Beautiful Mind


#What recommendations he/she will see?

The user might see two results from our recommendation engine when he/she encounters “The Post”: 


1. 12 Years a Slave
2. Hacksaw Ridge
3. Queen of Katwe
4. The Wind Rises
5. The Karate Kid

OR

1. 12 Years a Slave
2. Hacksaw Ridge
3. Queen of Katwe
4. The Wind Rises
5. A Beautiful Mind

