##1. Import packages and mount Colab

In [14]:
#Import packages
import pandas as pd
import numpy as np

#Import viz packages
import matplotlib.pyplot as plt

#Import KNN package
from sklearn.neighbors import KNeighborsClassifier

In [15]:
#Mount the colab
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


##2. Read CSV URL

In [16]:
#Read csv from URL
movie = pd.read_csv('https://github.com/ArinB/CA05-kNN/raw/master/movies_recommendation_data.csv')

##3. Check Movie Dataset

The data contains thirty movies, including data for each movie across seven genres and their IMDB ratings.  The implementation assumes that all columns contain numerical data.

In [17]:
#Head of dataset
movie.head(10)

Unnamed: 0,Movie ID,Movie Name,IMDB Rating,Biography,Drama,Thriller,Comedy,Crime,Mystery,History,Label
0,58,The Imitation Game,8.0,1,1,1,0,0,0,0,0
1,8,Ex Machina,7.7,0,1,0,0,0,1,0,0
2,46,A Beautiful Mind,8.2,1,1,0,0,0,0,0,0
3,62,Good Will Hunting,8.3,0,1,0,0,0,0,0,0
4,97,Forrest Gump,8.8,0,1,0,0,0,0,0,0
5,98,21,6.8,0,1,0,0,1,0,1,0
6,31,Gifted,7.6,0,1,0,0,0,0,0,0
7,3,Travelling Salesman,5.9,0,1,0,0,0,1,0,0
8,51,Avatar,7.9,0,0,0,0,0,0,0,0
9,47,The Karate Kid,7.2,0,1,0,0,0,0,0,0


In [18]:
#Dataset info
movie.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30 entries, 0 to 29
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Movie ID     30 non-null     int64  
 1   Movie Name   30 non-null     object 
 2   IMDB Rating  30 non-null     float64
 3   Biography    30 non-null     int64  
 4   Drama        30 non-null     int64  
 5   Thriller     30 non-null     int64  
 6   Comedy       30 non-null     int64  
 7   Crime        30 non-null     int64  
 8   Mystery      30 non-null     int64  
 9   History      30 non-null     int64  
 10  Label        30 non-null     int64  
dtypes: float64(1), int64(9), object(1)
memory usage: 2.7+ KB


In [19]:
#Check NULL value
movie.isnull().sum()

Movie ID       0
Movie Name     0
IMDB Rating    0
Biography      0
Drama          0
Thriller       0
Comedy         0
Crime          0
Mystery        0
History        0
Label          0
dtype: int64

##4. Data cleansing

The labels column values are all zeroes because we aren’t using this data set for classification or regression. You can ignore this column.

In [20]:
#Drop the column that we won't use
movie.drop(columns = 'Label', inplace=True)

In [21]:
#Check the columns again
movie.columns

Index(['Movie ID', 'Movie Name', 'IMDB Rating', 'Biography', 'Drama',
       'Thriller', 'Comedy', 'Crime', 'Mystery', 'History'],
      dtype='object')

##5. Seperate input and output variables

In [22]:
#Seperate input and output variables
X = movie[['IMDB Rating', 'Biography', 'Drama', 'Thriller', 'Comedy', 'Crime', 'Mystery', 'History']]
Y = movie[['Movie Name']]

##6. Building recommendation system

In [23]:
#Construct a Nearset Neighbors classifier and fit with X and Y
classifier = KNeighborsClassifier(n_neighbors = 5)
classifier.fit(X, Y)

  This is separate from the ipykernel package so we can avoid doing imports until


KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=5, p=2,
                     weights='uniform')

IMDB Rating = 7.2, Biography = Yes, Drama = Yes, Thriller = No, Comedy = No, Crime = No, Mystery = No, History = Yes

In [24]:
#Check the distance of each movie with "The Post", the results represent the nearset movies with "The Post" movie
movies = classifier.kneighbors([[7.2, 1, 1, 0, 0, 0, 0, 1]], return_distance= False)

Implement this problem using Python scikit-learn and display the answer within the Notebook with proper narrative / comments.

In [25]:
#Generating recommended movies based on previous distance result.
for item in movies:
    rec = movie.iloc[item]['Movie Name']
   
print(f'Recommended Movies for the people who wants to see "The Post": \n{rec}')

Recommended Movies for the people who wants to see "The Post": 
28    12 Years a Slave
27       Hacksaw Ridge
29      Queen of Katwe
16      The Wind Rises
2     A Beautiful Mind
Name: Movie Name, dtype: object


Final result:

Recommended Movies for the people who wants to see "The Post":

28.    12 Years a Slave

27.    Hacksaw Ridge

29.    Queen of Katwe

16.    The Wind Rises

2.     A Beautiful Mind