#Knn is a simple concept. It defines some distance between the items in your dataset and find the K closest items. You can use those items to predict some property of a test item, and vote for it. As an example , lets look at a movie prediction system . Lets try to guess the rating of the movie by looking at the 10 movies that are closest in terms of genres and popularity. In this project, we will load up every rating in the dataset into a pandas Dataframe. 

In [1]:
import pandas as pd 
import numpy as np
r_cols = ['user id', 'movie_id', 'rating']
ratings = pd.read_csv('C:/Users/Hamsini Sankaran/Desktop/DataScience/DataScience-Python3/ml-100k/u.data', sep='\t', names=r_cols, usecols=range(3))
ratings.head()


Unnamed: 0,user id,movie_id,rating
0,0,50,5
1,0,172,5
2,0,133,1
3,196,242,3
4,186,302,3


#grouping everything by movie ID and compute the total number of ratings(each movie's popularity) and the average rating of every movie 

In [2]:
movieProperties = ratings.groupby('movie_id').agg({'rating': [np.size, np.mean]})
movieProperties.head()

Unnamed: 0_level_0,rating,rating
Unnamed: 0_level_1,size,mean
movie_id,Unnamed: 1_level_2,Unnamed: 2_level_2
1,452,3.878319
2,131,3.206107
3,90,3.033333
4,209,3.550239
5,86,3.302326


In [3]:
#The raw number of ratings isnt very useful for computing distances between movies , so we will create a new DataFrame that contains the normalized number of ratings.So, a value of 0 means nobody rated it and a value of 1 will mean it is the most popular movie here 

In [4]:
movieNumRatings = pd.DataFrame(movieProperties['rating']['size'])
movieNormalizedNumRatings = movieNumRatings.apply(lambda x: (x - np.min(x)) / (np.max(x) - np.min(x)))
movieNormalizedNumRatings.head()

Unnamed: 0_level_0,size
movie_id,Unnamed: 1_level_1
1,0.773585
2,0.222985
3,0.152659
4,0.356775
5,0.145798


In [5]:
#now let's get the genre information from the u.item file . The way this works is there are 19 fields, each corresponding to a specific genre - a value of 0 means , it is not in the genre and a value of 1 means that is in that genre. A movie may have more than one genre associated with it . Each is put into a big python dictionary called movieDict. Every entry contains the movie name, list of genres, normalized popularity score, the average rating of the movie   

In [6]:
movieDict = {}
with open('C:/Users/Hamsini Sankaran/Desktop/DataScience/DataScience-Python3/ml-100k/u.item') as f:
    temp = ''
    for line in f:
        fields = line.rstrip('\\n').split('|')
        movieID = int(fields[0])
        name = fields[1]
        genres = fields[5:25]
        genres = map(int, genres)
        movieDict[movieID] = (name, np.array(list(genres)), movieNormalizedNumRatings.loc[movieID].get('size'), movieProperties.loc[movieID].rating.get('mean'))

In [7]:
movieDict[1]

('Toy Story (1995)',
 array([0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]),
 0.77358490566037741,
 3.8783185840707963)

In [8]:
from scipy import spatial 

def ComputeDistance(a, b):
    genresA = a[1]
    genresB = b[1]
    genreDistance = spatial.distance.cosine(genresA, genresB)
    popularityA = a[2]
    popularityB = b[2]
    popularityDistance = abs(popularityA - popularityB)
    return genreDistance + popularityDistance

ComputeDistance(movieDict[2], movieDict[4])

0.8004574042309891

In [9]:
#The higher the distance, the less similar the movies are 

In [10]:
print (movieDict[2])
print (movieDict[4])

('GoldenEye (1995)', array([0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0]), 0.22298456260720412, 3.2061068702290076)
('Get Shorty (1995)', array([0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]), 0.35677530017152659, 3.5502392344497609)


In [17]:
import operator 

def getNeighbors(movieID, K):
    distance = []
    for movie in movieDict:
        if (movie != movieID):
            dist = ComputeDistance(movieDict[movieID], movieDict[movie])
            distance.append((movie, dist))
    distance.sort(key=operator.itemgetter(1))
    neighbors = []
    for x in range(K):
        neighbors.append(distance[x][0])
    return neighbors
    
K = 5
avgRating = 0
neighbors = getNeighbors(1,K)
for neighbor in neighbors:
    avgRating += movieDict[neighbor][3]
    print (movieDict[neighbor][0] + " " + str(movieDict[neighbor][3]))
    
avgRating /= float(K)

Liar Liar (1997) 3.15670103093
Aladdin (1992) 3.81278538813
Willy Wonka and the Chocolate Factory (1971) 3.63190184049
Monty Python and the Holy Grail (1974) 4.0664556962
Full Monty, The (1997) 3.92698412698


In [18]:
avgRating 

3.7189656165466287

In [19]:
movieDict[1]

('Toy Story (1995)',
 array([0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]),
 0.77358490566037741,
 3.8783185840707963)