# Collaberative Filtering

-------------------
The process of information filtering by collecting human judgments (ratings) “word of mouth”

### User 
Any individual who provides ratings to a system

### Items 
Anything for which a human can provide a rating

The problem of collaborative filtering is to predict how well a user will like an item that he has not rated given a set of historical preference judgments for a community of users.

## Import Library

In [2]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from math import sqrt

## Load Dataset

In [3]:
dataset=pd.read_csv('film.csv')
nama=dataset['Nama Anda']
dataset.head()


Unnamed: 0,Timestamp,Nama Anda,Ada Apa dengan Cinta 2,Gundala,Dilan 1991,Bumi Manusia,Dua Garis Biru,Avengers: End Game,The Lion King,Aladdin,Spiderman: Far From Home,Captain Marvel
0,2019/09/17 10:18:21 AM GMT+7,Hania,3.0,5.0,4.0,4.0,4.0,,,,,
1,2019/09/17 10:18:37 AM GMT+7,Topik Zulkarnain,,,,,,5.0,5.0,,4.0,2.0
2,2019/09/17 10:18:39 AM GMT+7,AhokTemanFirli,,,,,,3.0,,,,4.0
3,2019/09/17 10:18:42 AM GMT+7,franadek,4.0,4.0,4.0,5.0,3.0,5.0,4.0,5.0,4.0,4.0
4,2019/09/17 10:19:01 AM GMT+7,OM INDRA,3.0,,2.0,,5.0,5.0,,1.0,5.0,5.0


## Drop Variable 'Timestamp' and 'Nama Anda'

In [4]:
dataset=dataset.drop('Timestamp',1)
dataset=dataset.set_index('Nama Anda',1)
dataset.head()

Unnamed: 0_level_0,Ada Apa dengan Cinta 2,Gundala,Dilan 1991,Bumi Manusia,Dua Garis Biru,Avengers: End Game,The Lion King,Aladdin,Spiderman: Far From Home,Captain Marvel
Nama Anda,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
Hania,3.0,5.0,4.0,4.0,4.0,,,,,
Topik Zulkarnain,,,,,,5.0,5.0,,4.0,2.0
AhokTemanFirli,,,,,,3.0,,,,4.0
franadek,4.0,4.0,4.0,5.0,3.0,5.0,4.0,5.0,4.0,4.0
OM INDRA,3.0,,2.0,,5.0,5.0,,1.0,5.0,5.0


## Missing Value Imputation with 0

In [5]:
dataset=dataset.fillna(0)
dataset.head()

Unnamed: 0_level_0,Ada Apa dengan Cinta 2,Gundala,Dilan 1991,Bumi Manusia,Dua Garis Biru,Avengers: End Game,The Lion King,Aladdin,Spiderman: Far From Home,Captain Marvel
Nama Anda,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
Hania,3.0,5.0,4.0,4.0,4.0,0.0,0.0,0.0,0.0,0.0
Topik Zulkarnain,0.0,0.0,0.0,0.0,0.0,5.0,5.0,0.0,4.0,2.0
AhokTemanFirli,0.0,0.0,0.0,0.0,0.0,3.0,0.0,0.0,0.0,4.0
franadek,4.0,4.0,4.0,5.0,3.0,5.0,4.0,5.0,4.0,4.0
OM INDRA,3.0,0.0,2.0,0.0,5.0,5.0,0.0,1.0,5.0,5.0


## Convert Dataframe into JSON form

In [6]:
import json
export=dataset.to_json(orient='index')
dataset=json.loads(export)
dataset

{'ANI': {'Ada Apa dengan Cinta 2': 4.0,
  'Aladdin': 4.0,
  'Avengers: End Game': 0.0,
  'Bumi Manusia ': 5.0,
  'Captain Marvel': 4.0,
  'Dilan 1991': 4.0,
  'Dua Garis Biru': 0.0,
  'Gundala': 0.0,
  'Spiderman: Far From Home': 3.0,
  'The Lion King': 0.0},
 'AhokTemanFirli': {'Ada Apa dengan Cinta 2': 0.0,
  'Aladdin': 0.0,
  'Avengers: End Game': 3.0,
  'Bumi Manusia ': 0.0,
  'Captain Marvel': 4.0,
  'Dilan 1991': 0.0,
  'Dua Garis Biru': 0.0,
  'Gundala': 0.0,
  'Spiderman: Far From Home': 0.0,
  'The Lion King': 0.0},
 'Damar Teman Firli': {'Ada Apa dengan Cinta 2': 5.0,
  'Aladdin': 0.0,
  'Avengers: End Game': 5.0,
  'Bumi Manusia ': 0.0,
  'Captain Marvel': 0.0,
  'Dilan 1991': 0.0,
  'Dua Garis Biru': 0.0,
  'Gundala': 0.0,
  'Spiderman: Far From Home': 5.0,
  'The Lion King': 0.0},
 'Dpv': {'Ada Apa dengan Cinta 2': 5.0,
  'Aladdin': 0.0,
  'Avengers: End Game': 5.0,
  'Bumi Manusia ': 0.0,
  'Captain Marvel': 5.0,
  'Dilan 1991': 4.0,
  'Dua Garis Biru': 0.0,
  'Gundala': 

## Similarity Score

In [7]:
def similarity_score(person1,person2):

    # this Returns the ration euclidean distancen score of person 1 and 2

    # To get both rated items by person 1 and 2
    both_viewed = {}

    for item in dataset[person1]:
        if item in dataset[person2]:
            both_viewed[item] = 1
        
        # The Conditions to check if they both have common rating items
        if len(both_viewed) == 0:
            return 0

        # Finding Euclidean distance
        sum_of_eclidean_distance = []

        for item in dataset[person1]:
            if item in dataset[person2]:
                sum_of_eclidean_distance.append(pow(dataset[person1][item] - dataset[person2][item], 2))
        sum_of_eclidean_distance = sum(sum_of_eclidean_distance)
        
        return 1/(1+sqrt(sum_of_eclidean_distance))


In [8]:
similarity_score('Mulya','ANI')

0.10056040392403998

Similarity Score shows how similar the user is to other users in terms of similarity in the film being watched. The similar score range is from 0 to 1, the closer to 1 the more similar 

## Pearson Correlation

In [9]:
def person_correlation(person1, person2):

   # To get both rated items
    both_rated = {}
    for item in dataset[person1]:
        if item in dataset[person2]:
            both_rated[item] = 1

    number_of_ratings = len(both_rated)

    # Checking for ratings in common
    if number_of_ratings == 0:
        return 0

    # Add up all the preferences of each user
    person1_preferences_sum = sum([dataset[person1][item] for item in both_rated])
    person2_preferences_sum = sum([dataset[person2][item] for item in both_rated])

    # Sum up the squares of preferences of each user
    person1_square_preferences_sum = sum([pow(dataset[person1][item],2) for item in both_rated])
    person2_square_preferences_sum = sum([pow(dataset[person2][item],2) for item in both_rated])

    # Sum up the product value of both preferences for each item
    product_sum_of_both_users = sum([dataset[person1][item] * dataset[person2][item] for item in both_rated])

    # Calculate the pearson score
    numerator_value = product_sum_of_both_users - (person1_preferences_sum*person2_preferences_sum/number_of_ratings)
    denominator_value = sqrt((person1_square_preferences_sum - pow(person1_preferences_sum,2)/number_of_ratings) * (person2_square_preferences_sum -pow(person2_preferences_sum,2)/number_of_ratings))

    if denominator_value == 0:
        return 0
    else:
        r = numerator_value / denominator_value
        return r

In [10]:
person_correlation('Mulya','ANI')

0.020526049976004625

Pearson score shows how similar the user is to other users in terms of similarity in the film being watched. The Pearson score range is from -1 to 1, the closer to 1 the stronger relationship they have 

## Most Similar Users

In [11]:
def most_similar_users(person, number_of_users):

    # returns the number_of_users (similar persons) for a given specific person
    scores = [(person_correlation(person, other_person), other_person) for other_person in dataset if other_person != person]

    # Sort the similar persons so the highest scores person will appear at the first
    scores.sort()
    scores.reverse()
    return scores[0:number_of_users]

In [12]:
most_similar_users('franadek',5)

[(0.4641395608164579, 'Mulya'),
 (0.45226701686664833, 'Rima'),
 (0.3482630165734967, 'ANI'),
 (0.325300024316178, 'Genjeh'),
 (0.325300024316178, 'Febi ganteng gak ada obat')]

From the function of most_similar_users from franadek, we can see that Mulya and Rima has the similiar Characteristics with franadek in watching Movies

## User Recommendations

In [13]:
def user_recommendations(person):

    # Gets recommendations for a person by using a weighted average of every other user's rankings
    totals = {}
    simSums = {}
    rankings_list =[]
    for other in dataset:
        # don't compare me to myself
        if other == person:
            continue
        sim = person_correlation(person,other)
        #print ">>>>>>>",sim

        # ignore scores of zero or lower
        if sim <=0: 
            continue
        for item in dataset[other]:

            # only score movies i haven't seen yet
            if item not in dataset[person] or dataset[person][item] == 0:

            # Similrity * score
                totals.setdefault(item,0)
                totals[item] += dataset[other][item]* sim
                # sum of similarities
                simSums.setdefault(item,0)
                simSums[item]+= sim

        # Create the normalized list

    rankings = [(total/simSums[item],item) for item,total in totals.items()]
    rankings.sort()
    rankings.reverse()
    # returns the recommended items
    recommendataions_list = [recommend_item for score,recommend_item in rankings]
    return recommendataions_list, rankings

In [24]:
rekomendasi=pd.DataFrame(user_recommendations('Mulya')[1],columns=['Ratings','Movie'])
rekomendasi[0:3]

Unnamed: 0,Ratings,Movie
0,2.594478,Ada Apa dengan Cinta 2
1,2.355511,Gundala
2,1.93597,Dilan 1991


### From the Content Based Recommendation, we can see that the movie that recommended for Mulya are Ada Apa dengan Cinta 2, Gundala, and Dilan 1991

## Thank You