# Recommender system

can be done in two ways: 
  1. user-based collaborative filtering:
            Build a matrix of things each user bought/viewed/rated
            compute similarity scores between users
            Find users similar to you
            Recommend stuff they bought/viewed/rated that you haven't yet
      Limitations:
            Sparsity : there are more number of people than the number of items
            Scalability: computation increases with increase in number of people.
            interest of people can keep changing.
  
  2. item-based collaborative filtering:
            Recommendations are based on relationship between things instead of people
            there are usually fewer things than people and hence less computation
            things or facts about them don't change
            Harder to game the system ( we can't create fake item unlike humans)
            
            
## How does item-based Collaborative filtering works:
(movie recommendation)

  1. find every pair of movies that are watched by the same person
  2. measure the similarity of their ratings across all users who watched both
  3. sort by movie, then by similarity strength

In [19]:
import pandas as pd
import numpy as np

In [5]:

r_cols = ['user_id', 'movie_id','rating']
ratings = pd.read_csv('./ml-100k/u.data', sep='\t', names=r_cols, usecols=range(3), encoding="ISO-8859-1")

m_cols = ['movie_id', 'title']
movies = pd.read_csv('./ml-100k/u.item', sep='|', names=m_cols, usecols=range(2), encoding="ISO-8859-1")

ratings = pd.merge(movies, ratings)

In [6]:
ratings.head()

Unnamed: 0,movie_id,title,user_id,rating
0,1,Toy Story (1995),308,4
1,1,Toy Story (1995),287,5
2,1,Toy Story (1995),148,4
3,1,Toy Story (1995),280,4
4,1,Toy Story (1995),66,3


using **pivot_table_function** which takes a column wise data as input  and groups the entries into a two-dimensional table that provides a multidimensional summarization of the data.

NaN  - not available / null
values - ratings value

In [14]:
movieRatings = ratings.pivot_table(index=['user_id'],columns=['title'],values='rating')
movieRatings.head()

title,'Til There Was You (1997),1-900 (1994),101 Dalmatians (1996),12 Angry Men (1957),187 (1997),2 Days in the Valley (1996),"20,000 Leagues Under the Sea (1954)",2001: A Space Odyssey (1968),3 Ninjas: High Noon At Mega Mountain (1998),"39 Steps, The (1935)",...,Yankee Zulu (1994),Year of the Horse (1997),You So Crazy (1994),Young Frankenstein (1974),Young Guns (1988),Young Guns II (1990),"Young Poisoner's Handbook, The (1995)",Zeus and Roxanne (1997),unknown,Á köldum klaka (Cold Fever) (1994)
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,,,,,,,,,,,...,,,,,,,,,,
1,,,2.0,5.0,,,3.0,4.0,,,...,,,,5.0,3.0,,,,4.0,
2,,,,,,,,,1.0,,...,,,,,,,,,,
3,,,,,2.0,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,


In [16]:
#extracting the users who watched star wars

starWarsRatings = movieRatings['Star Wars (1977)']
starWarsRatings.head()

user_id
0    5.0
1    5.0
2    5.0
3    NaN
4    5.0
Name: Star Wars (1977), dtype: float64

Pandas' corrwith function makes it really easy to compute the pairwise correlation of Star Wars' vector of user rating with every other movie

In [17]:
similarMovies = movieRatings.corrwith(starWarsRatings)
similarMovies = similarMovies.dropna()
df = pd.DataFrame(similarMovies)
df.head(10)

  c = cov(x, y, rowvar)
  c *= np.true_divide(1, fact)


Unnamed: 0_level_0,0
title,Unnamed: 1_level_1
'Til There Was You (1997),0.872872
1-900 (1994),-0.645497
101 Dalmatians (1996),0.211132
12 Angry Men (1957),0.184289
187 (1997),0.027398
2 Days in the Valley (1996),0.066654
"20,000 Leagues Under the Sea (1954)",0.289768
2001: A Space Odyssey (1968),0.230884
"39 Steps, The (1935)",0.106453
8 1/2 (1963),-0.142977


In [18]:
similarMovies.sort_values(ascending=False)

title
Hollow Reed (1996)                        1.0
Man of the Year (1995)                    1.0
Star Wars (1977)                          1.0
Stripes (1981)                            1.0
Full Speed (1996)                         1.0
                                         ... 
Theodore Rex (1995)                      -1.0
I Like It Like That (1994)               -1.0
Two Deaths (1995)                        -1.0
Roseanna's Grave (For Roseanna) (1997)   -1.0
Frankie Starlight (1995)                 -1.0
Length: 1410, dtype: float64

Let's construct a new DataFrame that counts up how many ratings exist for each movie, and also the average rating 

In [20]:
movieStats = ratings.groupby('title').agg({'rating': [np.size, np.mean]})
movieStats.head()

Unnamed: 0_level_0,rating,rating
Unnamed: 0_level_1,size,mean
title,Unnamed: 1_level_2,Unnamed: 2_level_2
'Til There Was You (1997),9,2.333333
1-900 (1994),5,2.6
101 Dalmatians (1996),109,2.908257
12 Angry Men (1957),125,4.344
187 (1997),41,3.02439


dropping the movies which has rating less than 100

In [38]:
famousmovies = movieStats['rating']['size']>=100
movieStats[famousmovies].sort_values([('rating', 'mean')], ascending=False)

#to check, if movies having less than 200 ratings were removed how it would affect the process
#famousmovies = movieStats['rating']['size']>=200
#movieStats[famousmovies].sort_values([('rating', 'mean')], ascending=False)

Unnamed: 0_level_0,rating,rating
Unnamed: 0_level_1,size,mean
title,Unnamed: 1_level_2,Unnamed: 2_level_2
"Close Shave, A (1995)",112,4.491071
Schindler's List (1993),298,4.466443
"Wrong Trousers, The (1993)",118,4.466102
Casablanca (1942),243,4.456790
"Shawshank Redemption, The (1994)",283,4.445230
...,...,...
Spawn (1997),143,2.615385
Event Horizon (1997),127,2.574803
Crash (1996),128,2.546875
Jungle2Jungle (1997),132,2.439394


In [39]:
df = movieStats[famousmovies].join(pd.DataFrame(similarMovies, columns=['similarity']))
df.head()



Unnamed: 0_level_0,"(rating, size)","(rating, mean)",similarity
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
101 Dalmatians (1996),109,2.908257,0.211132
12 Angry Men (1957),125,4.344,0.184289
2001: A Space Odyssey (1968),259,3.969112,0.230884
Absolute Power (1997),127,3.370079,0.08544
"Abyss, The (1989)",151,3.589404,0.203709


In [41]:
'''similarity columns tells how much the movie of that row is similar to that of stars wars (1977)
   rating,size tells how many humans have rated for that movie
   rating, mean tells the mean value of rating'''

df.sort_values(['similarity'], ascending=False)

Unnamed: 0_level_0,"(rating, size)","(rating, mean)",similarity
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Star Wars (1977),584,4.359589,1.000000
"Empire Strikes Back, The (1980)",368,4.206522,0.748353
Return of the Jedi (1983),507,4.007890,0.672556
Raiders of the Lost Ark (1981),420,4.252381,0.536117
Austin Powers: International Man of Mystery (1997),130,3.246154,0.377433
...,...,...,...
"Edge, The (1997)",113,3.539823,-0.127167
As Good As It Gets (1997),112,4.196429,-0.130466
Crash (1996),128,2.546875,-0.148507
G.I. Jane (1997),175,3.360000,-0.176734
