# Finding Similar  Movies (ref. Starwars)
We will start by loading MovieLens dataset. And merge data of u.data and items files that we care about, so we can work with movie name instead of ID's. (In real production, you'd stick with ID's and worry about the name at the display to make things efficient)

In [1]:
import pandas as pd 
import numpy  as np

In [2]:
r_cols=['user_id','movie_id','rating']
ratings =pd.read_csv('ml-100k/u.data',sep='\t',names=r_cols, 
                   usecols=range(3))  #selects 3 columns with index 0 1 2 
m_cols = ['movie_id','title']
movies=pd.read_csv('ml-100k/u.item',sep='|',names=m_cols, 
                   usecols=range(2))

In [3]:
ratings.shape,movies.shape

((100000, 3), (1682, 2))

In [4]:
ratings. head()

Unnamed: 0,user_id,movie_id,rating
0,196,242,3
1,186,302,3
2,22,377,1
3,244,51,2
4,166,346,1


In [5]:
movies.head()

Unnamed: 0,movie_id,title
0,1,Toy Story (1995)
1,2,GoldenEye (1995)
2,3,Four Rooms (1995)
3,4,Get Shorty (1995)
4,5,Copycat (1995)


### Merging Both Files

In [6]:
ratings= pd.merge(movies,ratings)
ratings.head()

Unnamed: 0,movie_id,title,user_id,rating
0,1,Toy Story (1995),308,4
1,1,Toy Story (1995),287,5
2,1,Toy Story (1995),148,4
3,1,Toy Story (1995),280,4
4,1,Toy Story (1995),66,3


Now the pivot_table function on a Dataframe will construct a user/movie rating matrix. Note how NaN indicates missing data i.e movies not rated by the user 

In [7]:
movieRatings=ratings.pivot_table(index='user_id',
                           columns=['title'],
                           values='rating')

movieRatings.head()

title,'Til There Was You (1997),1-900 (1994),101 Dalmatians (1996),12 Angry Men (1957),187 (1997),2 Days in the Valley (1996),"20,000 Leagues Under the Sea (1954)",2001: A Space Odyssey (1968),3 Ninjas: High Noon At Mega Mountain (1998),"39 Steps, The (1935)",...,Yankee Zulu (1994),Year of the Horse (1997),You So Crazy (1994),Young Frankenstein (1974),Young Guns (1988),Young Guns II (1990),"Young Poisoner's Handbook, The (1995)",Zeus and Roxanne (1997),unknown,� k�ldum klaka (Cold Fever) (1994)
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,,,2.0,5.0,,,3.0,4.0,,,...,,,,5.0,3.0,,,,4.0,
2,,,,,,,,,1.0,,...,,,,,,,,,,
3,,,,,2.0,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,
5,,,2.0,,,,,4.0,,,...,,,,4.0,,,,,4.0,


## Now we try to find who rated StarWars

In [8]:
starwars_rating=movieRatings['Star Wars (1977)']
starwars_rating.head()

user_id
1    5.0
2    5.0
3    NaN
4    5.0
5    4.0
Name: Star Wars (1977), dtype: float64

In [9]:
starwars_rating.shape

(943,)

### Now we find Correlation table of other movies with Starwars rating vector and will drop Nan values 

In [10]:
similar_movies=movieRatings.corrwith(starwars_rating) #or movieRatings.corrwith(movieRatings['Star Wars (1977)'])
similar_movies=similar_movies.dropna()
df=pd.DataFrame(similar_movies)

  c = cov(x, y, rowvar)
  c *= np.true_divide(1, fact)


In [11]:
df.head(10)

Unnamed: 0_level_0,0
title,Unnamed: 1_level_1
'Til There Was You (1997),0.872872
1-900 (1994),-0.645497
101 Dalmatians (1996),0.211132
12 Angry Men (1957),0.184289
187 (1997),0.027398
2 Days in the Valley (1996),0.066654
"20,000 Leagues Under the Sea (1954)",0.289768
2001: A Space Odyssey (1968),0.230884
"39 Steps, The (1935)",0.106453
8 1/2 (1963),-0.142977


In [12]:
#Sorting it 
similar_movies.sort_values(ascending=False) 
#or  df.sort_values(by=[0],ascending=False)

title
Hollow Reed (1996)                        1.0
Commandments (1997)                       1.0
Cosi (1996)                               1.0
No Escape (1994)                          1.0
Stripes (1981)                            1.0
                                         ... 
Roseanna's Grave (For Roseanna) (1997)   -1.0
For Ever Mozart (1996)                   -1.0
American Dream (1990)                    -1.0
Frankie Starlight (1995)                 -1.0
Fille seule, La (A Single Girl) (1995)   -1.0
Length: 1410, dtype: float64

# 

Results are probably messed up by movies that only have been viewed by handfull of people who also happen to like star wars. So we need to get rid of movies that were only watched by a few people people  that are producing spurious results. Let's construct a new Data frame that counts up how how many ratings exists for each movie, and also the average rating while we're at it- that could also come in handy later 

In [13]:
movieStats = ratings.groupby('title').agg({'rating':[np.size,np.mean]})
movieStats.head()

Unnamed: 0_level_0,rating,rating
Unnamed: 0_level_1,size,mean
title,Unnamed: 1_level_2,Unnamed: 2_level_2
'Til There Was You (1997),9,2.333333
1-900 (1994),5,2.6
101 Dalmatians (1996),109,2.908257
12 Angry Men (1957),125,4.344
187 (1997),41,3.02439


**Let's get rid of any movies rated by fewer than 100 people, and check the top-rated ones that are left**

In [14]:
popularMovies = movieStats['rating']['size']>=100
movieStats[popularMovies].sort_values([('rating','mean')], ascending=False)

Unnamed: 0_level_0,rating,rating
Unnamed: 0_level_1,size,mean
title,Unnamed: 1_level_2,Unnamed: 2_level_2
"Close Shave, A (1995)",112,4.491071
Schindler's List (1993),298,4.466443
"Wrong Trousers, The (1993)",118,4.466102
Casablanca (1942),243,4.456790
"Shawshank Redemption, The (1994)",283,4.445230
...,...,...
Spawn (1997),143,2.615385
Event Horizon (1997),127,2.574803
Crash (1996),128,2.546875
Jungle2Jungle (1997),132,2.439394


let's join data with original set of similar movies to Starwars

In [15]:
df=movieStats[popularMovies].join(
    pd.DataFrame(similar_movies,columns=["similarity"]))
#performing inner join operation like sql



In [16]:
df.head()

Unnamed: 0_level_0,"(rating, size)","(rating, mean)",similarity
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
101 Dalmatians (1996),109,2.908257,0.211132
12 Angry Men (1957),125,4.344,0.184289
2001: A Space Odyssey (1968),259,3.969112,0.230884
Absolute Power (1997),127,3.370079,0.08544
"Abyss, The (1989)",151,3.589404,0.203709


In [17]:
df.shape

(338, 3)

In [18]:
df.sort_values(['similarity'],ascending=False)

Unnamed: 0_level_0,"(rating, size)","(rating, mean)",similarity
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Star Wars (1977),583,4.358491,1.000000
"Empire Strikes Back, The (1980)",367,4.204360,0.747981
Return of the Jedi (1983),507,4.007890,0.672556
Raiders of the Lost Ark (1981),420,4.252381,0.536117
Austin Powers: International Man of Mystery (1997),130,3.246154,0.377433
...,...,...,...
"Edge, The (1997)",113,3.539823,-0.127167
As Good As It Gets (1997),112,4.196429,-0.130466
Crash (1996),128,2.546875,-0.148507
G.I. Jane (1997),175,3.360000,-0.176734
