# Movie Strand Recommendation 
<br>The basic ideas is if a customer watched one movie, mean
<br>There are two most popular recommendation algo **content-based filtering** and ***Collaborative filter***.

* ***How is content-based Filter working***?
    <img src="images/content_based_filtering.png" alt="Drawing" style="width: 400px;"/>

* ***Current Flow***
    <img src="images/oldArch.png" alt="Drawing" style="width: 400px;"/>
* ***New Flow***

    <img src="images/newArch.png" alt="Drawing" style="width: 400px;"/>


* #### Content-based filtering theory 
**[TF_IDF](https://en.wikipedia.org/wiki/Tf%E2%80%93idf)**, reflect how important a word is to a document in a collection or corpus.
<br>TF-IDF weighting frequency genres to determining the simularity of programs.

    <img src="images/tfidf.jpg" alt="Drawing" style="width: 400px;"/>
**[Cosine Similarity](https://en.wikipedia.org/wiki/Cosine_similarity)**
In this model, each item is stored as a vector of its attributes (which are also vectors) in an **n-dimensional space** and the angles between the vectors are calculated to **determine the similarity between the vectors**. 
<br> The user profile vectors are also created based on his WH.
<br> Once we have the user/movie profile, the similarity between an item and a user can be determined in a similar way.
    <img src="images/vector_space.png" alt="Drawing" style="width: 400px;"/>
**[1/2-Gram](https://www.quora.com/How-are-TF-IDF-vectorizers-with-n-gram-features-created)**
<br> An n-gram is a contiguous sequence of n items from a given sample of text or speech

                        0    Animation|Children's|Comedy
                        Name: genres, dtype: object
                        animation|children|comedy|animation children|children comedy
                          (0, 29)	0.49792764065786643
                          (0, 37)	0.40421250853461244
                          (0, 45)	0.23523321955508567
                          (0, 30)	0.521820712785304
                          (0, 38)	0.5109293383032849
 <br> For our case, we may have different Gram

* ***Our Genres from Bam*** <br>
 "Action & Adventure”, “Comedy Drama", "Children", “Kids","Thriller", "Suspense”, "Dark comedy",  “Suspense”,
"Adventure", "Animated", "Animation", "Comedy", "Children", “Kids”, "Music & Musical", "Musical comedy"
 "Horror", "Thriller", “Suspense","Documentary", "Music", “Children",”Sports”, ”Romance",  "Thriller", "Science fiction", “Romance",
“Fantasy",”Mystery","Holiday", “Holiday Drama”, "Animated”,"Other Sports","Historical drama”,"History","Military & War", “War","Western"
"Documentary","Crime", "Crime drama”,"Animated", “Animation",”Biography”, "Erotic", "Adults only"

* #### Data Preparation
In this Demo, I prepared three CSV file for user, movie, rating.
<br> For on-deman:
<br> we need to create two profiles for both user and movies.
<br> With the profile of user and movie, we can calculate their similarity.



### Implementation
With all that theory in mind, I am going to build a Content-Based Recommendation Engine that computes similarity between movies based on movie genres. It will suggest movies that are most similar to a particular movie based on its genre. To do so, I will make use of the file **movies.csv**.

In [19]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Reading users file
users = pd.read_csv('user.csv', sep='\t', encoding='latin-1', usecols=['user_id', 'gender', 'zipcode', 'age_desc', 'occ_desc'])
print(users.head())

# Reading movies file
AllMovies = pd.read_csv('movie.csv', sep='\t', encoding='latin-1', usecols=['movie_id', 'title', 'genres'])
print(AllMovies.head())

ratings = pd.read_csv('rating.csv', sep='\t', encoding='latin-1', usecols=['user_id', 'movie_id', 'rating'])
print(ratings.head())
ratings['rating'].describe()

   user_id gender zipcode  age_desc              occ_desc
0        1      F   48067  Under 18          K-12 student
1        2      M   70072       56+         self-employed
2        3      M   55117     25-34             scientist
3        4      M   02460     45-49  executive/managerial
4        5      M   55455     25-34                writer
   movie_id                               title                        genres
0         1                    Toy Story (1995)   Animation|Children's|Comedy
1         2                      Jumanji (1995)  Adventure|Children's|Fantasy
2         3             Grumpier Old Men (1995)                Comedy|Romance
3         4            Waiting to Exhale (1995)                  Comedy|Drama
4         5  Father of the Bride Part II (1995)                        Comedy
   user_id  movie_id  rating
0        1      1193       5
1        1       661       3
2        1       914       3
3        1      3408       4
4        1      2355       5


count    1.000209e+06
mean     3.581564e+00
std      1.117102e+00
min      1.000000e+00
25%      3.000000e+00
50%      4.000000e+00
75%      4.000000e+00
max      5.000000e+00
Name: rating, dtype: float64



* #### Merged Dataset used

In [20]:
# Join all 3 files into one dataframe
dataset = pd.merge(pd.merge(AllMovies, ratings),users)
# Display 20 movies with highest ratings
dataset[['genres','rating']].sort_values('rating', ascending=False).head(10)

Unnamed: 0,genres,rating
0,Animation|Children's|Comedy,5
489283,Comedy|Drama,5
489259,Comedy,5
489257,Action|Sci-Fi|Thriller,5
489256,Drama|Thriller,5
489237,Comedy,5
489236,Crime|Thriller,5
489226,Documentary,5
489224,Comedy,5
489212,Comedy|Drama|Romance,5


* #### WatchHistory
It will be used to create user profile

In [21]:
dataset.head(10)
watchHistory = dataset[['user_id', 'genres']]
print(watchHistory.head(10))

   user_id                                genres
0        1           Animation|Children's|Comedy
1        1  Animation|Children's|Musical|Romance
2        1                                 Drama
3        1       Action|Adventure|Fantasy|Sci-Fi
4        1                             Drama|War
5        1                      Children's|Drama
6        1   Animation|Children's|Comedy|Musical
7        1          Animation|Children's|Musical
8        1          Animation|Children's|Musical
9        1                  Crime|Drama|Thriller


* #### TopK algo 
<br> Top K algo based on WH

In [None]:
def topK(dataset, movies, ref_col, census, K):
    keyword_count = dict()
    popularDataSet = pd.DataFrame()
    for s in census: 
        keyword_count[s] = 0
    for census_keywords in dataset[ref_col]:     
        if type(census_keywords) == float and pd.isnull(census_keywords): 
            continue        
        #for s in [s for s in census_keywords if s in census]: 
        if pd.notnull(s): 
            #print(census_keywords)
            keyword_count[census_keywords] += 1
    keyword_occurences = []
    for k,v in keyword_count.items():
        keyword_occurences.append([k,v])
    keyword_occurences.sort(key = lambda x:x[1], reverse = True)
    cur = 0
    for k in keyword_occurences:
        cur += 1
        #print(cur)
        tmp = (movies.loc[lambda movies: movies['movie_id'] == int(k[0])])
        popularDataSet = popularDataSet.append(tmp)
        
        if cur >= K:
            break
    print(popularDataSet.size)
    return popularDataSet, keyword_occurences[0:K]
popularityMovies = pd.DataFrame()
popularityMovies, keyword_occurences = topK(dataset, AllMovies, 'movie_id', AllMovies['movie_id'], 5000)
print(popularityMovies.size)
print(popularityMovies.head(10))

print(keyword_occurences)
    

Movies will be used to create movie profile 

In [5]:
movies = popularityMovies
print(movies['genres'])
# Break up the big genre string into a string array
movies['genres'] = movies['genres'].str.split('|')
# Convert genres to string value
movies['genres'] = movies['genres'].fillna("").astype('str')
print(movies['genres'])

2789                           Comedy|Drama
257         Action|Adventure|Fantasy|Sci-Fi
1178      Action|Adventure|Drama|Sci-Fi|War
1192    Action|Adventure|Romance|Sci-Fi|War
476                 Action|Adventure|Sci-Fi
1959                       Action|Drama|War
585                  Action|Sci-Fi|Thriller
2502                 Action|Sci-Fi|Thriller
1250                          Comedy|Sci-Fi
589                          Drama|Thriller
1539         Action|Adventure|Comedy|Sci-Fi
1180                       Action|Adventure
604                    Crime|Drama|Thriller
2693                               Thriller
108                        Action|Drama|War
2327                         Comedy|Romance
1179        Action|Adventure|Comedy|Romance
523                               Drama|War
1575       Crime|Film-Noir|Mystery|Thriller
1245                         Comedy|Romance
1081        Children's|Drama|Fantasy|Sci-Fi
2559        Action|Adventure|Fantasy|Sci-Fi
2928                            

### Lets start to create matrix
<br>Movie/User profile 
<br>Convert a collection of raw documents to a matrix of TF-IDF features.

<br> Movie Profile from Movie Metadata:

In [6]:
from sklearn.feature_extraction.text import TfidfVectorizer
tf = TfidfVectorizer(analyzer='word',ngram_range=(1, 2),min_df=0, stop_words='english')
print(movies.head(2))
movie_profile = tf.fit_transform(movies['genres'])
feature_names = tf.get_feature_names()
print(len(feature_names))
print(feature_names[45], feature_names[72],feature_names[48], feature_names[0], feature_names[14], 
      feature_names[83], feature_names[119],feature_names[88], feature_names[1], feature_names[20],
      feature_names[86], feature_names[120],
      sep='|')
print(movie_profile[:2])

      movie_id                                      title  \
2789      2858                     American Beauty (1999)   
257        260  Star Wars: Episode IV - A New Hope (1977)   

                                            genres  
2789                           ['Comedy', 'Drama']  
257   ['Action', 'Adventure', 'Fantasy', 'Sci-Fi']  
127
comedy|drama|comedy drama|action|adventure|fantasy|sci|fi|action adventure|adventure fantasy|fantasy sci|sci fi
  (0, 45)	0.44898589583946386
  (0, 72)	0.38922070282647575
  (0, 48)	0.8043126940615187
  (1, 0)	0.21752151760260852
  (1, 14)	0.258536953612869
  (1, 83)	0.3597071272924552
  (1, 119)	0.26032148572733815
  (1, 88)	0.26032148572733815
  (1, 1)	0.31496600199160996
  (1, 20)	0.4737611821219671
  (1, 86)	0.479060275624473
  (1, 120)	0.26032148572733815


User Profile from WH:

In [7]:
print(ratings.loc[lambda ratings: ratings['user_id'] == 1])

    user_id  movie_id  rating
0         1      1193       5
1         1       661       3
2         1       914       3
3         1      3408       4
4         1      2355       5
5         1      1197       3
6         1      1287       5
7         1      2804       5
8         1       594       4
9         1       919       4
10        1       595       5
11        1       938       4
12        1      2398       4
13        1      2918       4
14        1      1035       5
15        1      2791       4
16        1      2687       3
17        1      2018       4
18        1      3105       5
19        1      2797       4
20        1      2321       3
21        1       720       3
22        1      1270       5
23        1       527       5
24        1      2340       3
25        1        48       5
26        1      1097       4
27        1      1721       4
28        1      1545       4
29        1       745       3
30        1      2294       4
31        1      3186       4
32        

In [8]:
print(dataset.head(2))
user_profile = tf.fit_transform(watchHistory['genres'])
feature_names = tf.get_feature_names()
#print(len(feature_names))
print(feature_names[29], feature_names[37],feature_names[30], feature_names[100], 
      feature_names[114], feature_names[42], feature_names[101], sep='|')
print(user_profile[:2])
print(user_profile.shape)

   movie_id              title                                genres  user_id  \
0         1   Toy Story (1995)           Animation|Children's|Comedy        1   
1        48  Pocahontas (1995)  Animation|Children's|Musical|Romance        1   

   rating gender zipcode  age_desc      occ_desc  
0       5      F   48067  Under 18  K-12 student  
1       5      F   48067  Under 18  K-12 student  
animation|children|animation children|musical|romance|children musical|musical romance
  (0, 29)	0.4837968178823387
  (0, 37)	0.42405235813507347
  (0, 45)	0.23739089261783358
  (0, 30)	0.5138586049354334
  (0, 38)	0.5154756393884438
  (1, 29)	0.348001467565544
  (1, 37)	0.3050264852951659
  (1, 30)	0.36962530969396445
  (1, 100)	0.3514900711161935
  (1, 114)	0.2449466944566168
  (1, 42)	0.47270271583983325
  (1, 101)	0.4921777539603983
(1000209, 127)


First, lets calculate similarity between each movie using Cosine Similarity. 

In [9]:
from sklearn.metrics.pairwise import linear_kernel
movie_movie_sim = linear_kernel(movie_profile, movie_profile)
print(movies.shape[0])
print(movie_movie_sim.shape)
print(movie_movie_sim[4].shape)
print(movie_movie_sim[:4, :4])

3883
(3883, 3883)
(3883,)
[[1.         0.         0.04820848 0.        ]
 [0.         1.         0.38301    0.36296304]
 [0.04820848 0.38301    1.         0.62054267]
 [0.         0.36296304 0.62054267 1.        ]]


In [10]:
print(movie_profile.shape, user_profile.shape)
user_movie_sim = linear_kernel(user_profile[0:5000],movie_profile)
print(movies.shape[0])
print(user_movie_sim.shape)
print(user_movie_sim[4].shape)
print(user_movie_sim[:10,:10])

(3883, 127) (1000209, 127)
3883
(5000, 3883)
(3883,)
[[0.10658516 0.         0.         0.         0.         0.
  0.         0.         0.05714765 0.        ]
 [0.         0.         0.         0.04741517 0.         0.
  0.         0.         0.         0.        ]
 [0.3892207  0.         0.12385899 0.         0.         0.20739437
  0.         0.         0.         0.3165589 ]
 [0.         0.99936687 0.37267247 0.35316657 0.53501323 0.06972072
  0.31895904 0.31895904 0.30453182 0.        ]
 [0.12798762 0.         0.2084481  0.15894096 0.         0.76542706
  0.         0.         0.         0.1040942 ]
 [0.10879787 0.         0.03462199 0.         0.         0.05797242
  0.         0.         0.         0.0884869 ]
 [0.08391139 0.         0.         0.         0.         0.
  0.         0.         0.04499068 0.        ]
 [0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.        ]
 [0.         0.         0.         0.         0.         0.

I now have a pairwise cosine similarity matrix for all the movies in the dataset. The next step is to write a function that returns the 20 most similar movies based on the cosine similarity score.

In [11]:
# Build a 1-dimensional array with movie titles
titles = movies['title']
indices = pd.Series(movies.index, index=movies['title'])

# sort cosine similarity 
def getSimilarMovies(title):
    idx = indices[title]
    sim_scores = list(enumerate(movie_movie_sim[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_scores = sim_scores[1:21]
    movie_indices = [i[0] for i in sim_scores]
    return titles.iloc[movie_indices]

def getRecommendation(userId):
    #idx = indices[userId]
    sim_scores = list(enumerate(user_movie_sim[userId]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_scores = sim_scores[1:21]
    movie_indices = [i[0] for i in sim_scores]
    return titles.iloc[movie_indices]

Let's try and get the top recommendations for a few movies and see how good the recommendations are.

In [12]:
getSimilarMovies('Chicken Run (2000)').head(20)

160                                          Crumb (1994)
2938                                American Movie (1999)
1131                            When We Were Kings (1996)
1172                           Thin Blue Line, The (1988)
2790                             Stop Making Sense (1984)
2608                       Buena Vista Social Club (1999)
2624                                      Trekkies (1997)
1174                        Madonna: Truth or Dare (1991)
1603                  Fast, Cheap & Out of Control (1997)
1737                                       Everest (1998)
1340    Paradise Lost: The Child Murders at Robin Hood...
545       Thirty-Two Short Films About Glenn Gould (1993)
3584                           Endless Summer, The (1966)
577                          Celluloid Closet, The (1995)
3258                                Beyond the Mat (2000)
552                                  War Room, The (1993)
3113    Mr. Death: The Rise and Fall of Fred A. Leucht...
1175          

In [13]:
getSimilarMovies('Toy Story 2 (1999)').head(20)

3039                             Fisher King, The (1991)
2179                              Say Anything... (1989)
2174                               Broadcast News (1987)
1224                                    Manhattan (1979)
2076                               Pretty in Pink (1986)
1156                              Cinema Paradiso (1988)
3616                               Prizzi's Honor (1985)
827                                          Emma (1996)
1996                    Purple Rose of Cairo, The (1985)
3192                                      Singles (1992)
1819                                  Hope Floats (1998)
221                              Don Juan DeMarco (1995)
2193                          About Last Night... (1986)
1826                            Can't Hardly Wait (1998)
3238                                  City Lights (1931)
347                              Corrina, Corrina (1994)
1193    Wings of Desire (Der Himmel Ã¼ber Berlin) (1987)
1503                           

In [14]:
print(dataset.head(100))


    movie_id                                              title  \
0          1                                   Toy Story (1995)   
1         48                                  Pocahontas (1995)   
2        150                                   Apollo 13 (1995)   
3        260          Star Wars: Episode IV - A New Hope (1977)   
4        527                            Schindler's List (1993)   
5        531                          Secret Garden, The (1993)   
6        588                                     Aladdin (1992)   
7        594             Snow White and the Seven Dwarfs (1937)   
8        595                        Beauty and the Beast (1991)   
9        608                                       Fargo (1996)   
10       661                   James and the Giant Peach (1996)   
11       720  Wallace & Gromit: The Best of Aardman Animatio...   
12       745                              Close Shave, A (1995)   
13       783                Hunchback of Notre Dame, The (1996

In [15]:
getRecommendation(81)

3200                                 Forever Young (1992)
1630                             Starship Troopers (1997)
1937                            Mask of Zorro, The (1998)
14                                Cutthroat Island (1995)
740     Dr. Strangelove or: How I Learned to Stop Worr...
957                             African Queen, The (1951)
1178    Star Wars: Episode V - The Empire Strikes Back...
2525                Open Your Eyes (Abre los ojos) (1997)
770                         Independence Day (ID4) (1996)
2593                        War of the Worlds, The (1953)
2612                               Free Enterprise (1998)
3569                                     Moonraker (1979)
3630                                       Starman (1984)
2753                                  Medicine Man (1992)
1564                                Excess Baggage (1997)
1198                 Big Blue, The (Le Grand Bleu) (1988)
913                                Golden Earrings (1947)
32            

In [16]:
getRecommendation(1)

360                      Lion King, The (1994)
591                Beauty and the Beast (1991)
1262                           Fantasia (1940)
590     Snow White and the Seven Dwarfs (1937)
1009                         Cinderella (1950)
1016                              Dumbo (1941)
655           James and the Giant Peach (1996)
1019                Alice in Wonderland (1951)
2027                    Sleeping Beauty (1959)
3090                      Fantasia 2000 (1999)
773        Hunchback of Notre Dame, The (1996)
1642                          Anastasia (1997)
2023               Return of Jafar, The (1993)
1011              Three Caballeros, The (1945)
626             All Dogs Go to Heaven 2 (1996)
1459                   Cats Don't Dance (1997)
3690                 Fun and Fancy Free (1947)
3707                        Melody Time (1948)
3706                    Make Mine Music (1946)
2012                Little Mermaid, The (1989)
Name: title, dtype: object

In [17]:
getRecommendation(7)

591                Beauty and the Beast (1991)
1262                           Fantasia (1940)
590     Snow White and the Seven Dwarfs (1937)
1009                         Cinderella (1950)
1016                              Dumbo (1941)
655           James and the Giant Peach (1996)
1019                Alice in Wonderland (1951)
2027                    Sleeping Beauty (1959)
3090                      Fantasia 2000 (1999)
773        Hunchback of Notre Dame, The (1996)
1642                          Anastasia (1997)
2023               Return of Jafar, The (1993)
1011              Three Caballeros, The (1945)
626             All Dogs Go to Heaven 2 (1996)
1459                   Cats Don't Dance (1997)
3690                 Fun and Fancy Free (1947)
3707                        Melody Time (1948)
3706                    Make Mine Music (1946)
47                           Pocahontas (1995)
1017                      Pete's Dragon (1977)
Name: title, dtype: object

In [18]:
getRecommendation(92)

14                          Cutthroat Island (1995)
2753                            Medicine Man (1992)
1564                          Excess Baggage (1997)
1198           Big Blue, The (Le Grand Bleu) (1988)
913                          Golden Earrings (1947)
32                          Wings of Courage (1995)
727                                  Twister (1996)
3636                          Bird on a Wire (1990)
957                       African Queen, The (1951)
1180                 Raiders of the Lost Ark (1981)
1271      Indiana Jones and the Last Crusade (1989)
2046    Indiana Jones and the Temple of Doom (1984)
1255                              Highlander (1986)
206                               Waterworld (1995)
1944                 Poseidon Adventure, The (1972)
1546                     Conan the Barbarian (1982)
1035             Ghost and the Darkness, The (1996)
928            Adventures of Robin Hood, The (1938)
43                             Mortal Kombat (1995)
2084        

Pros:
* No need for data on other users, thus no cold-start or sparsity problems.
* Can recommend to users with unique tastes.
* Can recommend new & unpopular items.
* Can provide explanations for recommended items by listing content-features that caused an item to be recommended (in this case, movie genres)

Cons:
* Finding the appropriate features is hard.
* Does not recommend items outside a user's content profile.
* Unable to exploit quality judgments of other users.

#### * ***On-Demand Algo improvement Spike***

    * Content-Based. 
    * Collaborative Filtering.
    * Neural Network.
    It is an investigation and POC. We may improve the algo step by step. 
    So, for the phase 1, we can use Content-Based Filtering as our algo. 
    Then Hybrid Recommendation, or even Neural Network.
    We will use genres to build recommendation, since genres is using for descrbing the movie. 
    And we can kind of assume that movies has the same genres are highly possiblely similar
 