Two types of Recommendation system:

    * Content based: based on tags( action, drama etc.)
    * Collaborative filtering: based on user behaviour and ratings 

In [14]:
import pandas as pd
movies = pd.read_csv("movies.csv")

In [15]:
movies

Unnamed: 0,id,title,genre,original_language,overview,popularity,release_date,vote_average,vote_count
0,278,The Shawshank Redemption,"Drama,Crime",en,Framed in the 1940s for the double murder of h...,94.075,1994-09-23,8.7,21862
1,19404,Dilwale Dulhania Le Jayenge,"Comedy,Drama,Romance",hi,"Raj is a rich, carefree, happy-go-lucky second...",25.408,1995-10-19,8.7,3731
2,238,The Godfather,"Drama,Crime",en,"Spanning the years 1945 to 1955, a chronicle o...",90.585,1972-03-14,8.7,16280
3,424,Schindler's List,"Drama,History,War",en,The true story of how businessman Oskar Schind...,44.761,1993-12-15,8.6,12959
4,240,The Godfather: Part II,"Drama,Crime",en,In the continuing saga of the Corleone crime f...,57.749,1974-12-20,8.6,9811
...,...,...,...,...,...,...,...,...,...
9995,10196,The Last Airbender,"Action,Adventure,Fantasy",en,"The story follows the adventures of Aang, a yo...",98.322,2010-06-30,4.7,3347
9996,331446,Sharknado 3: Oh Hell No!,"Action,TV Movie,Science Fiction,Comedy,Adventure",en,The sharks take bite out of the East Coast whe...,12.490,2015-07-22,4.7,417
9997,13995,Captain America,"Action,Science Fiction,War",en,"During World War II, a brave, patriotic Americ...",18.333,1990-12-14,4.6,332
9998,2312,In the Name of the King: A Dungeon Siege Tale,"Adventure,Fantasy,Action,Drama",en,A man named Farmer sets out to rescue his kidn...,15.159,2007-11-29,4.7,668


Cleaning movie titles with regex

In [16]:
movies.describe()

Unnamed: 0,id,popularity,vote_average,vote_count
count,10000.0,10000.0,10000.0,10000.0
mean,161243.505,34.697267,6.62115,1547.3094
std,211422.046043,211.684175,0.766231,2648.295789
min,5.0,0.6,4.6,200.0
25%,10127.75,9.15475,6.1,315.0
50%,30002.5,13.6375,6.6,583.5
75%,310133.5,25.65125,7.2,1460.0
max,934761.0,10436.917,8.7,31917.0


In [17]:
movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 9 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   id                 10000 non-null  int64  
 1   title              10000 non-null  object 
 2   genre              9997 non-null   object 
 3   original_language  10000 non-null  object 
 4   overview           9987 non-null   object 
 5   popularity         10000 non-null  float64
 6   release_date       10000 non-null  object 
 7   vote_average       10000 non-null  float64
 8   vote_count         10000 non-null  int64  
dtypes: float64(2), int64(2), object(5)
memory usage: 703.2+ KB


In [18]:
movies.isnull().sum()

id                    0
title                 0
genre                 3
original_language     0
overview             13
popularity            0
release_date          0
vote_average          0
vote_count            0
dtype: int64

## Feature engineering
Select necessary features.

In [19]:
movies.columns

Index(['id', 'title', 'genre', 'original_language', 'overview', 'popularity',
       'release_date', 'vote_average', 'vote_count'],
      dtype='object')

Id,title,genre,overview

not selecting original language as most movies are dubbed
release date not so imp only for few which have parts but will be recommended by names only 

In [20]:
movies_1 = movies[['id','title','overview','genre']]

In [21]:
movies_1

Unnamed: 0,id,title,overview,genre
0,278,The Shawshank Redemption,Framed in the 1940s for the double murder of h...,"Drama,Crime"
1,19404,Dilwale Dulhania Le Jayenge,"Raj is a rich, carefree, happy-go-lucky second...","Comedy,Drama,Romance"
2,238,The Godfather,"Spanning the years 1945 to 1955, a chronicle o...","Drama,Crime"
3,424,Schindler's List,The true story of how businessman Oskar Schind...,"Drama,History,War"
4,240,The Godfather: Part II,In the continuing saga of the Corleone crime f...,"Drama,Crime"
...,...,...,...,...
9995,10196,The Last Airbender,"The story follows the adventures of Aang, a yo...","Action,Adventure,Fantasy"
9996,331446,Sharknado 3: Oh Hell No!,The sharks take bite out of the East Coast whe...,"Action,TV Movie,Science Fiction,Comedy,Adventure"
9997,13995,Captain America,"During World War II, a brave, patriotic Americ...","Action,Science Fiction,War"
9998,2312,In the Name of the King: A Dungeon Siege Tale,A man named Farmer sets out to rescue his kidn...,"Adventure,Fantasy,Action,Drama"


### Content based recommendation system

In [22]:
movies_1['tags'] = movies_1['overview'] + movies_1['genre']
new_data = movies_1.drop(columns = ['overview','genre'])


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  movies_1['tags'] = movies_1['overview'] + movies_1['genre']


In [23]:
new_data

Unnamed: 0,id,title,tags
0,278,The Shawshank Redemption,Framed in the 1940s for the double murder of h...
1,19404,Dilwale Dulhania Le Jayenge,"Raj is a rich, carefree, happy-go-lucky second..."
2,238,The Godfather,"Spanning the years 1945 to 1955, a chronicle o..."
3,424,Schindler's List,The true story of how businessman Oskar Schind...
4,240,The Godfather: Part II,In the continuing saga of the Corleone crime f...
...,...,...,...
9995,10196,The Last Airbender,"The story follows the adventures of Aang, a yo..."
9996,331446,Sharknado 3: Oh Hell No!,The sharks take bite out of the East Coast whe...
9997,13995,Captain America,"During World War II, a brave, patriotic Americ..."
9998,2312,In the Name of the King: A Dungeon Siege Tale,A man named Farmer sets out to rescue his kidn...


Tags is a textual data we have to convert it into vector
in NLP we have :

    * Bag of word
    * TFIDF

In [25]:
from sklearn.feature_extraction.text import CountVectorizer

In [26]:
cv = CountVectorizer(max_features=10000, stop_words='english')# language in which the data is , data has 10000 entries

In [27]:
cv

CountVectorizer(max_features=10000, stop_words='english')

In [28]:
vector = cv.fit_transform(new_data['tags'].values.astype('U')).toarray() # U is utf

In [29]:
vector.shape

(10000, 10000)

For rcommendation system we need some kind of similarity between data instances 

    *Cosine similarity: we made a matrix with 10000 features , now each movie has 10000 features which means similarity can be defined as the inverse of distance between the points in 10000D space.

In [30]:
from sklearn.metrics.pairwise import cosine_similarity

In [31]:
similarity  = cosine_similarity(vector)

In [32]:
similarity.shape

(10000, 10000)

In [33]:
similarity


array([[1.        , 0.05634362, 0.12888482, ..., 0.07559289, 0.11065667,
        0.06388766],
       [0.05634362, 1.        , 0.07624929, ..., 0.        , 0.03636965,
        0.        ],
       [0.12888482, 0.07624929, 1.        , ..., 0.02273314, 0.06655583,
        0.08645856],
       ...,
       [0.07559289, 0.        , 0.02273314, ..., 1.        , 0.03253   ,
        0.02817181],
       [0.11065667, 0.03636965, 0.06655583, ..., 0.03253   , 1.        ,
        0.0412393 ],
       [0.06388766, 0.        , 0.08645856, ..., 0.02817181, 0.0412393 ,
        1.        ]])

In [40]:
x = list(enumerate(similarity[0]))
x

[(0, 1.0000000000000002),
 (1, 0.0563436169819011),
 (2, 0.12888481555661677),
 (3, 0.03688555567816587),
 (4, 0.11428571428571427),
 (5, 0.08451542547285165),
 (6, 0.03779644730092272),
 (7, 0.081199794294115),
 (8, 0.027066598098038335),
 (9, 0.035245368842512066),
 (10, 0.09583148474999097),
 (11, 0.0),
 (12, 0.04688072309384954),
 (13, 0.11952286093343935),
 (14, 0.030860669992418377),
 (15, 0.07476671794188401),
 (16, 0.0),
 (17, 0.06761234037828133),
 (18, 0.0),
 (19, 0.0),
 (20, 0.07207499701564471),
 (21, 0.040996003084539386),
 (22, 0.03194382824999699),
 (23, 0.06761234037828133),
 (24, 0.08199200616907877),
 (25, 0.059761430466719674),
 (26, 0.12298800925361816),
 (27, 0.05039526306789696),
 (28, 0.0994490316197694),
 (29, 0.03450327796711771),
 (30, 0.036037498507822355),
 (31, 0.030860669992418377),
 (32, 0.10811249552346706),
 (33, 0.06761234037828133),
 (34, 0.07824607964359516),
 (35, 0.036037498507822355),
 (36, 0.14638501094227999),
 (37, 0.04364357804719847),
 (38, 0

In [34]:
# simmple text to extract index of a movie
new_data[new_data['title']=="The Godfather"].index[0]

2

In [36]:
distance = sorted(list(enumerate(similarity[2])), reverse=True, key=lambda vector:vector[1])
# reverse = descending order, enumerate=gives index , list makes it printable,key sets a parameter to access specific elemnt of vector
distance

[(2, 1.0000000000000004),
 (4, 0.48976229911514363),
 (7419, 0.3521803625302496),
 (153, 0.3354968547317302),
 (2624, 0.3234983196103152),
 (9520, 0.3112864031823452),
 (2412, 0.3081578172139684),
 (330, 0.30499714066520933),
 (5010, 0.2995012465378748),
 (779, 0.29606845410646954),
 (7049, 0.29606845410646954),
 (9362, 0.2934836354418746),
 (4569, 0.29261523994305977),
 (3670, 0.2893456933022473),
 (4872, 0.28934569330224724),
 (1816, 0.2857953049377246),
 (4811, 0.28529870107872785),
 (6964, 0.2803652103289399),
 (4380, 0.2798845714165278),
 (734, 0.2758802939230217),
 (5605, 0.2758802939230217),
 (1223, 0.2756247308353552),
 (6788, 0.2756247308353552),
 (9245, 0.2756247308353552),
 (8555, 0.2744974265986884),
 (709, 0.2727977357881894),
 (3742, 0.2686124597780274),
 (519, 0.26687249808205815),
 (821, 0.26687249808205815),
 (6565, 0.26687249808205815),
 (250, 0.26622333025588873),
 (8503, 0.26622333025588873),
 (747, 0.2652335521937267),
 (233, 0.2641352718976872),
 (2272, 0.26413527

In [44]:
 for i in distance[0:5]:
        print(new_data.iloc[[i[0]]].title)

2    The Godfather
Name: title, dtype: object
4    The Godfather: Part II
Name: title, dtype: object
7419    Blood Ties
Name: title, dtype: object
153    Joker
Name: title, dtype: object
2624    Bomb City
Name: title, dtype: object


In [51]:
def recommend(movie):
    index = new_data[new_data['title']==movie].index[0]
    distance = sorted(list(enumerate(similarity[index])), reverse=True, key=lambda vector:vector[1])
    for i in distance[0:5]:
        print(new_data.iloc[[i[0]]].title)

In [52]:
recommend("Iron Man")

969    Iron Man
Name: title, dtype: object
3563    Iron Man 3
Name: title, dtype: object
962    Guardians of the Galaxy Vol. 2
Name: title, dtype: object
2100    Avengers: Age of Ultron
Name: title, dtype: object
1722    Star Wars: Episode III - Revenge of the Sith
Name: title, dtype: object
