# Stage 0 : Introduction

## Recommendor System

A recommender system is a tool that suggests items (like movies, products, or content) to users based on their preferences or behaviors. It uses algorithms to analyze data and predict what the user might like based on past interactions or similarities with other users.

For example, Netflix recommending movies or Amazon suggesting products are both examples of recommender systems.

#### Types of Recommender System
- Content based (generally we create tags and find similarity of tags in different items) (content similarity)
- Collaborative filtering based (on the basis of user's interest)
- Hybrid

#### Flow:
- Data - Preprocessing - model - website - deploy

# Stage 1 : Data

In [81]:
import numpy as np
import pandas as pd

In [82]:
credits = pd.read_csv("./tmdb_5000_credits.csv")

In [83]:
movies = pd.read_csv("./tmdb_5000_movies.csv")

In [84]:
# credits.head()

In [85]:
# movies.head()

In [86]:
movies = movies.merge(credits, on='title')

In [87]:
movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4809 entries, 0 to 4808
Data columns (total 23 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   budget                4809 non-null   int64  
 1   genres                4809 non-null   object 
 2   homepage              1713 non-null   object 
 3   id                    4809 non-null   int64  
 4   keywords              4809 non-null   object 
 5   original_language     4809 non-null   object 
 6   original_title        4809 non-null   object 
 7   overview              4806 non-null   object 
 8   popularity            4809 non-null   float64
 9   production_companies  4809 non-null   object 
 10  production_countries  4809 non-null   object 
 11  release_date          4808 non-null   object 
 12  revenue               4809 non-null   int64  
 13  runtime               4807 non-null   float64
 14  spoken_languages      4809 non-null   object 
 15  status               

In [88]:
movies.iloc[1, :]

budget                                                          300000000
genres                  [{"id": 12, "name": "Adventure"}, {"id": 14, "...
homepage                     http://disney.go.com/disneypictures/pirates/
id                                                                    285
keywords                [{"id": 270, "name": "ocean"}, {"id": 726, "na...
original_language                                                      en
original_title                   Pirates of the Caribbean: At World's End
overview                Captain Barbossa, long believed to be dead, ha...
popularity                                                     139.082615
production_companies    [{"name": "Walt Disney Pictures", "id": 2}, {"...
production_countries    [{"iso_3166_1": "US", "name": "United States o...
release_date                                                   2007-05-19
revenue                                                         961000000
runtime                               

In [89]:
movies.head(1)['genres'].values

array(['[{"id": 28, "name": "Action"}, {"id": 12, "name": "Adventure"}, {"id": 14, "name": "Fantasy"}, {"id": 878, "name": "Science Fiction"}]'],
      dtype=object)

In [90]:
movies.head(1)['cast'].values

array(['[{"cast_id": 242, "character": "Jake Sully", "credit_id": "5602a8a7c3a3685532001c9a", "gender": 2, "id": 65731, "name": "Sam Worthington", "order": 0}, {"cast_id": 3, "character": "Neytiri", "credit_id": "52fe48009251416c750ac9cb", "gender": 1, "id": 8691, "name": "Zoe Saldana", "order": 1}, {"cast_id": 25, "character": "Dr. Grace Augustine", "credit_id": "52fe48009251416c750aca39", "gender": 1, "id": 10205, "name": "Sigourney Weaver", "order": 2}, {"cast_id": 4, "character": "Col. Quaritch", "credit_id": "52fe48009251416c750ac9cf", "gender": 2, "id": 32747, "name": "Stephen Lang", "order": 3}, {"cast_id": 5, "character": "Trudy Chacon", "credit_id": "52fe48009251416c750ac9d3", "gender": 1, "id": 17647, "name": "Michelle Rodriguez", "order": 4}, {"cast_id": 8, "character": "Selfridge", "credit_id": "52fe48009251416c750ac9e1", "gender": 2, "id": 1771, "name": "Giovanni Ribisi", "order": 5}, {"cast_id": 7, "character": "Norm Spellman", "credit_id": "52fe48009251416c750ac9dd", "ge

In [126]:
movies[movies['id'] == 19615]

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,...,runtime,spoken_languages,status,tagline,title,vote_average,vote_count,movie_id,cast,crew
3669,0,"[{""id"": 18, ""name"": ""Drama""}]",,19615,[],en,Flying By,A real estate developer goes to his 25th high ...,1.546169,[],...,95.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,It's about the music,Flying By,7.0,2,19615,"[{""cast_id"": 1, ""character"": ""George"", ""credit...",[]


In [91]:
movies.head(1)['crew'].values

array(['[{"credit_id": "52fe48009251416c750aca23", "department": "Editing", "gender": 0, "id": 1721, "job": "Editor", "name": "Stephen E. Rivkin"}, {"credit_id": "539c47ecc3a36810e3001f87", "department": "Art", "gender": 2, "id": 496, "job": "Production Design", "name": "Rick Carter"}, {"credit_id": "54491c89c3a3680fb4001cf7", "department": "Sound", "gender": 0, "id": 900, "job": "Sound Designer", "name": "Christopher Boyes"}, {"credit_id": "54491cb70e0a267480001bd0", "department": "Sound", "gender": 0, "id": 900, "job": "Supervising Sound Editor", "name": "Christopher Boyes"}, {"credit_id": "539c4a4cc3a36810c9002101", "department": "Production", "gender": 1, "id": 1262, "job": "Casting", "name": "Mali Finn"}, {"credit_id": "5544ee3b925141499f0008fc", "department": "Sound", "gender": 2, "id": 1729, "job": "Original Music Composer", "name": "James Horner"}, {"credit_id": "52fe48009251416c750ac9c3", "department": "Directing", "gender": 2, "id": 2710, "job": "Director", "name": "James Cam

# Stage 2 : Data Pre-Processing

### Very Important (based on Domain Knowledge)

- genres
- keywords -> tags
- title
- overview -> Important for content similarity
- cast
- crew

### Neutral

- id -> used for website
- budget -> Compact | Mainstream | Epic
- popularity -> `numeric measure`
- production_companies
- production_countries
- release_date
- revenue ("(revenue-budget) / budget")
- vote_average `numeric measure`
- vote_count `numeric measure`


### Useless

- homepage
- original_title -> we use `title`
- runtime - `long , normal , short`
- spoken_languages 
- status
- tagline (vague)
- movie_id

In [92]:
movies.head(1)['keywords'].values

array(['[{"id": 1463, "name": "culture clash"}, {"id": 2964, "name": "future"}, {"id": 3386, "name": "space war"}, {"id": 3388, "name": "space colony"}, {"id": 3679, "name": "society"}, {"id": 3801, "name": "space travel"}, {"id": 9685, "name": "futuristic"}, {"id": 9840, "name": "romance"}, {"id": 9882, "name": "space"}, {"id": 9951, "name": "alien"}, {"id": 10148, "name": "tribe"}, {"id": 10158, "name": "alien planet"}, {"id": 10987, "name": "cgi"}, {"id": 11399, "name": "marine"}, {"id": 13065, "name": "soldier"}, {"id": 14643, "name": "battle"}, {"id": 14720, "name": "love affair"}, {"id": 165431, "name": "anti war"}, {"id": 193554, "name": "power relations"}, {"id": 206690, "name": "mind and soul"}, {"id": 209714, "name": "3d"}]'],
      dtype=object)

In [93]:
movies['original_language'].value_counts()

original_language
en    4510
fr      70
es      32
zh      27
de      27
hi      19
ja      16
it      14
ko      12
cn      12
ru      11
pt       9
da       7
sv       5
nl       4
fa       4
th       3
he       3
ta       2
cs       2
ro       2
id       2
ar       2
vi       1
sl       1
ps       1
no       1
ky       1
hu       1
pl       1
af       1
nb       1
tr       1
is       1
xx       1
te       1
el       1
Name: count, dtype: int64

---

In [94]:
# Selecting only important columns
movies_data = movies[['genres', 'keywords', 'title', 'overview', 'id', 'cast','crew' ]]

In [95]:
# checking null values
movies_data.isnull().sum()
# movies_data.dropna(inplace=True)

genres      0
keywords    0
title       0
overview    3
id          0
cast        0
crew        0
dtype: int64

In [96]:
# checking duplicate data points
movies_data.duplicated().sum()

0

#### Plan - to transform columns in desired formate

- we take the Overview Column, at the end of the paragraph we add the genres, keywords, titles and other tags to create a bigger paragraph

##### 1. transforming `genres` & `keywords`

In [97]:
movies_data.iloc[0]['genres']

'[{"id": 28, "name": "Action"}, {"id": 12, "name": "Adventure"}, {"id": 14, "name": "Fantasy"}, {"id": 878, "name": "Science Fiction"}]'

- FROM THIS (genres column and keywords column): 

```
'[{"id": 28, "name": "Action"}, {"id": 12, "name": "Adventure"}, {"id": 14, "name": "Fantasy"}, {"id": 878, "name": "Science Fiction"}]'
```
> string of list of dictionaries
>> 1. string to list,
>> 2. extract 'name' from list of dict, make a list of that.

- TO THIS : 
```
["Action", "Adventure", "Fantasy", "Science Fiction"]
```

In [98]:
import ast

# First Helper function

def dicToList(obj):
    list = []
    for i in ast.literal_eval(obj):
        list.append(i['name'])
    return list

- We use ast.literal_eval() to convert string to list

In [99]:
movies_data['genres'] = movies_data['genres'].apply(dicToList)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  movies_data['genres'] = movies_data['genres'].apply(dicToList)


In [100]:
movies_data['keywords'] = movies_data['keywords'].apply(dicToList)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  movies_data['keywords'] = movies_data['keywords'].apply(dicToList)


In [101]:
movies_data.head(2)

Unnamed: 0,genres,keywords,title,overview,id,cast,crew
0,"[Action, Adventure, Fantasy, Science Fiction]","[culture clash, future, space war, space colon...",Avatar,"In the 22nd century, a paraplegic Marine is di...",19995,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,"[Adventure, Fantasy, Action]","[ocean, drug abuse, exotic island, east india ...",Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",285,"[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."


#### now we need to modify cast and crew column.
- we take top / first three actors from the cast section
- we take 

In [102]:
movies_data['cast'][0]

'[{"cast_id": 242, "character": "Jake Sully", "credit_id": "5602a8a7c3a3685532001c9a", "gender": 2, "id": 65731, "name": "Sam Worthington", "order": 0}, {"cast_id": 3, "character": "Neytiri", "credit_id": "52fe48009251416c750ac9cb", "gender": 1, "id": 8691, "name": "Zoe Saldana", "order": 1}, {"cast_id": 25, "character": "Dr. Grace Augustine", "credit_id": "52fe48009251416c750aca39", "gender": 1, "id": 10205, "name": "Sigourney Weaver", "order": 2}, {"cast_id": 4, "character": "Col. Quaritch", "credit_id": "52fe48009251416c750ac9cf", "gender": 2, "id": 32747, "name": "Stephen Lang", "order": 3}, {"cast_id": 5, "character": "Trudy Chacon", "credit_id": "52fe48009251416c750ac9d3", "gender": 1, "id": 17647, "name": "Michelle Rodriguez", "order": 4}, {"cast_id": 8, "character": "Selfridge", "credit_id": "52fe48009251416c750ac9e1", "gender": 2, "id": 1771, "name": "Giovanni Ribisi", "order": 5}, {"cast_id": 7, "character": "Norm Spellman", "credit_id": "52fe48009251416c750ac9dd", "gender": 

In [103]:
# Cast Helper function
def castHelper(obj):
    list = []
    counter = 3;
    for i in ast.literal_eval(obj):
        if counter > 0:
            list.append(i['name'])
            counter -= 1
        else:
            break
    return list

In [104]:
movies_data['cast'] = movies_data['cast'].apply(castHelper)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  movies_data['cast'] = movies_data['cast'].apply(castHelper)


In [105]:
movies_data['cast'][0]

['Sam Worthington', 'Zoe Saldana', 'Sigourney Weaver']

In [106]:
movies_data['crew'][0]

'[{"credit_id": "52fe48009251416c750aca23", "department": "Editing", "gender": 0, "id": 1721, "job": "Editor", "name": "Stephen E. Rivkin"}, {"credit_id": "539c47ecc3a36810e3001f87", "department": "Art", "gender": 2, "id": 496, "job": "Production Design", "name": "Rick Carter"}, {"credit_id": "54491c89c3a3680fb4001cf7", "department": "Sound", "gender": 0, "id": 900, "job": "Sound Designer", "name": "Christopher Boyes"}, {"credit_id": "54491cb70e0a267480001bd0", "department": "Sound", "gender": 0, "id": 900, "job": "Supervising Sound Editor", "name": "Christopher Boyes"}, {"credit_id": "539c4a4cc3a36810c9002101", "department": "Production", "gender": 1, "id": 1262, "job": "Casting", "name": "Mali Finn"}, {"credit_id": "5544ee3b925141499f0008fc", "department": "Sound", "gender": 2, "id": 1729, "job": "Original Music Composer", "name": "James Horner"}, {"credit_id": "52fe48009251416c750ac9c3", "department": "Directing", "gender": 2, "id": 2710, "job": "Director", "name": "James Cameron"},

In [107]:
import ast
# Crew Helper function
# 'job' : 'Director'

def fetch_director(obj):
    for i in ast.literal_eval(obj):
        if i['job'] == 'Director':
            return [i['name']]
            break
    pass

In [108]:
movies_data['Director'] = movies_data['crew'].apply(fetch_director)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  movies_data['Director'] = movies_data['crew'].apply(fetch_director)


In [125]:
movies_data[movies_data['Director'].isnull()]


Unnamed: 0,genres,keywords,title,overview,id,cast,crew,Director
3669,[Drama],[],Flying By,"[A, real, estate, developer, goes, to, his, 25...",19615,"[BillyRayCyrus, HeatherLocklear, AhnaiseChrist...",[],
3678,[Family],[],Running Forever,"[After, being, estranged, since, her, mother's...",447027,[],[],
3736,"[Drama, Family, Foreign]",[],Paa,"[He, suffers, from, a, progeria, like, syndrom...",26379,"[AmitabhBachchan, AbhishekBachchan, VidyaBalan]","[{""credit_id"": ""52fe44fec3a368484e042a29"", ""de...",
3984,"[Comedy, Drama, Romance]",[independentfilm],Boynton Beach Club,"[A, handful, of, men, and, women, of, a, certa...",55831,"[BrendaVaccaro, DyanCannon, JosephBologna]",[],
4075,[],[],Sharkskin,"[The, Post, War, II, story, of, Manhattan, bor...",371085,[],[],
4112,[],[],"The Book of Mormon Movie, Volume 1: The Journey","[The, story, of, Lehi, and, his, wife, Sariah,...",48382,"[KirbyHeyborne, MichaelFlynn]",[],
4125,[],[],Hum To Mohabbat Karega,"[Raju,, a, waiter,, is, in, love, with, the, f...",325140,[],[],
4130,"[Animation, Family, Foreign]",[],Roadside Romeo,"[This, is, the, story, of, Romeo., A, dude, wh...",20653,"[SaifAliKhan, KareenaKapoor, JavedJaffrey]",[],
4254,"[Romance, Comedy, Drama]",[],Me You and Five Bucks,"[A, womanizing, yet, lovable, loser,, Charlie,...",361505,[],[],
4311,"[Comedy, Music]",[],Down & Out With The Dolls,"[The, raunchy,, spunky, tale, of, the, rise, a...",114065,[],[],


In [110]:
# import ast
# # Helper function
# def crewHelper(obj):
#     list = []
#     for i in ast.literal_eval(obj):
#         if i['job'] == 'Director':
#             list.append(i['name'])
#             break
#     return list

In [111]:
# movies_data['crew'] = movies_data['crew'].apply(crewHelper)

#### Converting Overview column (string) to List.

In [112]:
type(movies_data['overview'])

pandas.core.series.Series

In [113]:
movies_data['overview'] = movies_data['overview'].apply(lambda x: x.split() if isinstance(x, str) else [])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  movies_data['overview'] = movies_data['overview'].apply(lambda x: x.split() if isinstance(x, str) else [])


---
- Problem : if "Science Fiction" exist, making it as -> "ScienceFiction"

In [114]:
movies_data['genres'] = movies_data['genres'].apply(lambda x: [ i.replace(" ", "") for i in x])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  movies_data['genres'] = movies_data['genres'].apply(lambda x: [ i.replace(" ", "") for i in x])


In [115]:
movies_data.head(2)

Unnamed: 0,genres,keywords,title,overview,id,cast,crew,Director
0,"[Action, Adventure, Fantasy, ScienceFiction]","[culture clash, future, space war, space colon...",Avatar,"[In, the, 22nd, century,, a, paraplegic, Marin...",19995,"[Sam Worthington, Zoe Saldana, Sigourney Weaver]","[{""credit_id"": ""52fe48009251416c750aca23"", ""de...",[James Cameron]
1,"[Adventure, Fantasy, Action]","[ocean, drug abuse, exotic island, east india ...",Pirates of the Caribbean: At World's End,"[Captain, Barbossa,, long, believed, to, be, d...",285,"[Johnny Depp, Orlando Bloom, Keira Knightley]","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de...",[Gore Verbinski]


In [127]:
movies_data['keywords'] = movies_data['keywords'].apply(lambda x: [ i.replace(" ", "") for i in x])
movies_data['cast'] = movies_data['cast'].apply(lambda x: [ i.replace(" ", "") for i in x])
movies_data['Director'] = movies_data['Director'].apply(
    lambda x: [i.replace(" ", "") for i in x] if isinstance(x, list) and x else x
)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  movies_data['keywords'] = movies_data['keywords'].apply(lambda x: [ i.replace(" ", "") for i in x])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  movies_data['cast'] = movies_data['cast'].apply(lambda x: [ i.replace(" ", "") for i in x])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  movies_data['

In [128]:
movies_data.head(2)

Unnamed: 0,genres,keywords,title,overview,id,cast,crew,Director
0,"[Action, Adventure, Fantasy, ScienceFiction]","[cultureclash, future, spacewar, spacecolony, ...",Avatar,"[In, the, 22nd, century,, a, paraplegic, Marin...",19995,"[SamWorthington, ZoeSaldana, SigourneyWeaver]","[{""credit_id"": ""52fe48009251416c750aca23"", ""de...",[JamesCameron]
1,"[Adventure, Fantasy, Action]","[ocean, drugabuse, exoticisland, eastindiatrad...",Pirates of the Caribbean: At World's End,"[Captain, Barbossa,, long, believed, to, be, d...",285,"[JohnnyDepp, OrlandoBloom, KeiraKnightley]","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de...",[GoreVerbinski]


#### Finally tags columns

In [129]:
movies_data['tags'] = movies_data['overview'] + movies_data['genres'] + movies_data['keywords'] + movies_data['cast'] + movies_data['Director']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  movies_data['tags'] = movies_data['overview'] + movies_data['genres'] + movies_data['keywords'] + movies_data['cast'] + movies_data['Director']


In [130]:
movies_data['tags'] 

0       [In, the, 22nd, century,, a, paraplegic, Marin...
1       [Captain, Barbossa,, long, believed, to, be, d...
2       [A, cryptic, message, from, Bond’s, past, send...
3       [Following, the, death, of, District, Attorney...
4       [John, Carter, is, a, war-weary,, former, mili...
                              ...                        
4804    [El, Mariachi, just, wants, to, play, his, gui...
4805    [A, newlywed, couple's, honeymoon, is, upended...
4806    ["Signed,, Sealed,, Delivered", introduces, a,...
4807    [When, ambitious, New, York, attorney, Sam, is...
4808    [Ever, since, the, second, grade, when, he, fi...
Name: tags, Length: 4809, dtype: object

In [131]:
new_df = movies_data[['id', 'title', 'tags']]

In [134]:
new_df['tags'].isnull().sum()

30

In [135]:
new_df = new_df.drop(movies_data[movies_data['tags'].isnull()].index)

In [136]:
new_df['tags'] = new_df['tags'] .apply(lambda x:" ".join(x))

In [137]:
 new_df['tags'] = new_df['tags'] .apply(lambda x:x.lower())

In [138]:
new_df['tags'][0]

'in the 22nd century, a paraplegic marine is dispatched to the moon pandora on a unique mission, but becomes torn between following orders and protecting an alien civilization. action adventure fantasy sciencefiction cultureclash future spacewar spacecolony society spacetravel futuristic romance space alien tribe alienplanet cgi marine soldier battle loveaffair antiwar powerrelations mindandsoul 3d samworthington zoesaldana sigourneyweaver jamescameron'

In [139]:
new_df['tags'][1]

"captain barbossa, long believed to be dead, has come back to life and is headed to the edge of the earth with will turner and elizabeth swann. but nothing is quite as it seems. adventure fantasy action ocean drugabuse exoticisland eastindiatradingcompany loveofone'slife traitor shipwreck strongwoman ship alliance calypso afterlife fighter pirate swashbuckler aftercreditsstinger johnnydepp orlandobloom keiraknightley goreverbinski"

In [140]:
new_df['tags'].shape

(4779,)

In [141]:
# new_df.to_csv('preprocessed_5000_tmdb_movies_data.csv', index=False)

#### How to calculate similarity score btw these 2 texts ?

1. no. of same words
2. Vectorization (converting text to vector). {many techniques, e.g. Bag of Words}

#### Similarity Measures:
- Cosine Similarity
> - When data has high dimensional features
> - Suitable ffor continous or ordinal features
> - insensitive to the scale and magnitude of features
> - captures the direction and magnitude of similarity
- Jaccard Simillarity
> - not much efficient
> - Suitable for sparse and imbalanced data

##### Vectorization - Bag of words.
- we generate large text by combining all tags of all the rows.
- we find M most common
- we will not consider stop words
- count verctorizer sklearn

In [147]:
# setup numpy
np.set_printoptions(threshold=np.inf)

from sklearn.feature_extraction.text import CountVectorizer
# TODO: Read CountVectorizer Docs

In [148]:
cv = CountVectorizer(max_features=5000, stop_words='english')
vectors = cv.fit_transform(new_df['tags']).toarray() # create sparse metrix.

In [149]:
cv.get_feature_names_out() # gives top features names

array(['000', '007', '10', '100', '11', '12', '13', '14', '15', '16',
       '17', '18', '18th', '19', '1930s', '1940s', '1950s', '1960s',
       '1970s', '1980', '1980s', '1985', '1990s', '1999', '19th',
       '19thcentury', '20', '200', '2003', '2009', '20th', '21st', '23',
       '24', '25', '30', '300', '3d', '40', '50', '500', '60', '60s',
       '70', '70s', 'aaron', 'aaroneckhart', 'abandoned', 'abducted',
       'abigailbreslin', 'abilities', 'ability', 'able', 'aboard',
       'abuse', 'abusive', 'academic', 'academy', 'accept', 'accepted',
       'accepts', 'access', 'accident', 'accidental', 'accidentally',
       'accompanied', 'accomplish', 'account', 'accountant', 'accused',
       'ace', 'achieve', 'act', 'acting', 'action', 'actionhero',
       'actions', 'activist', 'activities', 'activity', 'actor', 'actors',
       'actress', 'acts', 'actual', 'actually', 'adam', 'adams',
       'adamsandler', 'adamshankman', 'adaptation', 'adapted', 'addict',
       'addicted', 'ad

 #### Problem 1: We have `'accident', 'accidental', 'accidentally'` as seperate words.
 
 Solution: Stemming 
 - e.g. ['love', 'loved', 'loving']
 - ['love', 'love', 'love']

In [152]:
from nltk.stem.porter import PorterStemmer
ps = PorterStemmer()

In [155]:
# testing helper function.
for i in ['love', 'loved', 'loving']:
    print(ps.stem(i))

love
love
love


In [160]:
# making stem helper function

# we pass text of each entry of tags and we apply stemming.
def stem(text):
    y = []
    
    for i in text.split():
        y .append(ps.stem(i))
    
    string = " " .join(y)
    return string
    
    

In [161]:
new_df['tags'][0]

'in the 22nd century, a paraplegic marine is dispatched to the moon pandora on a unique mission, but becomes torn between following orders and protecting an alien civilization. action adventure fantasy sciencefiction cultureclash future spacewar spacecolony society spacetravel futuristic romance space alien tribe alienplanet cgi marine soldier battle loveaffair antiwar powerrelations mindandsoul 3d samworthington zoesaldana sigourneyweaver jamescameron'

> new_df['tags'][0]

```'in the 22nd century, a paraplegic marine is dispatched to the moon pandora on a unique mission, but becomes torn between following orders and protecting an alien civilization. action adventure fantasy sciencefiction cultureclash future spacewar spacecolony society spacetravel futuristic romance space alien tribe alienplanet cgi marine soldier battle loveaffair antiwar powerrelations mindandsoul 3d samworthington zoesaldana sigourneyweaver jamescameron'```

> stem(new_df['tags'][0])

```'in the 22nd century, a parapleg marin is dispatch to the moon pandora on a uniqu mission, but becom torn between follow order and protect an alien civilization. action adventur fantasi sciencefict cultureclash futur spacewar spacecoloni societi spacetravel futurist romanc space alien tribe alienplanet cgi marin soldier battl loveaffair antiwar powerrel mindandsoul 3d samworthington zoesaldana sigourneyweav jamescameron'```

In [163]:
new_df['tags'].apply(stem)

0       in the 22nd century, a parapleg marin is dispa...
1       captain barbossa, long believ to be dead, ha c...
2       a cryptic messag from bond’ past send him on a...
3       follow the death of district attorney harvey d...
4       john carter is a war-weary, former militari ca...
                              ...                        
4804    el mariachi just want to play hi guitar and ca...
4805    a newlyw couple' honeymoon is upend by the arr...
4806    "signed, sealed, delivered" introduc a dedic q...
4807    when ambiti new york attorney sam is sent to s...
4808    ever sinc the second grade when he first saw h...
Name: tags, Length: 4779, dtype: object

In [165]:
new_df.to_csv('preprocessed_5000_tmdb_movies_data.csv', index=False)