# [Getting Started with NLP](https://dphi.tech/bootcamps/getting-started-with-natural-language-processing?utm_source=header)
by [CSpanias](https://cspanias.github.io/aboutme/), 28/01 - 06/02/2022 <br>

Bootcamp organized by **[DPhi](https://dphi.tech/community/)**, lectures given by [**Dipanjan (DJ) Sarkar**](https://www.linkedin.com/in/dipanzan/) ([GitHub repo](https://github.com/dipanjanS/nlp_essentials)) <br>

## Fundamental Tutorials for NLP:
* [NLTK Book](https://www.nltk.org/book/)
* [spaCy Tutorials](https://course.spacy.io/en/chapter1)

# CONTENT
1. Text Wrangling
2. Text Representation with Feature Engineering - Statistical Models
3. Text Representation with Feature Engineering - Deep Learning Models
4. [NLP Applications - Recommender Systems](#RecSys)
    1. [Content-Based Recommendation System](#CBRS)
    2. [Load and View Data](#Data)
    3. [NLP Pipeline](#Pipeline)
        1. [Text Pre-processing](#TextPre)
        1. [Feature Engineering - Extracting TF-IDF Features](#FeaEng)
        1. [Document Similarity Computation](#Similarity)
        1. [Find Top Similar Movies](#SimilarMovies)
            1. [Find Movie's ID](#ID)
            2. [Get Movie's Similarities](#MSim)
            3. [Get the 5 Most Similar Movies](#MostSim)
        1. [Build a Movie Recommendation Function](#Function)

<a name="RecSys"><a/>
# 4. NLP Applications - Recommender Systems

**Recommender systems** are one of the popular and most adopted applications of machine learning. They are typically used to **recommend entities to users** and these entites can be anything like products, movies, services and so on. 

Examples of recommendation systems include **Amazon suggesting products** on its website, **Netflix recommending movies**, **YouTube recommending videos**, etc.

Recommender systems are usually categorized as:

1. **Simple Rule-based Recommenders** <br>
They are typically based on **specific global metrics and thresholds** like movie popularity, global ratings etc.


2. **Content-based Recommenders** <br>
These are based on **providing similar entities based on a specific entity of interest**. To achieve that **content metadata can be used** such as movie descriptions, genre, cast, director, etc.


3. **Collaborative filtering Recommenders** <br>
They try to **predict recommendations and ratings based on past ratings** of different users and specific items.

<a name="CBRS"></a>
# 4.1 Content-Based Recommendation System

We will be building a **movie recommendation system based on data\metadata** pertaining to different movies.

Since our focus in not really recommendation engines but **NLP**, we will be leveraging the **text-based metadata for each movie** to try and recommend similar movies based on specific movies of interest. 

This falls under **content-based recommenders**. 

Install required **dependencies**.

In [2]:
#!pip install textsearch
#!pip install contractions
#nltk.download('punkt')
#nltk.download('stopwords')

<a name="Data"></a>
# 4.2 Load and View Data

In [14]:
import pandas as pd

# read data as dataframe
df = pd.read_csv('https://github.com/CSpanias/nlp_resources/blob/main/dphi_nlp_bootcamp/tmdb_5000_movies.csv.gz?raw=true',
                 compression='gzip')

# get basic info about df
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4803 entries, 0 to 4802
Data columns (total 20 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   budget                4803 non-null   int64  
 1   genres                4803 non-null   object 
 2   homepage              1712 non-null   object 
 3   id                    4803 non-null   int64  
 4   keywords              4803 non-null   object 
 5   original_language     4803 non-null   object 
 6   original_title        4803 non-null   object 
 7   overview              4800 non-null   object 
 8   popularity            4803 non-null   float64
 9   production_companies  4803 non-null   object 
 10  production_countries  4803 non-null   object 
 11  release_date          4802 non-null   object 
 12  revenue               4803 non-null   int64  
 13  runtime               4801 non-null   float64
 14  spoken_languages      4803 non-null   object 
 15  status               

In [15]:
# check the 1st 5 rows
df.head()

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2009-12-10,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800
1,300000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",http://disney.go.com/disneypictures/pirates/,285,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2007-05-19,961000000,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500
2,245000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.sonypictures.com/movies/spectre/,206647,"[{""id"": 470, ""name"": ""spy""}, {""id"": 818, ""name...",en,Spectre,A cryptic message from Bond’s past sends him o...,107.376788,"[{""name"": ""Columbia Pictures"", ""id"": 5}, {""nam...","[{""iso_3166_1"": ""GB"", ""name"": ""United Kingdom""...",2015-10-26,880674609,148.0,"[{""iso_639_1"": ""fr"", ""name"": ""Fran\u00e7ais""},...",Released,A Plan No One Escapes,Spectre,6.3,4466
3,250000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 80, ""nam...",http://www.thedarkknightrises.com/,49026,"[{""id"": 849, ""name"": ""dc comics""}, {""id"": 853,...",en,The Dark Knight Rises,Following the death of District Attorney Harve...,112.31295,"[{""name"": ""Legendary Pictures"", ""id"": 923}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2012-07-16,1084939099,165.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,The Legend Ends,The Dark Knight Rises,7.6,9106
4,260000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://movies.disney.com/john-carter,49529,"[{""id"": 818, ""name"": ""based on novel""}, {""id"":...",en,John Carter,"John Carter is a war-weary, former military ca...",43.926995,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}]","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2012-03-07,284139100,132.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"Lost in our world, found in another.",John Carter,6.1,2124


In [19]:
# create a new df with chose columns
df = df[['title', 'tagline', 'overview', 'popularity']]

# replace NaN values in 'tagline' columns with ''
df.tagline.fillna('', inplace=True)

# create new column by combining 'tagline' and 'overview' columns
df['description'] = df['tagline'].map(str) + ' ' + df['overview']

# drop all missing values
df.dropna(inplace=True)

# sort values based on 'populatiry' column
df = df.sort_values(by=['popularity'], ascending=False)

# check basic info
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4800 entries, 546 to 4553
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   title        4800 non-null   object 
 1   tagline      4800 non-null   object 
 2   overview     4800 non-null   object 
 3   popularity   4800 non-null   float64
 4   description  4800 non-null   object 
dtypes: float64(1), object(4)
memory usage: 225.0+ KB


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return self._update_inplace(result)


In [20]:
# check the 1st 5 rows
df.head()

Unnamed: 0,title,tagline,overview,popularity,description
546,Minions,"Before Gru, they had a history of bad bosses","Minions Stuart, Kevin and Bob are recruited by...",875.581305,"Before Gru, they had a history of bad bosses M..."
95,Interstellar,Mankind was born on Earth. It was never meant ...,Interstellar chronicles the adventures of a gr...,724.247784,Mankind was born on Earth. It was never meant ...
788,Deadpool,Witness the beginning of a happy ending,Deadpool tells the origin story of former Spec...,514.569956,Witness the beginning of a happy ending Deadpo...
94,Guardians of the Galaxy,All heroes start somewhere.,"Light years from Earth, 26 years after being a...",481.098624,All heroes start somewhere. Light years from E...
127,Mad Max: Fury Road,What a Lovely Day.,An apocalyptic story set in the furthest reach...,434.278564,What a Lovely Day. An apocalyptic story set in...


<a name="Pipeline"></a>
# 4.3 NLP Pipeline

The steps below will form our **NLP pipeline** for building our recommenders system:
1. [Text Pre-processing](#TextPre)
1. [Feature Engineering - Extracting TF-IDF Features](#FeaEng)
1. [Document Similarity Computation](#Similarity)
1. [Find Top Similar Movies](#SimilarMovies)
1. [Build a Movie Recommendation Function](#Function)

Recommendations are about **understanding the underlying features** which make us favour one choice over the other; **similarity metrics** can help on that. 

One popular and widely-used similarity metric is the **Cosine Similarity**.

_More info about Cosine Similarity can be found [here](https://medium.com/geekculture/cosine-similarity-and-cosine-distance-48eed889a5c4#:~:text=Cosine%20similarity%20is%20a%20metric,in%20a%20multi%2Ddimensional%20space.&text=As%20the%20cosine%20similarity%20measurement,more%20similar%20to%20each%20other.)._

<a name="TextPre"></a>
## 4.3.1 Text Pre-Processing

In [23]:
import nltk
import re
import numpy as np
import contractions

# load stopwords default nltk list
stop_words = nltk.corpus.stopwords.words('english')

def normalize_document(doc):
    """Normalize the document by performing basic text pre-processing tasks."""
    
    # remove special characters
    doc = re.sub(r'[^a-zA-Z0-9\s]', '', doc, re.I|re.A)
    # lower case letters
    doc = doc.lower()
    # remove trailing whitespace
    doc = doc.strip()
    # expand contractions
    doc = contractions.fix(doc)
    # tokenize document
    tokens = nltk.word_tokenize(doc)
    # remove stopwords
    filtered_tokens = [token for token in tokens if token not in stop_words]
    # re-create document from tokens
    doc = ' '.join(filtered_tokens)
    return doc

# vectorize function for faster computations
normalize_corpus = np.vectorize(normalize_document)

# normalize 'description colum'
norm_corpus = normalize_corpus(list(df['description']))

# check the length (rows) of corpus
print("The length of the normalized corpus is: {} rows.".format(len(norm_corpus)))

The length of the normalized corpus is: 4800 rows.


<a name="FeaEng"></a>
## 4.3.2 Feature Engineering - Extracting TF-IDF Features

In [24]:
from sklearn.feature_extraction.text import TfidfVectorizer

# instantiate Vectorizer
tf = TfidfVectorizer(ngram_range=(1, 2), min_df=2)

# fit vectorizer to our corpus
tfidf_matrix = tf.fit_transform(norm_corpus)

# check shape
print("The shape of our tf-idf matrix is {} rows and {} columns.".format(tfidf_matrix.shape[0], tfidf_matrix.shape[1]))

The shape of our tf-idf matrix is 4800 rows and 20471 columns.


<a name="Similarity"></a>
## 4.3.3 Document Similarity Computation

In [25]:
from sklearn.metrics.pairwise import cosine_similarity

# caclulate cosine similarity
doc_sim = cosine_similarity(tfidf_matrix)

# convert result to pandas DataFrame
doc_sim_df = pd.DataFrame(doc_sim)

# check the 1st 5 rows
doc_sim_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,4790,4791,4792,4793,4794,4795,4796,4797,4798,4799
0,1.0,0.0,0.0,0.0,0.006071,0.008067,0.0,0.0,0.0,0.0,...,0.018758,0.0,0.03793,0.0,0.0,0.0,0.0,0.0,0.0,0.009646
1,0.0,1.0,0.0,0.017839,0.007968,0.0,0.0,0.012501,0.0,0.01484,...,0.0,0.0,0.017564,0.0,0.019152,0.0,0.0,0.0,0.0,0.007963
2,0.0,0.0,1.0,0.0,0.017178,0.0,0.0,0.0,0.0,0.024326,...,0.0,0.006903,0.005024,0.0,0.012893,0.0,0.025975,0.0,0.027126,0.00934
3,0.0,0.017839,0.0,1.0,0.0,0.022414,0.0,0.0,0.0,0.037207,...,0.0,0.060846,0.025039,0.0,0.036237,0.030516,0.022605,0.0,0.0,0.0
4,0.006071,0.007968,0.017178,0.0,1.0,0.004673,0.0,0.064581,0.0,0.0,...,0.022064,0.019662,0.036561,0.0,0.015826,0.0,0.076033,0.004516,0.043475,0.011465


<a name="SimilarMovies"></a>
## 4.3.4 Find Top Similar Movies

Get a list of **movie titles**.

In [30]:
# create list with the title movies
movies_list = df['title'].values

# check the first 5 movie titles
print("The first 5 movie titles are:\n{}\n".format(movies_list[:5]))

# check shape
print("The list with the movie titles has {} rows.".format(movies_list.shape[0]))

The first 5 movie titles are:
['Minions' 'Interstellar' 'Deadpool' 'Guardians of the Galaxy'
 'Mad Max: Fury Road']

The list with the movie titles has 4800 rows.


Find the **top similar movies for a sample movie** with the following process:

1. [Find Movie's ID](#ID)
2. [Get Movie's Similarities](#MSim)
3. [Get the 5 Most Similar Movies](#MostSim)

<a name="ID"></a>
### 4.3.4.1 Find Movie's ID

In [33]:
# find the id of 'Minions'
movie_idx = np.where(movies_list == 'Minions')[0][0]

# check id
print("The index of the movie 'Minions' is: {}.".format(movie_idx))

The index of the movie 'Minions' is: 0.


Our movie list is **sorted by movie popularity** and the movie **Minions is the most popular movie**, thus, its **index is 0**.

<a name="MSim"></a>
### 4.3.4.2 Get Movie's Similarities

In [32]:
# find the movie similarity for 'Minions'
movie_similarities = doc_sim_df.iloc[movie_idx].values

# check the 1st 5 similarities
print("The first five movie similarities for 'Minions' are:\n{}".format(movie_similarities[:5]))

The first five movie similarities for 'Minions' are:
[1.         0.         0.         0.         0.00607053]


These are just **the first 5 similarities**, **not the 5 most similar movies** to 'Minions'. 

Notice that the first entry has a **similarity score of 1**, a perfect score, because it **compares 'Minions' with itself**.

<a name="MostSim"></a>
### 4.3.4.3 Get the 5 Most Similar Movies

In [42]:
# sort movies' index with most similar movies first
similar_movie_idxs = np.argsort(-movie_similarities)[1:6]

# check movies' indices
print("The indices of the 5 most similar movies to 'Minions' are: {}.\n".format(similar_movie_idxs))

# get the movies' names from the indeces
similar_movies = movies_list[similar_movie_idxs]

# check movies'names
print("The 5 most similar movies to 'Minions' are:\n{}.".format(similar_movies))

The indices of the 5 most similar movies to 'Minions' are: [ 33  60 737 490 298].

The 5 most similar movies to 'Minions' are:
['Despicable Me 2' 'Despicable Me'
 'Teenage Mutant Ninja Turtles: Out of the Shadows' 'Superman'
 'Rise of the Guardians'].


<a name="Function"></a>
## 4.3.5 Build a Movie Recommendation Function

We will build a movie recommender function to **recommend top 5 similar movies for any movie** that will take as **input**:
1. Movie Title
2. List with All Movie Titles
3. Document Similarity Matrix

In [43]:
def movie_recommender(movie_title, movies=movies_list, doc_sims=doc_sim_df):
    """Recommend the top five similar movies for a movie of interest."""
    
    # find movie id
    movie_idx = np.where(movies == movie_title)[0][0]
    # get movie similarities
    movie_similarities = doc_sims.iloc[movie_idx].values
    # get top 5 similar movie IDs
    similar_movie_idxs = np.argsort(-movie_similarities)[1:6]
    # get top 5 most similar movies
    similar_movies = movies[similar_movie_idxs]
    # return top 5 most similar movies
    return similar_movies

We can test the function by getting **movie recommendations for 20 popular movies in our list**.

In [53]:
# create a list with the 20 movie titles
popular_movies = ['Minions', 'Interstellar', 'Deadpool', 'Jurassic World', 'Pirates of the Caribbean: The Curse of the Black Pearl',
              'Dawn of the Planet of the Apes', 'The Hunger Games: Mockingjay - Part 1', 'Terminator Genisys', 
              'Captain America: Civil War', 'The Dark Knight', 'The Martian', 'Batman v Superman: Dawn of Justice', 
              'Pulp Fiction', 'The Godfather', 'The Shawshank Redemption', 'The Lord of the Rings: The Fellowship of the Ring',  
              'Harry Potter and the Chamber of Secrets', 'Star Wars', 'The Hobbit: The Battle of the Five Armies',
              'Iron Man']

# check the length of the list
print("The list of the popular movies includes {} movies.\n".format(len(popular_movies)))

# for every movie in our list with popular movies
for movie in popular_movies:
    # print movie's name
    print('Movie of Interest: {}\n'.format(movie))
    # print the top 5 recommended movies
    print('Top 5 recommended Movies:\n{}\n'.
          format(movie_recommender(movie_title=movie, movies=movies_list, doc_sims=doc_sim_df)))

The list of the popular movies includes 20 movies.

Movie of Interest: Minions

Top 5 recommended Movies:
['Despicable Me 2' 'Despicable Me'
 'Teenage Mutant Ninja Turtles: Out of the Shadows' 'Superman'
 'Rise of the Guardians']

Movie of Interest: Interstellar

Top 5 recommended Movies:
['Gattaca' 'Space Pirate Captain Harlock' 'Space Cowboys'
 'Starship Troopers' 'Final Destination 2']

Movie of Interest: Deadpool

Top 5 recommended Movies:
['Silent Trigger' 'Underworld: Evolution' 'Bronson' 'Shaft' 'Don Jon']

Movie of Interest: Jurassic World

Top 5 recommended Movies:
['Jurassic Park' 'The Lost World: Jurassic Park'
 "National Lampoon's Vacation" 'The Nut Job' 'Vacation']

Movie of Interest: Pirates of the Caribbean: The Curse of the Black Pearl

Top 5 recommended Movies:
["Pirates of the Caribbean: Dead Man's Chest"
 'Pirates of the Caribbean: On Stranger Tides' 'The Pirate'
 'The Pirates! In an Adventure with Scientists!' 'Joyful Noise']

Movie of Interest: Dawn of the Planet o