### This notebook is dedicated to the first step of computation of a 'distance' metric between the synopsis of two movies.

### All the movies desciption are cleaned and processed thanks to spacy in order to be able to compute then a TF-IDF matrix (Term Frequency - Inverse Document Frequency).

In [None]:
!pip install spacy -q

In [2]:
import re
import pandas as pd
import spacy
spacy.__version__

'3.1.1'

In [3]:
imdb = pd.read_csv('IMDb movies.csv',
                   usecols=[
                       'imdb_title_id',
                       'title',
                       'year',
                       'genre',
                       'duration',
                       'country',
                       'director',
                       'writer',
                       'actors',
                       'description',
                       'avg_vote'
                   ])

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


In [4]:
imdb = imdb[imdb['year'] != 'TV Movie 2019'] # one row has been wrongly filled.

In [5]:
imdb['imdb_title_id'] = imdb['imdb_title_id'].apply(lambda x: int(x.replace('tt', '')))
imdb.rename(columns={'imdb_title_id':'imdbId', 'avg_note':'imdb_note'}, inplace=True) # to match the id column with the movielens file
imdb['genre'] = imdb['genre'].str.lower().str.replace(' ','').str.split(',')

In [7]:
imdb = imdb.reset_index(drop=True)
imdb.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 85854 entries, 0 to 85853
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   imdbId       85854 non-null  int64  
 1   title        85854 non-null  object 
 2   year         85854 non-null  object 
 3   genre        85854 non-null  object 
 4   duration     85854 non-null  int64  
 5   country      85790 non-null  object 
 6   director     85767 non-null  object 
 7   writer       84282 non-null  object 
 8   actors       85785 non-null  object 
 9   description  83739 non-null  object 
 10  avg_vote     85854 non-null  float64
dtypes: float64(1), int64(2), object(8)
memory usage: 7.2+ MB


In [8]:
# Random description to understand better the texts we are dealing with.
imdb['description'][100]

'Prince Kasatsky is a just and proud youth, shock and disappointment with the world bring him to church, he becomes father Sergius. It is a story of his piety and temptation.'

### Now that we are comfortable with the dataset, we can begin the NLP :
### 1. cleaning of the text
### 2. tokenizing the text by keeping only the 'lemma' and removing all stop words

### Note that the computation of the TF-IDF matriw will be done in the following notebook : 'Synopsis-distance'.

In [9]:
# Download of spacy model for processing english.
!python -m spacy download en_core_web_sm -q

[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [10]:
import en_core_web_sm
from spacy.lang.en.stop_words import STOP_WORDS

nlp = en_core_web_sm.load()

In [11]:
imdb['description'] = imdb['description'].fillna("")

In [12]:
imdb.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 85854 entries, 0 to 85853
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   imdbId       85854 non-null  int64  
 1   title        85854 non-null  object 
 2   year         85854 non-null  object 
 3   genre        85854 non-null  object 
 4   duration     85854 non-null  int64  
 5   country      85790 non-null  object 
 6   director     85767 non-null  object 
 7   writer       84282 non-null  object 
 8   actors       85785 non-null  object 
 9   description  85854 non-null  object 
 10  avg_vote     85854 non-null  float64
dtypes: float64(1), int64(2), object(8)
memory usage: 7.2+ MB


In [13]:
imdb['clean_description'] = imdb['description'].apply(lambda x: re.sub('[!\"#$%&()*+,-./:;<=>?@\[\]^_`{|}\\\]'," ",x))
imdb['clean_description'] = imdb['clean_description'].apply(lambda x: re.sub("\s+"," ",x))
imdb['clean_description'] = imdb['clean_description'].apply(lambda x: x.lower().strip())

In [14]:
# Random check of the cleaning impacts.
imdb['clean_description'][95]

'a wwi english officer is inspired the night before a dangerous mission by a vision of joan of arc whose story he relives'

In [15]:
tokenized_description = imdb['clean_description'].apply(lambda x: nlp(x))

In [16]:
tokenized_description

0        (the, adventures, of, a, female, reporter, in,...
1        (true, story, of, notorious, australian, outla...
2        (two, men, of, high, rank, are, both, wooing, ...
3        (the, fabled, queen, of, egypt, 's, affair, wi...
4        (loosely, adapted, from, dante, 's, divine, co...
                               ...                        
85849    (a, psychiatric, hospital, patient, pretends, ...
85850    (a, middle, aged, veterinary, surgeon, believe...
85851                                                   ()
85852                                                   ()
85853    (pep, a, 13, year, old, boy, is, in, love, wit...
Name: clean_description, Length: 85854, dtype: object

In [18]:
tokenized_description = tokenized_description.apply(lambda x: [token.lemma_ for token in x if (token.text not in STOP_WORDS) & (token.lemma_ not in STOP_WORDS)])
tokenized_description

0                     [adventure, female, reporter, 1890s]
1        [true, story, notorious, australian, outlaw, n...
2        [man, high, rank, woo, beautiful, famous, eque...
3        [fabled, queen, egypt, affair, roman, general,...
4        [loosely, adapt, dante, divine, comedy, inspir...
                               ...                        
85849    [psychiatric, hospital, patient, pretend, craz...
85850    [middle, aged, veterinary, surgeon, believe, w...
85851                                                   []
85852                                                   []
85853    [pep, 13, year, old, boy, love, girl, grandpar...
Name: clean_description, Length: 85854, dtype: object

In [19]:
imdb['clean_token_description'] = [" ".join(x) for x in tokenized_description]

In [21]:
imdb_export = imdb[['imdbId', 'clean_token_description']]

In [24]:
imdb_export.to_csv('clean_token_description.csv', index=False)

### Now the TF-IDF will be computed in the 'Synopsis_distance' notebook.