## Part 2.1 - Correcting & Enriching the dataset

The notebook is split into two parts:

**Part 2.1.1:** Will correct any mistake found in the dataset from the previous notebook versions and will add a new column to the dataset. The new column will be the official TMDB id of each movie. In the same part new columns will be added to further support the *training* and the *accuracy* of the recommendation algorithm. The new columns currently added into the dataset to enrich the written text for each movie are:
* Overview text from TMDB API
* Instead of one review we will now use 3 reviews per movie were possible.

**Part 2.1.2:** Will be used to add any additional movie (rows) into the dataset to contain more genres or movies. One way to achieve this is using the TMDB ids collected from part 2.1.1 and exploit the content of movies from the TMDB API.

#### -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --

### Section 2.1.1
To follow through this section you are advised to read the comments on top of each code block.

In [3]:
"""
Import the two csv files that contain the movie titles and the movie TMDB ids. 
This dataset will be merged with the already created and cleaned dataset from part 2 to bring the registered TMDB id per movie.
"""
import pandas as pd
import joblib
import os
import re
movie_links=pd.read_csv('links.csv')
movie_movies=pd.read_csv('movies.csv')
movie_merged=movie_movies.merge(movie_links, how='inner', on='movieId')
assert movie_links.shape[0]==movie_movies.shape[0]==movie_merged.shape[0]

exp = r'\(\d\d\d\d.'
pattern = '\((\d{4})\)'
movie_merged['year'] =movie_merged.title.str.extract(pattern, expand=False) #False returns a series
movie_merged['title']=movie_merged['title'].apply(lambda x: re.sub(exp,"",x).strip())
movie_merged['title']=movie_merged['title'].str.replace(r'(.*?),?\s*(The|A|An|Les)?(?=\s*\(.*\)\s*|$).*', r'\2 \1')
movie_merged['title']=movie_merged['title'].str.strip().reset_index(drop=True)
movie_merged.dropna(inplace=True)

c:\users\spano\miniconda3\lib\site-packages\numpy\.libs\libopenblas.JPIJNSWNNAN3CE6LLI5FWSPHUT2VXMTH.gfortran-win_amd64.dll
c:\users\spano\miniconda3\lib\site-packages\numpy\.libs\libopenblas.TXA6YQSD3GCQQC22GEQ54J2UDCXDXHWN.gfortran-win_amd64.dll
  movie_merged['title']=movie_merged['title'].str.replace(r'(.*?),?\s*(The|A|An|Les)?(?=\s*\(.*\)\s*|$).*', r'\2 \1')


In [17]:
movie_merged[movie_merged["title"].isin(["Babylon"])]

Unnamed: 0,movieId,title,genres,imdbId,tmdbId,year
17222,86657,Babylon,Drama,80406,57082.0,1981


In [12]:
movie_merged.head()

Unnamed: 0,movieId,title,genres,imdbId,tmdbId,year
0,1,Toy Story,Adventure|Animation|Children|Comedy|Fantasy,114709,862.0,1995
1,2,Jumanji,Adventure|Children|Fantasy,113497,8844.0,1995
2,3,Grumpier Old Men,Comedy|Romance,113228,15602.0,1995
3,4,Waiting to Exhale,Comedy|Drama|Romance,114885,31357.0,1995
4,5,Father of the Bride Part II,Comedy,113041,11862.0,1995


In [2]:
"""
python package 1: tmdbv3apic
python package 2: tmdbsimple
"""
from tmdbv3api import TMDb
from tmdbv3api import Movie
from tmdbv3api import Discover

tmdb = TMDb()
tmdb.api_key = 'bc519c5e728ee16367446f4bd61ef2f5'
tmdb.language = 'en'
tmdb.debug = True

movie=Movie()
m=movie.details(movie_id=602262)
print(m.keys())
print(m.genres)
print(m.title)
print(m.overview)
# discover = Discover()
# movie = discover.discover_movies({
#     'primary_release_date.gte': '2015-01-01'
# })

dict_keys(['adult', 'backdrop_path', 'belongs_to_collection', 'budget', 'genres', 'homepage', 'id', 'imdb_id', 'original_language', 'original_title', 'overview', 'popularity', 'poster_path', 'production_companies', 'production_countries', 'release_date', 'revenue', 'runtime', 'spoken_languages', 'status', 'tagline', 'title', 'video', 'vote_average', 'vote_count', 'videos', 'trailers', 'images', 'casts', 'translations', 'keywords', 'release_dates'])
[{'id': 35, 'name': 'Comedy'}, {'id': 9648, 'name': 'Mystery'}, {'id': 53, 'name': 'Thriller'}, {'id': 10770, 'name': 'TV Movie'}]
Dead Husbands
Dr. Carter Elson is a man who finds a list of men's names among his wife Alex's possessions. When Carter discovers his own name at the bottom of the list, and that some of the other names are those of dead men, he confides in his friend/agent Betty. Time is ticking as they try and figure out what the list means before his name reaches the top. Alex, a small town girl who marries the up-and-coming do

In [6]:
movie.details(165911)["imdb_id"]

'tt0187809'

In [473]:
"""
Import the dataset. The version of the dataset below is derived from part 2.
"""
dataset=joblib.load(os.path.join(os.getcwd(),"dataset_part_2_cleaned_of_redundant_genres_18022021.pkl"))

In [474]:
"""Print the total na values accross all columns"""
dataset.isna().sum()

title                0
genres               0
rating               0
imdb_url             0
reviews_url          0
actors               0
plot                 0
imdb_rating          0
director             0
reviews              0
year               175
sentiment_value      0
movie_features       0
reduced_genres       0
dtype: int64

In [475]:
"""Drop the movies with na values found in column year."""
dataset.dropna(inplace=True)
dataset.shape

(48947, 14)

In [476]:
"""
Merge the dataset from part 2 with the downloaded dataset from Groplens that contains the TMDB id per movie.
This code block is the starting point of enriching movies with additional content. 
Using their registered TMDB id we can extract information from the TDMB database API.
"""
dataset_tmdbid=dataset.merge(movie_merged[['title', 'year', 'tmdbId']], how='left', on=['title', 'year'], indicator=True)
dataset_tmdbid_rduplicates=dataset_tmdbid.sort_values('year').drop_duplicates(['title', 'year'], keep='last').sort_index().reset_index(drop=True)
dataset_tmdbid_rduplicates=dataset_tmdbid_rduplicates.reset_index(drop=True)
print(dataset_tmdbid.shape)
print(dataset_tmdbid_rduplicates.shape)

(49108, 16)
(48875, 16)


In [477]:
"""
Print the dataset rows with no tmbd id. 
For the rows printed we didn't locate an existing TMDB id from the Grouplens dataset
"""
dataset_tmdbid_rduplicates[dataset_tmdbid_rduplicates['_merge']!='both']

Unnamed: 0,title,genres,rating,imdb_url,reviews_url,actors,plot,imdb_rating,director,reviews,year,sentiment_value,movie_features,reduced_genres,tmdbId,_merge
700,Halfmoon,[Drama],3.0,http://www.imdb.com/title/tt0114103/,http://www.imdb.com/title/tt0114103/reviews?sp...,"[Samir Guesmi, Khaled Ksouri, Sondos Belhassen...",Three short stories by the American expatriate...,6.9,Frieder Schlaich,[I just got done watching this. It has 3 short...,1995,1.0,Halfmoon Samir Guesmi Khaled Ksouri Sondos Bel...,[Drama],,left_only
1626,The Sadness of Sex,[Drama],2.71,http://www.imdb.com/title/tt0114322/,http://www.imdb.com/title/tt0114322/reviews?sp...,"[Barry Yourgrau, Peta Wilson, Barbara Baumann,...",Comprised of fifteen vignettes of varying leng...,6.5,Rupert Wainwright,[The Sadness Of Sex is one of the best art wor...,1995,1.0,The Sadness of Sex Barry Yourgrau Peta Wilson ...,[Drama],,left_only
1658,Follow the Bitch,[Comedy],2.79,http://www.imdb.com/title/tt0119139/,http://www.imdb.com/title/tt0119139/reviews?sp...,"[Michael Cudlitz, Ray Porter, Dion Luther, Mel...","A group of single guys meet for their weekly, ...",7.4,Julian Stone,[I haven't seen this movie since 1997 and I ju...,1996,1.0,Follow the Bitch Michael Cudlitz Ray Porter Di...,[Comedy],,left_only
2079,The Master,[Action],2.06,http://www.imdb.com/title/tt0087690/,http://www.imdb.com/title/tt0087690/reviews?sp...,"[Lee Van Cleef, Timothy Van Patten, Shô Kosugi]",An aging American ninja master and his headstr...,4.5,Michael Sloan,"[I remember my excitement, as an 11 year old a...",1984,0.0,The Master Lee Van Cleef Timothy Van Patten Sh...,[Action],,left_only
3002,Creature,[Documentary],2.95,http://www.imdb.com/title/tt0198385/,http://www.imdb.com/title/tt0198385/reviews?sp...,"[Filberto Ascencio, Butch Dean, Dusty Dean, St...",Kyle Dean was a misfit in her North Carolina s...,6.5,Parris Patton,"[I would give this a 10 out of 10, but I did w...",1999,1.0,Creature Filberto Ascencio Butch Dean Dusty De...,[Documentary],,left_only
4950,Luminarias,"[Comedy, Romance]",2.68,http://www.imdb.com/title/tt0160498/,http://www.imdb.com/title/tt0160498/reviews?sp...,"[Evelina Fernández, Scott Bakula, Cheech Marin...",Four professional women meet at an East Los An...,5.4,José Luis Valenzuela,"[Though some of the acting was a little stiff,...",2000,1.0,Luminarias Evelina Fernández Scott Bakula Chee...,"[Comedy, Romance]",,left_only
5337,The Dogwalker,[Drama],1.5,http://www.imdb.com/title/tt0309521/,http://www.imdb.com/title/tt0309521/reviews?sp...,"[Diane Gaidry, Pamela Gordon, Lyn Vaus, Lisa J...",The L. A. dog walking scene provides a colorfu...,6.9,Jacques Thelemaque,[I saw this at the Santa Fe Film Festival whil...,2002,0.0,The Dogwalker Diane Gaidry Pamela Gordon Lyn V...,[Drama],,left_only
5508,Faces of Death 5,"[Documentary, Horror]",1.35,http://www.imdb.com/title/tt0223249/,http://www.imdb.com/title/tt0223249/reviews?sp...,"[Clyde Barrow, Michael Carr, Bonnie Parker, Ro...",The fifth entry in the Faces of Death series.,2.9,John Alan Schwartz,"[Gorgon Video, obviously trying to milk this c...",1996,0.0,Faces of Death 5 Clyde Barrow Michael Carr Bon...,"[Documentary, Horror]",,left_only
6892,Prisoner of Paradise,[Documentary],3.21,http://www.imdb.com/title/tt0348862/,http://www.imdb.com/title/tt0348862/reviews?sp...,"[Ian Holm, Robert Lantz, Eleonore Hertzberg, L...",Documentary about Holocaust victim Kurt Gerron.,7.6,Malcolm Clarke,[Any documentary about a successful Berlin Cab...,2002,1.0,Prisoner of Paradise Ian Holm Robert Lantz Ele...,[Documentary],,left_only
7227,Love Life,"[Comedy, Romance]",4.12,http://www.imdb.com/title/tt0253016/,http://www.imdb.com/title/tt0253016/reviews?sp...,"[Des Brady, Galit Hershkovitz, Luke Goss, Suri...","When a one-night stand results in pregnancy, S...",5.9,Ray Brady,[Inherited this from my x's DVD collection whe...,2001,1.0,Love Life Des Brady Galit Hershkovitz Luke Gos...,"[Comedy, Romance]",,left_only


In [478]:
"""
Get the index of the movies that have an NA tmdbId after the join with the Grouplens dataset.
Those 55 indexes below should be reviewed one by one and apply necessary actions. 
Either remove them from the dataset or replace their tmdbId with the correct one from the TMDB API.
"""
list_values=dataset_tmdbid_rduplicates['title'].index[dataset_tmdbid_rduplicates['tmdbId'].isna()]
list_values

Int64Index([  700,  1626,  1658,  2079,  3002,  4950,  5337,  5508,  6892,
             7227,  8793,  9166, 13251, 14355, 14552, 14816, 15330, 15353,
            15484, 15554, 15934, 16226, 16311, 16536, 16764, 17676, 17836,
            17949, 18300, 18718, 18874, 19057, 20901, 20913, 21088, 21099,
            21149, 21150, 21157, 21162, 21222, 21281, 21454, 21520, 21959,
            22372, 22427, 22611, 22632, 22818, 22902, 23073, 23177, 23428,
            23621],
           dtype='int64')

In [479]:
"""
list_one: 55 indexes with missing TMDB ID.
list_two: 39 indexes to remove from the dataset.
list_keep_indexes: 16 (55-39) indexes to keep in the dataset and find their official tmdb id.
"""
list_one=[700,1626,1658,2079,3002,4950,5337,5508,6892,7227,8793,9166,13251,14355,14552,14816,15330,15353,15484,15554,15934,16226,16311,16536,16764,17676,17836,17949,18300,18718,18874,19057,20901,20913,21088,21099,21149,21150,21157,21162,21222,21281,21454,21520,21959,22372,22427,22611,22632,22818,22902,23073,23177,23428,23621]
list_two=[5508,7227,8793,9166,14355,14552,15353,15554,15934,16226,16311,16536,16764,17676,17836,17949,18300,18718,18874,19057,20913,21088,21099,21149,21150,21157,21162,21222,21281,21454,21520,21959,22372,22611,22632,22818,23073,23177,23621]
list_keep_indexes=[x for x in list_one if x not in list_two]
list_keep_indexes=[elem for elem in list_keep_indexes if elem not in [2079,3002,22902]]
list_keep_indexes

[700,
 1626,
 1658,
 4950,
 5337,
 6892,
 13251,
 14816,
 15330,
 15484,
 20901,
 22427,
 23428]

In [480]:
"""
Out of the 55 NaN TMDB ids and after executing the cell above for each of the 55 movies,
we decided that the below 15 movies will be kept and the correct TMDB id will be replaced.
"""
#index 700 - Change title to "Paul Bowles: Half Moon"
dataset_tmdbid_rduplicates.iloc[list_values[0],0]="Paul Bowles: Half Moon"

#index 1626 - Change year of release from 1995 to 1998
dataset_tmdbid_rduplicates.iloc[list_values[1],10]=str(1998)

#index 2079,3002 - Change manually TMDB id
dataset_tmdbid_rduplicates.iloc[list_values[3],14]=68722.0
dataset_tmdbid_rduplicates.iloc[list_values[4],14]=73963.0

#index 6892 - Change year of release to 2003
dataset_tmdbid_rduplicates.iloc[list_values[8],10]=str(2003)

#index 15330 - Change year of release from 1978 to 2009
dataset_tmdbid_rduplicates.iloc[list_values[16],10]=str(2009)

#index 20901 - Change year of release from 2001 to 1999
dataset_tmdbid_rduplicates.iloc[list_values[32],10]=str(1999)

#index 22902 - Change tmdb id to 722616
dataset_tmdbid_rduplicates.iloc[list_values[50],14]=722616.0

In [481]:
"""
The function to apply on the indexes kept out of those 55 with nan tmdb ids. 
Note that out of the 16 indexes, three of them were excluded because their official TMDB id was manually set. 
For the transformations applied on the 16 indexes see the previous code block.
"""
import numpy as np
def replace_nan_tmdbids(row_index):
    """
    Purpose: Collect the official TMDB id for a movie.
    Arguments: The row of the movie corresponding to its index.
    Output: The collected TMDB id.
    """
    keep_movie_title=dataset_tmdbid_rduplicates.iloc[row_index]["title"] #1
    collected_id=[options['id'] for options in movie.search(keep_movie_title) if (options['title'].lower()==keep_movie_title.lower() and options['release_date'].split('-')[0]==dataset_tmdbid_rduplicates['year'][dataset_tmdbid_rduplicates['title']==keep_movie_title].values[0])] #2
    return float(collected_id[0]) #3

In [482]:
"""Apply the function above to collect the official TMDB id of the 16 movies without a collected tmdb id."""
dataset_tmdbid_rduplicates["index"]=dataset_tmdbid_rduplicates.index
tqdm.pandas()
dataset_tmdbid_rduplicates["tmdbId"]=dataset_tmdbid_rduplicates["index"].progress_apply(lambda x: replace_nan_tmdbids(x) if x in list_keep_indexes else dataset_tmdbid_rduplicates.iloc[x,14])

  0%|          | 0/48875 [00:00<?, ?it/s]

In [491]:
"""
The rest 39 indexes(movies) are removed from the whole dataset.
"""
index_to_remove=[5508,7227,8793,9166,14355,14552,15353,15554,15934,16226,16311,16536,16764,17676,17836,17949,18300,18718,18874,19057,20913,21088,21099,21149,21150,21157,21162,21222,21281,21454,21520,21959,22372,22611,22632,22818,23073,23177,23621]
dataset_tmdbid_rduplicates=dataset_tmdbid_rduplicates[~dataset_tmdbid_rduplicates.index.isin(index_to_remove)]
dataset_tmdbid_rduplicates=dataset_tmdbid_rduplicates.reset_index(drop=True)
dataset_tmdbid_rduplicates.shape

(48836, 17)

In [492]:
array_nonvalid_tmdbids=[]
def get_updated_tmdbid(tmdbid, array_nonvalid_tmdbids):
    """
    Purpose: Update the tmdb IDs per movie from the TMDB API source. For the majority of the movies in the dataset, 
             their TMDB id should much the relative id from the Grouplens dataset. In any different scenario the correct
             id from the API replaces the old one of the Grouplens dataset. The updated ids will be used to extract
             movie's overview text from the TMDB API.
    Arguments: tmdbid -> The movie's id from TMDB database
               array_nonvalid_tmdbids -> This is an array that get updated every time a recorded tmdbi from Grouplens dataset was invalid. 
                                         Using the movie's title and year of release we try to discover the correct tmdbid.
    Outputs: The updated tmdbid of the movie.
    """
    try:
        if(movie.details(movie_id=tmdbid)):
            search_result=movie.details(movie_id=tmdbid).id
        return search_result
    except:
        movie_title=dataset_tmdbid_rduplicates['title'][dataset_tmdbid_rduplicates['tmdbId']==tmdbid].values[0]
        original_titles=[i["original_title"] for i in movie.search(movie_title)]
        print("exception caught at movie:",movie_title,"\n")
        if len(movie.search(movie_title))!=0:
            if movie_title in original_titles:
                search_result=[options['id'] for options in movie.search(movie_title) if (options['title'].lower()==movie_title.lower() and options['release_date'].split('-')[0]==dataset_tmdbid_rduplicates['year'][dataset_tmdbid_rduplicates['title']==movie_title].values[0])]
                if len(search_result)==1:
                    search_result=search_result[0]
                else:
                    search_result=[]
            else:
                search_result=[]
        else:
            search_result=[]
        array_nonvalid_tmdbids.append(tmdbid)
        return search_result

In [493]:
"""
Movie title: 1066 is dropped from the dataset because the TMDB database does not have sufficient info for that movie
So we expect from 48836 movies to drop to 48835
"""
dataset_tmdbid_rduplicates=dataset_tmdbid_rduplicates[dataset_tmdbid_rduplicates["title"]!="1066"]
dataset_tmdbid_rduplicates.shape

(48835, 17)

In [494]:
"""
Movie title: The Queen of Spades is dropped from the dataset because the TMDB database does not have sufficient info for that movie
So we expect from 48835 movies to drop to 48834.
"""
dataset_tmdbid_rduplicates=dataset_tmdbid_rduplicates.drop(dataset_tmdbid_rduplicates[(dataset_tmdbid_rduplicates["title"]=="The Queen of Spades") & (dataset_tmdbid_rduplicates["year"]==str(1916))].index)
dataset_tmdbid_rduplicates=dataset_tmdbid_rduplicates.reset_index(drop=True)
dataset_tmdbid_rduplicates.shape

(48834, 17)

In [495]:
"""Apply the function get_updated_tmdbid() to return a new column with the updated ids per movie."""
from tqdm.notebook import tqdm
tqdm.pandas()
dataset_tmdbid_rduplicates['tmdbid_updated']=dataset_tmdbid_rduplicates.iloc[0:,14].progress_apply(lambda x: get_updated_tmdbid(x, array_nonvalid_tmdbids))

  0%|          | 0/48834 [00:00<?, ?it/s]

exception caught at movie: Navy Seals 

exception caught at movie: Best of the Best 

exception caught at movie: Escaflowne: The Movie 

exception caught at movie: Ffolkes 

exception caught at movie: Rose Red 

exception caught at movie: Pride and Prejudice 

exception caught at movie: Tinker, Tailor, Soldier, Spy 

exception caught at movie: Children of Dune 

exception caught at movie: Dune 

exception caught at movie: Jesus of Nazareth 

exception caught at movie: The Blue and the Gray 

exception caught at movie: Smiley's People 

exception caught at movie: The Bourne Identity 

exception caught at movie: Lonesome Dove 

exception caught at movie: Prime Suspect 

exception caught at movie: It 

exception caught at movie: Prime Suspect 2 

exception caught at movie: Return to Lonesome Dove 

exception caught at movie: The Stand 

exception caught at movie: The Langoliers 

exception caught at movie: From the Earth to the Moon 

exception caught at movie: Merlin 

exception caught a

exception caught at movie: 8 

exception caught at movie: Cranford 

exception caught at movie: Shrek the Musical 

exception caught at movie: Going Postal 

exception caught at movie: Rosemary's Baby 

exception caught at movie: Neverland 

exception caught at movie: The Escape Artist 

exception caught at movie: Gunbuster 

exception caught at movie: 10.5 

exception caught at movie: Al-risâlah 

exception caught at movie: 2AM: The Smiling Man 

exception caught at movie: Centennial 

exception caught at movie: Sins 

exception caught at movie: Archangel 

exception caught at movie: Cambridge Spies 

exception caught at movie: King Solomon's Mines 

exception caught at movie: Generation War 

exception caught at movie: Auschwitz: The Nazis and the 'Final Solution' 

exception caught at movie: History of the Eagles 

exception caught at movie: Cleopatra 

exception caught at movie: Olive Kitteridge 

exception caught at movie: Wuthering Heights 

exception caught at movie: Long Way Ro

exception caught at movie: Jean-Claude Van Johnson 

exception caught at movie: Over the Garden Wall 

exception caught at movie: Wolf Creek 

exception caught at movie: Small Island 

exception caught at movie: Horace and Pete 

exception caught at movie: Maigret Sets A Trap 

exception caught at movie: Sense and Sensibility 

exception caught at movie: Jane Eyre 

exception caught at movie: Twist of Fate 

exception caught at movie: Baseball 

exception caught at movie: Crisis in Six Scenes 

exception caught at movie: Legally Blonde: The Musical 

exception caught at movie: The Way We Live Now 

exception caught at movie: The Secret of Crickley Hall 

exception caught at movie: Nature's Great Events 

exception caught at movie: Political Animals 

exception caught at movie: Spies of Warsaw 

exception caught at movie: Kickassia 

exception caught at movie: Suburban Knights 

exception caught at movie: The Spoils of Babylon 

exception caught at movie: Le passe-muraille 

exception c

exception caught at movie: Pole to Pole 

exception caught at movie: I Know My First Name Is Steven 

exception caught at movie: Carne Y Arena 

exception caught at movie: Chinna Gounder 

exception caught at movie: Brotherhood of the Rose 



In [498]:
"""
This code block is necessary to print the rows of the dataset that have no equal tmdbid_updated to tmdbid.
This means that their initial tmbdid from Grouplens is not equal to the id the TMDB database has assigned based on title and year of release.
Fortunately, those movies are approximately 511 out of 48833.
"""
import numpy as np
dataset_tmdbid_rduplicates=dataset_tmdbid_rduplicates.reset_index(drop=True)
comparison_column=np.where(dataset_tmdbid_rduplicates["tmdbId"]==dataset_tmdbid_rduplicates["tmdbid_updated"], True, False)
dataset_tmdbid_rduplicates["equal"]=comparison_column
dataset_tmdbid_rduplicates.iloc[0:,:][dataset_tmdbid_rduplicates["equal"]==False]

Unnamed: 0,title,genres,rating,imdb_url,reviews_url,actors,plot,imdb_rating,director,reviews,year,sentiment_value,movie_features,reduced_genres,tmdbId,_merge,index,tmdbid_updated,equal
3999,Navy Seals,"[Action, Adventure, War]",2.71,http://www.imdb.com/title/tt0100232/,http://www.imdb.com/title/tt0100232/reviews?sp...,"[Charlie Sheen, Michael Biehn, Joanne Whalley,...",A battle-hardened Seal Team sets off on a miss...,5.6,Lewis Teague,"[""Navy SEALS"" is a fun escapist Hollywood test...",1990,1.0,Navy Seals Charlie Sheen Michael Biehn Joanne ...,"[Action, Adventure, War]",12773.0,both,3999,427910,False
4353,Best of the Best,[Action],2.97,http://www.imdb.com/title/tt0096913/,http://www.imdb.com/title/tt0096913/reviews?sp...,"[Eric Roberts, Phillip Rhee, James Earl Jones,...",A team from the United States is going to comp...,6.4,Robert Radler,[I'm not sure why I liked this movie so much. ...,1989,1.0,Best of the Best Eric Roberts Phillip Rhee Jam...,[Action],17882.0,both,4353,238751,False
4844,Escaflowne: The Movie,"[Action, Adventure, Animation]",3.45,http://www.imdb.com/title/tt0270933/,http://www.imdb.com/title/tt0270933/reviews?sp...,"[Maaya Sakamoto, Tomokazu Seki, Jôji Nakata, M...","A grim retelling of the television series ""The...",6.7,Kazuki Akane,[Anyone settling down to watch the usual roman...,2000,1.0,Escaflowne: The Movie Maaya Sakamoto Tomokazu ...,"[Action, Adventure, Animation]",68149.0,both,4844,[],False
4982,Ffolkes,"[Action, Adventure, Thriller]",2.96,http://www.imdb.com/title/tt0081809/,http://www.imdb.com/title/tt0081809/reviews?sp...,"[Roger Moore, James Mason, Anthony Perkins, Mi...","When terrorists take over two oil rigs, and th...",6.3,Andrew V. McLaglen,[A gang of criminals hijack a Norwegian supply...,1979,1.0,Ffolkes Roger Moore James Mason Anthony Perkin...,"[Action, Adventure, Thriller]",24549.0,both,4982,[],False
7263,Rose Red,"[Horror, Mystery, Thriller]",3.21,http://www.imdb.com/title/tt0259153/,http://www.imdb.com/title/tt0259153/reviews?sp...,"[Nancy Travis, Matt Keeslar, Kimberly J. Brown...",A group of people with psychic powers are invi...,6.7,Nancy Travis,"[Someone said this was ""too long"" and made the...",2002,1.0,Rose Red Nancy Travis Matt Keeslar Kimberly J....,"[Horror, Mystery, Thriller]",14980.0,both,7265,788551,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
48486,Pole to Pole,"[Adventure, Documentary]",4.00,http://www.imdb.com/title/tt0103514/,http://www.imdb.com/title/tt0103514/reviews?sp...,[Michael Palin],Michael Palin undertakes a journey by the most...,8.4,Michael Palin,[This review is based on watching the DVD vers...,1992,1.0,Pole to Pole Michael Palin Michael Palin Micha...,"[Adventure, Documentary]",167774.0,both,48527,[],False
48513,I Know My First Name Is Steven,"[Crime, Drama]",4.00,http://www.imdb.com/title/tt0097553/,http://www.imdb.com/title/tt0097553/reviews?sp...,"[Cindy Pickett, John Ashton, Corin Nemec, Luke...","The harrowing true account of Steven Stayner, ...",7.7,Cindy Pickett,[I remember so much of this movie even though ...,1989,1.0,I Know My First Name Is Steven Cindy Pickett J...,"[Crime, Drama]",61693.0,both,48554,[],False
48794,Carne Y Arena,"[Short, Drama]",2.50,http://www.imdb.com/title/tt6212516/,http://www.imdb.com/title/tt6212516/reviews?sp...,"[Hector Luis Bustamante, Toy Lei, Christopher ...","Based on true accounts, the superficial lines ...",8.5,Alejandro G. Iñárritu,[To review a new medium can be a daunting task...,2017,1.0,Carne Y Arena Hector Luis Bustamante Toy Lei C...,[Drama],524475.0,both,48835,[],False
48797,Chinna Gounder,"[Drama, Romance, Thriller]",2.50,http://www.imdb.com/title/tt0306644/,http://www.imdb.com/title/tt0306644/reviews?sp...,"[Salim Ghouse, Aachi Manorama, Sukanya, Vadive...",The village head is a man who sticks to honest...,6.8,R.V. Udhaya Kumar,[The story is about the village panchayat man ...,1992,1.0,Chinna Gounder Salim Ghouse Aachi Manorama Suk...,"[Drama, Romance, Thriller]",344115.0,both,48838,78983,False


In [503]:
dataset_tmdbid_rduplicates.shape

(48834, 19)

In [505]:
"""Run once to save (serialize) the dataset with the updated tmdbids in local disk"""
joblib.dump(dataset_tmdbid_rduplicates, "dataset_updated_tmdbid_18022021.pkl")
dataset_overview=dataset_tmdbid_rduplicates.copy()
dataset_overview.shape

(48834, 19)

In [736]:
"""For the next function to work properly it's imperative to change the data type of tmdbid_updated column from integer to float."""
dataset_overview["tmdbid_updated"]=dataset_overview["tmdbid_updated"].apply(lambda x: float(x) if x!=[] else x)

In [737]:
array_empty_overview=[]
def get_overview(tmdbid, array_empty_overview):
    """
    Purpose: Extract the overview text of a movie based on its assigned tmdb id from TMDB API.
             The overview text of a movie will be combined to its plot summary from IMDB database. The latter is already extracted.
    Arguments: tmdbid -> The movie's id from TMDB database
               array_empty_overview -> Collect the ids with no recorded overview text.
    Outputs: The overview text of the movie as officially written in the TMDB API.
    """
    try:
        if(tmdbid!=[]):
            assert type(tmdbid)==float
            search_result=movie.details(movie_id=tmdbid).overview
        else:
            search_result=" "
            pass
        return search_result
    except:
        movie_title=dataset_overview['title'][dataset_overview['tmdbid_updated']==tmdbid].values[0]
        print("exception caught at movie:",movie_title,"\n")
        search_result=" "
        array_empty_overview.append(tmdbid)
        return search_result

In [738]:
"""
Apply the function get_overview() to retrieve the official overview text per movie from the TMDB database. 
Note that for movies without a valid tmdbid (i.e []) an empty string represents their overview.
"""
dataset_overview["overview"]=dataset_overview.iloc[0:,17].progress_apply(lambda x: get_overview(x, array_empty_overview))

  0%|          | 0/48834 [00:00<?, ?it/s]

exception caught at movie: Kickassia 

exception caught at movie: Suburban Knights 

exception caught at movie: To Boldly Flee 



In [43]:
from tqdm.notebook import tqdm
"""Save the dataset with the overview text extracted per movie."""
#joblib.dump(dataset_overview,"dataset_overview_18022021.pkl")
dataset_overview=joblib.load("dataset_overview_18022021.pkl")

In [44]:
"""
Before proceeding to the next part, Data tokenization & transformation for NLP applications, we should perfrom 3 tasks.
Task 1: Replace the information of the movies "The Master", "Creature" with new details corresponding to different movies that those already imported in the dataset.
Doing a research on movies with a non-valid tmdbid we decided that some of them should be dropped from the dataset, while the title of some others were kept but their content was replaced.
Content is: actors, plot, reviews, imdb_rating, directors
"""
#Task 1: Bring info about two movies: The Master, Creature
from bs4 import BeautifulSoup
import requests
import re

#URL links to IMDB
the_master_link="http://www.imdb.com/title/tt1560747/"
creature_link="http://www.imdb.com/title/tt1686018/"
content_links=[the_master_link,creature_link]

reviews_master_link=the_master_link+"reviews?spoiler=hide&sort=helpfulnessScore&dir=desc&ratingFilter=0"
reviews_creature_link=creature_link+"reviews?spoiler=hide&sort=helpfulnessScore&dir=desc&ratingFilter=0"
reviews_links=[reviews_master_link,reviews_creature_link]

content_html=[requests.get(i) for i in content_links]
reviews_html=[requests.get(i) for i in reviews_links]

content_soup=[BeautifulSoup(i.text) for i in content_html]
reviews_soup=[BeautifulSoup(i.text) for i in reviews_html]

"""
Field 1: Extract plot summary
"""
myfield_plot = []
plot_summary = []
index_to_remove_no_plot = []

[myfield_plot.append(i.find_all('div', {'class':'plot_summary'})) for i in tqdm(content_soup, desc="Plot Summary")]
[[[plot_summary.append(y.text) for y in x.find_all('div', {'class':'summary_text'})] for x in i] if len(i) !=0 else index_to_remove_no_plot.append(myfield_plot.index(i)) for i in myfield_plot]
print("Length of the list with Movies that don't have plot summary: {}".format(len(index_to_remove_no_plot)))
if len(index_to_remove_no_plot) == 0:
    print("None of the movie miss plot")
else:
    print("Indexes to remove: {}".format(index_to_remove_no_plot))

"""
Field 2: Extract actors
"""
myfield_cast = []
phase_two = []
phase_three = []
actors_list = []
index_to_remove_no_actors = []

[myfield_cast.append(i.find_all('table', {'class':'cast_list'})) for i in tqdm(content_soup, desc="Actors")]
r_one = re.compile(".*name")
[[phase_two.append(j.find_all('a', {'href':r_one})) for j in i] for i in myfield_cast]
[phase_three.append(phase_two[i][1::2]) for i in range(len(phase_two))]
[actors_list.append(list(map(lambda x: x.text.strip(' ').replace('\n', ''), actors))) for actors in phase_three]            
index_to_remove_no_actors = [i for i,x in enumerate(myfield_cast) if not x]
print("Length of the list with Movies that don't have actors: {}".format(len(index_to_remove_no_actors)))
if len(index_to_remove_no_actors) == 0:
    print("None of the movie miss actors")
else:
    print("Indexes to remove: {}".format(index_to_remove_no_actors))

"""
Field 3: Extract director name(s)
"""
myfield_director = []
director_name = []
index_to_remove_no_directors = []

[myfield_director.append(i.find_all('div', {'class':'plot_summary'})) for i in tqdm(content_soup, desc="Director names")]
r_name = re.compile(".*name")
[[director_name.append(j.find_all('a', {'href':r_name})) for j in i] for i in myfield_director]
director_names = [item[0].text for item in director_name if len(item)!=0]
index_to_remove_no_directors = [i for i,x in enumerate(myfield_director) if not x]
print("Length of the list with Movies that don't have directors: {}".format(len(index_to_remove_no_directors)))
if len(index_to_remove_no_directors) == 0:
    print("None of the movie miss directors")
else:
    print("Indexes to remove: {}".format(index_to_remove_no_directors))

"""
Field 4: Extract imdb movie rating
"""
myfield_rating = []
ratings = []
index_to_remove_no_rating = []

[myfield_rating.append(i.find_all('div', {'class':'ratingValue'})) for i in tqdm(content_soup, desc="IMDB rating")]
[[[ratings.append(y.text) for y in x.find_all('span', {'itemprop':'ratingValue'})] for x in i] for i in myfield_rating]
index_to_remove_no_rating = [i for i,x in enumerate(myfield_rating) if not x]
print("Length of the list with Movies that are not rated: {}".format(len(index_to_remove_no_rating)))
if len(index_to_remove_no_rating) == 0:
    print("None of the movie miss ratings")
else:
    print("Indexes to remove: {}".format(index_to_remove_no_rating))

"""
Field 5: Extract movie reviews
"""
myfield_review_step_one = []
myfield_review_step_two = []
myfield_review_step_three = []

[myfield_review_step_one.append(i.find_all('div', {'class':'lister-list'})) for i in tqdm(reviews_soup, desc="User reviews")]
[[myfield_review_step_two.append(j.find_all('div', {'class':'text show-more__control'})) for j in i] for i in myfield_review_step_one]
[myfield_review_step_three.append(list(map(lambda x: x.text, reviews))) for reviews in myfield_review_step_two]
index_to_remove_no_review = [i for i,x in enumerate(myfield_review_step_one) if not x]
print("Length of the list with Movies that don't have reviews: {}".format(len(index_to_remove_no_review)))
if len(index_to_remove_no_review) == 0:
    print("None of the movies miss reviews")
else:
    print("Indexes to remove: {}".format(index_to_remove_no_review))
# print([i for i,x in enumerate(myfield_review_step_one) if not x])
# print([i for i,x in enumerate(myfield_review_step_two) if not x])
# print([i for i,x in enumerate(myfield_review_step_three) if not x])

Plot Summary:   0%|          | 0/2 [00:00<?, ?it/s]

Length of the list with Movies that don't have plot summary: 0
None of the movie miss plot


Actors:   0%|          | 0/2 [00:00<?, ?it/s]

Length of the list with Movies that don't have actors: 0
None of the movie miss actors


Director names:   0%|          | 0/2 [00:00<?, ?it/s]

Length of the list with Movies that don't have directors: 0
None of the movie miss directors


IMDB rating:   0%|          | 0/2 [00:00<?, ?it/s]

Length of the list with Movies that are not rated: 0
None of the movie miss ratings


User reviews:   0%|          | 0/2 [00:00<?, ?it/s]

Length of the list with Movies that don't have reviews: 0
None of the movies miss reviews


In [45]:
"""
Continue task 1
Extract also the year and the genre of the movies.
"""
year_value=[movie.details(68722.0).release_date.split("-")[0],movie.details(73963.0).release_date.split("-")[0]]
genres_value=[[movie.details(68722.0).genres[0]['name']],[movie.details(73963.0).genres[0]['name']]]

In [46]:
"""Clean the plot summary and reviews extracted for the new movies"""
plot_summary_cleaned=[re.sub(' +', ' ',plot_text.strip().replace(',', ', ').replace('.', '. ').replace('?', '? ').replace('!', '! ').replace('\n                    See full summary\xa0»', '').rstrip()) for plot_text in plot_summary]
reviews_cleaned=[list(map(lambda x: re.sub(' +', ' ', x.strip().replace(',', ', ').replace('.', '. ').replace('?', '? ').replace('!', '! ').replace('\n                    See full summary\xa0»', '').rstrip()), reviews)) for reviews in myfield_review_step_three]

In [47]:
"""Finally replace the extracted values for the movies 'The Master' and 'Creature'"""
dataset_overview.iloc[2079,1]=[genres_value[0]] #it's imperative the genres to be a list for the notebooks to come
dataset_overview.iloc[3002,1]=[genres_value[1]] #it's imperative the genres to be a list for the notebooks to come

dataset_overview.iloc[2079,3]=content_links[0]
dataset_overview.iloc[3002,3]=content_links[1]

dataset_overview.iloc[2079,4]=reviews_links[0]
dataset_overview.iloc[3002,4]=reviews_links[1]

dataset_overview.iloc[2079,5][0:len(dataset_overview.iloc[2079,5])]=actors_list[0][0:len(dataset_overview.iloc[2079,5])]
dataset_overview.iloc[3002,5][0:len(dataset_overview.iloc[3002,5])]=actors_list[1][0:len(dataset_overview.iloc[3002,5])]

dataset_overview.iloc[2079,6]=plot_summary_cleaned[0]
dataset_overview.iloc[3002,6]=plot_summary_cleaned[1]

dataset_overview.iloc[2079,7]=ratings[0]
dataset_overview.iloc[3002,7]=ratings[1]

dataset_overview.iloc[2079,8]=director_names[0]
dataset_overview.iloc[3002,8]=director_names[1]

dataset_overview.iloc[2079,9][0:len(dataset_overview.iloc[2079,9])]=reviews_cleaned[0][0:len(dataset_overview.iloc[2079,9])]
dataset_overview.iloc[3002,9][0:len(dataset_overview.iloc[3002,9])]=reviews_cleaned[1][0:len(dataset_overview.iloc[3002,9])]

dataset_overview.iloc[2079,10]=year_value[0]
dataset_overview.iloc[3002,10]=year_value[1]

dataset_overview.iloc[2079,13]=[genres_value[0]]
dataset_overview.iloc[3002,13]=[genres_value[1]]

In [48]:
"""Task 2 is executed in order to combine the plot summary and overview text per movie into 1 corpus"""
dataset_plot_overview=dataset_overview.copy()
dataset_plot_overview["plot_overview"]=dataset_plot_overview.apply(lambda x: x["plot"] if (x["overview"]==" " or len(x["plot"])>len(x["overview"])) else x["overview"], axis=1)

In [49]:
"""Task 3 is executed to exploit three reviews for movies with at least three reviews."""
dataset_plot_overview["reviews_enriched"]=dataset_plot_overview["reviews"].apply(lambda x: " ".join(x[0:3]) if len(x)>=3 else " ".join(x[0:2]) if len(x)==2 else x[0])

In [50]:
"""
Print the rows of the dataframe with two reviews. 
This code block is only for demonstration purposes and can provide an easy way to filter cells with list content.
"""
dataset_plot_overview[dataset_plot_overview["reviews"].map(len)==1]

Unnamed: 0,title,genres,rating,imdb_url,reviews_url,actors,plot,imdb_rating,director,reviews,...,movie_features,reduced_genres,tmdbId,_merge,index,tmdbid_updated,equal,overview,plot_overview,reviews_enriched
55,Kids of the Round Table,"[Adventure, Children, Comedy]",2.06,http://www.imdb.com/title/tt0113541/,http://www.imdb.com/title/tt0113541/reviews?spoiler=hide&sort=helpfulnessScore&dir=desc&ratingFilter=0,"[Johnny Morina, Maggie Castle, Christopher Olscamp, Justin Borntraeger, Billy Coyle, Jeoffrey Graves, Malcolm McDowell, Peter Aykroyd, Mélany Goudreau, James Rae, Jamieson Boulanger, Roc LaFortune, Michael Ironside, René Simard, Melissa Altro]","Eleven-year-old Alex and his fellow homemade heroes are having a backyard blast. Battles rage, knights fight and damsels distress in a cardboard Camelot of dirt-bike steeds, aluminum foil . . .",4.8,Robert Tinnell,"[when i was a little younger i loved this movie and now i watch it and it's boring and major un-exciting, its a cute movie for kids, like kids 9 or 10 but if you watch it and you are over that age, i dont think you will like it, its based on kids and things kids do]",...,"Kids of the Round Table Johnny Morina Maggie Castle Christopher Olscamp Justin Borntraeger Billy Coyle Jeoffrey Graves Malcolm McDowell Peter Aykroyd Mélany Goudreau James Rae Jamieson Boulanger Roc LaFortune Michael Ironside René Simard Melissa Altro Robert Tinnell Eleven-year-old Alex and his fellow homemade heroes are having a backyard blast. Battles rage, knights fight and damsels distress in a cardboard Camelot of dirt-bike steeds, aluminum foil . . . Adventure Children Comedy","[Adventure, Children, Comedy]",124057.0,both,55,124057.0,True,"Set in modern times, Alex finds King Arthur's sword Excalibur and must prove himself worthy of it.","Eleven-year-old Alex and his fellow homemade heroes are having a backyard blast. Battles rage, knights fight and damsels distress in a cardboard Camelot of dirt-bike steeds, aluminum foil . . .","when i was a little younger i loved this movie and now i watch it and it's boring and major un-exciting, its a cute movie for kids, like kids 9 or 10 but if you watch it and you are over that age, i dont think you will like it, its based on kids and things kids do"
106,Catwalk,[Documentary],3.04,http://www.imdb.com/title/tt0112646/,http://www.imdb.com/title/tt0112646/reviews?spoiler=hide&sort=helpfulnessScore&dir=desc&ratingFilter=0,"[Christy Turlington, Azzedine Alaïa, Giorgio Armani, Nadja Auermann, Sandra Bernhard, Kate Betts, Brandi, Carla Bruni, Naomi Campbell, Nino Cerruti, Helena Christensen, Francesco Clemente, Grace Coddington, Cindy Crawford, Rufus Crawford]","A camera follows model Christy Turlington through the spring fashion shows in Milan, Paris, and New York one year in the early 1990s, probably 1992. She and others dash from one designer's . . .",6.9,Robert Leacock,"[Christy Turlington is the perfect person for documenting how the modeling and fashion industries collide. This film was a diamond in the rough. It not keeps true to the pure nature of the world of fashion and it's industry. It finally provides the respect needed for the modeling industry. \nThis film follows the most highly respected models in the Fashion History. A 90+ minute chronicle of Christy Turlington's world wind life during collections in the fashion capitals. Moving from Milan to Paris and then ending in New York. The movie provides a realistic view of a model and the lives they lead. They are just like the rest of us. One exception - in the case of Christy, she is flawless by the centimeter. What I would like to see is Christy do a follow up documentary - since the filming of this movie in 1993 she has grown considerably in her career as well as her personal life. \nThe documentary explains why Christy is one of the very few Supermodels still on top of an industry that is ever changing.]",...,"Catwalk Christy Turlington Azzedine Alaïa Giorgio Armani Nadja Auermann Sandra Bernhard Kate Betts Brandi Carla Bruni Naomi Campbell Nino Cerruti Helena Christensen Francesco Clemente Grace Coddington Cindy Crawford Rufus Crawford Robert Leacock A camera follows model Christy Turlington through the spring fashion shows in Milan, Paris, and New York one year in the early 1990s, probably 1992. She and others dash from one designer's . . . Documentary",[Documentary],89333.0,both,106,89333.0,True,"A documentary following Christy Turlington and other models during spring fashion week in Milan, Paris and New York.","A camera follows model Christy Turlington through the spring fashion shows in Milan, Paris, and New York one year in the early 1990s, probably 1992. She and others dash from one designer's . . .","Christy Turlington is the perfect person for documenting how the modeling and fashion industries collide. This film was a diamond in the rough. It not keeps true to the pure nature of the world of fashion and it's industry. It finally provides the respect needed for the modeling industry. \nThis film follows the most highly respected models in the Fashion History. A 90+ minute chronicle of Christy Turlington's world wind life during collections in the fashion capitals. Moving from Milan to Paris and then ending in New York. The movie provides a realistic view of a model and the lives they lead. They are just like the rest of us. One exception - in the case of Christy, she is flawless by the centimeter. What I would like to see is Christy do a follow up documentary - since the filming of this movie in 1993 she has grown considerably in her career as well as her personal life. \nThe documentary explains why Christy is one of the very few Supermodels still on top of an industry that is ever changing."
139,Shadows,[Drama],3.06,http://www.imdb.com/title/tt0094878/,http://www.imdb.com/title/tt0094878/reviews?spoiler=hide&sort=helpfulnessScore&dir=desc&ratingFilter=0,"[Zygmunt Bielawski, Jerzy Binczycki, Mieczyslaw Janowski, Jerzy Kaszubowski, Andrzej Kowalik, Bogdan Kuczkowski, Eugeniusz Kujawski, Igor Kujawski, Eliasz Kuziemski, Slawa Kwasniewska, Boguslaw Linda, Beata Maj, Ludomir Olszewski, Ryszard Radwanski, Beata Redodober]","During the Second World War, tens of thousands of blonde, blue-eyed Polish children were snatched from their parents and given to German families. Lebensborn was part of Hitler's plan to . . .",6.5,Jerzy Kaszubowski,"[The Polish title translates to Shadows, but the movie can be found in the rental services under the name The Road Home (which risks confusion with a 1999 Chinese movie with the same title). The place is Poland and the year is 1945, right after the end of WWII. Russians soldiers and tanks are ubiquitous and the fledgling communist government has a tenuous hold on power, as Nationalist partisans roam the forests and conduct night raids. Ethnic Germans are being deported back to Germany in appalling conditions. The corruption and misdeeds of the prewar Polish government are being exposed, principally the appalling lack of military preparation prior to the German onslaught in 1939. All of this is seen through the eyes of Jerzy Ostrovsky, a ten year old child. Jerzy is blond and blue-eyed, thus a candidate for membership in the mythical ""Aryan race""; he was an unwitting part of a Nazi program to take by force children like him to Germany to be indoctrinated as ""culturally German"". Jerzy has just been repatriated and returned to his family; his mother (whose welcome is strangely subdued) and his paternal grandparents (Jerzy's father is missing since 1939 and presumed dead). Jerzy's family seems to have weathered the Nazi occupation; they still possess their large country estate more or less intact. Jerzy's mother refuses to play her role as widow of a hero and is struggling to lead a normal life, while Jerzy's grandfather clings to Poland's traditions and past military glory. Jerzy is more proficient in German than in Polish (he is taunted and abused by other children at school) and he refuses to believe his father is dead. Jerzy's take on reality is poetic and has at times the quality of a fairy tale, and we see what he actually witnesses as well what he imagines or dreams. The movie is heavily symbolic, with an eagle, a white horse and the father's ghost a part of the plot. Director Jerzy Kaszubowski, who also wrote the script does an excellent job of integrating disparate elements into a coherent movie, and acting is excellent all around. If anything could be objected is that knowledge of recent Polish history is taken for granted, which may disorient at times the non-Polish viewer (like me).]",...,"Shadows Zygmunt Bielawski Jerzy Binczycki Mieczyslaw Janowski Jerzy Kaszubowski Andrzej Kowalik Bogdan Kuczkowski Eugeniusz Kujawski Igor Kujawski Eliasz Kuziemski Slawa Kwasniewska Boguslaw Linda Beata Maj Ludomir Olszewski Ryszard Radwanski Beata Redodober Jerzy Kaszubowski During the Second World War, tens of thousands of blonde, blue-eyed Polish children were snatched from their parents and given to German families. Lebensborn was part of Hitler's plan to . . . Drama",[Drama],525153.0,both,139,525153.0,True,"During the Second World War, tens of thousands of blonde, blue-eyed Polish children were snatched from their parents and given to German families. Lebensborn was part of Hitler's plan to expand the Aryan master race within the Third Reich. In 'The Road Home', eight-year old Jerzy returns home at the end of the war to a joyful reunion with his long-lost mother and grandfather. But problems arise as he is taunted by his peers and, longing for his missing father, burns with resentment for his new communist stepfather.","During the Second World War, tens of thousands of blonde, blue-eyed Polish children were snatched from their parents and given to German families. Lebensborn was part of Hitler's plan to expand the Aryan master race within the Third Reich. In 'The Road Home', eight-year old Jerzy returns home at the end of the war to a joyful reunion with his long-lost mother and grandfather. But problems arise as he is taunted by his peers and, longing for his missing father, burns with resentment for his new communist stepfather.","The Polish title translates to Shadows, but the movie can be found in the rental services under the name The Road Home (which risks confusion with a 1999 Chinese movie with the same title). The place is Poland and the year is 1945, right after the end of WWII. Russians soldiers and tanks are ubiquitous and the fledgling communist government has a tenuous hold on power, as Nationalist partisans roam the forests and conduct night raids. Ethnic Germans are being deported back to Germany in appalling conditions. The corruption and misdeeds of the prewar Polish government are being exposed, principally the appalling lack of military preparation prior to the German onslaught in 1939. All of this is seen through the eyes of Jerzy Ostrovsky, a ten year old child. Jerzy is blond and blue-eyed, thus a candidate for membership in the mythical ""Aryan race""; he was an unwitting part of a Nazi program to take by force children like him to Germany to be indoctrinated as ""culturally German"". Jerzy has just been repatriated and returned to his family; his mother (whose welcome is strangely subdued) and his paternal grandparents (Jerzy's father is missing since 1939 and presumed dead). Jerzy's family seems to have weathered the Nazi occupation; they still possess their large country estate more or less intact. Jerzy's mother refuses to play her role as widow of a hero and is struggling to lead a normal life, while Jerzy's grandfather clings to Poland's traditions and past military glory. Jerzy is more proficient in German than in Polish (he is taunted and abused by other children at school) and he refuses to believe his father is dead. Jerzy's take on reality is poetic and has at times the quality of a fairy tale, and we see what he actually witnesses as well what he imagines or dreams. The movie is heavily symbolic, with an eagle, a white horse and the father's ghost a part of the plot. Director Jerzy Kaszubowski, who also wrote the script does an excellent job of integrating disparate elements into a coherent movie, and acting is excellent all around. If anything could be objected is that knowledge of recent Polish history is taken for granted, which may disorient at times the non-Polish viewer (like me)."
399,Brother Minister: The Assassination of Malcolm X,[Documentary],3.32,http://www.imdb.com/title/tt0109339/,http://www.imdb.com/title/tt0109339/reviews?spoiler=hide&sort=helpfulnessScore&dir=desc&ratingFilter=0,"[Peter Bailey, Roscoe Lee Browne, John Henrik Clarke, Louis Farrakhan, James Fox, Robert Haggins, Robert L. Haggins, Khalil Islam, Benjamin Karim, Charles Kenyatta, Bab Zak Kondo, William Kunstler, Malcolm X, Elijah Muhammad, Wallace D. Muhammad]","Brother Minister reveals the mystery surrounding the assassination of Malcolm X at the Audubon Ballroom in New York City on February 21, 1965. It probes the innocence of two of the . . .",7.2,Jefri Aalmuhammed,"[If you have read the Autobiography of Malcolm X, you probably have most of the background. And the actual assassination is not clear at the end. Maybe I am wrong for expecting a solution. But this documentary is about the production team's efforts to find people related to the event and that is about it. Much conspiracy. Speculation. Not much data. Yet a lot of ""I heard from someone"". In the end I felt this was a waste of my time. Contact me with Questions, Comments or Suggestions ryitfork @ bitmail. ch]",...,"Brother Minister: The Assassination of Malcolm X Peter Bailey Roscoe Lee Browne John Henrik Clarke Louis Farrakhan James Fox Robert Haggins Robert L. Haggins Khalil Islam Benjamin Karim Charles Kenyatta Bab Zak Kondo William Kunstler Malcolm X Elijah Muhammad Wallace D. Muhammad Jefri Aalmuhammed Brother Minister reveals the mystery surrounding the assassination of Malcolm X at the Audubon Ballroom in New York City on February 21, 1965. It probes the innocence of two of the . . . Documentary",[Documentary],316098.0,both,399,316098.0,True,"Brother Minister reveals the mystery surrounding the assassination of Malcolm X at the Audubon Ballroom in New York City on February 21, 1965. It probes the innocence of two of the ...","Brother Minister reveals the mystery surrounding the assassination of Malcolm X at the Audubon Ballroom in New York City on February 21, 1965. It probes the innocence of two of the . . .","If you have read the Autobiography of Malcolm X, you probably have most of the background. And the actual assassination is not clear at the end. Maybe I am wrong for expecting a solution. But this documentary is about the production team's efforts to find people related to the event and that is about it. Much conspiracy. Speculation. Not much data. Yet a lot of ""I heard from someone"". In the end I felt this was a waste of my time. Contact me with Questions, Comments or Suggestions ryitfork @ bitmail. ch"
615,Asfour Stah,[Drama],3.27,http://www.imdb.com/title/tt0090665/,http://www.imdb.com/title/tt0090665/reviews?spoiler=hide&sort=helpfulnessScore&dir=desc&ratingFilter=0,"[Selim Boughedir, Mustapha Adouani, Rabia Ben Abdallah, Mohamed Driss, Hélène Catzaras, Fatma Ben Saïdane, Abdelhamid Gayess, Jamel Sassi, Radhouane Meddeb, Carolyn Chelby, Zahira Ben Ammar, Fethi Haddaoui, Issa Harath, Taoufik Chabchoub, Slah Msadek]",A coming-of-age comedy/drama set in Tunisia. Twelve-year-old Noura is an impressionable boy who must learn to reconcile two conflicting worlds - the loving world of Muslim women and the . . .,6.7,Férid Boughedir,"[""Halfaouine: Boy of the Terraces"" (1990, Ferid Boughedir) Apparently puberty's the same for everybody, everywhere. Transcending time and place, date and setting, the film is a coming-of-age tale far from the thematically similar Tunisian entry ""The Silences of the Palace"", which depicting in scorching detail the plight of the woman in this society and her expectory burden of rape and sexual favors, ""Halfaouine"" depicts male-female Tunisian relations as much like our own American, with a lot of strong and knowing females and the lascivious lunkhead sweethearts who love them. Into this world is twelve-year-old Noura, endeavoring on a quest to understand the world of women, how to be a man, relationships, and above all, to see boobs. Now, this is not the demented and disgusting brilliance of something like ""Leolo"", and it's not any outrageous or controversial endeavor on the level of ""Maladolescenza"" or ""Murmur of the Heart"", but its intent is to be breezy entertaining, not to get people talking about it in shocked whispers to their friends. The film has a solid grasp for both a good punchline (for instance, when a womanizer gives a young boy advice on how to get with women, the film cuts directly to the boy getting punished for harassing local girls) and a good visual joke (as when a street argument is cooled as the participants find a common bond: ogling a woman as she saunters away), functioning as little more than mindless entertainment. It fails to be overly notable outside of surprising me how comfortable Tunisia is with nudity, as the film is filled to the brim with it. Lead actor Selim Boughedir is not only sort of dumb-looking, but also only seems to have two facial expressions: A blank slate and a sheepish grin. The rest of the characters are ill-defined, and most of the women, with the except his town-temptress aunt, are more notable for their bodies than their lives, which, it could be argued, is somewhat appropriate, considering the movie's title character is a horny, inexperienced twelve-year-old. In the end, it's a harmless bit a fluff, a sex comedy with broad characters and a surplus of nudity, and more than anything else, I think it's nice to know that no matter where you're from, be it the suburbs of Michigan, the neighborhoods of Paris or the streets of Tunisia, every boy around the world just wants to get a little action. {Grade: 7. 5/10 (B-) / #15 (of 24) of 1990}]",...,Asfour Stah Selim Boughedir Mustapha Adouani Rabia Ben Abdallah Mohamed Driss Hélène Catzaras Fatma Ben Saïdane Abdelhamid Gayess Jamel Sassi Radhouane Meddeb Carolyn Chelby Zahira Ben Ammar Fethi Haddaoui Issa Harath Taoufik Chabchoub Slah Msadek Férid Boughedir A coming-of-age comedy/drama set in Tunisia. Twelve-year-old Noura is an impressionable boy who must learn to reconcile two conflicting worlds - the loving world of Muslim women and the . . . Drama,[Drama],44281.0,both,615,44281.0,True,"Twelve-year-old Noura dangles uncertainly in that difficult netherworld between childhood and adulthood. His growing libido has gotten him banned from the women's baths, where his mother took him when he was younger, but he's not yet old enough to participate in grown-up discussions with the men of his Tunisian village. Noura's only real friend is a troublemaker named Salih -- the village political outcast.","Twelve-year-old Noura dangles uncertainly in that difficult netherworld between childhood and adulthood. His growing libido has gotten him banned from the women's baths, where his mother took him when he was younger, but he's not yet old enough to participate in grown-up discussions with the men of his Tunisian village. Noura's only real friend is a troublemaker named Salih -- the village political outcast.","""Halfaouine: Boy of the Terraces"" (1990, Ferid Boughedir) Apparently puberty's the same for everybody, everywhere. Transcending time and place, date and setting, the film is a coming-of-age tale far from the thematically similar Tunisian entry ""The Silences of the Palace"", which depicting in scorching detail the plight of the woman in this society and her expectory burden of rape and sexual favors, ""Halfaouine"" depicts male-female Tunisian relations as much like our own American, with a lot of strong and knowing females and the lascivious lunkhead sweethearts who love them. Into this world is twelve-year-old Noura, endeavoring on a quest to understand the world of women, how to be a man, relationships, and above all, to see boobs. Now, this is not the demented and disgusting brilliance of something like ""Leolo"", and it's not any outrageous or controversial endeavor on the level of ""Maladolescenza"" or ""Murmur of the Heart"", but its intent is to be breezy entertaining, not to get people talking about it in shocked whispers to their friends. The film has a solid grasp for both a good punchline (for instance, when a womanizer gives a young boy advice on how to get with women, the film cuts directly to the boy getting punished for harassing local girls) and a good visual joke (as when a street argument is cooled as the participants find a common bond: ogling a woman as she saunters away), functioning as little more than mindless entertainment. It fails to be overly notable outside of surprising me how comfortable Tunisia is with nudity, as the film is filled to the brim with it. Lead actor Selim Boughedir is not only sort of dumb-looking, but also only seems to have two facial expressions: A blank slate and a sheepish grin. The rest of the characters are ill-defined, and most of the women, with the except his town-temptress aunt, are more notable for their bodies than their lives, which, it could be argued, is somewhat appropriate, considering the movie's title character is a horny, inexperienced twelve-year-old. In the end, it's a harmless bit a fluff, a sex comedy with broad characters and a surplus of nudity, and more than anything else, I think it's nice to know that no matter where you're from, be it the suburbs of Michigan, the neighborhoods of Paris or the streets of Tunisia, every boy around the world just wants to get a little action. {Grade: 7. 5/10 (B-) / #15 (of 24) of 1990}"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
48807,Kaloyan,"[Drama, History]",2.75,http://www.imdb.com/title/tt0311383/,http://www.imdb.com/title/tt0311383/reviews?spoiler=hide&sort=helpfulnessScore&dir=desc&ratingFilter=0,"[Vasil Stoychev, Bogomil Simeonov, Spas Dzhonev, Andrey Mihaylov, Magdalena Mircheva, Tzvetana Maneva, Ivan Stefanov, Ivan Tonev, Nikolay Doychev, Lyubomir Dimitrov, Bozhidar Lechev, Konstantin Dimchev, Stefan Petrov, Boris Mihaylov, Vasil Prodanov]",1197. King Kaloyan ascends the throne in hard times for Bulgaria. The country is still recovering from a century of Byzantine subjugation. He is forced to carry out a very flexible foreign . . .,7.7,Yuriy Arnaudov,"[For me Kaloyan is a greatly underestimated achievement of Bulgarian cinema, perhaps because its generation could not appreciate its advantages - one expects to see just a war movie that shows how great this Bulgarian tzar was (and he certainly was). However this movie is not just a lesson in History - it makes you be Kaloyan, feel his emotions, think the way he thinks. His character is realistic as well - equally fair and brutal, which a real leader had to be. The story is complex and dynamic as it also features minor characters (in terms of history); the music is suitable and the battle scenes are well made. Its atmosphere is certainly memorable. I consider the film as obligatory for Bulgarian movie fans; personally I prefer it to Aszparuh.]",...,Kaloyan Vasil Stoychev Bogomil Simeonov Spas Dzhonev Andrey Mihaylov Magdalena Mircheva Tzvetana Maneva Ivan Stefanov Ivan Tonev Nikolay Doychev Lyubomir Dimitrov Bozhidar Lechev Konstantin Dimchev Stefan Petrov Boris Mihaylov Vasil Prodanov Yuriy Arnaudov 1197. King Kaloyan ascends the throne in hard times for Bulgaria. The country is still recovering from a century of Byzantine subjugation. He is forced to carry out a very flexible foreign . . . Drama History,[Drama],248775.0,both,48848,248775.0,True,"1197. King Kaloyan ascends the throne in hard times for Bulgaria. The country is still recovering from a century of Byzantine subjugation. He is forced to carry out a very flexible foreign policy in order to strength his positions. Pope Innocent III recognizes him as Emperor (Tsar), but a little later the fourth Crusade crosses the country under Emperor Baldwin. A new conflict is coming. Tsar Kaloyan wages the decisive battle at Adrianople and wins.","1197. King Kaloyan ascends the throne in hard times for Bulgaria. The country is still recovering from a century of Byzantine subjugation. He is forced to carry out a very flexible foreign policy in order to strength his positions. Pope Innocent III recognizes him as Emperor (Tsar), but a little later the fourth Crusade crosses the country under Emperor Baldwin. A new conflict is coming. Tsar Kaloyan wages the decisive battle at Adrianople and wins.","For me Kaloyan is a greatly underestimated achievement of Bulgarian cinema, perhaps because its generation could not appreciate its advantages - one expects to see just a war movie that shows how great this Bulgarian tzar was (and he certainly was). However this movie is not just a lesson in History - it makes you be Kaloyan, feel his emotions, think the way he thinks. His character is realistic as well - equally fair and brutal, which a real leader had to be. The story is complex and dynamic as it also features minor characters (in terms of history); the music is suitable and the battle scenes are well made. Its atmosphere is certainly memorable. I consider the film as obligatory for Bulgarian movie fans; personally I prefer it to Aszparuh."
48809,"Parakalo, gynaikes, min klaite...","[Comedy, Drama]",3.50,http://www.imdb.com/title/tt0237541/,http://www.imdb.com/title/tt0237541/reviews?spoiler=hide&sort=helpfulnessScore&dir=desc&ratingFilter=0,"[Argyris Bakirtzis, Dimitris Vlahos, Dora Masklavanou, Nikolas Kekos, Emily Papahristou, Makis Artopoulos, Nikos Bogris, Dina Bouli-Gounari, Evgenia Fousiani, Konstadinos Fousianis, Evelyn Gavrilou, Voula Glinou, Dinos Haralambopoulos, Eirini Kaltezioti, Antonis Kioukas]","A village in Arcadia invites a ""famous"" icon painter and his assistant, Theofanis and Theodosios (Argyris Bakirtzis and Dimitris Vlachos), to restore the worn-out murals in an historic . . .",7.1,Stavros Tsiolis,"[This 1992 film was directed by Stavros Tsiolis and Christos Vakalopoulos who unfortunately died of cancer a little time later. The film tells the story of a painter of religious paintings and his apprentice who are hired by the town council of a small town in the Greek province to restore the paintings of their church. The painter and his apprentice begin their journey, arrive to the town, meet with all kinds of people, make philosophic conversations, do almost everything except getting their job done, but this doesn't matter at all. As the Greek poet Kostas Kavafis writes in his poem ""Ithaki"". what is important is not reaching the destination but the journey itself. This is exactly the case in the film. There is a strong comic aura in the film and the plot flows smoothly without the film becoming boring or ""pseudo-sophisticated"", something that was happening in most of the Greek films of the 80's and early 90's. This film reconciled me with the Greek cinema.]",...,"Parakalo, gynaikes, min klaite... Argyris Bakirtzis Dimitris Vlahos Dora Masklavanou Nikolas Kekos Emily Papahristou Makis Artopoulos Nikos Bogris Dina Bouli-Gounari Evgenia Fousiani Konstadinos Fousianis Evelyn Gavrilou Voula Glinou Dinos Haralambopoulos Eirini Kaltezioti Antonis Kioukas Stavros Tsiolis A village in Arcadia invites a ""famous"" icon painter and his assistant, Theofanis and Theodosios (Argyris Bakirtzis and Dimitris Vlachos), to restore the worn-out murals in an historic . . . Comedy Drama","[Comedy, Drama]",312510.0,both,48850,312510.0,True,,"A village in Arcadia invites a ""famous"" icon painter and his assistant, Theofanis and Theodosios (Argyris Bakirtzis and Dimitris Vlachos), to restore the worn-out murals in an historic . . .","This 1992 film was directed by Stavros Tsiolis and Christos Vakalopoulos who unfortunately died of cancer a little time later. The film tells the story of a painter of religious paintings and his apprentice who are hired by the town council of a small town in the Greek province to restore the paintings of their church. The painter and his apprentice begin their journey, arrive to the town, meet with all kinds of people, make philosophic conversations, do almost everything except getting their job done, but this doesn't matter at all. As the Greek poet Kostas Kavafis writes in his poem ""Ithaki"". what is important is not reaching the destination but the journey itself. This is exactly the case in the film. There is a strong comic aura in the film and the plot flows smoothly without the film becoming boring or ""pseudo-sophisticated"", something that was happening in most of the Greek films of the 80's and early 90's. This film reconciled me with the Greek cinema."
48811,Première année,[Drama],1.17,http://www.imdb.com/title/tt6690004/,http://www.imdb.com/title/tt6690004/reviews?spoiler=hide&sort=helpfulnessScore&dir=desc&ratingFilter=0,"[Vincent Lacoste, William Lebghil, Michel Lerousseau, Darina Al Joundi, Benoît Di Marco, Graziella Delerm, Guillaume Clérice, Alexandre Blazy, Noémi Silvania, Laurette Tessier, Adrien Schmück, Quitterie Boascals de Reals, Tiphaine Piovesan, Jeremy Corallo, Mehdi Amouri]",Friendship sparks between newcomer Benjamin and held-back Antoine during the first year of medical school.,6.7,Thomas Lilti,"[That's a purely medical student study movie, made by a former doctor, who lnows what he's talking about. the tale of a friendship between two students with its ups and downs. A very good character study, which also shows what medical students life is actually is. Also a beautiful friendship story pulled by a solid ending. An ending about sacrifice.]",...,Première année Vincent Lacoste William Lebghil Michel Lerousseau Darina Al Joundi Benoît Di Marco Graziella Delerm Guillaume Clérice Alexandre Blazy Noémi Silvania Laurette Tessier Adrien Schmück Quitterie Boascals de Reals Tiphaine Piovesan Jeremy Corallo Mehdi Amouri Thomas Lilti Friendship sparks between newcomer Benjamin and held-back Antoine during the first year of medical school. Drama,[Drama],463858.0,both,48852,463858.0,True,"Antoine is about to start his first year of medical school… for the third time. Benjamin, just out of high school, will make his first try. He soon realizes it's not exactly a walk in the park. In a fiercely competitive environment, with nights dedicated to hard studying rather than hard partying, the two freshmen will have to adapt and find a middle ground between despair for the present and hope for the future.","Antoine is about to start his first year of medical school… for the third time. Benjamin, just out of high school, will make his first try. He soon realizes it's not exactly a walk in the park. In a fiercely competitive environment, with nights dedicated to hard studying rather than hard partying, the two freshmen will have to adapt and find a middle ground between despair for the present and hope for the future.","That's a purely medical student study movie, made by a former doctor, who lnows what he's talking about. the tale of a friendship between two students with its ups and downs. A very good character study, which also shows what medical students life is actually is. Also a beautiful friendship story pulled by a solid ending. An ending about sacrifice."
48825,The Happy Sad,"[Drama, Romance]",3.00,http://www.imdb.com/title/tt2049559/,http://www.imdb.com/title/tt2049559/reviews?spoiler=hide&sort=helpfulnessScore&dir=desc&ratingFilter=0,"[Leroy McClain, Sorel Carradine, Charlie Barnett, Cameron Scoggins, Maria Dizzia, Sue Jean Kim, Jamie Harrold, Michael Nathanson, Devon O'Brien, Alex Jenkins, Russell Durham, Ben Sinclair, Ricardo Calderon, Debra Barnes, Selin Baykal]","Two young couples in New York-one black and gay, one white and heterosexual-find their lives intertwined as they create new relationship norms, explore sexual identity, and redefine monogamy.",5.2,Rodney Evans,"[Should have been called The Miserably Sad since there is nary a whiff of happiness in this exercise in bitterness. Who is audience for it supposed to be? Self hating gays? Unhappy heterosexuals? Miserable bisexuals? Selfish jackasses? If so than they hit the nail on the head but if their target audience was discriminating viewers the makers of this picture must have been sorely disappointed, as anybody unfortunate enough to see this will be too. Badly written, poorly acted by most of the cast amateurishly shot and with a callous attitude towards its characters there ISN'T something here for anyone.]",...,"The Happy Sad Leroy McClain Sorel Carradine Charlie Barnett Cameron Scoggins Maria Dizzia Sue Jean Kim Jamie Harrold Michael Nathanson Devon O'Brien Alex Jenkins Russell Durham Ben Sinclair Ricardo Calderon Debra Barnes Selin Baykal Rodney Evans Two young couples in New York-one black and gay, one white and heterosexual-find their lives intertwined as they create new relationship norms, explore sexual identity, and redefine monogamy. Drama Romance","[Drama, Romance]",209108.0,both,48866,209108.0,True,"Two young couples in New York-one black and gay, one white and heterosexual-find their lives intertwined as they create new relationship norms, explore sexual identity, and redefine monogamy.","Two young couples in New York-one black and gay, one white and heterosexual-find their lives intertwined as they create new relationship norms, explore sexual identity, and redefine monogamy.","Should have been called The Miserably Sad since there is nary a whiff of happiness in this exercise in bitterness. Who is audience for it supposed to be? Self hating gays? Unhappy heterosexuals? Miserable bisexuals? Selfish jackasses? If so than they hit the nail on the head but if their target audience was discriminating viewers the makers of this picture must have been sorely disappointed, as anybody unfortunate enough to see this will be too. Badly written, poorly acted by most of the cast amateurishly shot and with a callous attitude towards its characters there ISN'T something here for anyone."


#### -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --

### Section 2.1.2
To follow through this section you are advised to read the comments on top of each code block.

In [40]:
"""The section is currently empty. We plan to periodically add more movies and genres in the dataset"""

'The section is currently empty. We plan to periodically add more movies and genres in the dataset'

#### -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --

In [51]:
"""Version with all columns included-Serialize the dataset that will be used in part 3.1 for data tokenization and transformation."""
joblib.dump(dataset_plot_overview,'dataset_part_3.1_22022021.pkl')

['dataset_part_3.1_22022021.pkl']

In [52]:
"""Drop unnecessary columns that won't be used in notebooks to come."""
dataset_plot_overview.drop(["_merge", "index", "equal"], axis=1, inplace=True)

In [53]:
"""Light Version with only the necessary columns included-Serialize the dataset that will be used in part 3.1 for data tokenization and transformation."""
joblib.dump(dataset_plot_overview,'dataset_part_3.1_22022021_light.pkl')

['dataset_part_3.1_22022021_light.pkl']

#### THIS IS THE END OF PART 2.1 - Correcting & Enriching the dataset