# Data preperation 

The datasets from the IMDB database are normalised to atleast third degree, hence this notebook will take the the necassary fields of information and put it together for quick data retreival. This is essential as we want our system to be as quick as possible.

In [153]:
##importing libraries
import pandas as pd #for data manipulation
import unidecode #to replace accents with english letters

## Creating dataframe with Movie/Show information

### 1. Starting with main title file with language information

In [154]:
#reading the title files
df = pd.read_csv('../datasets from IMDB/title_akas.tsv',sep='\t')

  interactivity=interactivity, compiler=compiler, result=result)


In [155]:
df.head()

Unnamed: 0,titleId,ordering,title,region,language,types,attributes,isOriginalTitle
0,tt0000001,1,Карменсіта,UA,\N,imdbDisplay,\N,0
1,tt0000001,2,Carmencita,DE,\N,\N,literal title,0
2,tt0000001,3,Carmencita - spanyol tánc,HU,\N,imdbDisplay,\N,0
3,tt0000001,4,Καρμενσίτα,GR,\N,imdbDisplay,\N,0
4,tt0000001,5,Карменсита,RU,\N,imdbDisplay,\N,0


In [156]:
df.language.unique()

array(['\\N', 'ja', 'sv', 'en', 'tr', 'es', 'sr', 'cs', 'fa', 'fr', 'bg',
       'ca', 'nl', 'qbn', 'pt', 'ru', 'uk', 'qbp', 'ar', 'cmn', 'rn',
       'bs', 'de', 'hi', 'yi', 'qbo', 'ka', 'hr', 'sl', 'he', 'tg', 'sk',
       'kk', 'da', 'el', 'fi', 'it', 'gsw', 'yue', 'az', 'ms', 'pl', 'mr',
       'uz', 'gl', 'th', 'ta', 'eu', 'be', 'af', 'la', 'hy', 'ur', 'bn',
       'te', 'lt', 'mk', 'et', 'lv', 'gd', 'tl', 'cy', 'id', 'qal', 'gu',
       'ml', 'ro', 'hu', 'pa', 'kn', 'wo', 'no', 'is', 'sq', 'zh', 'ps',
       'nqo', 'sd', 'ga', 'xh', 'mi', 'zu', 'ku', 'rm', 'prs', 'ky', 'vi',
       'fro', 'ko', 'haw', 'mn', 'lo', 'my', 'am', 'qac', 'ne', 'myv',
       'br', 'iu', 'st', 'tn', 'cr'], dtype=object)

As observed above there are movies/shows from multiple languages in IMDB database. But, this project is only focussing on English and Hindi movies. Hence, rows of information for all other languages are removed

In [157]:
#Keeping titles with english or hindi titles and removing every other row of information
selected_languages=['en','hi']
df = df[df.language.isin(selected_languages)]

Also, columns like ordering,region, type, attributes and isOriginalTitle are of no use to the recommendation system. Hence, these columns are removed below.

In [158]:
df = df.drop(labels=['types','attributes','isOriginalTitle','ordering'],axis=1)

In [159]:
df.head()

Unnamed: 0,titleId,title,region,language
95,tt0000012,The Arrival of a Train,XWW,en
97,tt0000012,The Arrival of a Train at La Ciotat,XWW,en
107,tt0000012,The Arrival of a Train,XEU,en
157,tt0000016,Boat Leaving the Port,XWW,en
239,tt0000029,Baby's Meal,XWW,en


### 2. Adding genre, year and title type(movie or show) information

In [160]:
#reading the file with basic title information 
basic = pd.read_csv('../datasets from IMDB/title_basics.tsv',sep='\t')

  interactivity=interactivity, compiler=compiler, result=result)


In [161]:
basic.head()

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
0,tt0000001,short,Carmencita,Carmencita,0,1894,\N,1,"Documentary,Short"
1,tt0000002,short,Le clown et ses chiens,Le clown et ses chiens,0,1892,\N,5,"Animation,Short"
2,tt0000003,short,Pauvre Pierrot,Pauvre Pierrot,0,1892,\N,4,"Animation,Comedy,Romance"
3,tt0000004,short,Un bon bock,Un bon bock,0,1892,\N,12,"Animation,Short"
4,tt0000005,short,Blacksmith Scene,Blacksmith Scene,0,1893,\N,1,"Comedy,Short"


Next, the basic data is joined with the title information. 

In [162]:
#joining the required columns from the the basic tile dataframe
basic = basic.set_index('tconst')
df_with_basic = pd.merge(df,basic[['titleType','isAdult','startYear','genres']],left_on='titleId',right_on='tconst',how='left')

In [163]:
df_with_basic.head()

Unnamed: 0,titleId,title,region,language,titleType,isAdult,startYear,genres
0,tt0000012,The Arrival of a Train,XWW,en,short,0.0,1896,"Action,Documentary,Short"
1,tt0000012,The Arrival of a Train at La Ciotat,XWW,en,short,0.0,1896,"Action,Documentary,Short"
2,tt0000012,The Arrival of a Train,XEU,en,short,0.0,1896,"Action,Documentary,Short"
3,tt0000016,Boat Leaving the Port,XWW,en,short,0.0,1895,"Documentary,Short"
4,tt0000029,Baby's Meal,XWW,en,short,0.0,1895,"Documentary,Short"


Checking the types of titles in the data

In [164]:
df_with_basic.titleType.unique()

array(['short', 'movie', 'tvSeries', 'tvMovie', 'tvMiniSeries',
       'tvEpisode', 'tvShort', 'video', 'videoGame', 'tvSpecial', nan],
      dtype=object)

Only movies, tvSeries and TVMiniSeries are relavent to this project. Hence, every row of information with other title type is removed.

In [165]:
to_keep = ['movie','tvSeries','tvMiniSeries']
df_with_basic = df_with_basic[df_with_basic['titleType'].isin(to_keep)]

Checking the genres column for values.

In [166]:
df_with_basic.genres.unique()

array(['\\N', 'Drama', 'Drama,Romance', ...,
       'Musical,Reality-TV,Talk-Show', 'Comedy,Short,Talk-Show',
       'Music,Musical,Reality-TV'], dtype=object)

In [167]:
df_with_basic[df_with_basic.genres == '\\N']

Unnamed: 0,titleId,title,region,language,titleType,isAdult,startYear,genres
68,tt0000838,The Cultivation of the Cacao Tree,XWW,en,movie,0.0,1909,\N
78,tt0001051,Magical Dream,XWW,en,movie,0.0,1909,\N
80,tt0001122,The Red Inn,XWW,en,movie,0.0,1910,\N
112,tt0002329,Today and Tomorrow,XWW,en,movie,0.0,1912,\N
126,tt0002801,The Black Diamond,XWW,en,movie,0.0,1913,\N
...,...,...,...,...,...,...,...,...
2651813,tt9908394,Sex Documentary: Meaty,XWW,en,movie,0.0,1981,\N
2651969,tt9909276,Documentary Porn: Compulsive Rapist,XWW,en,movie,0.0,1981,\N
2651992,tt9909736,Porno Documentary: Housewife's Prostitution Team,XWW,en,movie,0.0,1981,\N
2651993,tt9909744,Please Seduce Me with Dirty Words,XWW,en,movie,0.0,1981,\N


Movie without genres ar of no use to this project. Hence they are removed.

In [168]:
#removing rows with \N as a genre
df_with_basic = df_with_basic[~(df_with_basic.genres =='\\N')]

Some english movie are mentioned twice in the dataste both in Hindi and English languages. This could have happened as english movies are often dubbed in hindi to reach to a wider audience. One such example is shown below

In [169]:
df_with_basic[df_with_basic.title == 'Coffee & Kareem']

Unnamed: 0,titleId,title,region,language,titleType,isAdult,startYear,genres
2649423,tt9898858,Coffee & Kareem,CA,en,movie,0.0,2020,"Action,Comedy"
2649424,tt9898858,Coffee & Kareem,IN,hi,movie,0.0,2020,"Action,Comedy"


Checking if there are some movie in hindi woth region different to India.

In [170]:
df_with_basic[(df_with_basic.region != 'IN') & (df_with_basic.language == 'hi')]

Unnamed: 0,titleId,title,region,language,titleType,isAdult,startYear,genres
21192,tt0085743,Jaane Bhi Do Yaaro,US,hi,movie,0.0,1983,"Comedy,Drama"
2483062,tt9248934,Marjaavaan,CA,hi,movie,0.0,2019,"Action,Drama,Romance"


Changing the region of above title to India

In [171]:
df_with_basic.loc[(df_with_basic.title == 'Jaane Bhi Do Yaaro'),'region']='IN'
df_with_basic.loc[(df_with_basic.title == 'Marjaavaan'),'region']='IN'

Checking titles with region as India and language as english

In [172]:
df_with_basic[(df_with_basic.region == 'IN') & (df_with_basic.language == 'en')].tail(4000)


Unnamed: 0,titleId,title,region,language,titleType,isAdult,startYear,genres
843703,tt1210819,The Lone Ranger,IN,en,movie,0.0,2013,"Action,Adventure,Western"
846501,tt1211837,Doctor Strange,IN,en,movie,0.0,2016,"Action,Adventure,Fantasy"
846961,tt1211956,Escape Plan,IN,en,movie,0.0,2013,"Action,Thriller"
847493,tt1212142,Rowdy Gari Pellam,IN,en,movie,0.0,1991,Drama
848750,tt12125238,Ek Thi Begum,IN,en,tvSeries,0.0,2020,"Crime,Drama"
850344,tt1212974,Bitch Slap,IN,en,movie,0.0,2009,"Action,Comedy,Crime"
851239,tt1213404,Knight Rider,IN,en,tvSeries,0.0,2008,"Adventure,Mystery,Thriller"
859903,tt1216496,Madeo,IN,en,movie,0.0,2009,"Crime,Drama,Thriller"
859968,tt1216520,Womb,IN,en,movie,0.0,2010,"Drama,Romance,Sci-Fi"
862975,tt12176398,Cooked with Cannabis,IN,en,tvSeries,0.0,2020,Reality-TV


Changing language of titles with region in India to hindi

In [173]:
to_hindi = ['Baiju Bawra','Devdas','Pyaasa','Madhumati','Apur Sansar','Mughal-E-Azam','Dil Hi to Hai','Mahanagar','Kashmir Ki Kali','Mahapurush','Sadhu Aur Shaitaan','Amar Prem','Andaaz','Abhimaan','Bobby','Saudagar','Benaam','Majboor','Muqaddar Ka Sikandar','Sparsh','Chakra','Ram Balram','Taxi Chor','Naseeb','Namak Halaal','Satte Pe Satta','Andhaa Kaanoon','Sharabi','Ghulami','Mohabbat','Aakhree Raasta','Ijaazat','Awara Baap','Jalwa','Jawab Hum Denge','Salaam Bombay!','Batwara','Lamhe','Sanam Bewafa','Shivaji Surathkal','One Two Three','Agnisakshi','Sakhi Tumi Kar','Taj Mahal 1989','Kehta Hai Yeh Dil','Dosti Ka Naya Maidan','Kambakkht Ishq','Aswathama','Meri Gudiya','Amaanat','Mahabharatham','Kavita Bhabhi','Yehh Jadu Hai Jinn Ka','3 Shyaane','Chhaliya','Agra','Savdhaan India 3 - India Fights Back','Shubharambh','Phir Laut Aayi Naagin','Thappad','Hurdang','Bhoomi','Radhe','Dostana 2','K.G.F: Chapter 2','Sindura Bindu','Badnaam Gali','Hum Toh Tere Aashiq Hai','Naqaab','Awarapan','Race','Inkaar','Baarish','Love Aaj Kal','Dhol','Jeena Isi Ka Naam Hai','Om Namah Shivay','Kya Hoga Nimmo Ka','Kasamh Se',"Mani Ratnam's Guru",'Deal or No Deal','Pyaar Ke Side Effects','Sivaji: The Boss','Satyaghath: Crime Never Pays','Khosla Ka Ghosla!','Chalte Chalte Kahan Aagaye Hum','Lage Raho Munna Bhai','Chandramukhi','Lucky','Meri Jung: One Man Army','Deewane Huye Paaga','Mumbai Express','Parineeta','Zabaan Sambhal Ke','Afsana Dilwalon Ka','Phir Bhi Dil Hai Hindustani','Sanskar','Vikram Aur Betaal','Aap Ki Adalat','Tu Tu Main Main','Shaktimaan: The First Indian Superhero','Ramayan','Chandrakanta','Buniyaad','Meri Aan Man At Work','Tere Naam','Hungama','Kohraam','Karishma Kali Kaa','Ek Aur Ek Gyarah','Sakshyam','Mahatma','Lahore','Ek Hasina Thi','Krantikari','Saathiya','Yeh Dil Aashiqanaa','Ittefaq','Yeh Jo Hai Zindagi','Aarzoo','Shri Krishna','Hamara Dil Aapke Paas Hai','Saajan Chale Sasural','Tenali Rama','Phir Bhi Dil Hai Hindustani','Shola Aur Shabnam','Billu Barber','Firaaq','Don 2','Wake Up Sid','Kahaani Ghar Ghar Kii','Once Upon a Time in Bombay','StreetDance 3D','Kasautii Zindagii Kay','Kahiin to Hoga','Anjaana Anjaani','Paan Singh Tomar','Yamla Pagla Deewana','Kismat','Once Upon a Time in Mumbaai Again','Murder 2','Sasural Simar Ka','Matru Ki Bijlee Ka Mandola','Y.J.H.D.','Aashiqui 2','Shootout at Wadala','Singham Returns','Jannat 2','A.B.C.D','Gully Boy','Savdhaan India','Comedy Circus Ke Ajoobe','Kahani Comedy Circus Ki','Love Marriage Ya Arranged Marriage','Kuch Toh Log Kahenge','F.I.R.','Ninja Hattori','Tere Mere Sapne','Madhubala - Ek Ishq Ek Junoon','Jaanu','Yeh Hai Mohabbatein','Hum Saath Aath Hai','Humpty Sharma Ki Dulhania','Dishoom','The Shaukeens','Welcome to Karachi','Yeh Hai Aashiqui','Akbar Birbal','MTV Roadies','Bhabi Ji Ghar Par Hai','On Air with AIB','Meri Pyaari Bindu','Aahat','MTV Unplugged India','Yamla Pagla Deewana Again',"TSP's Zeroes",'Salaam Zindagi','Street Dancer 3D']

In [174]:
remove= ['Munnabhai 2nd Innings','Munnabhai 2nd Innings']

Changing language of the above title to hindi

In [175]:
df_with_basic.loc[(df_with_basic.title.isin(to_hindi)),'language']='hi'

In [176]:
df_with_basic = df_with_basic[~(df_with_basic.language.isin(remove))]

In [178]:
pd.set_option('display.max_rows', 4100)

In [181]:
df_with_basic[(df_with_basic.language == 'hi')].head(4060)

Unnamed: 0,titleId,title,region,language,titleType,isAdult,startYear,genres
543,tt0015324,Sherlock Jr.,IN,hi,movie,0.0,1924,"Action,Comedy,Romance"
1044,tt0021282,Rain or Shine,IN,hi,movie,0.0,1930,"Comedy,Drama,Romance"
1139,tt0022268,Platinum Blonde,IN,hi,movie,0.0,1931,"Comedy,Romance"
1235,tt0023464,Shopworn,IN,hi,movie,0.0,1932,"Drama,Romance"
1315,tt0024617,The Story of Temple Drake,IN,hi,movie,0.0,1933,Drama
1531,tt0027817,Jeevan Naiya,IN,hi,movie,0.0,1936,Drama
1544,tt0027977,Modern Times,IN,hi,movie,0.0,1936,"Comedy,Drama,Family"
1562,tt0028212,Sabotage,IN,hi,movie,0.0,1936,"Crime,Thriller"
1993,tt0033029,Second Chorus,IN,hi,movie,0.0,1940,"Comedy,Musical,Romance"
2278,tt0035209,Prelude to War,IN,hi,movie,0.0,1942,"Documentary,War"


In [184]:
to_english = ['Sherlock Jr.','Rain or Shine','Platinum Blonde','Shopworn','Black Widow','Seven Samurai','12 Angry Men','Witness for the Prosecution','Hiroshima Mon Amour','Psycho',"Cleopatra's Daughter",'The Good, the Bad and the Ugly',"Guess Who's Coming to Dinner",'The Nude Restaurant','Point Blank','Enter the Game of Death','Marlowe','The Love Factor','Days and Nights in the Forest','Hollywood Blue','Murmur of the Heart','Enter the Dragon','Belladonna of Sadness','The Godfather: Part II','The Holy Mountain','The Night Porter','The Story of O','Submission','Star Wars: Episode IV - A New Hope','Apocalypse Now','American Gigolo','The Blue Lagoon','Caligula','Star Wars: Episode V - The Empire Strikes Back','Superman II','The Shining','The Entity','Private Lessons','Reds','Basket Case','Caligula and Messalina','E.T. the Extra-Terrestrial','The Evil Dead','A Little Sex','Outsiders','Star Wars: Episode VI - Return of the Jedi','Scarface','Trading Places','Gremlins','Angel','Hollywood Hot Tubs','The Terminator','Dragon Ball','Back to the Future','The Goonies','Little Flames','The Mosquito Coast','9½ Weeks','Top Gun','Stand by Me','Welcome to 18','Star Trek: The Next Generation','What Every Frenchwoman Wants','Full Metal Jacket','The Lost Boys','The Untouchables','Grave of the Fireflies','Legend of the Galactic Heroes','Wild Orchid','The Marrying Man','Only Yesterday','The Silence of the Lambs','Basic Instinct','Dracula','All Ladies Do It','he Opposite Sex and How to Live with Them','Reservoir Dogs','Scent of a Woman','Tokyo Decadence','Bad Boy Bubby','Jurassic Park',"Schindler's List",'Naked',"Baby's Day Out",'Friends','Cold Water','Forrest Gump','The Lion King','The Professional','The Shawshank Redemption','Pulp Fiction','The Smile of the Fox','Bad Boys','Braveheart','The Wood','Cast Away','American Pie','The Lord of the Rings: The Return of the King','The Sixth Sense','Boredom','American Beauty','The White Ship','Siska','Frivolous Lola','Golden Eyes Secret Agent 077','Requiem for a Dream','Little Nicky','Romance','Schoolgirls in Chains','Bionic Ninja','Wolf Guy','Sexy','Sexy Beast','Paths in the Night','Naked Video','Snatch','Memento','Amélie','Love & Sex','The Matrix Revolutions','Spirited Away','AMALL','The Pianist','Sex and Lucia','The Pornographer','Catch Me If You Can','The Girl Next Door','Battle Royale','Hollywood Sex Fantasy','Kill Bill: Vol. 1','Black Angel',"People's Dada","Red Dragon",'Meet the Fockers','X2: X-Men United','The Office','800 Bullets','Wrong Turn','xXx','Take Care of My Cat','Shrek 2','Timeline','Sex Is Comedy','The Wire','City of God','Berserk','Big Fish','The Day After Tomorrow','The Day Maradona Met Gardel','Pirates of the Caribbean: The Curse of the Blacl Pearl','The Best Sex Ever',"I'm Not Scared",'Mystic River','The Stepford Wives','Ella Enchanted','xXx: State of the Union','Scooby-Doo 2: Monsters Unleashed','The Notebook','Eternal Sunshine of the Spotless Mind','The Haunted Mansion','The Polar Express','Two Brothers','The Aviator','I, Robot','Memories of Murder','King Kong','Inglourious Basterds','The Life Aquatic with Steve Zissou','Downfall','The Chronicles of Narnia: The Lion, the Witch and the Wardrobe','Taking Lives','Please Teacher!','Charlie and the Chocolate Factory','D.E.B.S.','Jurassic World','Iron Man','Batman Begins','Saints and Soldiers','The Story of the Weeping Came','Halloween','Mi piace lavorare (Mobbing)','Jack Paradise (Les nuits de Montréal)','Crash','Survival Island','The Criminals','The Ellen DeGeneres Show','The 7th Day','Casino Royale','Ma Mère','The Da Vinci Code','Ratatouille','Rome','Motherless Brooklyn','The Office','The Grudge','The Exorcism of Emily Rose','Little Children','Casshern','The Departed','Watchmen','Lost','9 Songs','House M.D.','Veronica Mars','Spider-Man 3',"Grey's Anatomy",'Show Me Yours','300','Avatar: The Last Airbender','Land of the Dead','Lie with Me','A Soap','Fullmetal Alchemist','The Curious Case of Benjamin Button','V for Vendetta','The Invisible','Kidulthood','Next','Doctor Who','Alita: Battle Ange','Angel','Beowulf','Zodiac','The Painted Veil','Shazam!','Hancock','Blood Diamond','Wonder Woman','Good Luck Chuck','Lady in the Water','Perfect Mismatch','Life of Pi','The Equalizer','The Smart Hunt','Prison Break','Reincarnation','The Holiday','Star Wars: The Clone Wars','Captain America: The First Avenger','X-Men Origins: Wolverine','How I Met Your Mother','Supernatural','Spanish Beauty','The Prince of Tennis','Unnatural & Accidental','The Dark Knight','There Will Be Blood','Eagle Man','Apocalypto',"It's Always Sunny in Philadelphia",'Robbery','Westworld','Big Brother','No Country for Old Men','I Am Legend','A Good Day to Be Black & Sexy','The Prestige','Bus Conductor','The Ode to Joy','Fantastic 4: Rise of the Silver Surfer','Miss Cobra','XXXHOLiC','The Oxford Murders','Good Boy, Bad Boy','Carnival Row','Sixty Six','Strictly Sexual','Our Victory','Time','Into the Wild','Body of Lies','The Tudors','Man of Steel','Dexter',"My Mom's New Boyfriend",'Getting Home','The Last Full Measure','Honeymoon','Meet Bill','Flood','Carriers','Interstellar','World War Z','The Sex Movie','Arn: The Knight Templar','Pet Sematary','The Town','The Avengers','Death Note','The Big Bang Theory','The Hobbit: An Unexpected Journey','Breaking Bad','Young People Fucking','Harry Potter and the Deathly Hallows: Part 1','Salt','Game of Thrones','Source Code','Black Swan','The Amazing Spider-Man','The Happening','Six Sex Scenes and a Murder','Sherlock Holmes','The Wolf of Wall Street','The Struggle','Sex Drugs & Theatre','Medically Yourrs',"Hellcat's Revenge II: Deadman's Hand",'The Gift','Sweater','Under the Blue Sky','The Secret Life of My Secretary',"Angel's Last Mission: Love",'Sing "Yesterday" for Me','Never Back Down','Avenue 5','Argo','Sleeping with My Student','Death Proof',"Hachi: A Dog's Tale",'Bluff City Law','Stumptown','Outmatched','Prodigal Son','All Rise','You Cannot Hide','Fly Girls','Titans','Mission Over Mars','Black Christmas','Stalked','Bad Guys: The Movie','They Say Nothing Stays the Same',"Inside Bill's Brain: Decoding Bill Gates",'The Hole','House Arrest','Cosmos: Possible Worlds','Shutter Island','Sex Drive','Cubicles','Coloquinte','The Kung Fu Master','Handsome Siblings','Zombieland','Spice and Wolf','Little Soldier','The Rite','Never Kiss Your Best Friend','The Hobbit: The Desolation of Smaug','Wuthering Heights','7 Islands','The Boys','Memorist','The Mentalist','Harry Potter and the Deathly Hallows: Part 2','Gran Torino','The Rebound','Jack Ryan: Shadow Recruit','Rambo: Last Blood','Moneyball','Doctor Strange','I Am Love','Mission: Impossible - Ghost Protocol','Sorority Row',"Don't Look Down",'Conviction','Sex and the City 2','Room in Rome','X-Men: First Class','A Serbian Film','The Social Network','Bad Teacher','Sex Ed','xXx: Return of Xander Cage','Pirates of the Caribbean: On Stranger Tides','The Irishman','The Secret in Their Eyes','The Twilight Saga: Breaking Dawn: Part 1','3-D Sex and Zen: Extreme Ecstasy','Tomb Raider','Suicide Squad','Mad Max: Fury Road','42 Kilometres','Journey 2: The Mysterious Island','No Strings Attached','Apartment: Are You Looking for One?','X-Man Wolverine 2','The Conjuring','Gravity','Sherlock','Aquaman','Oblivion','Bad Boys for Life','The Walking Dead','Point Blank','Zombieland: Double Tap','Maleficent','Suits','Friends with Benefits','Maid Sama!','Project X','The Dictator','The Impossible','X: Night of Vengeance','This Is Not a Film','Now You See Me','The Twilight Saga: Breaking Dawn - Part 2','Intouchables','Trollhunters: Tales of Arcadia','Top Gun: Maverick','Orchids: My Intersex Adventure','Hollywood Sex Wars','Hollywood Sex Wars','Is This a Zombie?','Pirates of the Caribbean: Dead Men Tell No Tales','Black & White & Sex','Prosecutor Princess','13 Reasons Why','Person of Interest','American Horror Story','Django Unchained','House of Cards','Blade Runner 2049','Hollywood Rules','The World God Only Knows',"Let's Be Cops","Miss Peregrine's Home for Peculiar Children",'The Hangover Part III','Toy Story 4','Rush','Angry Birds','A Aa E Ee','Guardians of the Galaxy','The Great Wall','Ip Man 4: The Finale','The Imitation Game','Black Mirror','Inside Out','The Hunt','The Queen of Versailles','A Boy Called Sailboat','True Story','Men in Black: International','Naruto SD: Rock Lee & His Ninja Pals','The Hobbit: The Battle of the Five Armies','Fifty Shades of Grey','Avengers: Age of Ultron','Cosmos: A Spacetime Odyssey','Predestination','Outlawed','Star Wars: The Rise of Skywalker','Silicon Valley','The Town That Dreaded Sundown','The Fault in Our Stars','Whiplash','Me Before You','Narcos','The Blacklist','Bad Man','The Interview','Kingsman: The Secret Service','Furious 7','Ip Man 3','Shank','John Wick','Insurgent','The Jungle Book','Now You See Me 2','Into the Ashes','I Hear Your Voice','The Nightmare','The Boy Next Door',"Master's Sun",'Palm Trees in the Snow','Nine: Nine Time Travels',"I Do... Until I Don't",'Sully','Obsessed','Logan','Zapatlela 2','Daredevil','Heartless','Transformers: The Last Knight','X-Men: Apocalypse','When Marnie Was There','Despicable Me 3','The Hateful Eight','My Love from the Star']

In [183]:
remove_eng = ['Léon: The Professional','KKHH','****','KNPH','K3G','KKKG','www.XXX.com','R.D.B.','Veer-Zaara','Koi... Tumsa Nahin','Perfect Mis Match','Munnabhai 2nd Innings','Don: The Chase Begins Again','Dashavtar','Sivaji',"Mani Ratnam's Guru",'Harry Potter aur maut ke tohfe, part 1','XXY','Search: WWW','www.love.com','Love.Com','Guardians of the Galaxy: Anktriksh ke Boss','A.B.C.D','Cosmos','Furious Seven']

Now, there are reptetition of titles as seen below. To keep the right data, 
* The region of every title from US or Canada will be changed to AA
* The region of every title from India stays as it is
* The region of every title with any other country/region changes to ZZ

We will now sort the rows in descending order of langugae and region. The duplicated that will occur first are kept and everything else is removed.


In [75]:
df_with_basic[df_with_basic.duplicated(subset=['title'], keep=False)]

Unnamed: 0,titleId,title,region,language,titleType,isAdult,startYear,genres
73,tt0000941,Love Crazy,XWW,en,movie,0.0,1909,Drama
82,tt0001175,Camille,XWW,en,movie,0.0,1912,"Drama,Romance"
127,tt0002844,Fantomas,XWW,en,movie,0.0,1913,"Crime,Drama"
150,tt0003419,The Student of Prague,XWW,en,movie,0.0,1913,"Drama,Fantasy,Horror"
153,tt0003622,Love of Perdition,XWW,en,movie,0.0,1914,Drama
...,...,...,...,...,...,...,...,...
2649644,tt9900908,Handcuffs,XWW,en,movie,0.0,1969,"Action,Comedy,Crime"
2651088,tt9906644,Manoharam,CA,en,movie,0.0,2019,"Comedy,Drama"
2651089,tt9906644,Manoharam,IN,hi,movie,0.0,2019,"Comedy,Drama"
2653686,tt9915686,Khatra Khatra Khatra,IN,hi,tvSeries,0.0,2019,"Comedy,Reality-TV"


Changing region of othr countires to ZZ

In [76]:
index = (df_with_basic.language == 'en') & ((df_with_basic.region != 'US') | (df_with_basic.region != 'CA'))
df_with_basic.loc[(index),'region']='ZZ'

In [77]:
df_with_basic[(df_with_basic.language == 'en') ]

Unnamed: 0,titleId,title,region,language,titleType,isAdult,startYear,genres
73,tt0000941,Love Crazy,ZZ,en,movie,0.0,1909,Drama
82,tt0001175,Camille,ZZ,en,movie,0.0,1912,"Drama,Romance"
86,tt0001258,The White Slave Trade,ZZ,en,movie,0.0,1910,Drama
88,tt0001338,A Night in May,ZZ,en,movie,0.0,1910,Drama
93,tt0001790,"Les Misérables, Part 1: Jean Valjean",ZZ,en,movie,0.0,1913,Drama
...,...,...,...,...,...,...,...,...
2653687,tt9915686,The Khatra Show,ZZ,en,tvSeries,0.0,2019,"Comedy,Reality-TV"
2653688,tt9915686,Khatra Khatra Khatra,ZZ,en,tvSeries,0.0,2019,"Comedy,Reality-TV"
2653746,tt9916170,The Rehearsal,ZZ,en,movie,0.0,2019,Drama
2653749,tt9916206,Nojor,ZZ,en,tvSeries,0.0,2019,Fantasy


In [78]:
#Sorting the dataframe by langauge column
df_with_basic.sort_values(by=['region','language'],ascending= False ,inplace=True)

In [79]:
df_with_basic.drop_duplicates(subset=['titleId'],keep='first',inplace=True)

Checking again.

In [80]:
#checking again to confirm that Coffee & Kareem only appears in Enlglish language
df_with_basic[df_with_basic.title == 'Coffee & Kareem']

Unnamed: 0,titleId,title,region,language,titleType,isAdult,startYear,genres
2649423,tt9898858,Coffee & Kareem,ZZ,en,movie,0.0,2020,"Action,Comedy"


The language is correct now. Let's add ratings to the datframe now.

### 3.  Adding ratings information

In [81]:
ratings = pd.read_csv('../datasets from IMDB/title_rating.tsv',sep='\t')

In [82]:
ratings.head()

Unnamed: 0,tconst,averageRating,numVotes
0,tt0000001,5.6,1608
1,tt0000002,6.0,197
2,tt0000003,6.5,1286
3,tt0000004,6.1,121
4,tt0000005,6.1,2051


Joining the ratings information with the dataframe created above.

In [83]:
ratings = ratings.set_index('tconst')
df_with_ratings = pd.merge(df_with_basic,ratings,left_on='titleId',right_on='tconst',how='left')

In [84]:
df_with_ratings.head()

Unnamed: 0,titleId,title,region,language,titleType,isAdult,startYear,genres,averageRating,numVotes
0,tt0000941,Love Crazy,ZZ,en,movie,0.0,1909,Drama,4.2,13.0
1,tt0001175,Camille,ZZ,en,movie,0.0,1912,"Drama,Romance",5.5,22.0
2,tt0001258,The White Slave Trade,ZZ,en,movie,0.0,1910,Drama,5.8,80.0
3,tt0001338,A Night in May,ZZ,en,movie,0.0,1910,Drama,5.4,7.0
4,tt0001790,"Les Misérables, Part 1: Jean Valjean",ZZ,en,movie,0.0,1913,Drama,5.8,21.0


In [85]:
df_with_ratings.shape

(99015, 10)

In [86]:
df_with_ratings['title'] = df_with_ratings['title'].apply(lambda x: unidecode.unidecode(x))

In [87]:
df_with_ratings = df_with_ratings[~(df_with_ratings.startYear =='\\N')]
df_with_ratings['startYear'] = df_with_ratings['startYear'].apply(int)
df_with_ratings =  df_with_ratings[df_with_ratings.startYear >= 1950]

In [88]:
df_with_ratings.startYear.min()

1950

In [89]:
df_with_ratings.dropna(axis=0,how='any',inplace=True)

In [90]:
df_with_ratings.shape

(73161, 10)

In [91]:
df_with_ratings.titleType.unique()

array(['movie', 'tvSeries', 'tvMiniSeries'], dtype=object)

In [92]:
df_with_ratings.loc[(df_with_ratings.titleType == 'tvMiniSeries'),'titleType']='Mini-Series'
df_with_ratings.loc[(df_with_ratings.titleType == 'tvSeries'),'titleType']='Series'

In [93]:
df_with_ratings.loc[(df_with_ratings.language == 'en'),'language']='English'
df_with_ratings.loc[(df_with_ratings.language == 'hi'),'language']='Hindi'

In [100]:
df_with_ratings[df_with_ratings.language == 'Hindi'].sample(50)

Unnamed: 0,titleId,title,region,language,titleType,isAdult,startYear,genres,averageRating,numVotes
97768,tt0483786,Ram Aur Shyam,IN,Hindi,movie,0.0,1996,"Action,Crime,Drama",4.7,14.0
98399,tt3322892,XX,IN,Hindi,movie,0.0,2017,Horror,4.5,10635.0
97308,tt0215987,Dawedaar,IN,Hindi,movie,0.0,1987,Drama,5.7,22.0
97268,tt0175572,Do Dil,IN,Hindi,movie,0.0,1965,Romance,6.5,24.0
98112,tt1606253,Yes or No,IN,Hindi,movie,0.0,2001,"Drama,Romance",6.5,152.0
98100,tt1586680,Shameless,IN,Hindi,Series,0.0,2011,"Comedy,Drama",8.6,188594.0
97298,tt0213611,Dharam-Veer,IN,Hindi,movie,0.0,1977,"Action,Adventure,Comedy",6.7,630.0
97450,tt0272736,Mujhse Dosti Karoge!,IN,Hindi,movie,0.0,2002,"Drama,Family,Musical",5.1,4767.0
98976,tt9448332,Ishq,IN,Hindi,movie,0.0,2019,"Drama,Romance,Thriller",7.5,1507.0
98672,tt6117702,Munna Michael,IN,Hindi,movie,0.0,2017,"Action,Drama,Music",3.3,2502.0


Creating a csv with title information

In [95]:
df_with_ratings.to_csv("title_information.csv")

## Creating dataframe with cast information for each title

### 1. Creating dataframe with actor, actress and director infromation

Reading principal actor info

In [42]:
cast_info = pd.read_csv('../datasets from IMDB/title_principal.tsv',sep='\t')

In [43]:
cast_info.head(20)

Unnamed: 0,tconst,ordering,nconst,category,job,characters
0,tt0000001,1,nm1588970,self,\N,"[""Self""]"
1,tt0000001,2,nm0005690,director,\N,\N
2,tt0000001,3,nm0374658,cinematographer,director of photography,\N
3,tt0000002,1,nm0721526,director,\N,\N
4,tt0000002,2,nm1335271,composer,\N,\N
5,tt0000003,1,nm0721526,director,\N,\N
6,tt0000003,2,nm5442194,producer,producer,\N
7,tt0000003,3,nm1335271,composer,\N,\N
8,tt0000003,4,nm5442200,editor,\N,\N
9,tt0000004,1,nm0721526,director,\N,\N


Removin titles in cast datafrme which are not present in the datafrme created above.

In [44]:
cast_info = cast_info[cast_info.tconst.isin(df_with_ratings.titleId)]

Keeping the inforamtion of actor, actress and director and removing everything else

In [45]:
cast_info = cast_info[(cast_info.category == 'actor') | (cast_info.category == 'actress') | (cast_info.category == 'director')]

Removing the columns which are not required for the project. 

In [46]:
cast_info.drop(labels=['ordering','job','characters'], axis=1,inplace=True)

In [47]:
cast_info.head()


Unnamed: 0,tconst,nconst,category
316495,tt0039442,nm0007023,actor
316496,tt0039442,nm0544330,actress
316497,tt0039442,nm0019330,actor
316498,tt0039442,nm0370455,actress
316499,tt0039442,nm0349426,director


### 2. Replacing the codes for cast with their names

In [48]:
name_info = pd.read_csv('../datasets from IMDB/name_basics.tsv',sep='\t')

In [49]:
name_info.head()

Unnamed: 0,nconst,primaryName,birthYear,deathYear,primaryProfession,knownForTitles
0,nm0000001,Fred Astaire,1899,1987,"soundtrack,actor,miscellaneous","tt0050419,tt0043044,tt0072308,tt0053137"
1,nm0000002,Lauren Bacall,1924,2014,"actress,soundtrack","tt0037382,tt0038355,tt0071877,tt0117057"
2,nm0000003,Brigitte Bardot,1934,\N,"actress,soundtrack,producer","tt0059956,tt0049189,tt0054452,tt0057345"
3,nm0000004,John Belushi,1949,1982,"actor,soundtrack,writer","tt0077975,tt0080455,tt0072562,tt0078723"
4,nm0000005,Ingmar Bergman,1918,2007,"writer,director,actor","tt0083922,tt0050976,tt0050986,tt0060827"


In [50]:
#replacing the codes in the cast dataframe from names
cast_info['nconst'] = cast_info['nconst'].map(name_info.set_index('nconst')['primaryName'])

In [51]:
#renaming names column
cast_info.rename(columns = {'nconst':'name'}, inplace = True)

In [52]:
cast_info.head()

Unnamed: 0,tconst,name,category
316495,tt0039442,José Luis López Vázquez,actor
316496,tt0039442,Kiti Mánver,actress
316497,tt0039442,Francisco Algora,actor
316498,tt0039442,Hanna Haxmann,actress
316499,tt0039442,Manuel Gutiérrez Aragón,director


Changing the values in name column to string and then replacing accents with unaccanted letters.

In [53]:
cast_info['name'] = cast_info['name'].apply(str)
cast_info['name'] = cast_info['name'].apply(lambda x: unidecode.unidecode(x))

Writing to csv file

In [54]:
cast_info = cast_info.reset_index()
cast_info.to_csv('cast_information.csv',encoding="UTF-8")