### Feature Pre-processing

#### The Feature Pre-processing of IMDB data is done using python. 
#### The IMDB Feature Pre-processing.py reads in the 7 data files and does the feature preprocessing of the IMDb data. 
#### After which, the desired set of tables are output as tab-separate-value (tsv) files.

## Importing required libraries

In [1]:
import pandas as pd
import numpy as np
df = pd.read_csv('title.basics.tsv.gz', delimiter ='\t', encoding='utf-8', low_memory=False)
df1 = pd.read_csv('title.akas.tsv.gz', delimiter ='\t', encoding='utf-8', low_memory=False)
df2 = pd.read_csv('title.crew.tsv.gz', delimiter ='\t', encoding='utf-8', low_memory=False)
df3 = pd.read_csv('title.episode.tsv.gz', delimiter ='\t', encoding='utf-8', low_memory=False)
df4 = pd.read_csv('name.basics.tsv.gz', delimiter ='\t', encoding='utf-8', low_memory=False)
df5 = pd.read_csv('title.principals.tsv.gz', delimiter ='\t', encoding='utf-8', low_memory=False)
df6 = pd.read_csv('title.ratings.tsv.gz', delimiter ='\t', encoding='utf-8', low_memory=False)

## Seggretting basic dataset on the basis of titletype
#### This dataset column titleType contains following values

In [2]:
df.titleType.value_counts()

titleType
tvEpisode       7802453
short            956529
movie            659814
video            281557
tvSeries         250599
tvMovie          143207
tvMiniSeries      50785
tvSpecial         43769
videoGame         36187
tvShort           10037
tvPilot               1
Name: count, dtype: int64

#### We clubbed above values into episode, movie and series
## Movie Dataframe

In [3]:
dfMovie = df[(df["titleType"] == "short") | (df["titleType"] == "tvShort") | (df["titleType"] == "movie")
            | (df["titleType"] == "tvMovie")]
dfMovie["titleType"] = "movie"

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dfMovie["titleType"] = "movie"


## Series dataset

In [4]:
dfSeries = df[(df["titleType"] == "tvSeries") | (df["titleType"] == "tvMiniSeries")]
dfSeries["titleType"] = "series"

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dfSeries["titleType"] = "series"


## Episode dataset

In [5]:
dfEpisode = df[(df["titleType"]== "tvEpisode")]
dfEpisode['titleType'] = 'episode'

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dfEpisode['titleType'] = 'episode'


## Merge episode, series, movie 

In [6]:
dfBasicsNew = pd.concat([dfMovie, dfSeries, dfEpisode], ignore_index=True)
dfBasicsNew.head()

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
0,tt0000001,movie,Carmencita,Carmencita,0,1894,\N,1,"Documentary,Short"
1,tt0000002,movie,Le clown et ses chiens,Le clown et ses chiens,0,1892,\N,5,"Animation,Short"
2,tt0000003,movie,Pauvre Pierrot,Pauvre Pierrot,0,1892,\N,4,"Animation,Comedy,Romance"
3,tt0000004,movie,Un bon bock,Un bon bock,0,1892,\N,12,"Animation,Short"
4,tt0000005,movie,Blacksmith Scene,Blacksmith Scene,0,1893,\N,1,"Comedy,Short"


#### So there are now 3 titleType values present in dataset

In [7]:
dfBasicsNew.titleType.value_counts()

titleType
episode    7802453
movie      1769587
series      301384
Name: count, dtype: int64

#### After that we merged this dataset with ratings dataset such that titles will get rating and votes count information.

## Merge new dataset with akas
#### We merged this dataset with Title.basics.tsv so that we can get information of titles according to different regions and their languages of each movie, series and episode

In [8]:
titleakaswithbasics = pd.merge(df1, dfBasicsNew, left_on=['titleId'], right_on = ['tconst'])
titleakaswithbasics

Unnamed: 0,titleId,ordering,title,region,language,types,attributes,isOriginalTitle,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
0,tt0000001,1,Карменсіта,UA,\N,imdbDisplay,\N,0,tt0000001,movie,Carmencita,Carmencita,0,1894,\N,1,"Documentary,Short"
1,tt0000001,2,Carmencita,DE,\N,\N,literal title,0,tt0000001,movie,Carmencita,Carmencita,0,1894,\N,1,"Documentary,Short"
2,tt0000001,3,Carmencita - spanyol tánc,HU,\N,imdbDisplay,\N,0,tt0000001,movie,Carmencita,Carmencita,0,1894,\N,1,"Documentary,Short"
3,tt0000001,4,Καρμενσίτα,GR,\N,imdbDisplay,\N,0,tt0000001,movie,Carmencita,Carmencita,0,1894,\N,1,"Documentary,Short"
4,tt0000001,5,Карменсита,RU,\N,imdbDisplay,\N,0,tt0000001,movie,Carmencita,Carmencita,0,1894,\N,1,"Documentary,Short"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
36697119,tt9916852,5,Episódio #3.20,PT,pt,\N,\N,0,tt9916852,episode,Episode #3.20,Episode #3.20,0,2010,\N,\N,"Action,Drama,Family"
36697120,tt9916852,6,Episodio #3.20,IT,it,\N,\N,0,tt9916852,episode,Episode #3.20,Episode #3.20,0,2010,\N,\N,"Action,Drama,Family"
36697121,tt9916852,7,एपिसोड #3.20,IN,hi,\N,\N,0,tt9916852,episode,Episode #3.20,Episode #3.20,0,2010,\N,\N,"Action,Drama,Family"
36697122,tt9916856,1,The Wind,DE,\N,imdbDisplay,\N,0,tt9916856,movie,The Wind,The Wind,0,2015,\N,27,Short


#### Then we segregated this whole dataset by movie, series and episode titleTypes datasets.

#### fetching series, movie and episode values from titleType

In [9]:
akas_series = titleakaswithbasics[titleakaswithbasics["titleType"] == "series"]
akas_movie = titleakaswithbasics[titleakaswithbasics["titleType"] == "movie"]
akas_episode = titleakaswithbasics[titleakaswithbasics["titleType"] == "episode"]

## droping columns of basics

In [10]:
series_alias = akas_series.drop(columns=["tconst", "titleType", "primaryTitle", "originalTitle", "isAdult",
                           "startYear", "endYear", "runtimeMinutes", "genres"], axis = 1)
movie_alias = akas_movie.drop(columns=["tconst", "titleType", "primaryTitle", "originalTitle", "isAdult",
                           "startYear", "endYear", "runtimeMinutes", "genres"], axis = 1)                         
episode_alias = akas_episode.drop(columns=["tconst", "titleType", "primaryTitle", "originalTitle", "isAdult",
                           "startYear", "endYear", "runtimeMinutes", "genres"], axis = 1)                 

#### Remove Attributes, language

In [12]:
movie_alias.head()

Unnamed: 0,titleId,ordering,title,region,language,types,attributes,isOriginalTitle
0,tt0000001,1,Карменсіта,UA,\N,imdbDisplay,\N,0
1,tt0000001,2,Carmencita,DE,\N,\N,literal title,0
2,tt0000001,3,Carmencita - spanyol tánc,HU,\N,imdbDisplay,\N,0
3,tt0000001,4,Καρμενσίτα,GR,\N,imdbDisplay,\N,0
4,tt0000001,5,Карменсита,RU,\N,imdbDisplay,\N,0


# merging episode dataset with of episodes column of title.basics with ratings

## Title.episode.tsv
#### As we already have information about episodes from title.basics dataset, we merge this dataset with title.basics

In [13]:
df_episode_new = pd.merge(df3, dfEpisode, left_on=['tconst'], right_on = ['tconst'])
df_episode_new.drop(columns=["genres"], axis = 1, inplace=True)
df_episode_new.head()

Unnamed: 0,tconst,parentTconst,seasonNumber,episodeNumber,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes
0,tt0041951,tt0041038,1,9,episode,The Tenderfeet,The Tenderfeet,0,1949,\N,30
1,tt0042816,tt0989125,1,17,episode,Othello,Othello,0,1950,\N,135
2,tt0042889,tt0989125,\N,\N,episode,The Tragedy of King Richard II/II,The Tragedy of King Richard II/II,0,1950,\N,145
3,tt0043426,tt0040051,3,42,episode,Coriolanus,Coriolanus,0,1951,\N,60
4,tt0043631,tt0989125,2,16,episode,The Life of King Henry V,The Life of King Henry V,0,1951,\N,133


In [14]:
df_episode_new = pd.merge(df_episode_new, df6, on='tconst', how = "left")
df_episode_new.head()

Unnamed: 0,tconst,parentTconst,seasonNumber,episodeNumber,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,averageRating,numVotes
0,tt0041951,tt0041038,1,9,episode,The Tenderfeet,The Tenderfeet,0,1949,\N,30,7.6,86.0
1,tt0042816,tt0989125,1,17,episode,Othello,Othello,0,1950,\N,135,,
2,tt0042889,tt0989125,\N,\N,episode,The Tragedy of King Richard II/II,The Tragedy of King Richard II/II,0,1950,\N,145,,
3,tt0043426,tt0040051,3,42,episode,Coriolanus,Coriolanus,0,1951,\N,60,,
4,tt0043631,tt0989125,2,16,episode,The Life of King Henry V,The Life of King Henry V,0,1951,\N,133,,


In [15]:
df6[df6["tconst"] == "tt0041951"]

Unnamed: 0,tconst,averageRating,numVotes
23268,tt0041951,7.6,86


## segregating movie and series wrt title type 

In [16]:
dfBasicsNew_movie = dfBasicsNew[dfBasicsNew["titleType"] == "movie"]
dfBasicsNew_series = dfBasicsNew[dfBasicsNew["titleType"] == "series"]
print(dfBasicsNew_movie.shape, dfBasicsNew_series.shape)

(1769587, 9) (301384, 9)


## For Movies

In [17]:
df_movie_genre = dfBasicsNew_movie[["tconst", "genres"]]
df_movie_genre = df_movie_genre[df_movie_genre["genres"] != "\\N"]
df_movie_genre = pd.DataFrame(df_movie_genre.genres.str.split(',')
                                          .tolist(), index=df_movie_genre.tconst).stack()
df_movie_genre = df_movie_genre.reset_index()[['tconst', 0]] 
df_movie_genre.columns = ['tconst', 'genres']
df_movie_genre.head()

Unnamed: 0,tconst,genres
0,tt0000001,Documentary
1,tt0000001,Short
2,tt0000002,Animation
3,tt0000002,Short
4,tt0000003,Animation


## For series

In [18]:
df_series_genre = dfBasicsNew_series[["tconst", "genres"]]
df_series_genre = df_series_genre[df_series_genre["genres"] != "\\N"]
df_series_genre = pd.DataFrame(df_series_genre.genres.str.split(',')
                                          .tolist(), index=df_series_genre.tconst).stack()
df_series_genre = df_series_genre.reset_index()[['tconst', 0]] 
df_series_genre.columns = ['tconst', 'genres']
df_series_genre.head()

Unnamed: 0,tconst,genres
0,tt0035803,Documentary
1,tt0035803,News
2,tt0038276,Talk-Show
3,tt0039120,Family
4,tt0039120,Game-Show


## Name.basics.tsv
#### As this dataset contains information of people and their profession in the movie, it also contains multivalued column named primary Profession. We solved this problem by converting those values into equivalent number of rows.

In [19]:
df4["primaryProfession"].replace(np.nan, "None", inplace = True)
df42 = pd.DataFrame(df4.primaryProfession.str.split(',')
                                         .tolist(), index=df4.nconst).stack()
df42 = df42.reset_index()[['nconst', 0]] 
df42.rename(columns = {0: 'primaryProfession'}, inplace=True)

In [20]:
df4_notNone = df4[df4["primaryProfession"] != "None"]

In [21]:
df4_final = pd.merge(df4_notNone, df42, left_on=['nconst'], right_on = ['nconst'])
df4_final.drop(columns = ["primaryProfession_x"], axis = True, inplace=True)
df4_final.rename(columns = {"primaryProfession_y": 'primaryProfession'}, inplace=True)

In [22]:
df4_final.head()

Unnamed: 0,nconst,primaryName,birthYear,deathYear,knownForTitles,primaryProfession
0,nm0000001,Fred Astaire,1899,1987,"tt0050419,tt0031983,tt0072308,tt0053137",soundtrack
1,nm0000001,Fred Astaire,1899,1987,"tt0050419,tt0031983,tt0072308,tt0053137",actor
2,nm0000001,Fred Astaire,1899,1987,"tt0050419,tt0031983,tt0072308,tt0053137",miscellaneous
3,nm0000002,Lauren Bacall,1924,2014,"tt0117057,tt0037382,tt0038355,tt0075213",actress
4,nm0000002,Lauren Bacall,1924,2014,"tt0117057,tt0037382,tt0038355,tt0075213",soundtrack


#### Using this information, we created 3 entities viz. actor( which contains actor and actress information), writer and director.

In [23]:
df4_final_actor_actress = df4_final[(df4_final["primaryProfession"] == "actor") | 
                                    (df4_final["primaryProfession"] == "actress")]
df4_final_director = df4_final[df4_final["primaryProfession"] == "director"]
df4_final_writer = df4_final[df4_final["primaryProfession"] == "writer"]

In [24]:
df4_final_actor_actress["gender"] = "None"

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df4_final_actor_actress["gender"] = "None"


In [25]:
df4_final_actor_actress.loc[df4_final_actor_actress['primaryProfession'] == 'actor', 'gender'] = 'male'
df4_final_actor_actress.loc[df4_final_actor_actress['primaryProfession'] == 'actress', 'gender'] = 'female'
df4_final_actor_actress

Unnamed: 0,nconst,primaryName,birthYear,deathYear,knownForTitles,primaryProfession,gender
1,nm0000001,Fred Astaire,1899,1987,"tt0050419,tt0031983,tt0072308,tt0053137",actor,male
3,nm0000002,Lauren Bacall,1924,2014,"tt0117057,tt0037382,tt0038355,tt0075213",actress,female
5,nm0000003,Brigitte Bardot,1934,\N,"tt0057345,tt0049189,tt0054452,tt0056404",actress,female
8,nm0000004,John Belushi,1949,1982,"tt0078723,tt0077975,tt0080455,tt0072562",actor,male
13,nm0000005,Ingmar Bergman,1918,2007,"tt0069467,tt0050986,tt0050976,tt0083922",actor,male
...,...,...,...,...,...,...,...
14273827,nm9993698,Sebi John,\N,\N,tt8736744,actor,male
14273828,nm9993699,Dani Jacob,\N,\N,tt8736744,actor,male
14273829,nm9993700,Sexy Angel,\N,\N,tt7523066,actress,female
14273830,nm9993701,Sanjai Kuriakose,\N,\N,tt8736744,actor,male


## Title.crew.tsv
#### In this dataset, both directors and writers columns contains multi value cells. So we generated number of rows according to their multiple values. We also removed null values.

In [26]:
df2 = pd.read_csv('title.crew.tsv.gz', delimiter ='\t', encoding='utf-8', low_memory=False)
df2.head()

Unnamed: 0,tconst,directors,writers
0,tt0000001,nm0005690,\N
1,tt0000002,nm0721526,\N
2,tt0000003,nm0721526,\N
3,tt0000004,nm0721526,\N
4,tt0000005,nm0005690,\N


In [27]:
df2_writers = df2[(df2["writers"] != "\\N")]
print(df2_writers["writers"].unique().tolist())

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



In [28]:
df2_writers

Unnamed: 0,tconst,directors,writers
8,tt0000009,nm0085156,nm0085156
34,tt0000036,nm0005690,nm0410331
74,tt0000076,nm0005690,nm0410331
89,tt0000091,nm0617588,nm0617588
106,tt0000108,nm0005690,nm0410331
...,...,...,...
10234933,tt9916848,nm1485677,"nm9187127,nm1485677,nm9826385,nm9299459,nm1628284"
10234934,tt9916850,nm1485677,"nm9187127,nm1485677,nm9826385,nm1628284"
10234935,tt9916852,nm1485677,"nm9187127,nm1485677,nm9826385,nm9299459,nm1628284"
10234936,tt9916856,nm10538645,nm6951431


## Title.principals.tsv
### We first merged this dataset with datasets we created, containing movie, series and episode information.


In [30]:
df5 = pd.read_csv('title.principals.tsv.gz', delimiter ='\t', encoding='utf-8', low_memory=False)

In [31]:
df_principal_basecnew = pd.merge(dfBasicsNew, df5, left_on=['tconst'], right_on = ['tconst'])

In [32]:
df_principal_basecnew.titleType.value_counts()

titleType
episode    44276835
movie      10617738
series      1685539
Name: count, dtype: int64

In [33]:
df_principal_movie = df_principal_basecnew[(df_principal_basecnew['titleType']=='movie')]
df_principal_series = df_principal_basecnew[(df_principal_basecnew['titleType']=='series')]
df_principal_episode = df_principal_basecnew[(df_principal_basecnew['titleType']=='episode')]

#### Then, we segregated this new dataset according to the title type and got 3 datasets individually containing principal information of movies, series and episode.

## For movies

In [34]:
dfMovieRelation = df_principal_movie[(df_principal_movie['category']=='actor') | 
                                         (df_principal_movie['category']=='actress') |
                                         (df_principal_movie['category']=='writer') |
                                         (df_principal_movie['category']=='director')]
dfMovieRelation.drop(columns = ['titleType','primaryTitle', 'originalTitle','isAdult','startYear','endYear',
                                     'runtimeMinutes', "job", "genres"], axis = 1, inplace = True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dfMovieRelation.drop(columns = ['titleType','primaryTitle', 'originalTitle','isAdult','startYear','endYear',


#### Acted by relation set between actor and movie

In [35]:
dfActorMovieRelation = dfMovieRelation[(dfMovieRelation["category"] == "actor") | 
                                       (dfMovieRelation["category"] == "actress")]
dfActorMovieRelation.head()

Unnamed: 0,tconst,ordering,nconst,category,characters
11,tt0000005,1,nm0443482,actor,"[""Blacksmith""]"
12,tt0000005,2,nm0653042,actor,"[""Assistant""]"
16,tt0000007,1,nm0179163,actor,\N
17,tt0000007,2,nm0183947,actor,\N
21,tt0000008,1,nm0653028,actor,"[""Sneezing Man""]"


#### Directed by relation set between director and movie

In [36]:
dfDirectorMovieRelation = dfMovieRelation[(dfMovieRelation["category"] == "director")]
dfDirectorMovieRelation.head()

Unnamed: 0,tconst,ordering,nconst,category,characters
1,tt0000001,2,nm0005690,director,\N
3,tt0000002,1,nm0721526,director,\N
5,tt0000003,1,nm0721526,director,\N
9,tt0000004,1,nm0721526,director,\N
13,tt0000005,3,nm0005690,director,\N


#### Directed by relation set between writer and movie

In [37]:
dfWriterMovieRelation = dfMovieRelation[(dfMovieRelation["category"] == "writer")]
dfWriterMovieRelation.head()

Unnamed: 0,tconst,ordering,nconst,category,characters
89,tt0000036,3,nm0410331,writer,\N
146,tt0000076,3,nm0410331,writer,\N
188,tt0000108,3,nm0410331,writer,\N
192,tt0000109,3,nm0410331,writer,\N
196,tt0000110,3,nm0410331,writer,\N


## For series 

In [38]:
dfSeriesRelation = df_principal_series[(df_principal_series['category']=='actor') | 
                                         (df_principal_series['category']=='actress') |
                                         (df_principal_series['category']=='writer') |
                                         (df_principal_series['category']=='director')]
dfSeriesRelation.drop(columns = ['titleType','primaryTitle', 'originalTitle','isAdult','startYear','endYear',
                                     'runtimeMinutes', "job", "genres"], axis = 1, inplace = True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dfSeriesRelation.drop(columns = ['titleType','primaryTitle', 'originalTitle','isAdult','startYear','endYear',


## Acted by relation set between actor and series


In [39]:
dfActorSeriesRelation = dfSeriesRelation[(dfSeriesRelation["category"] == "actor") | 
                                       (dfSeriesRelation["category"] == "actress")]
dfActorSeriesRelation.head()

Unnamed: 0,tconst,ordering,nconst,category,characters
10617751,tt0039120,10,nm0416564,actor,"[""Announcer""]"
10617759,tt0039120,8,nm0274631,actress,"[""Co-host"",""Self""]"
10617760,tt0039120,9,nm0240118,actor,"[""Announcer""]"
10617762,tt0039121,2,nm0114766,actor,"[""King Cole (1949)""]"
10617763,tt0039121,3,nm0333320,actress,"[""Host (1948)""]"


## Acted by relation set between director and series


In [40]:
dfDirectorSeriesRelation = dfSeriesRelation[(dfSeriesRelation["category"] == "director")]
dfDirectorSeriesRelation.head()

Unnamed: 0,tconst,ordering,nconst,category,characters
10645795,tt0080344,5,nm0469957,director,\N
10659431,tt0096275,5,nm0288476,director,\N
10665966,tt0102759,5,nm0924334,director,\N
10666067,tt0103246,5,nm0776209,director,\N
10670997,tt0107506,5,nm1091568,director,\N


## Acted by relation set between writer and series


In [41]:
dfWriterSeriesRelation = dfSeriesRelation[(dfSeriesRelation["category"] == "writer")]
dfWriterSeriesRelation.head()

Unnamed: 0,tconst,ordering,nconst,category,characters
10617987,tt0040051,5,nm0548529,writer,\N
10618069,tt0040996,5,nm0326055,writer,\N
10618144,tt0041008,5,nm0520492,writer,\N
10618198,tt0041018,5,nm2186062,writer,\N
10618374,tt0041038,5,nm0872077,writer,\N


## Adding ratings to Movie, series and episodes

In [45]:
dfMovieNew = pd.merge(dfMovie, df6, how = "left", left_on=['tconst'], right_on = ['tconst'])
dfSeriesNew = pd.merge(dfSeries, df6, how = "left", left_on=['tconst'], right_on = ['tconst'])
dfEpisodeNew = pd.merge(dfEpisode, df6, how = "left", left_on=['tconst'], right_on = ['tconst'])

In [46]:
dfEpisodeNew

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres,averageRating,numVotes
0,tt0041951,episode,The Tenderfeet,The Tenderfeet,0,1949,\N,30,Western,7.6,86.0
1,tt0042816,episode,Othello,Othello,0,1950,\N,135,Drama,,
2,tt0042889,episode,The Tragedy of King Richard II/II,The Tragedy of King Richard II/II,0,1950,\N,145,Drama,,
3,tt0043426,episode,Coriolanus,Coriolanus,0,1951,\N,60,Drama,,
4,tt0043631,episode,The Life of King Henry V,The Life of King Henry V,0,1951,\N,133,Drama,,
...,...,...,...,...,...,...,...,...,...,...,...
7802448,tt9916846,episode,Episode #3.18,Episode #3.18,0,2009,\N,\N,"Action,Drama,Family",,
7802449,tt9916848,episode,Episode #3.17,Episode #3.17,0,2009,\N,\N,"Action,Drama,Family",,
7802450,tt9916850,episode,Episode #3.19,Episode #3.19,0,2010,\N,\N,"Action,Drama,Family",,
7802451,tt9916852,episode,Episode #3.20,Episode #3.20,0,2010,\N,\N,"Action,Drama,Family",,


In [47]:
dfSeriesNew

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres,averageRating,numVotes
0,tt0035599,series,Voice of Firestone Televues,Voice of Firestone Televues,0,1943,1947,15,\N,,
1,tt0035803,series,The German Weekly Review,Die Deutsche Wochenschau,0,1940,1945,12,"Documentary,News",8.1,57.0
2,tt0038276,series,You Are an Artist,You Are an Artist,0,1946,1955,15,Talk-Show,,
3,tt0039120,series,Americana,Americana,0,1947,1949,30,"Family,Game-Show",2.7,18.0
4,tt0039121,series,Birthday Party,Birthday Party,0,1947,1949,30,Family,,
...,...,...,...,...,...,...,...,...,...,...,...
301379,tt9916210,series,Rumpole of the Bailey,Rumpole of the Bailey,0,\N,\N,\N,\N,,
301380,tt9916216,series,Kalyanam Mudhal Kadhal Varai,Kalyanam Mudhal Kadhal Varai,0,2014,2017,22,Romance,8.6,15.0
301381,tt9916218,series,Lost in Food,Lost in Food,0,2016,2017,\N,Talk-Show,,
301382,tt9916380,series,Meie aasta Aafrikas,Meie aasta Aafrikas,0,2019,\N,43,"Adventure,Comedy,Family",8.3,117.0


In [48]:
dfMovieNew

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres,averageRating,numVotes
0,tt0000001,movie,Carmencita,Carmencita,0,1894,\N,1,"Documentary,Short",5.7,2000.0
1,tt0000002,movie,Le clown et ses chiens,Le clown et ses chiens,0,1892,\N,5,"Animation,Short",5.8,269.0
2,tt0000003,movie,Pauvre Pierrot,Pauvre Pierrot,0,1892,\N,4,"Animation,Comedy,Romance",6.5,1889.0
3,tt0000004,movie,Un bon bock,Un bon bock,0,1892,\N,12,"Animation,Short",5.5,178.0
4,tt0000005,movie,Blacksmith Scene,Blacksmith Scene,0,1893,\N,1,"Comedy,Short",6.2,2676.0
...,...,...,...,...,...,...,...,...,...,...,...
1769582,tt9916730,movie,6 Gunn,6 Gunn,0,2017,\N,116,Drama,7.6,11.0
1769583,tt9916754,movie,Chico Albuquerque - Revelações,Chico Albuquerque - Revelações,0,2013,\N,49,Documentary,,
1769584,tt9916756,movie,Pretty Pretty Black Girl,Pretty Pretty Black Girl,0,2019,\N,\N,Short,,
1769585,tt9916764,movie,38,38,0,2018,\N,\N,Short,,


## Dropping columns and converting datasets to csv files - basics + ratings

### Dropping endYear,  titleType, Genre

In [110]:
dfMovieNewFinal = dfMovieNew.drop(columns=["endYear", "titleType", "genres"], axis = 1)
dfMovieNewFinal.to_csv("Final CSV Files/Movie.tsv", encoding="utf-8", sep="\t", index=False)

In [51]:
dfSeriesNewFinal = dfSeriesNew.drop(columns=["endYear", "titleType", "genres", "runtimeMinutes"], axis = 1)
dfSeriesNewFinal.to_csv("Final CSV Files/Series.tsv", encoding="utf-8", sep="\t", index=False)

In [52]:
dfEpisodeNewFinal = df_episode_new.drop(columns=["endYear", "titleType","runtimeMinutes"], axis = 1)
dfEpisodeNewFinal.to_csv("Final CSV Files/Episode.tsv", encoding="utf-8", sep="\t", index=False)

### Episode.tsv

In [112]:
dfEpisodeNewFinal.head(5)

Unnamed: 0,tconst,parentTconst,seasonNumber,episodeNumber,primaryTitle,originalTitle,isAdult,startYear,averageRating,numVotes
0,tt0041951,tt0041038,1,9,The Tenderfeet,The Tenderfeet,0,1949,7.6,86.0
1,tt0042816,tt0989125,1,17,Othello,Othello,0,1950,,
2,tt0042889,tt0989125,\N,\N,The Tragedy of King Richard II/II,The Tragedy of King Richard II/II,0,1950,,
3,tt0043426,tt0040051,3,42,Coriolanus,Coriolanus,0,1951,,
4,tt0043631,tt0989125,2,16,The Life of King Henry V,The Life of King Henry V,0,1951,,


In [54]:
df_episode_new.head(5)

Unnamed: 0,tconst,parentTconst,seasonNumber,episodeNumber,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,averageRating,numVotes
0,tt0041951,tt0041038,1,9,episode,The Tenderfeet,The Tenderfeet,0,1949,\N,30,7.6,86.0
1,tt0042816,tt0989125,1,17,episode,Othello,Othello,0,1950,\N,135,,
2,tt0042889,tt0989125,\N,\N,episode,The Tragedy of King Richard II/II,The Tragedy of King Richard II/II,0,1950,\N,145,,
3,tt0043426,tt0040051,3,42,episode,Coriolanus,Coriolanus,0,1951,\N,60,,
4,tt0043631,tt0989125,2,16,episode,The Life of King Henry V,The Life of King Henry V,0,1951,\N,133,,


In [55]:
df_episode_new

Unnamed: 0,tconst,parentTconst,seasonNumber,episodeNumber,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,averageRating,numVotes
0,tt0041951,tt0041038,1,9,episode,The Tenderfeet,The Tenderfeet,0,1949,\N,30,7.6,86.0
1,tt0042816,tt0989125,1,17,episode,Othello,Othello,0,1950,\N,135,,
2,tt0042889,tt0989125,\N,\N,episode,The Tragedy of King Richard II/II,The Tragedy of King Richard II/II,0,1950,\N,145,,
3,tt0043426,tt0040051,3,42,episode,Coriolanus,Coriolanus,0,1951,\N,60,,
4,tt0043631,tt0989125,2,16,episode,The Life of King Henry V,The Life of King Henry V,0,1951,\N,133,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...
7800094,tt9916846,tt1289683,3,18,episode,Episode #3.18,Episode #3.18,0,2009,\N,\N,,
7800095,tt9916848,tt1289683,3,17,episode,Episode #3.17,Episode #3.17,0,2009,\N,\N,,
7800096,tt9916850,tt1289683,3,19,episode,Episode #3.19,Episode #3.19,0,2010,\N,\N,,
7800097,tt9916852,tt1289683,3,20,episode,Episode #3.20,Episode #3.20,0,2010,\N,\N,,


In [56]:
dfEpisodeNewFinal.set_index(["tconst"]).index.is_unique

True

### Series.tsv

In [57]:
dfSeriesNewFinal

Unnamed: 0,tconst,primaryTitle,originalTitle,isAdult,startYear,averageRating,numVotes
0,tt0035599,Voice of Firestone Televues,Voice of Firestone Televues,0,1943,,
1,tt0035803,The German Weekly Review,Die Deutsche Wochenschau,0,1940,8.1,57.0
2,tt0038276,You Are an Artist,You Are an Artist,0,1946,,
3,tt0039120,Americana,Americana,0,1947,2.7,18.0
4,tt0039121,Birthday Party,Birthday Party,0,1947,,
...,...,...,...,...,...,...,...
301379,tt9916210,Rumpole of the Bailey,Rumpole of the Bailey,0,\N,,
301380,tt9916216,Kalyanam Mudhal Kadhal Varai,Kalyanam Mudhal Kadhal Varai,0,2014,8.6,15.0
301381,tt9916218,Lost in Food,Lost in Food,0,2016,,
301382,tt9916380,Meie aasta Aafrikas,Meie aasta Aafrikas,0,2019,8.3,117.0


### Movie.tsv

In [58]:
dfMovieNewFinal

Unnamed: 0,tconst,primaryTitle,originalTitle,isAdult,startYear,runtimeMinutes,averageRating,numVotes
0,tt0000001,Carmencita,Carmencita,0,1894,1,5.7,2000.0
1,tt0000002,Le clown et ses chiens,Le clown et ses chiens,0,1892,5,5.8,269.0
2,tt0000003,Pauvre Pierrot,Pauvre Pierrot,0,1892,4,6.5,1889.0
3,tt0000004,Un bon bock,Un bon bock,0,1892,12,5.5,178.0
4,tt0000005,Blacksmith Scene,Blacksmith Scene,0,1893,1,6.2,2676.0
...,...,...,...,...,...,...,...,...
1769582,tt9916730,6 Gunn,6 Gunn,0,2017,116,7.6,11.0
1769583,tt9916754,Chico Albuquerque - Revelações,Chico Albuquerque - Revelações,0,2013,49,,
1769584,tt9916756,Pretty Pretty Black Girl,Pretty Pretty Black Girl,0,2019,\N,,
1769585,tt9916764,38,38,0,2018,\N,,


## Dropping columns and converting datasets to csv files - akas

In [59]:
movie_alias_final = movie_alias.drop(columns=["language", "types", "attributes"], axis = 1)
movie_alias_final.to_csv("Final CSV Files/movie_alias.tsv", encoding="utf-8", sep="\t", index=False)

In [60]:
series_alias_final = series_alias.drop(columns=["language", "types", "attributes"], axis = 1)
series_alias_final.to_csv("Final CSV Files/series_alias.tsv", encoding="utf-8", sep="\t", index=False)

In [61]:
episode_alias_final = episode_alias.drop(columns=["language", "types", "attributes"], axis = 1)
episode_alias_final.to_csv("Final CSV Files/episode_alias.tsv", encoding="utf-8", sep="\t", index=False)

### movie_alias.tsv

In [62]:
movie_alias_final

Unnamed: 0,titleId,ordering,title,region,isOriginalTitle
0,tt0000001,1,Карменсіта,UA,0
1,tt0000001,2,Carmencita,DE,0
2,tt0000001,3,Carmencita - spanyol tánc,HU,0
3,tt0000001,4,Καρμενσίτα,GR,0
4,tt0000001,5,Карменсита,RU,0
...,...,...,...,...,...
36696853,tt9916756,2,Pretty Pretty Black Girl,\N,1
36696868,tt9916764,1,38,US,0
36696869,tt9916764,2,38,\N,1
36697122,tt9916856,1,The Wind,DE,0


### series_alias.tsv

In [63]:
series_alias_final

Unnamed: 0,titleId,ordering,title,region,isOriginalTitle
204777,tt0035599,1,Voice of Firestone Televues,\N,1
204778,tt0035599,2,Voice of Firestone Televues,US,0
206488,tt0035803,1,The German Weekly Review,GB,0
206489,tt0035803,2,The German Weekly Review,US,0
206490,tt0035803,3,Die Deutsche Wochenschau,DE,0
...,...,...,...,...,...
36696251,tt9916218,1,Lost in Food,CN,0
36696412,tt9916380,1,Meie aasta Aafrikas,EE,0
36696413,tt9916380,2,Meie aasta Aafrikas,\N,1
36696800,tt9916678,1,Acelerados,BR,0


### episode_alias.tsv

In [64]:
episode_alias_final

Unnamed: 0,titleId,ordering,title,region,isOriginalTitle
262032,tt0041951,1,The Tenderfeet,US,0
270461,tt0042816,1,Othello,GB,0
271262,tt0042889,1,The Tragedy of King Richard II/II,\N,1
271263,tt0042889,2,The Tragedy of King Richard II/II,GB,0
271264,tt0042889,3,Richard II,GB,0
...,...,...,...,...,...
36697117,tt9916852,3,Folge #3.20,DE,0
36697118,tt9916852,4,エピソード #3.20,JP,0
36697119,tt9916852,5,Episódio #3.20,PT,0
36697120,tt9916852,6,Episodio #3.20,IT,0


## Dropping columns and converting datasets to csv files - name basic

### Actor

In [65]:
df_actor_final_new = df4_final_actor_actress.drop(columns=["primaryProfession"], axis = 1)

In [66]:
df_actor_knownForTitles = df_actor_final_new[["nconst", "knownForTitles"]]
df_actor_knownForTitles = df_actor_knownForTitles[df_actor_knownForTitles["knownForTitles"] != "\\N"]
df_actor_knownForTitles = pd.DataFrame(df_actor_knownForTitles.knownForTitles.str.split(',')
                                          .tolist(), index=df_actor_knownForTitles.nconst).stack()
df_actor_knownForTitles = df_actor_knownForTitles.reset_index()[['nconst', 0]] 
df_actor_knownForTitles.columns = ['nconst', 'knownForTitles']

In [67]:
df_actor_knownForTitles.to_csv("Final CSV Files/actorknownfortitles.tsv", encoding="utf-8", sep="\t", index=False)

In [68]:
df_actor_final_new.drop(columns = ["knownForTitles"], axis = 1, inplace=True)

In [69]:
df_actor_final_new.to_csv("Final CSV Files/actor.tsv", encoding="utf-8", sep="\t", index=False)

### actor.tsv

In [70]:
df_actor_final_new

Unnamed: 0,nconst,primaryName,birthYear,deathYear,gender
1,nm0000001,Fred Astaire,1899,1987,male
3,nm0000002,Lauren Bacall,1924,2014,female
5,nm0000003,Brigitte Bardot,1934,\N,female
8,nm0000004,John Belushi,1949,1982,male
13,nm0000005,Ingmar Bergman,1918,2007,male
...,...,...,...,...,...
14273827,nm9993698,Sebi John,\N,\N,male
14273828,nm9993699,Dani Jacob,\N,\N,male
14273829,nm9993700,Sexy Angel,\N,\N,female
14273830,nm9993701,Sanjai Kuriakose,\N,\N,male


### actorknownfortitles.tsv

In [106]:
df_actor_knownForTitles

Unnamed: 0,nconst,knownForTitles
0,nm0000001,tt0050419
1,nm0000001,tt0031983
2,nm0000001,tt0072308
3,nm0000001,tt0053137
4,nm0000002,tt0117057
...,...,...
8941512,nm9993701,tt8736744
8941513,nm9993703,tt11212278
8941514,nm9993703,tt6225166
8941515,nm9993703,tt10627062


### Director

In [71]:
df_director_final_new = df4_final_director.drop(columns=["primaryProfession"], axis = 1)

In [72]:
df_director_knownForTitles = df_director_final_new[["nconst", "knownForTitles"]]
df_director_knownForTitles = df_director_knownForTitles[df_director_knownForTitles["knownForTitles"] != "\\N"]
df_director_knownForTitles = pd.DataFrame(df_director_knownForTitles.knownForTitles.str.split(',')
                                          .tolist(), index=df_director_knownForTitles.nconst).stack()
df_director_knownForTitles = df_director_knownForTitles.reset_index()[['nconst', 0]] 
df_director_knownForTitles.columns = ['nconst', 'knownForTitles']

In [73]:
df_director_knownForTitles.to_csv("Final CSV Files/directorknownfortitles.tsv", encoding="utf-8", sep="\t", index=False)

In [74]:
df_director_final_new.drop(columns = ["knownForTitles"], axis = 1, inplace=True)

In [75]:
df_director_final_new.to_csv("Final CSV Files/director.tsv", encoding="utf-8", sep="\t", index=False)

### directorknownfortitles.tsv

In [107]:
df_director_knownForTitles

Unnamed: 0,nconst,knownForTitles
0,nm0000005,tt0069467
1,nm0000005,tt0050986
2,nm0000005,tt0050976
3,nm0000005,tt0083922
4,nm0000008,tt0047296
...,...,...
1858228,nm9993708,tt11772858
1858229,nm9993709,tt11697102
1858230,nm9993709,tt17717854
1858231,nm9993709,tt11772812


### director.tsv

In [76]:
df_director_final_new

Unnamed: 0,nconst,primaryName,birthYear,deathYear
12,nm0000005,Ingmar Bergman,1918,2007
22,nm0000008,Marlon Brando,1924,2004
28,nm0000010,James Cagney,1899,1986
52,nm0000019,Federico Fellini,1920,1993
67,nm0000024,John Gielgud,1904,2000
...,...,...,...,...
14273810,nm9993679,Art Jones,\N,\N
14273822,nm9993694,Chinmay Mishra,\N,\N
14273824,nm9993696,Ibrahim-Aloduley,\N,\N
14273837,nm9993708,Eli Bevins,\N,\N


### Writer

In [77]:
df_writer_final_new = df4_final_writer.drop(columns=["primaryProfession"], axis = 1)

In [78]:
df_writer_knownForTitles = df_writer_final_new[["nconst", "knownForTitles"]]
df_writer_knownForTitles = df_writer_knownForTitles[df_writer_knownForTitles["knownForTitles"] != "\\N"]
df_writer_knownForTitles = pd.DataFrame(df_writer_knownForTitles.knownForTitles.str.split(',')
                                          .tolist(), index=df_writer_knownForTitles.nconst).stack()
df_writer_knownForTitles = df_writer_knownForTitles.reset_index()[['nconst', 0]] 
df_writer_knownForTitles.columns = ['nconst', 'knownForTitles']

In [79]:
df_writer_knownForTitles.to_csv("Final CSV Files/writerknownfortitles.tsv", encoding="utf-8", sep="\t", index=False)

In [80]:
df_writer_final_new.drop(columns = ["knownForTitles"], axis = 1, inplace=True)

In [81]:
df_writer_final_new.to_csv("Final CSV Files/writer.tsv", encoding="utf-8", sep="\t", index=False)

### writerknownfortitles.tsv

In [108]:
df_writer_knownForTitles

Unnamed: 0,nconst,knownForTitles
0,nm0000004,tt0078723
1,nm0000004,tt0077975
2,nm0000004,tt0080455
3,nm0000004,tt0072562
4,nm0000005,tt0069467
...,...,...
2172867,nm9993709,tt11772904
2172868,nm9993713,tt10709066
2172869,nm9993713,tt15134202
2172870,nm9993713,tt20319332


### writer.tsv

In [82]:
df_writer_final_new

Unnamed: 0,nconst,primaryName,birthYear,deathYear
10,nm0000004,John Belushi,1949,1982
11,nm0000005,Ingmar Bergman,1918,2007
51,nm0000019,Federico Fellini,1920,1993
66,nm0000024,John Gielgud,1904,2000
76,nm0000027,Alec Guinness,1914,2000
...,...,...,...,...
14273795,nm9993657,Jason Green,\N,\N
14273823,nm9993694,Chinmay Mishra,\N,\N
14273838,nm9993708,Eli Bevins,\N,\N
14273841,nm9993709,Lu Bevins,\N,\N


### Dropping columns and converting datasets to csv files - name basic

# All relations related to movie

In [83]:
dfActorMovieRelation.drop(['category'],inplace = True, axis = 1)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dfActorMovieRelation.drop(['category'],inplace = True, axis = 1)


In [84]:
dfActorMovieRelation.to_csv("Final CSV Files/movie_actor_relation.tsv", encoding="utf-8", sep="\t", index=False)

In [85]:
dfDirectorMovieRelation.drop(['category','characters'],inplace = True, axis = 1)
# .drop(['category'],inplace = True, axis = 1)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dfDirectorMovieRelation.drop(['category','characters'],inplace = True, axis = 1)


In [86]:
dfDirectorMovieRelation.to_csv("Final CSV Files/movie_director_relation.tsv", encoding="utf-8", sep="\t", index=False)

In [87]:
dfWriterMovieRelation.drop(['category','characters'],inplace = True, axis = 1)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dfWriterMovieRelation.drop(['category','characters'],inplace = True, axis = 1)


In [88]:
dfWriterMovieRelation.to_csv("Final CSV Files/movie_writer_relation.tsv", encoding="utf-8", sep="\t", index=False)

### movie_actor_relation.tsv

In [96]:
dfActorMovieRelation

Unnamed: 0,tconst,ordering,nconst,characters
11,tt0000005,1,nm0443482,"[""Blacksmith""]"
12,tt0000005,2,nm0653042,"[""Assistant""]"
16,tt0000007,1,nm0179163,\N
17,tt0000007,2,nm0183947,\N
21,tt0000008,1,nm0653028,"[""Sneezing Man""]"
...,...,...,...,...
10617724,tt9916764,4,nm7403794,"[""Judah Harris""]"
10617730,tt9916856,1,nm3394271,"[""Maria""]"
10617731,tt9916856,2,nm10538650,"[""Sandra""]"
10617732,tt9916856,3,nm10538646,"[""Stephan""]"


### movie_director_relation.tsv

In [97]:
dfDirectorMovieRelation

Unnamed: 0,tconst,ordering,nconst
1,tt0000001,2,nm0005690
3,tt0000002,1,nm0721526
5,tt0000003,1,nm0721526
9,tt0000004,1,nm0721526
13,tt0000005,3,nm0005690
...,...,...,...
10617695,tt9916730,5,nm10538612
10617705,tt9916754,5,nm9272490
10617706,tt9916754,6,nm8349149
10617725,tt9916764,5,nm6685122


### movie_writer_relation.tsv

In [89]:
dfWriterMovieRelation

Unnamed: 0,tconst,ordering,nconst
89,tt0000036,3,nm0410331
146,tt0000076,3,nm0410331
188,tt0000108,3,nm0410331
192,tt0000109,3,nm0410331
196,tt0000110,3,nm0410331
...,...,...,...
10617726,tt9916764,6,nm6687687
10617727,tt9916764,7,nm10538642
10617728,tt9916764,8,nm9641593
10617729,tt9916764,9,nm10538643


# All relations related to series

In [90]:
dfActorSeriesRelation.drop(['category'],inplace = True, axis = 1)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dfActorSeriesRelation.drop(['category'],inplace = True, axis = 1)


In [91]:
dfActorSeriesRelation.to_csv("Final CSV Files/series_actor_relation.tsv", encoding="utf-8", sep="\t", index=False)

In [92]:
dfDirectorSeriesRelation.drop(['category','characters'],inplace = True, axis = 1)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dfDirectorSeriesRelation.drop(['category','characters'],inplace = True, axis = 1)


In [93]:
dfDirectorSeriesRelation.to_csv("Final CSV Files/series_director_relation.tsv", encoding="utf-8", sep="\t", index=False)

In [94]:
dfWriterSeriesRelation.drop(['category','characters'],inplace = True, axis = 1)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dfWriterSeriesRelation.drop(['category','characters'],inplace = True, axis = 1)


In [95]:
dfWriterSeriesRelation.to_csv("Final CSV Files/series_writer_relation.tsv", encoding="utf-8", sep="\t", index=False)

### series_actor_relation.tsv

In [98]:
dfActorSeriesRelation

Unnamed: 0,tconst,ordering,nconst,characters
10617751,tt0039120,10,nm0416564,"[""Announcer""]"
10617759,tt0039120,8,nm0274631,"[""Co-host"",""Self""]"
10617760,tt0039120,9,nm0240118,"[""Announcer""]"
10617762,tt0039121,2,nm0114766,"[""King Cole (1949)""]"
10617763,tt0039121,3,nm0333320,"[""Host (1948)""]"
...,...,...,...,...
12303271,tt9916678,4,nm10715037,"[""Caio"",""Tito""]"
12303273,tt9916678,6,nm10561260,"[""Henrique"",""Pedro""]"
12303274,tt9916678,7,nm10561289,"[""Ivan"",""O Crush""]"
12303275,tt9916678,8,nm5828671,"[""Carla"",""Juliana""]"


### series_director_relation.ts

In [99]:
dfDirectorSeriesRelation

Unnamed: 0,tconst,ordering,nconst
10645795,tt0080344,5,nm0469957
10659431,tt0096275,5,nm0288476
10665966,tt0102759,5,nm0924334
10666067,tt0103246,5,nm0776209
10670997,tt0107506,5,nm1091568
...,...,...,...
12272800,tt9352278,2,nm4506941
12276297,tt9422982,1,nm10319788
12288941,tt9646928,3,nm9956362
12290481,tt9675440,4,nm10433499


### series_writer_relation.tsv

In [100]:
dfWriterSeriesRelation

Unnamed: 0,tconst,ordering,nconst
10617987,tt0040051,5,nm0548529
10618069,tt0040996,5,nm0326055
10618144,tt0041008,5,nm0520492
10618198,tt0041018,5,nm2186062
10618374,tt0041038,5,nm0872077
...,...,...,...
12303122,tt9914698,5,nm10537329
12303131,tt9914700,5,nm10184851
12303192,tt9915338,4,nm10184851
12303200,tt9915672,6,nm10537972


## genre

In [101]:
df_movie_genre
df_series_genre.genres.value_counts()

genres
Comedy         68397
Drama          57021
Documentary    41055
Reality-TV     25282
Talk-Show      23058
Animation      17575
Family         16964
Romance        12633
Crime          12234
Action         12185
Adventure      11756
Music          11048
Game-Show       8902
Sport           8694
News            8642
History         6926
Mystery         6464
Fantasy         6313
Thriller        6054
Sci-Fi          5430
Short           5302
Horror          5185
Biography       3600
Adult           2701
Musical         1816
War             1387
Western          497
Film-Noir          1
Name: count, dtype: int64

In [102]:
df_movie_genre.to_csv("Final CSV Files/movie_genre.tsv", encoding="utf-8", sep="\t", index=False)

In [103]:
df_series_genre.to_csv("Final CSV Files/series_genre.tsv", encoding="utf-8", sep="\t", index=False)

### movie_genre.tsv

In [104]:
df_movie_genre

Unnamed: 0,tconst,genres
0,tt0000001,Documentary
1,tt0000001,Short
2,tt0000002,Animation
3,tt0000002,Short
4,tt0000003,Animation
...,...,...
3032606,tt9916730,Drama
3032607,tt9916754,Documentary
3032608,tt9916756,Short
3032609,tt9916764,Short


### series_genre.tsv

In [105]:
df_series_genre

Unnamed: 0,tconst,genres
0,tt0035803,Documentary
1,tt0035803,News
2,tt0038276,Talk-Show
3,tt0039120,Family
4,tt0039120,Game-Show
...,...,...
387117,tt9916218,Talk-Show
387118,tt9916380,Adventure
387119,tt9916380,Comedy
387120,tt9916380,Family
