**Firstly, import and read the necessary datasets for the project.**


**title.principals.tsv: Contains the principal cast/crew for titles**
tconst (string) - alphanumeric unique identifier of the title
ordering (integer) – a number to uniquely identify rows for a given titleId
nconst (string) - alphanumeric unique identifier of the name/person
category (string) - the category of job that person was in
job (string) - the specific job title if applicable, else '\N'

**name.basics.tsv – Contains the following information for names of people:**
nconst (string) - alphanumeric unique identifier of the name/person
primaryName (string)– name by which the person is most often credited
birthYear – in YYYY format
deathYear – in YYYY format if applicable, else '\N'
primaryProfession (array of strings)– the top-3 professions of the person
knownForTitles (array of tconsts) – titles the person is known for.


**title.basics.tsv contains all available title information from imdb:**
tconst (string) - alphanumeric unique identifier of the title
titleType (string) – the type/format of the title (e.g. movie, short, tvseries, tvepisode, video, etc)
primaryTitle (string) – the more popular title / the title used by the filmmakers on promotional materials at the point of release
originalTitle (string) - original title, in the original language
isAdult (boolean) - 0: non-adult title; 1: adult title
startYear (YYYY) – represents the release year of a title. In the case of TV Series, it is the series start year
endYear (YYYY) – TV Series end year. ‘\N’ for all other title types
runtimeMinutes – primary runtime of the title, in minutes
genres (string array) – includes up to three genres associated with the title.

As the focus of the project is on movies, other formats will be excluded (besides the values 'movie', 'tvMovie', and 'short'.)

**movie-links.gz - file containing their movies and their citations from a repository...**
**countries.gz - file containing movies and their country of origin;** in some cases, there is more than one value.

Using the functions imported from listreader, the last two files will be furtherly parsed in order to create the appropriate dataframes, as in their original state they only contain values in text format. Remember to copyright the people of the other project -- link--



In [2]:

# creating the tables with pandas to interact with our datasets
# getting the current working directory
cwd = os.getcwd()

name = cwd+'/datasets/name.basics.tsv'
name_table = pd.read_csv(name , sep='\t', header=0)

title = cwd+'/datasets/title.principals.tsv'
title_table = pd.read_csv(title , sep='\t', header=0)

#import titles ds to match the movie-links titles, 
title_name = cwd+'/datasets/title.basics.tsv'
title_name_table= pd.read_csv(title_name, sep='\t', header=0, low_memory=False)


countries = cwd+'/datasets/countries.list.gz'   #get countries
countries_table = get_movie_countries(countries)

movies = cwd+'/datasets/movie-links.list.gz'
movies_table = get_movie_links(movies)

'''
testing the tables
'''
# print(movies_table.head(5))
# print(countries_table.head(5))
# print(title_table.head(5))
# print(name_table.head(5))


#merge films and citations with their corresponding countries:

movies_cit_countries = pd.merge(movies_table, countries_table, on='movie', how='left')

#perform second merge for the movie's citations
movies_cit_countries=pd.merge(movies_cit_countries, countries_table, left_on="cites", right_on="movie",how="left")

movies_cit_countries.rename(columns={"movie_x":"movie","movie_y":"cited_movies","country_x":"movie_country","country_y":"cites_country"},inplace=True)
movies_cit_countries=movies_cit_countries.reindex(columns= ['movie', 'movie_country', 'cites', 'movie_y','cites_country'])
movies_cit_countries.drop(columns=['movie_y'], inplace=True)
movies_cit_countries.head(100)

Unnamed: 0,movie,movie_country,cites,cites_country
0,#DevinAuditions (2016),Canada,Alien (1979),UK
1,#DevinAuditions (2016),Canada,Alien (1979),USA
2,#DevinAuditions (2016),Canada,Die Hard (1988),USA
3,#DevinAuditions (2016),Canada,Ferris Bueller's Day Off (1986),USA
4,#DevinAuditions (2016),Canada,Fight Club (1999),USA
...,...,...,...,...
95,'Joe Strummer': The Man (2007),USA,Joe Strummer: The Future Is Unwritten (2007),UK
96,'Kurenai no kenjû' yo eien ni (2000),Japan,Kurenai no kenju (1961),Japan
97,'Men Olsenbanden var ikke død!' (1984),Norway,Olsen-banden møter kongen og knekten (1974),Norway
98,'Men Olsenbanden var ikke død!' (1984),Norway,Olsen-banden og Dynamitt-Harry (1970),Norway


In [3]:
movies_cit_countries['movie_country'].unique()
movies_cit_countries['cites_country'].unique()

array(['UK', 'USA', 'Germany', 'Hong Kong', 'New Zealand', 'Australia',
       'Netherlands', 'India', 'West Germany', 'Finland', 'Poland',
       'France', 'Italy', 'Philippines', 'Mexico', 'Ireland', 'Japan',
       'Norway', 'Greece', 'Jamaica', 'Belgium', 'Spain', 'Soviet Union',
       'Canada', 'South Korea', 'Argentina', 'Bolivia', 'Sweden',
       'Bosnia and Herzegovina', 'Austria', 'Turkey', 'Vietnam',
       'East Germany', 'Taiwan', 'Czechoslovakia', 'Yugoslavia',
       'Czech Republic', 'Bahamas', 'Iran', 'Ukraine', nan, 'Singapore',
       'Thailand', 'Brazil', 'Switzerland', 'China', 'Denmark', 'Romania',
       'South Africa', 'Israel', 'Cambodia', 'Indonesia', 'Panama',
       'Portugal', 'Pakistan', 'Chile', 'Peru', 'Estonia', 'Hungary',
       'Russia', 'Ghana', 'Burkina Faso', 'Malaysia', 'Luxembourg',
       'United Arab Emirates', 'Morocco', 'Monaco', 'Mauritius', 'Serbia',
       'Albania', 'Senegal', 'Latvia', 'Bulgaria',
       'Federal Republic of Yugoslavia'

To merge data coming from different sources, the cleaning process includes:
<li> replacing "\\N" with np.nan values in the imdb datasets.
<li> extracting values from the 'movie' column of the movie-links dataset  and storing title and year in separate columns
<li> further cleaning on the titles: removing leading and trailing spaces, normalizing values by removing special characters.<br>
We also identified some differences in titles of non-english titled movies; some were fixed but unfortunately a large part of them was dropped.
<li> removing data from both datasets regarding non-movie or short movie titles.

In [4]:
#retrieve information for movies collaborators

# replace /n for nan values in the imdb data
title_name_table = title_name_table.replace("\\N", np.nan)

# extract titles and year from the 'movie' column: pattern of four digits between parentheses at the end of the string
pattern = r'^(.*) \((\d{4})\)$'

movies_cit_countries[['title', 'year']] = movies_cit_countries['movie'].str.extract(pattern)
movies_cit_countries[['primaryTitle_citation', 'year_citation']] = movies_cit_countries['cites'].str.extract(pattern)

movies_cit_countries.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 332925 entries, 0 to 332924
Data columns (total 8 columns):
 #   Column                 Non-Null Count   Dtype 
---  ------                 --------------   ----- 
 0   movie                  332925 non-null  object
 1   movie_country          332634 non-null  object
 2   cites                  332925 non-null  object
 3   cites_country          332867 non-null  object
 4   title                  332925 non-null  object
 5   year                   332925 non-null  object
 6   primaryTitle_citation  332925 non-null  object
 7   year_citation          332925 non-null  object
dtypes: object(8)
memory usage: 22.9+ MB


In [5]:
#******TROUBLESHOOTING missing values on merge ********

#check encoding is same in both, utf-8 
#check that year format is correct everywhere
non_valid_year = list([y for y in movies_cit_countries['year'].unique() if not (str(y).isdigit() and len(str(y)) == 4)])
non_valid_year
error_rows = movies_cit_countries[movies_cit_countries['year'].isin(non_valid_year)]
error_rows

non_valid_year_x = list([y for y in movies_cit_countries['year_citation'].unique() if not (str(y).isdigit() and len(str(y)) == 4)])
non_valid_year_x
error_rows_x = movies_cit_countries[movies_cit_countries['year_citation'].isin(non_valid_year_x)]
error_rows.empty and error_rows_x.empty



True

In [6]:
#remove useless whitespaces
movies_cit_countries['title'] = movies_cit_countries['title'].str.strip()
movies_cit_countries['primaryTitle_citation'] = movies_cit_countries['primaryTitle_citation'].str.strip()

title_name_table['primaryTitle'] = title_name_table['primaryTitle'].str.strip()
title_name_table['originalTitle'] = title_name_table['originalTitle'].str.strip()

#take out tv series and videogames/videos

title_name_table = title_name_table.loc[title_name_table["titleType"] != 'tvEpisode' ]
title_name_table = title_name_table.loc[title_name_table["titleType"] != 'tvSeries' ]
title_name_table = title_name_table.loc[title_name_table["titleType"] != 'videoGame' ]
title_name_table = title_name_table.loc[title_name_table["titleType"] != 'video' ]
title_name_table = title_name_table.loc[title_name_table["titleType"] != 'tvMiniSeries' ]
title_name_table = title_name_table.loc[title_name_table["titleType"] != 'tvShort' ]
title_name_table = title_name_table.loc[title_name_table["titleType"] != 'tvSpecial' ]



title_name_table.drop(columns=['endYear','runtimeMinutes'], inplace=True)


In [7]:
movies_cit_countries = movies_cit_countries.replace(['Zui jia pai dang 3: Nv huang mi ling'] ,'Zui jia pai dang 3: Nu huang mi ling')


In [8]:

#remove special characters etc

def normalize_string(s):
    s = s.replace('&', 'and')
    s = re.sub(r'[^\w\s]', '', s).lower()  # Keep alphanumeric and whitespace characters
    s = re.sub(r'\s+', '', s)  # Remove whitespace characters
    s = unidecode(s)  # Remove accents and diacritics
    return s


title_name_table['originalTitle_normalized'] = title_name_table['originalTitle'].apply(normalize_string)
movies_cit_countries['title_citation_normalized'] = movies_cit_countries['primaryTitle_citation'].apply(normalize_string)
title_name_table['primaryTitle_normalized'] = title_name_table['primaryTitle'].apply(normalize_string)


movies_cit_countries 

#make sure formats are the same
#assign best datatypes

movies_cit_countries=movies_cit_countries.convert_dtypes()
title_name_table=title_name_table.convert_dtypes()

#they all turn out as str- edit years
#remove nan values

movies_cit_countries['year'] =  movies_cit_countries['year'].fillna('0')
movies_cit_countries['year_citation'] =  movies_cit_countries['year_citation'].fillna('0')
title_name_table['startYear'] =  title_name_table['startYear'].fillna('0')


#assign types
movies_cit_countries= movies_cit_countries.astype({'year': 'int32', 'year_citation':'int32'})

title_name_table= title_name_table.astype({'startYear': 'int32'})



In [9]:
movies_cit_countries.info()
title_name_table.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 332925 entries, 0 to 332924
Data columns (total 9 columns):
 #   Column                     Non-Null Count   Dtype 
---  ------                     --------------   ----- 
 0   movie                      332925 non-null  string
 1   movie_country              332634 non-null  string
 2   cites                      332925 non-null  string
 3   cites_country              332867 non-null  string
 4   title                      332925 non-null  string
 5   year                       332925 non-null  int32 
 6   primaryTitle_citation      332925 non-null  string
 7   year_citation              332925 non-null  int32 
 8   title_citation_normalized  332925 non-null  string
dtypes: int32(2), string(7)
memory usage: 22.9 MB
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1692874 entries, 0 to 9640999
Data columns (total 9 columns):
 #   Column                    Dtype 
---  ------                    ----- 
 0   tconst                    strin

---------------------------------------------------------------------------------------------------------------------------
The first outer merge concerns the citations of the films, and connects movie data based on movie title; then the values are furtherly filtered with the condition that their year matches by a maximum difference of 3 years. This approach helps to a. identify the unmatched values and b. avoid a lot of missing values given the fact that our data comes from different sources and at times contains year differences.


In [10]:
#first merge on title 
final_df = pd.merge(movies_cit_countries, title_name_table,
             left_on=['title_citation_normalized'],
             right_on=['originalTitle_normalized'],how='outer', indicator=True)
                              

final_df.info()                     

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2798154 entries, 0 to 2798153
Data columns (total 19 columns):
 #   Column                     Dtype   
---  ------                     -----   
 0   movie                      string  
 1   movie_country              string  
 2   cites                      string  
 3   cites_country              string  
 4   title                      string  
 5   year                       float64 
 6   primaryTitle_citation      string  
 7   year_citation              float64 
 8   title_citation_normalized  string  
 9   tconst                     string  
 10  titleType                  string  
 11  primaryTitle               string  
 12  originalTitle              string  
 13  isAdult                    string  
 14  startYear                  float64 
 15  genres                     string  
 16  originalTitle_normalized   string  
 17  primaryTitle_normalized    string  
 18  _merge                     category
dtypes: category(1), float

In [11]:
#identify which are the movies that share the same year and filter
matches = final_df[final_df['_merge'] == "both"].copy()

#include 3 year differences
matches['year_diff'] = matches['year_citation'] - matches['startYear']
matches = matches.loc[matches['year_diff'].between(-3, 3)]

#now we have the more accurate matches
              

Unnamed: 0,movie,movie_country,cites,cites_country,title,year,primaryTitle_citation,year_citation,title_citation_normalized,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,genres,originalTitle_normalized,primaryTitle_normalized,_merge,year_diff
0,#DevinAuditions (2016),Canada,Alien (1979),UK,#DevinAuditions,2016.0,Alien,1979.0,alien,tt0078748,movie,Alien,Alien,0,1979.0,"Horror,Sci-Fi",alien,alien,both,0.0
10,#DevinAuditions (2016),Canada,Alien (1979),USA,#DevinAuditions,2016.0,Alien,1979.0,alien,tt0078748,movie,Alien,Alien,0,1979.0,"Horror,Sci-Fi",alien,alien,both,0.0
20,.com for Murder (2002),USA,Alien (1979),UK,.com for Murder,2002.0,Alien,1979.0,alien,tt0078748,movie,Alien,Alien,0,1979.0,"Horror,Sci-Fi",alien,alien,both,0.0
30,.com for Murder (2002),USA,Alien (1979),USA,.com for Murder,2002.0,Alien,1979.0,alien,tt0078748,movie,Alien,Alien,0,1979.0,"Horror,Sci-Fi",alien,alien,both,0.0
40,30 Minutes or Less (2011),Germany,Alien (1979),UK,30 Minutes or Less,2011.0,Alien,1979.0,alien,tt0078748,movie,Alien,Alien,0,1979.0,"Horror,Sci-Fi",alien,alien,both,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1224183,è solo questione di punti di vista (2012),Italy,Letto a tre piazze (1960),Italy,è solo questione di punti di vista,2012.0,Letto a tre piazze,1960.0,lettoatrepiazze,tt0054023,movie,Letto a tre piazze,Letto a tre piazze,0,1960.0,Comedy,lettoatrepiazze,lettoatrepiazze,both,0.0
1224184,è solo questione di tempo (2013),Italy,Il successo (1963),Italy,è solo questione di tempo,2013.0,Il successo,1963.0,ilsuccesso,tt0057540,movie,Il successo,Il successo,0,1963.0,Comedy,ilsuccesso,ilsuccesso,both,0.0
1224185,è solo questione di tempo (2013),Italy,Il successo (1963),Italy,è solo questione di tempo,2013.0,Il successo,1963.0,ilsuccesso,tt1248945,tvMovie,Il successo,Il successo,0,1963.0,,ilsuccesso,ilsuccesso,both,0.0
1224187,è solo questione di tempo (2013),Italy,Il successo (1963),France,è solo questione di tempo,2013.0,Il successo,1963.0,ilsuccesso,tt0057540,movie,Il successo,Il successo,0,1963.0,Comedy,ilsuccesso,ilsuccesso,both,0.0


In [12]:
#find missing values
movies_cit_countries['movie'].unique()


<StringArray>
[                                     '#DevinAuditions (2016)',
                                        '#FromJennifer (2017)',
                                                 '#Rip (2013)',
                                         '#TubeClash02 (2016)',
 '#chicagoGirl: The Social Network Takes on a Dictator (2013)',
                                        '#twitterkills (2014)',
                                                '#will (2016)',
                                            '$10 Raise (1935)',
                               '$10,000 Under a Pillow (1921)',
                                             '$5 a Day (2008)',
 ...
                                'Úristen, itt lönek... (2005)',
                                       'Üksindusfilm 2 (2016)',
                                           'Üvegtigris (2001)',
                                        'Üvegtigris 2. (2006)',
                                        'Üvegtigris 3. (2010)',
                     

In [13]:
matches['primaryTitle_citation'].unique() #there are 5.000 values missing

<StringArray>
[                           'Alien',                         'Die Hard',
         "Ferris Bueller's Day Off",                       'Fight Club',
                       'Magic Mike',                    'The Gold Rush',
                      'The Shining',                       '2 Jennifer',
                      'To Jennifer',                     'Love, Simple',
 ...
                'Með allt á hreinu',              'Plecarea Vlasinilor',
 'Ôgon no taka - Zempen: Makyô-hen',                   'Taii no musume',
                     'Üksindusfilm',                       'Üvegtigris',
                    'Üvegtigris 2.',                        'Smáfuglar',
               'Letto a tre piazze',                      'Il successo']
Length: 40790, dtype: string

Next step is to investigate which films were not merged, and perform a second merge based on their title from the movies dataframe, and their title in the imdb dataset, but **this time in its primaryTitle format,** aka the titme that is often more commonly used.

In [14]:

imdb_unmatched = final_df[final_df['_merge'] == "right_only"].copy()
links_unmatched = final_df[final_df['_merge'] == "left_only"].copy()

imdb_unmatched
matches_1 = pd.merge(links_unmatched, imdb_unmatched,
             left_on=['title_citation_normalized'],
             right_on=['primaryTitle_normalized'],how='outer', indicator="match_primaryTitle") #match with primaryTitle this time

matches_1


Unnamed: 0,movie_x,movie_country_x,cites_x,cites_country_x,title_x,year_x,primaryTitle_citation_x,year_citation_x,title_citation_normalized_x,tconst_x,...,titleType_y,primaryTitle_y,originalTitle_y,isAdult_y,startYear_y,genres_y,originalTitle_normalized_y,primaryTitle_normalized_y,_merge_y,match_primaryTitle
0,(T)Raumschiff Surprise - Periode 1 (2004),Germany,Le cinquième élément (1997),France,(T)Raumschiff Surprise - Periode 1,2004.0,Le cinquième élément,1997.0,lecinquiemeelement,,...,,,,,,,,,,left_only
1,24x36: A Movie About Movie Posters (2016),Canada,Le cinquième élément (1997),France,24x36: A Movie About Movie Posters,2016.0,Le cinquième élément,1997.0,lecinquiemeelement,,...,,,,,,,,,,left_only
2,24x36: A Movie About Movie Posters (2016),USA,Le cinquième élément (1997),France,24x36: A Movie About Movie Posters,2016.0,Le cinquième élément,1997.0,lecinquiemeelement,,...,,,,,,,,,,left_only
3,Astro Boy (2009),Hong Kong,Le cinquième élément (1997),France,Astro Boy,2009.0,Le cinquième élément,1997.0,lecinquiemeelement,,...,,,,,,,,,,left_only
4,Astro Boy (2009),USA,Le cinquième élément (1997),France,Astro Boy,2009.0,Le cinquième élément,1997.0,lecinquiemeelement,,...,,,,,,,,,,left_only
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1577509,,,,,,,,,,,...,movie,Dankyavar Danka,Dankyavar Danka,0,2013.0,Comedy,dankyavardanka,dankyavardanka,right_only,right_only
1577510,,,,,,,,,,,...,short,Hay Que Ser Paciente,Hay Que Ser Paciente,0,2015.0,"Documentary,Short",hayqueserpaciente,hayqueserpaciente,right_only,right_only
1577511,,,,,,,,,,,...,movie,6 Gunn,6 Gunn,0,2017.0,,6gunn,6gunn,right_only,right_only
1577512,,,,,,,,,,,...,movie,Chico Albuquerque - Revelações,Chico Albuquerque - Revelações,0,2013.0,Documentary,chicoalbuquerquerevelacoes,chicoalbuquerquerevelacoes,right_only,right_only


In [15]:
matches_1.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1577514 entries, 0 to 1577513
Data columns (total 39 columns):
 #   Column                       Non-Null Count    Dtype   
---  ------                       --------------    -----   
 0   movie_x                      3588 non-null     string  
 1   movie_country_x              3581 non-null     string  
 2   cites_x                      3588 non-null     string  
 3   cites_country_x              3587 non-null     string  
 4   title_x                      3588 non-null     string  
 5   year_x                       3588 non-null     float64 
 6   primaryTitle_citation_x      3588 non-null     string  
 7   year_citation_x              3588 non-null     float64 
 8   title_citation_normalized_x  3588 non-null     string  
 9   tconst_x                     0 non-null        string  
 10  titleType_x                  0 non-null        string  
 11  primaryTitle_x               0 non-null        string  
 12  originalTitle_x             

In [16]:
matches_primaryTitle = matches_1[matches_1['match_primaryTitle'] == "both"].copy()
matches_primaryTitle.values[0]
matches_primaryTitle['year_diff_prim'] = matches_primaryTitle['year_citation_x'] - matches_primaryTitle['startYear_y']
matches_primaryTitle = matches_primaryTitle.loc[matches_primaryTitle['year_diff_prim'].between(-3, 3)]
matches_primaryTitle

Unnamed: 0,movie_x,movie_country_x,cites_x,cites_country_x,title_x,year_x,primaryTitle_citation_x,year_citation_x,title_citation_normalized_x,tconst_x,...,primaryTitle_y,originalTitle_y,isAdult_y,startYear_y,genres_y,originalTitle_normalized_y,primaryTitle_normalized_y,_merge_y,match_primaryTitle,year_diff_prim
48,(T)Raumschiff Surprise - Periode 1 (2004),Germany,Star Wars: Episode V - The Empire Strikes Back...,USA,(T)Raumschiff Surprise - Periode 1,2004.0,Star Wars: Episode V - The Empire Strikes Back,1980.0,starwarsepisodevtheempirestrikesback,,...,Star Wars: Episode V - The Empire Strikes Back,The Empire Strikes Back,0,1980.0,"Action,Adventure,Fantasy",theempirestrikesback,starwarsepisodevtheempirestrikesback,right_only,both,0.0
49,17 Again (2009),USA,Star Wars: Episode V - The Empire Strikes Back...,USA,17 Again,2009.0,Star Wars: Episode V - The Empire Strikes Back,1980.0,starwarsepisodevtheempirestrikesback,,...,Star Wars: Episode V - The Empire Strikes Back,The Empire Strikes Back,0,1980.0,"Action,Adventure,Fantasy",theempirestrikesback,starwarsepisodevtheempirestrikesback,right_only,both,0.0
50,2010: The Odyssey Continues (1984),USA,Star Wars: Episode V - The Empire Strikes Back...,USA,2010: The Odyssey Continues,1984.0,Star Wars: Episode V - The Empire Strikes Back,1980.0,starwarsepisodevtheempirestrikesback,,...,Star Wars: Episode V - The Empire Strikes Back,The Empire Strikes Back,0,1980.0,"Action,Adventure,Fantasy",theempirestrikesback,starwarsepisodevtheempirestrikesback,right_only,both,0.0
51,24x36: A Movie About Movie Posters (2016),Canada,Star Wars: Episode V - The Empire Strikes Back...,USA,24x36: A Movie About Movie Posters,2016.0,Star Wars: Episode V - The Empire Strikes Back,1980.0,starwarsepisodevtheempirestrikesback,,...,Star Wars: Episode V - The Empire Strikes Back,The Empire Strikes Back,0,1980.0,"Action,Adventure,Fantasy",theempirestrikesback,starwarsepisodevtheempirestrikesback,right_only,both,0.0
52,24x36: A Movie About Movie Posters (2016),USA,Star Wars: Episode V - The Empire Strikes Back...,USA,24x36: A Movie About Movie Posters,2016.0,Star Wars: Episode V - The Empire Strikes Back,1980.0,starwarsepisodevtheempirestrikesback,,...,Star Wars: Episode V - The Empire Strikes Back,The Empire Strikes Back,0,1980.0,"Action,Adventure,Fantasy",theempirestrikesback,starwarsepisodevtheempirestrikesback,right_only,both,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3290,Yin yang lu shi ba zhi Gui shang shen (2003),Hong Kong,Troublesome Night 11 (2001),Hong Kong,Yin yang lu shi ba zhi Gui shang shen,2003.0,Troublesome Night 11,2001.0,troublesomenight11,,...,Troublesome Night 11,Yam yeung lo 11: Liu gwai lo min,0,2001.0,"Comedy,Horror",yamyeunglo11liugwailomin,troublesomenight11,right_only,both,0.0
3291,Yin yang lu shi jiu zhi Wo dui yan jian dao ye...,Hong Kong,Troublesome Night 11 (2001),Hong Kong,Yin yang lu shi jiu zhi Wo dui yan jian dao ye,2003.0,Troublesome Night 11,2001.0,troublesomenight11,,...,Troublesome Night 11,Yam yeung lo 11: Liu gwai lo min,0,2001.0,"Comedy,Horror",yamyeunglo11liugwailomin,troublesomenight11,right_only,both,0.0
3292,Yin yang lu shi liu zhi hui dao wu xia shi dai...,Hong Kong,Troublesome Night 11 (2001),Hong Kong,Yin yang lu shi liu zhi hui dao wu xia shi dai,2002.0,Troublesome Night 11,2001.0,troublesomenight11,,...,Troublesome Night 11,Yam yeung lo 11: Liu gwai lo min,0,2001.0,"Comedy,Horror",yamyeunglo11liugwailomin,troublesomenight11,right_only,both,0.0
3293,Yin yang lu shi qi zhi jian fang you gui (2002),Hong Kong,Troublesome Night 11 (2001),Hong Kong,Yin yang lu shi qi zhi jian fang you gui,2002.0,Troublesome Night 11,2001.0,troublesomenight11,,...,Troublesome Night 11,Yam yeung lo 11: Liu gwai lo min,0,2001.0,"Comedy,Horror",yamyeunglo11liugwailomin,troublesomenight11,right_only,both,0.0


In [17]:
links_missing_completely = matches_1[matches_1['match_primaryTitle'] == "left_only"].copy()#show non matched data
#to check:
links_missing_completely['primaryTitle_citation_x'].unique()

<StringArray>
[                         'Le cinquième élément',
                      'The 12 Dogs of Christmas',
      'Ging chaat goo si 4: Ji gaan daan yam mo',
                '2012 Gold Rush Expedition Race',
                              'Ah fei zing zyun',
                              'Faa yeung nin wa',
      'Non si deve profanare il sonno dei morti',
                             'Jeg, en kvinda II',
              'Yu pu tuan er zhi Yu nu xin jing',
                 'Yu pu tuan: Tou qing bao jian',
 ...
      'Yin yang lu shi qi zhi jian fang you gui',
         'Yin yang lu shi ba zhi Gui shang shen',
                                 'The Goldbergs',
                                       'Yu feng',
                                 'Detstvo Bambi',
 'Yôjû kyôshitsu gaiden 2: Eros no megami kôrin',
                             'Cossacks in Exile',
                                        'Ji sor',
                    'Zhi zun ji zhuang yuan cai',
                               

As we can see the missing values are often non english titles; There were lengthy efforts to match the remaining titles using similarity algorithms,
but due to them being very computationally demanding, given the limitations of this project we had to drop the remaining unrecognized titles.

In [18]:
#remove columns with all nan values after merging

matches = matches.dropna(axis=1, how='all')
matches_primaryTitle = matches_primaryTitle.dropna(axis=1, how='all')

#cleaning

matches_primaryTitle.columns = matches_primaryTitle.columns.str.replace("_y", "")
matches_primaryTitle.columns = matches_primaryTitle.columns.str.replace("_x","")
matches_primaryTitle = matches_primaryTitle.reset_index(drop=True)
matches = matches.reset_index(drop=True)
matches_primaryTitle.drop(columns=['_merge'], inplace=True)
matches.drop(columns=['_merge'], inplace=True)
matches_primaryTitle.rename(columns={'year_diff_prim' : 'year_diff'}, inplace= True)

matches


Unnamed: 0,movie,movie_country,cites,cites_country,title,year,primaryTitle_citation,year_citation,title_citation_normalized,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,genres,originalTitle_normalized,primaryTitle_normalized,year_diff
0,#DevinAuditions (2016),Canada,Alien (1979),UK,#DevinAuditions,2016.0,Alien,1979.0,alien,tt0078748,movie,Alien,Alien,0,1979.0,"Horror,Sci-Fi",alien,alien,0.0
1,#DevinAuditions (2016),Canada,Alien (1979),USA,#DevinAuditions,2016.0,Alien,1979.0,alien,tt0078748,movie,Alien,Alien,0,1979.0,"Horror,Sci-Fi",alien,alien,0.0
2,.com for Murder (2002),USA,Alien (1979),UK,.com for Murder,2002.0,Alien,1979.0,alien,tt0078748,movie,Alien,Alien,0,1979.0,"Horror,Sci-Fi",alien,alien,0.0
3,.com for Murder (2002),USA,Alien (1979),USA,.com for Murder,2002.0,Alien,1979.0,alien,tt0078748,movie,Alien,Alien,0,1979.0,"Horror,Sci-Fi",alien,alien,0.0
4,30 Minutes or Less (2011),Germany,Alien (1979),UK,30 Minutes or Less,2011.0,Alien,1979.0,alien,tt0078748,movie,Alien,Alien,0,1979.0,"Horror,Sci-Fi",alien,alien,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
358636,è solo questione di punti di vista (2012),Italy,Letto a tre piazze (1960),Italy,è solo questione di punti di vista,2012.0,Letto a tre piazze,1960.0,lettoatrepiazze,tt0054023,movie,Letto a tre piazze,Letto a tre piazze,0,1960.0,Comedy,lettoatrepiazze,lettoatrepiazze,0.0
358637,è solo questione di tempo (2013),Italy,Il successo (1963),Italy,è solo questione di tempo,2013.0,Il successo,1963.0,ilsuccesso,tt0057540,movie,Il successo,Il successo,0,1963.0,Comedy,ilsuccesso,ilsuccesso,0.0
358638,è solo questione di tempo (2013),Italy,Il successo (1963),Italy,è solo questione di tempo,2013.0,Il successo,1963.0,ilsuccesso,tt1248945,tvMovie,Il successo,Il successo,0,1963.0,,ilsuccesso,ilsuccesso,0.0
358639,è solo questione di tempo (2013),Italy,Il successo (1963),France,è solo questione di tempo,2013.0,Il successo,1963.0,ilsuccesso,tt0057540,movie,Il successo,Il successo,0,1963.0,Comedy,ilsuccesso,ilsuccesso,0.0


In [19]:
#create final df with matches
final_citations = pd.concat([matches,matches_primaryTitle],ignore_index=True)


final_citations['match_primaryTitle'].replace({'both': 'True'},inplace=True)
final_citations['match_primaryTitle'] = final_citations['match_primaryTitle'].cat.add_categories(['False']).fillna('False')
final_citations.drop(columns=['originalTitle_normalized','primaryTitle_normalized','title_citation_normalized'],inplace=True)
final_citations


Unnamed: 0,movie,movie_country,cites,cites_country,title,year,primaryTitle_citation,year_citation,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,genres,year_diff,match_primaryTitle
0,#DevinAuditions (2016),Canada,Alien (1979),UK,#DevinAuditions,2016.0,Alien,1979.0,tt0078748,movie,Alien,Alien,0,1979.0,"Horror,Sci-Fi",0.0,False
1,#DevinAuditions (2016),Canada,Alien (1979),USA,#DevinAuditions,2016.0,Alien,1979.0,tt0078748,movie,Alien,Alien,0,1979.0,"Horror,Sci-Fi",0.0,False
2,.com for Murder (2002),USA,Alien (1979),UK,.com for Murder,2002.0,Alien,1979.0,tt0078748,movie,Alien,Alien,0,1979.0,"Horror,Sci-Fi",0.0,False
3,.com for Murder (2002),USA,Alien (1979),USA,.com for Murder,2002.0,Alien,1979.0,tt0078748,movie,Alien,Alien,0,1979.0,"Horror,Sci-Fi",0.0,False
4,30 Minutes or Less (2011),Germany,Alien (1979),UK,30 Minutes or Less,2011.0,Alien,1979.0,tt0078748,movie,Alien,Alien,0,1979.0,"Horror,Sci-Fi",0.0,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
359436,Yin yang lu shi ba zhi Gui shang shen (2003),Hong Kong,Troublesome Night 11 (2001),Hong Kong,Yin yang lu shi ba zhi Gui shang shen,2003.0,Troublesome Night 11,2001.0,tt0297441,movie,Troublesome Night 11,Yam yeung lo 11: Liu gwai lo min,0,2001.0,"Comedy,Horror",0.0,True
359437,Yin yang lu shi jiu zhi Wo dui yan jian dao ye...,Hong Kong,Troublesome Night 11 (2001),Hong Kong,Yin yang lu shi jiu zhi Wo dui yan jian dao ye,2003.0,Troublesome Night 11,2001.0,tt0297441,movie,Troublesome Night 11,Yam yeung lo 11: Liu gwai lo min,0,2001.0,"Comedy,Horror",0.0,True
359438,Yin yang lu shi liu zhi hui dao wu xia shi dai...,Hong Kong,Troublesome Night 11 (2001),Hong Kong,Yin yang lu shi liu zhi hui dao wu xia shi dai,2002.0,Troublesome Night 11,2001.0,tt0297441,movie,Troublesome Night 11,Yam yeung lo 11: Liu gwai lo min,0,2001.0,"Comedy,Horror",0.0,True
359439,Yin yang lu shi qi zhi jian fang you gui (2002),Hong Kong,Troublesome Night 11 (2001),Hong Kong,Yin yang lu shi qi zhi jian fang you gui,2002.0,Troublesome Night 11,2001.0,tt0297441,movie,Troublesome Night 11,Yam yeung lo 11: Liu gwai lo min,0,2001.0,"Comedy,Horror",0.0,True


In [20]:

final_citations.drop_duplicates(inplace=True)
final_citations.columns = final_citations.columns.str.replace("_citation", "_cit")

final_citations= final_citations.add_suffix('_cit')
final_citations.rename(columns={'movie_cit':'movie', 'movie_country_cit':'movie_country','title_cit':'title','year_cit':'year'} ,inplace=True)
final_citations.columns = final_citations.columns.str.replace("cites_", "")
final_citations.columns = final_citations.columns.str.replace("_cit_cit", "_cit")

final_citations


Unnamed: 0,movie,movie_country,cit,country_cit,title,year,primaryTitle_cit,year_cit,tconst_cit,titleType_cit,primaryTitle_cit.1,originalTitle_cit,isAdult_cit,startYear_cit,genres_cit,year_diff_cit,match_primaryTitle_cit
0,#DevinAuditions (2016),Canada,Alien (1979),UK,#DevinAuditions,2016.0,Alien,1979.0,tt0078748,movie,Alien,Alien,0,1979.0,"Horror,Sci-Fi",0.0,False
1,#DevinAuditions (2016),Canada,Alien (1979),USA,#DevinAuditions,2016.0,Alien,1979.0,tt0078748,movie,Alien,Alien,0,1979.0,"Horror,Sci-Fi",0.0,False
2,.com for Murder (2002),USA,Alien (1979),UK,.com for Murder,2002.0,Alien,1979.0,tt0078748,movie,Alien,Alien,0,1979.0,"Horror,Sci-Fi",0.0,False
3,.com for Murder (2002),USA,Alien (1979),USA,.com for Murder,2002.0,Alien,1979.0,tt0078748,movie,Alien,Alien,0,1979.0,"Horror,Sci-Fi",0.0,False
4,30 Minutes or Less (2011),Germany,Alien (1979),UK,30 Minutes or Less,2011.0,Alien,1979.0,tt0078748,movie,Alien,Alien,0,1979.0,"Horror,Sci-Fi",0.0,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
359436,Yin yang lu shi ba zhi Gui shang shen (2003),Hong Kong,Troublesome Night 11 (2001),Hong Kong,Yin yang lu shi ba zhi Gui shang shen,2003.0,Troublesome Night 11,2001.0,tt0297441,movie,Troublesome Night 11,Yam yeung lo 11: Liu gwai lo min,0,2001.0,"Comedy,Horror",0.0,True
359437,Yin yang lu shi jiu zhi Wo dui yan jian dao ye...,Hong Kong,Troublesome Night 11 (2001),Hong Kong,Yin yang lu shi jiu zhi Wo dui yan jian dao ye,2003.0,Troublesome Night 11,2001.0,tt0297441,movie,Troublesome Night 11,Yam yeung lo 11: Liu gwai lo min,0,2001.0,"Comedy,Horror",0.0,True
359438,Yin yang lu shi liu zhi hui dao wu xia shi dai...,Hong Kong,Troublesome Night 11 (2001),Hong Kong,Yin yang lu shi liu zhi hui dao wu xia shi dai,2002.0,Troublesome Night 11,2001.0,tt0297441,movie,Troublesome Night 11,Yam yeung lo 11: Liu gwai lo min,0,2001.0,"Comedy,Horror",0.0,True
359439,Yin yang lu shi qi zhi jian fang you gui (2002),Hong Kong,Troublesome Night 11 (2001),Hong Kong,Yin yang lu shi qi zhi jian fang you gui,2002.0,Troublesome Night 11,2001.0,tt0297441,movie,Troublesome Night 11,Yam yeung lo 11: Liu gwai lo min,0,2001.0,"Comedy,Horror",0.0,True


This concludes the final data regarding our citations. The final dataframe contains the title, year and country of movies and their citations, and the imdb info (originalTitle, primaryTitle, year, tconst identifier, type, and genre info from imdb. The last column match_primaryTitle signifies whether the values were merged based on the primaryTitle or not.
<br>
<br>
We now move on to merging the two datasets to get imdb info regarding the *movies* citing other movies, repeating the same process for a. normalizing the movie names and performing the two merges based on titles matching either in the original or the primary format, and their year of release.

In [22]:
#MERGE OF MOVIES AND CITATIONS
#normalize movie names

movies_cit_countries['movie_normalized'] = movies_cit_countries['title'].apply(normalize_string)


Unnamed: 0,movie,movie_country,cites,cites_country,title,year,primaryTitle_citation,year_citation,title_citation_normalized,movie_normalized
322,(T)Raumschiff Surprise - Periode 1 (2004),Germany,2001: A Space Odyssey (1968),UK,(T)Raumschiff Surprise - Periode 1,2004,2001: A Space Odyssey,1968,2001aspaceodyssey,traumschiffsurpriseperiode1
323,(T)Raumschiff Surprise - Periode 1 (2004),Germany,2001: A Space Odyssey (1968),USA,(T)Raumschiff Surprise - Periode 1,2004,2001: A Space Odyssey,1968,2001aspaceodyssey,traumschiffsurpriseperiode1
325,(T)Raumschiff Surprise - Periode 1 (2004),Germany,Back to the Future Part III (1990),USA,(T)Raumschiff Surprise - Periode 1,2004,Back to the Future Part III,1990,backtothefuturepartiii,traumschiffsurpriseperiode1
327,(T)Raumschiff Surprise - Periode 1 (2004),Germany,Der Schuh des Manitu (2001),Germany,(T)Raumschiff Surprise - Periode 1,2004,Der Schuh des Manitu,2001,derschuhdesmanitu,traumschiffsurpriseperiode1
374,...Promises to Keep (1974),USA,The Yakuza (1974),USA,...Promises to Keep,1974,The Yakuza,1974,theyakuza,promisestokeep
...,...,...,...,...,...,...,...,...,...,...
332320,xXx: Return of Xander Cage (2017),China,xXx: State of the Union (2005),USA,xXx: Return of Xander Cage,2017,xXx: State of the Union,2005,xxxstateoftheunion,xxxreturnofxandercage
332321,xXx: Return of Xander Cage (2017),Canada,xXx: State of the Union (2005),USA,xXx: Return of Xander Cage,2017,xXx: State of the Union,2005,xxxstateoftheunion,xxxreturnofxandercage
332322,xXx: Return of Xander Cage (2017),USA,xXx: State of the Union (2005),USA,xXx: Return of Xander Cage,2017,xXx: State of the Union,2005,xxxstateoftheunion,xxxreturnofxandercage
332344,¡Maldito bastardo! (2008),Spain,La consulta del Dr. Natalio (2004),Spain,¡Maldito bastardo!,2008,La consulta del Dr. Natalio,2004,laconsultadeldrnatalio,malditobastardo


In [23]:
#merge 
movies_df = pd.merge(movies_cit_countries, title_name_table,
             left_on='movie_normalized',
             right_on='originalTitle_normalized', how='outer', indicator=True)

movies_df

Unnamed: 0,movie,movie_country,cites,cites_country,title,year,primaryTitle_citation,year_citation,title_citation_normalized,movie_normalized,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,genres,originalTitle_normalized,primaryTitle_normalized,_merge
0,#DevinAuditions (2016),Canada,Alien (1979),UK,#DevinAuditions,2016.0,Alien,1979.0,alien,devinauditions,tt6231178,short,#DevinAuditions,#DevinAuditions,0,2016.0,"Comedy,Short",devinauditions,devinauditions,both
1,#DevinAuditions (2016),Canada,Alien (1979),USA,#DevinAuditions,2016.0,Alien,1979.0,alien,devinauditions,tt6231178,short,#DevinAuditions,#DevinAuditions,0,2016.0,"Comedy,Short",devinauditions,devinauditions,both
2,#DevinAuditions (2016),Canada,Die Hard (1988),USA,#DevinAuditions,2016.0,Die Hard,1988.0,diehard,devinauditions,tt6231178,short,#DevinAuditions,#DevinAuditions,0,2016.0,"Comedy,Short",devinauditions,devinauditions,both
3,#DevinAuditions (2016),Canada,Ferris Bueller's Day Off (1986),USA,#DevinAuditions,2016.0,Ferris Bueller's Day Off,1986.0,ferrisbuellersdayoff,devinauditions,tt6231178,short,#DevinAuditions,#DevinAuditions,0,2016.0,"Comedy,Short",devinauditions,devinauditions,both
4,#DevinAuditions (2016),Canada,Fight Club (1999),USA,#DevinAuditions,2016.0,Fight Club,1999.0,fightclub,devinauditions,tt6231178,short,#DevinAuditions,#DevinAuditions,0,2016.0,"Comedy,Short",devinauditions,devinauditions,both
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2457536,,,,,,,,,,,tt9916706,movie,Dankyavar Danka,Dankyavar Danka,0,2013.0,Comedy,dankyavardanka,dankyavardanka,right_only
2457537,,,,,,,,,,,tt9916724,short,Hay Que Ser Paciente,Hay Que Ser Paciente,0,2015.0,"Documentary,Short",hayqueserpaciente,hayqueserpaciente,right_only
2457538,,,,,,,,,,,tt9916730,movie,6 Gunn,6 Gunn,0,2017.0,,6gunn,6gunn,right_only
2457539,,,,,,,,,,,tt9916754,movie,Chico Albuquerque - Revelações,Chico Albuquerque - Revelações,0,2013.0,Documentary,chicoalbuquerquerevelacoes,chicoalbuquerquerevelacoes,right_only


In [24]:
matches_movies = movies_df[movies_df['_merge'] == "both"].copy()
matches_movies['year_diff'] = matches_movies['year'] - matches_movies['startYear']
matches_movies = matches_movies.loc[matches_movies['year_diff'].between(-3, 3)]
matches_movies

Unnamed: 0,movie,movie_country,cites,cites_country,title,year,primaryTitle_citation,year_citation,title_citation_normalized,movie_normalized,...,titleType,primaryTitle,originalTitle,isAdult,startYear,genres,originalTitle_normalized,primaryTitle_normalized,_merge,year_diff
0,#DevinAuditions (2016),Canada,Alien (1979),UK,#DevinAuditions,2016.0,Alien,1979.0,alien,devinauditions,...,short,#DevinAuditions,#DevinAuditions,0,2016.0,"Comedy,Short",devinauditions,devinauditions,both,0.0
1,#DevinAuditions (2016),Canada,Alien (1979),USA,#DevinAuditions,2016.0,Alien,1979.0,alien,devinauditions,...,short,#DevinAuditions,#DevinAuditions,0,2016.0,"Comedy,Short",devinauditions,devinauditions,both,0.0
2,#DevinAuditions (2016),Canada,Die Hard (1988),USA,#DevinAuditions,2016.0,Die Hard,1988.0,diehard,devinauditions,...,short,#DevinAuditions,#DevinAuditions,0,2016.0,"Comedy,Short",devinauditions,devinauditions,both,0.0
3,#DevinAuditions (2016),Canada,Ferris Bueller's Day Off (1986),USA,#DevinAuditions,2016.0,Ferris Bueller's Day Off,1986.0,ferrisbuellersdayoff,devinauditions,...,short,#DevinAuditions,#DevinAuditions,0,2016.0,"Comedy,Short",devinauditions,devinauditions,both,0.0
4,#DevinAuditions (2016),Canada,Fight Club (1999),USA,#DevinAuditions,2016.0,Fight Club,1999.0,fightclub,devinauditions,...,short,#DevinAuditions,#DevinAuditions,0,2016.0,"Comedy,Short",devinauditions,devinauditions,both,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
882066,è solo questione di punti di vista (2012),Italy,"Un genio, due compari, un pollo (1975)",Italy,è solo questione di punti di vista,2012.0,"Un genio, due compari, un pollo",1975.0,ungenioduecompariunpollo,esoloquestionedipuntidivista,...,movie,è solo questione di punti di vista,è solo questione di punti di vista,0,2012.0,"Action,Adventure,Comedy",esoloquestionedipuntidivista,esoloquestionedipuntidivista,both,0.0
882067,è solo questione di punti di vista (2012),Italy,"Un genio, due compari, un pollo (1975)",France,è solo questione di punti di vista,2012.0,"Un genio, due compari, un pollo",1975.0,ungenioduecompariunpollo,esoloquestionedipuntidivista,...,movie,è solo questione di punti di vista,è solo questione di punti di vista,0,2012.0,"Action,Adventure,Comedy",esoloquestionedipuntidivista,esoloquestionedipuntidivista,both,0.0
882068,è solo questione di punti di vista (2012),Italy,"Un genio, due compari, un pollo (1975)",West Germany,è solo questione di punti di vista,2012.0,"Un genio, due compari, un pollo",1975.0,ungenioduecompariunpollo,esoloquestionedipuntidivista,...,movie,è solo questione di punti di vista,è solo questione di punti di vista,0,2012.0,"Action,Adventure,Comedy",esoloquestionedipuntidivista,esoloquestionedipuntidivista,both,0.0
882069,è solo questione di tempo (2013),Italy,Il successo (1963),Italy,è solo questione di tempo,2013.0,Il successo,1963.0,ilsuccesso,esoloquestioneditempo,...,movie,è solo questione di tempo,è solo questione di tempo,0,2013.0,Comedy,esoloquestioneditempo,esoloquestioneditempo,both,0.0


In [25]:
imdb_unmatched_1 = movies_df[movies_df['_merge'] == "right_only"].copy()
links_unmatched_1 = movies_df[movies_df['_merge'] == "left_only"].copy()
imdb_unmatched_1

movies_candidate_primary = pd.merge(links_unmatched_1, imdb_unmatched_1,
             left_on=['movie_normalized'],
             right_on=['primaryTitle_normalized'],how='outer', indicator="match_primaryTitle")


movies_match_primary = movies_candidate_primary[movies_candidate_primary['match_primaryTitle'] == "both"].copy()
movies_match_primary['year_diff_prim'] = movies_match_primary['year_x'] - movies_match_primary['startYear_y']
movies_match_primary = movies_match_primary.loc[movies_match_primary['year_diff_prim'].between(-3, 3)]
movies_match_primary

Unnamed: 0,movie_x,movie_country_x,cites_x,cites_country_x,title_x,year_x,primaryTitle_citation_x,year_citation_x,title_citation_normalized_x,movie_normalized_x,...,primaryTitle_y,originalTitle_y,isAdult_y,startYear_y,genres_y,originalTitle_normalized_y,primaryTitle_normalized_y,_merge_y,match_primaryTitle,year_diff_prim
135,Angels Wash Their Faces (1939),USA,Code of the Streets (1939),USA,Angels Wash Their Faces,1939.0,Code of the Streets,1939.0,codeofthestreets,angelswashtheirfaces,...,Angels Wash Their Faces,The Angels Wash Their Faces,0,1939.0,"Drama,Romance",theangelswashtheirfaces,angelswashtheirfaces,right_only,both,0.0
136,Angels Wash Their Faces (1939),USA,Hell's Kitchen (1939),USA,Angels Wash Their Faces,1939.0,Hell's Kitchen,1939.0,hellskitchen,angelswashtheirfaces,...,Angels Wash Their Faces,The Angels Wash Their Faces,0,1939.0,"Drama,Romance",theangelswashtheirfaces,angelswashtheirfaces,right_only,both,0.0
137,Angels Wash Their Faces (1939),USA,They Made Me a Criminal (1939),USA,Angels Wash Their Faces,1939.0,They Made Me a Criminal,1939.0,theymademeacriminal,angelswashtheirfaces,...,Angels Wash Their Faces,The Angels Wash Their Faces,0,1939.0,"Drama,Romance",theangelswashtheirfaces,angelswashtheirfaces,right_only,both,0.0
138,Angels Wash Their Faces (1939),USA,Angels with Dirty Faces (1938),USA,Angels Wash Their Faces,1939.0,Angels with Dirty Faces,1938.0,angelswithdirtyfaces,angelswashtheirfaces,...,Angels Wash Their Faces,The Angels Wash Their Faces,0,1939.0,"Drama,Romance",theangelswashtheirfaces,angelswashtheirfaces,right_only,both,0.0
322,Blood for Dracula (1974),Italy,Flesh for Frankenstein (1973),USA,Blood for Dracula,1974.0,Flesh for Frankenstein,1973.0,fleshforfrankenstein,bloodfordracula,...,Blood for Dracula,Sangue per Dracula,0,1974.0,Horror,sangueperdracula,bloodfordracula,right_only,both,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4696,X-Men: Dark Phoenix (2018),USA,X-Men: The Last Stand (2006),Canada,X-Men: Dark Phoenix,2018.0,X-Men: The Last Stand,2006.0,xmenthelaststand,xmendarkphoenix,...,X-Men: Dark Phoenix,Dark Phoenix,0,2019.0,"Action,Adventure,Sci-Fi",darkphoenix,xmendarkphoenix,right_only,both,-1.0
4697,X-Men: Dark Phoenix (2018),USA,X-Men: The Last Stand (2006),USA,X-Men: Dark Phoenix,2018.0,X-Men: The Last Stand,2006.0,xmenthelaststand,xmendarkphoenix,...,X-Men: Dark Phoenix,Dark Phoenix,0,2019.0,"Action,Adventure,Sci-Fi",darkphoenix,xmendarkphoenix,right_only,both,-1.0
4698,X-Men: Dark Phoenix (2018),USA,X-Men: The Last Stand (2006),UK,X-Men: Dark Phoenix,2018.0,X-Men: The Last Stand,2006.0,xmenthelaststand,xmendarkphoenix,...,X-Men: Dark Phoenix,Dark Phoenix,0,2019.0,"Action,Adventure,Sci-Fi",darkphoenix,xmendarkphoenix,right_only,both,-1.0
4699,X-Men: Dark Phoenix (2018),USA,X: First Class (2011),USA,X-Men: Dark Phoenix,2018.0,X: First Class,2011.0,xfirstclass,xmendarkphoenix,...,X-Men: Dark Phoenix,Dark Phoenix,0,2019.0,"Action,Adventure,Sci-Fi",darkphoenix,xmendarkphoenix,right_only,both,-1.0


In [26]:
movies_candidate_primary[movies_candidate_primary['match_primaryTitle']=='left_only']


Unnamed: 0,movie_x,movie_country_x,cites_x,cites_country_x,title_x,year_x,primaryTitle_citation_x,year_citation_x,title_citation_normalized_x,movie_normalized_x,...,titleType_y,primaryTitle_y,originalTitle_y,isAdult_y,startYear_y,genres_y,originalTitle_normalized_y,primaryTitle_normalized_y,_merge_y,match_primaryTitle
0,...Be yom hashlishi (2010),Israel,Ai no korîda (1976),Japan,...Be yom hashlishi,2010.0,Ai no korîda,1976.0,ainokorida,beyomhashlishi,...,,,,,,,,,,left_only
1,...Be yom hashlishi (2010),Israel,Ai no korîda (1976),France,...Be yom hashlishi,2010.0,Ai no korîda,1976.0,ainokorida,beyomhashlishi,...,,,,,,,,,,left_only
2,100 Tears (2007),USA,Unearthed (2004),USA,100 Tears,2007.0,Unearthed,2004.0,unearthed,100tears,...,,,,,,,,,,left_only
3,100 lat w kinie (1995),Poland,Amator (1979),Poland,100 lat w kinie,1995.0,Amator,1979.0,amator,100latwkinie,...,,,,,,,,,,left_only
4,100 lat w kinie (1995),UK,Amator (1979),Poland,100 lat w kinie,1995.0,Amator,1979.0,amator,100latwkinie,...,,,,,,,,,,left_only
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4963,Zui jia pai dang (1982),Hong Kong,Goldfinger (1964),UK,Zui jia pai dang,1982.0,Goldfinger,1964.0,goldfinger,zuijiapaidang,...,,,,,,,,,,left_only
4964,Zui jia pai dang (1982),Hong Kong,The Godfather (1972),USA,Zui jia pai dang,1982.0,The Godfather,1972.0,thegodfather,zuijiapaidang,...,,,,,,,,,,left_only
4965,Zui jia pai dang (1982),Hong Kong,The Pink Panther (1963),USA,Zui jia pai dang,1982.0,The Pink Panther,1963.0,thepinkpanther,zuijiapaidang,...,,,,,,,,,,left_only
4966,Zui jia pai dang: Zui jie pai dang (1997),Hong Kong,Zui jia pai dang (1982),Hong Kong,Zui jia pai dang: Zui jie pai dang,1997.0,Zui jia pai dang,1982.0,zuijiapaidang,zuijiapaidangzuijiepaidang,...,,,,,,,,,,left_only


In [27]:
#check missing values
matches_movies = matches_movies.dropna(axis=1, how='all')
movies_match_primary = movies_match_primary.dropna(axis=1, how='all')

movies_match_primary.columns = movies_match_primary.columns.str.replace("_y", "")
movies_match_primary.columns = movies_match_primary.columns.str.replace("_x","")

matches_movies = matches_movies.reset_index(drop=True)
movies_match_primary = movies_match_primary.reset_index(drop=True)
# matches_movies.drop(columns=['_merge'], inplace=True)
movies_match_primary.drop(columns=['_merge'], inplace=True)
movies_match_primary.rename(columns={'year_diff_prim' : 'year_diff'}, inplace= True)

movies_match_primary


Unnamed: 0,movie,movie_country,cites,cites_country,title,year,primaryTitle_citation,year_citation,title_citation_normalized,movie_normalized,...,titleType,primaryTitle,originalTitle,isAdult,startYear,genres,originalTitle_normalized,primaryTitle_normalized,match_primaryTitle,year_diff
0,Angels Wash Their Faces (1939),USA,Code of the Streets (1939),USA,Angels Wash Their Faces,1939.0,Code of the Streets,1939.0,codeofthestreets,angelswashtheirfaces,...,movie,Angels Wash Their Faces,The Angels Wash Their Faces,0,1939.0,"Drama,Romance",theangelswashtheirfaces,angelswashtheirfaces,both,0.0
1,Angels Wash Their Faces (1939),USA,Hell's Kitchen (1939),USA,Angels Wash Their Faces,1939.0,Hell's Kitchen,1939.0,hellskitchen,angelswashtheirfaces,...,movie,Angels Wash Their Faces,The Angels Wash Their Faces,0,1939.0,"Drama,Romance",theangelswashtheirfaces,angelswashtheirfaces,both,0.0
2,Angels Wash Their Faces (1939),USA,They Made Me a Criminal (1939),USA,Angels Wash Their Faces,1939.0,They Made Me a Criminal,1939.0,theymademeacriminal,angelswashtheirfaces,...,movie,Angels Wash Their Faces,The Angels Wash Their Faces,0,1939.0,"Drama,Romance",theangelswashtheirfaces,angelswashtheirfaces,both,0.0
3,Angels Wash Their Faces (1939),USA,Angels with Dirty Faces (1938),USA,Angels Wash Their Faces,1939.0,Angels with Dirty Faces,1938.0,angelswithdirtyfaces,angelswashtheirfaces,...,movie,Angels Wash Their Faces,The Angels Wash Their Faces,0,1939.0,"Drama,Romance",theangelswashtheirfaces,angelswashtheirfaces,both,0.0
4,Blood for Dracula (1974),Italy,Flesh for Frankenstein (1973),USA,Blood for Dracula,1974.0,Flesh for Frankenstein,1973.0,fleshforfrankenstein,bloodfordracula,...,movie,Blood for Dracula,Sangue per Dracula,0,1974.0,Horror,sangueperdracula,bloodfordracula,both,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
400,X-Men: Dark Phoenix (2018),USA,X-Men: The Last Stand (2006),Canada,X-Men: Dark Phoenix,2018.0,X-Men: The Last Stand,2006.0,xmenthelaststand,xmendarkphoenix,...,movie,X-Men: Dark Phoenix,Dark Phoenix,0,2019.0,"Action,Adventure,Sci-Fi",darkphoenix,xmendarkphoenix,both,-1.0
401,X-Men: Dark Phoenix (2018),USA,X-Men: The Last Stand (2006),USA,X-Men: Dark Phoenix,2018.0,X-Men: The Last Stand,2006.0,xmenthelaststand,xmendarkphoenix,...,movie,X-Men: Dark Phoenix,Dark Phoenix,0,2019.0,"Action,Adventure,Sci-Fi",darkphoenix,xmendarkphoenix,both,-1.0
402,X-Men: Dark Phoenix (2018),USA,X-Men: The Last Stand (2006),UK,X-Men: Dark Phoenix,2018.0,X-Men: The Last Stand,2006.0,xmenthelaststand,xmendarkphoenix,...,movie,X-Men: Dark Phoenix,Dark Phoenix,0,2019.0,"Action,Adventure,Sci-Fi",darkphoenix,xmendarkphoenix,both,-1.0
403,X-Men: Dark Phoenix (2018),USA,X: First Class (2011),USA,X-Men: Dark Phoenix,2018.0,X: First Class,2011.0,xfirstclass,xmendarkphoenix,...,movie,X-Men: Dark Phoenix,Dark Phoenix,0,2019.0,"Action,Adventure,Sci-Fi",darkphoenix,xmendarkphoenix,both,-1.0


In [28]:
#get final data for movies
final_movies = pd.concat([matches_movies,movies_match_primary],ignore_index=True)
final_movies['match_primaryTitle'].replace({'both': 'True'},inplace=True)
final_movies['match_primaryTitle'] = final_movies['match_primaryTitle'].cat.add_categories(['False']).fillna('False')
final_movies.columns = final_movies.columns.str.replace("_citation", "_cit")
final_movies.drop(columns=['title_cit_normalized','movie_normalized','originalTitle_normalized','primaryTitle_normalized','_merge'], inplace=True)
final_movies

Unnamed: 0,movie,movie_country,cites,cites_country,title,year,primaryTitle_cit,year_cit,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,genres,year_diff,match_primaryTitle
0,#DevinAuditions (2016),Canada,Alien (1979),UK,#DevinAuditions,2016.0,Alien,1979.0,tt6231178,short,#DevinAuditions,#DevinAuditions,0,2016.0,"Comedy,Short",0.0,False
1,#DevinAuditions (2016),Canada,Alien (1979),USA,#DevinAuditions,2016.0,Alien,1979.0,tt6231178,short,#DevinAuditions,#DevinAuditions,0,2016.0,"Comedy,Short",0.0,False
2,#DevinAuditions (2016),Canada,Die Hard (1988),USA,#DevinAuditions,2016.0,Die Hard,1988.0,tt6231178,short,#DevinAuditions,#DevinAuditions,0,2016.0,"Comedy,Short",0.0,False
3,#DevinAuditions (2016),Canada,Ferris Bueller's Day Off (1986),USA,#DevinAuditions,2016.0,Ferris Bueller's Day Off,1986.0,tt6231178,short,#DevinAuditions,#DevinAuditions,0,2016.0,"Comedy,Short",0.0,False
4,#DevinAuditions (2016),Canada,Fight Club (1999),USA,#DevinAuditions,2016.0,Fight Club,1999.0,tt6231178,short,#DevinAuditions,#DevinAuditions,0,2016.0,"Comedy,Short",0.0,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
372573,X-Men: Dark Phoenix (2018),USA,X-Men: The Last Stand (2006),Canada,X-Men: Dark Phoenix,2018.0,X-Men: The Last Stand,2006.0,tt6565702,movie,X-Men: Dark Phoenix,Dark Phoenix,0,2019.0,"Action,Adventure,Sci-Fi",-1.0,True
372574,X-Men: Dark Phoenix (2018),USA,X-Men: The Last Stand (2006),USA,X-Men: Dark Phoenix,2018.0,X-Men: The Last Stand,2006.0,tt6565702,movie,X-Men: Dark Phoenix,Dark Phoenix,0,2019.0,"Action,Adventure,Sci-Fi",-1.0,True
372575,X-Men: Dark Phoenix (2018),USA,X-Men: The Last Stand (2006),UK,X-Men: Dark Phoenix,2018.0,X-Men: The Last Stand,2006.0,tt6565702,movie,X-Men: Dark Phoenix,Dark Phoenix,0,2019.0,"Action,Adventure,Sci-Fi",-1.0,True
372576,X-Men: Dark Phoenix (2018),USA,X: First Class (2011),USA,X-Men: Dark Phoenix,2018.0,X: First Class,2011.0,tt6565702,movie,X-Men: Dark Phoenix,Dark Phoenix,0,2019.0,"Action,Adventure,Sci-Fi",-1.0,True


Unnamed: 0,movie,movie_country,cites,cites_country,title,year,primaryTitle_cit,year_cit,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,genres,year_diff,match_primaryTitle
0,#DevinAuditions (2016),Canada,Alien (1979),UK,#DevinAuditions,2016.0,Alien,1979.0,tt6231178,short,#DevinAuditions,#DevinAuditions,0,2016.0,"Comedy,Short",0.0,False
1,#DevinAuditions (2016),Canada,Alien (1979),USA,#DevinAuditions,2016.0,Alien,1979.0,tt6231178,short,#DevinAuditions,#DevinAuditions,0,2016.0,"Comedy,Short",0.0,False
2,#DevinAuditions (2016),Canada,Die Hard (1988),USA,#DevinAuditions,2016.0,Die Hard,1988.0,tt6231178,short,#DevinAuditions,#DevinAuditions,0,2016.0,"Comedy,Short",0.0,False
3,#DevinAuditions (2016),Canada,Ferris Bueller's Day Off (1986),USA,#DevinAuditions,2016.0,Ferris Bueller's Day Off,1986.0,tt6231178,short,#DevinAuditions,#DevinAuditions,0,2016.0,"Comedy,Short",0.0,False
4,#DevinAuditions (2016),Canada,Fight Club (1999),USA,#DevinAuditions,2016.0,Fight Club,1999.0,tt6231178,short,#DevinAuditions,#DevinAuditions,0,2016.0,"Comedy,Short",0.0,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
372573,X-Men: Dark Phoenix (2018),USA,X-Men: The Last Stand (2006),Canada,X-Men: Dark Phoenix,2018.0,X-Men: The Last Stand,2006.0,tt6565702,movie,X-Men: Dark Phoenix,Dark Phoenix,0,2019.0,"Action,Adventure,Sci-Fi",-1.0,True
372574,X-Men: Dark Phoenix (2018),USA,X-Men: The Last Stand (2006),USA,X-Men: Dark Phoenix,2018.0,X-Men: The Last Stand,2006.0,tt6565702,movie,X-Men: Dark Phoenix,Dark Phoenix,0,2019.0,"Action,Adventure,Sci-Fi",-1.0,True
372575,X-Men: Dark Phoenix (2018),USA,X-Men: The Last Stand (2006),UK,X-Men: Dark Phoenix,2018.0,X-Men: The Last Stand,2006.0,tt6565702,movie,X-Men: Dark Phoenix,Dark Phoenix,0,2019.0,"Action,Adventure,Sci-Fi",-1.0,True
372576,X-Men: Dark Phoenix (2018),USA,X: First Class (2011),USA,X-Men: Dark Phoenix,2018.0,X: First Class,2011.0,tt6565702,movie,X-Men: Dark Phoenix,Dark Phoenix,0,2019.0,"Action,Adventure,Sci-Fi",-1.0,True


In [30]:
final_movies = final_movies.drop_duplicates()
final_citations = final_citations.drop_duplicates()

final_movies = final_movies.dropna(axis=1, how='all')

final_citations = final_citations.dropna(axis=1, how='all')




Unnamed: 0,movie,movie_country,cit,country_cit,title,year,primaryTitle_cit,year_cit,tconst_cit,titleType_cit,primaryTitle_cit.1,originalTitle_cit,isAdult_cit,startYear_cit,genres_cit,year_diff_cit,match_primaryTitle_cit
0,#DevinAuditions (2016),Canada,Alien (1979),UK,#DevinAuditions,2016.0,Alien,1979.0,tt0078748,movie,Alien,Alien,0,1979.0,"Horror,Sci-Fi",0.0,False
1,#DevinAuditions (2016),Canada,Alien (1979),USA,#DevinAuditions,2016.0,Alien,1979.0,tt0078748,movie,Alien,Alien,0,1979.0,"Horror,Sci-Fi",0.0,False
2,.com for Murder (2002),USA,Alien (1979),UK,.com for Murder,2002.0,Alien,1979.0,tt0078748,movie,Alien,Alien,0,1979.0,"Horror,Sci-Fi",0.0,False
3,.com for Murder (2002),USA,Alien (1979),USA,.com for Murder,2002.0,Alien,1979.0,tt0078748,movie,Alien,Alien,0,1979.0,"Horror,Sci-Fi",0.0,False
4,30 Minutes or Less (2011),Germany,Alien (1979),UK,30 Minutes or Less,2011.0,Alien,1979.0,tt0078748,movie,Alien,Alien,0,1979.0,"Horror,Sci-Fi",0.0,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
359436,Yin yang lu shi ba zhi Gui shang shen (2003),Hong Kong,Troublesome Night 11 (2001),Hong Kong,Yin yang lu shi ba zhi Gui shang shen,2003.0,Troublesome Night 11,2001.0,tt0297441,movie,Troublesome Night 11,Yam yeung lo 11: Liu gwai lo min,0,2001.0,"Comedy,Horror",0.0,True
359437,Yin yang lu shi jiu zhi Wo dui yan jian dao ye...,Hong Kong,Troublesome Night 11 (2001),Hong Kong,Yin yang lu shi jiu zhi Wo dui yan jian dao ye,2003.0,Troublesome Night 11,2001.0,tt0297441,movie,Troublesome Night 11,Yam yeung lo 11: Liu gwai lo min,0,2001.0,"Comedy,Horror",0.0,True
359438,Yin yang lu shi liu zhi hui dao wu xia shi dai...,Hong Kong,Troublesome Night 11 (2001),Hong Kong,Yin yang lu shi liu zhi hui dao wu xia shi dai,2002.0,Troublesome Night 11,2001.0,tt0297441,movie,Troublesome Night 11,Yam yeung lo 11: Liu gwai lo min,0,2001.0,"Comedy,Horror",0.0,True
359439,Yin yang lu shi qi zhi jian fang you gui (2002),Hong Kong,Troublesome Night 11 (2001),Hong Kong,Yin yang lu shi qi zhi jian fang you gui,2002.0,Troublesome Night 11,2001.0,tt0297441,movie,Troublesome Night 11,Yam yeung lo 11: Liu gwai lo min,0,2001.0,"Comedy,Horror",0.0,True


Unnamed: 0,movie,movie_country,cites,cites_country,title,year,primaryTitle_cit,year_cit,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,genres,year_diff,match_primaryTitle
0,#DevinAuditions (2016),Canada,Alien (1979),UK,#DevinAuditions,2016.0,Alien,1979.0,tt6231178,short,#DevinAuditions,#DevinAuditions,0,2016.0,"Comedy,Short",0.0,False
1,#DevinAuditions (2016),Canada,Alien (1979),USA,#DevinAuditions,2016.0,Alien,1979.0,tt6231178,short,#DevinAuditions,#DevinAuditions,0,2016.0,"Comedy,Short",0.0,False
2,#DevinAuditions (2016),Canada,Die Hard (1988),USA,#DevinAuditions,2016.0,Die Hard,1988.0,tt6231178,short,#DevinAuditions,#DevinAuditions,0,2016.0,"Comedy,Short",0.0,False
3,#DevinAuditions (2016),Canada,Ferris Bueller's Day Off (1986),USA,#DevinAuditions,2016.0,Ferris Bueller's Day Off,1986.0,tt6231178,short,#DevinAuditions,#DevinAuditions,0,2016.0,"Comedy,Short",0.0,False
4,#DevinAuditions (2016),Canada,Fight Club (1999),USA,#DevinAuditions,2016.0,Fight Club,1999.0,tt6231178,short,#DevinAuditions,#DevinAuditions,0,2016.0,"Comedy,Short",0.0,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
372573,X-Men: Dark Phoenix (2018),USA,X-Men: The Last Stand (2006),Canada,X-Men: Dark Phoenix,2018.0,X-Men: The Last Stand,2006.0,tt6565702,movie,X-Men: Dark Phoenix,Dark Phoenix,0,2019.0,"Action,Adventure,Sci-Fi",-1.0,True
372574,X-Men: Dark Phoenix (2018),USA,X-Men: The Last Stand (2006),USA,X-Men: Dark Phoenix,2018.0,X-Men: The Last Stand,2006.0,tt6565702,movie,X-Men: Dark Phoenix,Dark Phoenix,0,2019.0,"Action,Adventure,Sci-Fi",-1.0,True
372575,X-Men: Dark Phoenix (2018),USA,X-Men: The Last Stand (2006),UK,X-Men: Dark Phoenix,2018.0,X-Men: The Last Stand,2006.0,tt6565702,movie,X-Men: Dark Phoenix,Dark Phoenix,0,2019.0,"Action,Adventure,Sci-Fi",-1.0,True
372576,X-Men: Dark Phoenix (2018),USA,X: First Class (2011),USA,X-Men: Dark Phoenix,2018.0,X: First Class,2011.0,tt6565702,movie,X-Men: Dark Phoenix,Dark Phoenix,0,2019.0,"Action,Adventure,Sci-Fi",-1.0,True


In [60]:
#remove unneeded columns
citations_data = final_citations[['movie', 'movie_country','cit','country_cit','primaryTitle_cit','year_cit','tconst_cit','titleType_cit','genres_cit']]

citations_data = citations_data.loc[:, ~citations_data.columns.duplicated()]
citations_data


movies_data  = final_movies[['movie', 'movie_country','cites','cites_country','title','year','tconst','titleType','genres']]

movies_data



Unnamed: 0,movie,movie_country,cit,country_cit,primaryTitle_cit,year_cit,tconst_cit,titleType_cit,genres_cit
0,#DevinAuditions (2016),Canada,Alien (1979),UK,Alien,1979.0,tt0078748,movie,"Horror,Sci-Fi"
1,#DevinAuditions (2016),Canada,Alien (1979),USA,Alien,1979.0,tt0078748,movie,"Horror,Sci-Fi"
2,.com for Murder (2002),USA,Alien (1979),UK,Alien,1979.0,tt0078748,movie,"Horror,Sci-Fi"
3,.com for Murder (2002),USA,Alien (1979),USA,Alien,1979.0,tt0078748,movie,"Horror,Sci-Fi"
4,30 Minutes or Less (2011),Germany,Alien (1979),UK,Alien,1979.0,tt0078748,movie,"Horror,Sci-Fi"
...,...,...,...,...,...,...,...,...,...
359436,Yin yang lu shi ba zhi Gui shang shen (2003),Hong Kong,Troublesome Night 11 (2001),Hong Kong,Troublesome Night 11,2001.0,tt0297441,movie,"Comedy,Horror"
359437,Yin yang lu shi jiu zhi Wo dui yan jian dao ye...,Hong Kong,Troublesome Night 11 (2001),Hong Kong,Troublesome Night 11,2001.0,tt0297441,movie,"Comedy,Horror"
359438,Yin yang lu shi liu zhi hui dao wu xia shi dai...,Hong Kong,Troublesome Night 11 (2001),Hong Kong,Troublesome Night 11,2001.0,tt0297441,movie,"Comedy,Horror"
359439,Yin yang lu shi qi zhi jian fang you gui (2002),Hong Kong,Troublesome Night 11 (2001),Hong Kong,Troublesome Night 11,2001.0,tt0297441,movie,"Comedy,Horror"


As a last step, we combine the dataframes created above to get the final data for our graphs:
    all relevant information for movies: titles, country, year, genre, type, imdb identifier.

In [63]:

graph_df = pd.merge(movies_data, citations_data, left_on=['movie','movie_country','cites','cites_country'],
                    right_on = ['movie','movie_country','cit','country_cit'])

graph_df

Unnamed: 0,movie,movie_country,cites,cites_country,title,year,tconst,titleType,genres,cit,country_cit,primaryTitle_cit,year_cit,tconst_cit,titleType_cit,genres_cit
0,#DevinAuditions (2016),Canada,Alien (1979),UK,#DevinAuditions,2016.0,tt6231178,short,"Comedy,Short",Alien (1979),UK,Alien,1979.0,tt0078748,movie,"Horror,Sci-Fi"
1,#DevinAuditions (2016),Canada,Alien (1979),USA,#DevinAuditions,2016.0,tt6231178,short,"Comedy,Short",Alien (1979),USA,Alien,1979.0,tt0078748,movie,"Horror,Sci-Fi"
2,#DevinAuditions (2016),Canada,Die Hard (1988),USA,#DevinAuditions,2016.0,tt6231178,short,"Comedy,Short",Die Hard (1988),USA,Die Hard,1988.0,tt0095016,movie,"Action,Thriller"
3,#DevinAuditions (2016),Canada,Ferris Bueller's Day Off (1986),USA,#DevinAuditions,2016.0,tt6231178,short,"Comedy,Short",Ferris Bueller's Day Off (1986),USA,Ferris Bueller's Day Off,1986.0,tt0091042,movie,Comedy
4,#DevinAuditions (2016),Canada,Fight Club (1999),USA,#DevinAuditions,2016.0,tt6231178,short,"Comedy,Short",Fight Club (1999),USA,Fight Club,1999.0,tt0137523,movie,Drama
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
403277,X-Men: Dark Phoenix (2018),USA,X-Men: The Last Stand (2006),Canada,X-Men: Dark Phoenix,2018.0,tt6565702,movie,"Action,Adventure,Sci-Fi",X-Men: The Last Stand (2006),Canada,X-Men: The Last Stand,2006.0,tt0376994,movie,"Action,Adventure,Sci-Fi"
403278,X-Men: Dark Phoenix (2018),USA,X-Men: The Last Stand (2006),USA,X-Men: Dark Phoenix,2018.0,tt6565702,movie,"Action,Adventure,Sci-Fi",X-Men: The Last Stand (2006),USA,X-Men: The Last Stand,2006.0,tt0376994,movie,"Action,Adventure,Sci-Fi"
403279,X-Men: Dark Phoenix (2018),USA,X-Men: The Last Stand (2006),UK,X-Men: Dark Phoenix,2018.0,tt6565702,movie,"Action,Adventure,Sci-Fi",X-Men: The Last Stand (2006),UK,X-Men: The Last Stand,2006.0,tt0376994,movie,"Action,Adventure,Sci-Fi"
403280,X-Men: Dark Phoenix (2018),USA,X: First Class (2011),USA,X-Men: Dark Phoenix,2018.0,tt6565702,movie,"Action,Adventure,Sci-Fi",X: First Class (2011),USA,X: First Class,2011.0,tt1270798,movie,"Action,Sci-Fi"


In [64]:
graph_df['movie'].unique()

<StringArray>
[                                     '#DevinAuditions (2016)',
                                        '#FromJennifer (2017)',
                                                 '#Rip (2013)',
                                               'R.I.P. (2001)',
                                               'R.I.P. (2013)',
                                         '#TubeClash02 (2016)',
 '#chicagoGirl: The Social Network Takes on a Dictator (2013)',
                                        '#twitterkills (2014)',
                                                '#will (2016)',
                                            '$10 Raise (1935)',
 ...
                                   'Tea with Mussolini (1999)',
                    "The Crime Doctor's Strangest Case (1943)",
                         'The Killer Bean 2: The Party (2000)',
                            'The Prince and the Dybbuk (2017)',
                             'The Young Black Stallion (2003)',
                     

In [65]:
graph_df.to_csv('graph_1_movies_cit.csv')

To retrieve film agents, we will use the tconst obtained for each film and its citations to link the data with the people involved in each film as directors, writers or producers, from the imdb dataset.

In [67]:
#get directors, writers, producers of films

movies_data


Unnamed: 0,movie,movie_country,cites,cites_country,title,year,tconst,titleType,genres
0,#DevinAuditions (2016),Canada,Alien (1979),UK,#DevinAuditions,2016.0,tt6231178,short,"Comedy,Short"
1,#DevinAuditions (2016),Canada,Alien (1979),USA,#DevinAuditions,2016.0,tt6231178,short,"Comedy,Short"
2,#DevinAuditions (2016),Canada,Die Hard (1988),USA,#DevinAuditions,2016.0,tt6231178,short,"Comedy,Short"
3,#DevinAuditions (2016),Canada,Ferris Bueller's Day Off (1986),USA,#DevinAuditions,2016.0,tt6231178,short,"Comedy,Short"
4,#DevinAuditions (2016),Canada,Fight Club (1999),USA,#DevinAuditions,2016.0,tt6231178,short,"Comedy,Short"
...,...,...,...,...,...,...,...,...,...
372573,X-Men: Dark Phoenix (2018),USA,X-Men: The Last Stand (2006),Canada,X-Men: Dark Phoenix,2018.0,tt6565702,movie,"Action,Adventure,Sci-Fi"
372574,X-Men: Dark Phoenix (2018),USA,X-Men: The Last Stand (2006),USA,X-Men: Dark Phoenix,2018.0,tt6565702,movie,"Action,Adventure,Sci-Fi"
372575,X-Men: Dark Phoenix (2018),USA,X-Men: The Last Stand (2006),UK,X-Men: Dark Phoenix,2018.0,tt6565702,movie,"Action,Adventure,Sci-Fi"
372576,X-Men: Dark Phoenix (2018),USA,X: First Class (2011),USA,X-Men: Dark Phoenix,2018.0,tt6565702,movie,"Action,Adventure,Sci-Fi"


In [95]:
# isolate the positions we are interested in, filtering out actors and other members.
crew_table = title_table[title_table['category'].isin(['director', 'producer', 'writer']) ]

#perform left merge to get the needed data for our films.
crew_data = pd.merge(graph_df, crew_table, left_on='tconst', right_on='tconst', how='left')
crew_data

Unnamed: 0,movie,movie_country,cites,cites_country,title,year,tconst,titleType,genres,cit,...,primaryTitle_cit,year_cit,tconst_cit,titleType_cit,genres_cit,ordering,nconst,category,job,characters
0,#DevinAuditions (2016),Canada,Alien (1979),UK,#DevinAuditions,2016.0,tt6231178,short,"Comedy,Short",Alien (1979),...,Alien,1979.0,tt0078748,movie,"Horror,Sci-Fi",,,,,
1,#DevinAuditions (2016),Canada,Alien (1979),USA,#DevinAuditions,2016.0,tt6231178,short,"Comedy,Short",Alien (1979),...,Alien,1979.0,tt0078748,movie,"Horror,Sci-Fi",,,,,
2,#DevinAuditions (2016),Canada,Die Hard (1988),USA,#DevinAuditions,2016.0,tt6231178,short,"Comedy,Short",Die Hard (1988),...,Die Hard,1988.0,tt0095016,movie,"Action,Thriller",,,,,
3,#DevinAuditions (2016),Canada,Ferris Bueller's Day Off (1986),USA,#DevinAuditions,2016.0,tt6231178,short,"Comedy,Short",Ferris Bueller's Day Off (1986),...,Ferris Bueller's Day Off,1986.0,tt0091042,movie,Comedy,,,,,
4,#DevinAuditions (2016),Canada,Fight Club (1999),USA,#DevinAuditions,2016.0,tt6231178,short,"Comedy,Short",Fight Club (1999),...,Fight Club,1999.0,tt0137523,movie,Drama,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1192906,X-Men: Dark Phoenix (2018),USA,X: First Class (2011),USA,X-Men: Dark Phoenix,2018.0,tt6565702,movie,"Action,Adventure,Sci-Fi",X: First Class (2011),...,X: First Class,2011.0,tt1270798,movie,"Action,Sci-Fi",8.0,nm0795682,producer,producer,\N
1192907,X-Men: Dark Phoenix (2018),USA,X: First Class (2011),UK,X-Men: Dark Phoenix,2018.0,tt6565702,movie,"Action,Adventure,Sci-Fi",X: First Class (2011),...,X: First Class,2011.0,tt1270798,movie,"Action,Sci-Fi",5.0,nm1334526,director,\N,\N
1192908,X-Men: Dark Phoenix (2018),USA,X: First Class (2011),UK,X-Men: Dark Phoenix,2018.0,tt6565702,movie,"Action,Adventure,Sci-Fi",X: First Class (2011),...,X: First Class,2011.0,tt1270798,movie,"Action,Sci-Fi",6.0,nm0356745,producer,producer,\N
1192909,X-Men: Dark Phoenix (2018),USA,X: First Class (2011),UK,X-Men: Dark Phoenix,2018.0,tt6565702,movie,"Action,Adventure,Sci-Fi",X: First Class (2011),...,X: First Class,2011.0,tt1270798,movie,"Action,Sci-Fi",7.0,nm0404446,producer,producer,\N


In [94]:
len(crew_data[crew_data['tconst'].isna()])

0

In [92]:
#missing attributes
len(crew_data[crew_data['nconst'].isna()]) 

22558

The missing data is due to the fact that some movies do not have all listed positions for the films

In [96]:
crew_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1192911 entries, 0 to 1192910
Data columns (total 21 columns):
 #   Column            Non-Null Count    Dtype  
---  ------            --------------    -----  
 0   movie             1192911 non-null  string 
 1   movie_country     1191487 non-null  string 
 2   cites             1192911 non-null  string 
 3   cites_country     1192663 non-null  string 
 4   title             1192911 non-null  string 
 5   year              1192911 non-null  float64
 6   tconst            1192911 non-null  object 
 7   titleType         1192911 non-null  string 
 8   genres            1185232 non-null  string 
 9   cit               1192911 non-null  string 
 10  country_cit       1192663 non-null  string 
 11  primaryTitle_cit  1192911 non-null  string 
 12  year_cit          1192911 non-null  float64
 13  tconst_cit        1192911 non-null  string 
 14  titleType_cit     1192911 non-null  string 
 15  genres_cit        1180353 non-null  string 
 16  

In [90]:
title_table[title_table['tconst'] == 'tt6231178']
#while the data has info for actors, it does not have any for our interested positions.

Unnamed: 0,tconst,ordering,nconst,category,job,characters
44884253,tt6231178,1,nm2100606,actress,\N,"[""Melody Casting""]"
44884254,tt6231178,2,nm2153041,actor,\N,"[""Devin""]"


Perform a second merge to get the info for our citations as well.

In [98]:
crew_df = pd.merge(crew_data,crew_table,left_on='tconst_cit', right_on='tconst', how='left')
crew_df

Unnamed: 0,movie,movie_country,cites,cites_country,title,year,tconst_x,titleType,genres,cit,...,nconst_x,category_x,job_x,characters_x,tconst_y,ordering_y,nconst_y,category_y,job_y,characters_y
0,#DevinAuditions (2016),Canada,Alien (1979),UK,#DevinAuditions,2016.0,tt6231178,short,"Comedy,Short",Alien (1979),...,,,,,tt0078748,10.0,nm0001353,producer,producer,\N
1,#DevinAuditions (2016),Canada,Alien (1979),UK,#DevinAuditions,2016.0,tt6231178,short,"Comedy,Short",Alien (1979),...,,,,,tt0078748,5.0,nm0000631,director,\N,\N
2,#DevinAuditions (2016),Canada,Alien (1979),UK,#DevinAuditions,2016.0,tt6231178,short,"Comedy,Short",Alien (1979),...,,,,,tt0078748,6.0,nm0639321,writer,screenplay by,\N
3,#DevinAuditions (2016),Canada,Alien (1979),UK,#DevinAuditions,2016.0,tt6231178,short,"Comedy,Short",Alien (1979),...,,,,,tt0078748,7.0,nm0795953,writer,story by,\N
4,#DevinAuditions (2016),Canada,Alien (1979),UK,#DevinAuditions,2016.0,tt6231178,short,"Comedy,Short",Alien (1979),...,,,,,tt0078748,8.0,nm0140826,producer,producer,\N
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4330563,X-Men: Dark Phoenix (2018),USA,X: First Class (2011),UK,X-Men: Dark Phoenix,2018.0,tt6565702,movie,"Action,Adventure,Sci-Fi",X: First Class (2011),...,nm0795682,producer,producer,\N,tt1270798,5.0,nm0891216,director,\N,\N
4330564,X-Men: Dark Phoenix (2018),USA,X: First Class (2011),UK,X-Men: Dark Phoenix,2018.0,tt6565702,movie,"Action,Adventure,Sci-Fi",X: First Class (2011),...,nm0795682,producer,producer,\N,tt1270798,6.0,nm1005420,writer,screenplay by,\N
4330565,X-Men: Dark Phoenix (2018),USA,X: First Class (2011),UK,X-Men: Dark Phoenix,2018.0,tt6565702,movie,"Action,Adventure,Sci-Fi",X: First Class (2011),...,nm0795682,producer,producer,\N,tt1270798,7.0,nm0826714,writer,screenplay by,\N
4330566,X-Men: Dark Phoenix (2018),USA,X: First Class (2011),UK,X-Men: Dark Phoenix,2018.0,tt6565702,movie,"Action,Adventure,Sci-Fi",X: First Class (2011),...,nm0795682,producer,producer,\N,tt1270798,8.0,nm0963359,writer,screenplay by,\N


In [100]:
#cleanup to view which info is for movies (_'movie') and citations('_cit').
crew_df.columns = crew_df.columns.str.replace("_y", "_cit")
crew_df.columns = crew_df.columns.str.replace("_x", "_movie")

crew_df

Unnamed: 0,movie,movie_country,cites,cites_country,title,year,tconst_movie,titleType,genres,cit,...,nconst_movie,category_movie,job_movie,characters_movie,tconst_cit,ordering_cit,nconst_cit,category_cit,job_cit,characters_cit
0,#DevinAuditions (2016),Canada,Alien (1979),UK,#DevinAuditions,2016.0,tt6231178,short,"Comedy,Short",Alien (1979),...,,,,,tt0078748,10.0,nm0001353,producer,producer,\N
1,#DevinAuditions (2016),Canada,Alien (1979),UK,#DevinAuditions,2016.0,tt6231178,short,"Comedy,Short",Alien (1979),...,,,,,tt0078748,5.0,nm0000631,director,\N,\N
2,#DevinAuditions (2016),Canada,Alien (1979),UK,#DevinAuditions,2016.0,tt6231178,short,"Comedy,Short",Alien (1979),...,,,,,tt0078748,6.0,nm0639321,writer,screenplay by,\N
3,#DevinAuditions (2016),Canada,Alien (1979),UK,#DevinAuditions,2016.0,tt6231178,short,"Comedy,Short",Alien (1979),...,,,,,tt0078748,7.0,nm0795953,writer,story by,\N
4,#DevinAuditions (2016),Canada,Alien (1979),UK,#DevinAuditions,2016.0,tt6231178,short,"Comedy,Short",Alien (1979),...,,,,,tt0078748,8.0,nm0140826,producer,producer,\N
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4330563,X-Men: Dark Phoenix (2018),USA,X: First Class (2011),UK,X-Men: Dark Phoenix,2018.0,tt6565702,movie,"Action,Adventure,Sci-Fi",X: First Class (2011),...,nm0795682,producer,producer,\N,tt1270798,5.0,nm0891216,director,\N,\N
4330564,X-Men: Dark Phoenix (2018),USA,X: First Class (2011),UK,X-Men: Dark Phoenix,2018.0,tt6565702,movie,"Action,Adventure,Sci-Fi",X: First Class (2011),...,nm0795682,producer,producer,\N,tt1270798,6.0,nm1005420,writer,screenplay by,\N
4330565,X-Men: Dark Phoenix (2018),USA,X: First Class (2011),UK,X-Men: Dark Phoenix,2018.0,tt6565702,movie,"Action,Adventure,Sci-Fi",X: First Class (2011),...,nm0795682,producer,producer,\N,tt1270798,7.0,nm0826714,writer,screenplay by,\N
4330566,X-Men: Dark Phoenix (2018),USA,X: First Class (2011),UK,X-Men: Dark Phoenix,2018.0,tt6565702,movie,"Action,Adventure,Sci-Fi",X: First Class (2011),...,nm0795682,producer,producer,\N,tt1270798,8.0,nm0963359,writer,screenplay by,\N


In [101]:
crew_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4330568 entries, 0 to 4330567
Data columns (total 27 columns):
 #   Column            Dtype  
---  ------            -----  
 0   movie             string 
 1   movie_country     string 
 2   cites             string 
 3   cites_country     string 
 4   title             string 
 5   year              float64
 6   tconst_movie      object 
 7   titleType         string 
 8   genres            string 
 9   cit               string 
 10  country_cit       string 
 11  primaryTitle_cit  string 
 12  year_cit          float64
 13  tconst_cit        object 
 14  titleType_cit     string 
 15  genres_cit        string 
 16  ordering_movie    float64
 17  nconst_movie      object 
 18  category_movie    object 
 19  job_movie         object 
 20  characters_movie  object 
 21  tconst_cit        object 
 22  ordering_cit      float64
 23  nconst_cit        object 
 24  category_cit      object 
 25  job_cit           object 
 26  characters_cit

In [103]:
#remove unneccessary columns
crew_df.drop(columns=['ordering_movie','ordering_cit','characters_movie', 'characters_cit'], inplace=True)
crew_df

Unnamed: 0,movie,movie_country,cites,cites_country,title,year,tconst_movie,titleType,genres,cit,...,tconst_cit,titleType_cit,genres_cit,nconst_movie,category_movie,job_movie,tconst_cit.1,nconst_cit,category_cit,job_cit
0,#DevinAuditions (2016),Canada,Alien (1979),UK,#DevinAuditions,2016.0,tt6231178,short,"Comedy,Short",Alien (1979),...,tt0078748,movie,"Horror,Sci-Fi",,,,tt0078748,nm0001353,producer,producer
1,#DevinAuditions (2016),Canada,Alien (1979),UK,#DevinAuditions,2016.0,tt6231178,short,"Comedy,Short",Alien (1979),...,tt0078748,movie,"Horror,Sci-Fi",,,,tt0078748,nm0000631,director,\N
2,#DevinAuditions (2016),Canada,Alien (1979),UK,#DevinAuditions,2016.0,tt6231178,short,"Comedy,Short",Alien (1979),...,tt0078748,movie,"Horror,Sci-Fi",,,,tt0078748,nm0639321,writer,screenplay by
3,#DevinAuditions (2016),Canada,Alien (1979),UK,#DevinAuditions,2016.0,tt6231178,short,"Comedy,Short",Alien (1979),...,tt0078748,movie,"Horror,Sci-Fi",,,,tt0078748,nm0795953,writer,story by
4,#DevinAuditions (2016),Canada,Alien (1979),UK,#DevinAuditions,2016.0,tt6231178,short,"Comedy,Short",Alien (1979),...,tt0078748,movie,"Horror,Sci-Fi",,,,tt0078748,nm0140826,producer,producer
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4330563,X-Men: Dark Phoenix (2018),USA,X: First Class (2011),UK,X-Men: Dark Phoenix,2018.0,tt6565702,movie,"Action,Adventure,Sci-Fi",X: First Class (2011),...,tt1270798,movie,"Action,Sci-Fi",nm0795682,producer,producer,tt1270798,nm0891216,director,\N
4330564,X-Men: Dark Phoenix (2018),USA,X: First Class (2011),UK,X-Men: Dark Phoenix,2018.0,tt6565702,movie,"Action,Adventure,Sci-Fi",X: First Class (2011),...,tt1270798,movie,"Action,Sci-Fi",nm0795682,producer,producer,tt1270798,nm1005420,writer,screenplay by
4330565,X-Men: Dark Phoenix (2018),USA,X: First Class (2011),UK,X-Men: Dark Phoenix,2018.0,tt6565702,movie,"Action,Adventure,Sci-Fi",X: First Class (2011),...,tt1270798,movie,"Action,Sci-Fi",nm0795682,producer,producer,tt1270798,nm0826714,writer,screenplay by
4330566,X-Men: Dark Phoenix (2018),USA,X: First Class (2011),UK,X-Men: Dark Phoenix,2018.0,tt6565702,movie,"Action,Adventure,Sci-Fi",X: First Class (2011),...,tt1270798,movie,"Action,Sci-Fi",nm0795682,producer,producer,tt1270798,nm0963359,writer,screenplay by


In [104]:
crew_df.to_csv('crew_data.csv')

In [106]:
crew_df[crew_df.duplicated()]


Unnamed: 0,movie,movie_country,cites,cites_country,title,year,tconst_movie,titleType,genres,cit,...,tconst_cit,titleType_cit,genres_cit,nconst_movie,category_movie,job_movie,tconst_cit.1,nconst_cit,category_cit,job_cit


In [107]:
gephi_1 = graph_df.rename(columns={'movie':'Source','cites':'Target'})

gephi_1.to_csv('gephi_g1.csv')

In [111]:
g_countries= graph_df.rename(columns={'movie_country':'Source','cites_country':'Target'})
g_countries.info()
g_countries.to_csv('gephi_countries.csv')

<class 'pandas.core.frame.DataFrame'>
Int64Index: 403282 entries, 0 to 403281
Data columns (total 16 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   movie             403282 non-null  string 
 1   Source            402808 non-null  string 
 2   cites             403282 non-null  string 
 3   Target            403167 non-null  string 
 4   title             403282 non-null  string 
 5   year              403282 non-null  float64
 6   tconst            403282 non-null  string 
 7   titleType         403282 non-null  string 
 8   genres            400232 non-null  string 
 9   cit               403282 non-null  string 
 10  country_cit       403167 non-null  string 
 11  primaryTitle_cit  403282 non-null  string 
 12  year_cit          403282 non-null  float64
 13  tconst_cit        403282 non-null  string 
 14  titleType_cit     403282 non-null  string 
 15  genres_cit        399049 non-null  string 
dtypes: float64(2), strin

Unnamed: 0,nconst,primaryName,birthYear,deathYear,primaryProfession,knownForTitles
0,nm0000001,Fred Astaire,1899,1987,"soundtrack,actor,miscellaneous","tt0050419,tt0045537,tt0072308,tt0053137"
1,nm0000002,Lauren Bacall,1924,2014,"actress,soundtrack","tt0037382,tt0071877,tt0117057,tt0038355"
2,nm0000003,Brigitte Bardot,1934,\N,"actress,soundtrack,music_department","tt0057345,tt0054452,tt0049189,tt0056404"
3,nm0000004,John Belushi,1949,1982,"actor,soundtrack,writer","tt0078723,tt0080455,tt0072562,tt0077975"
4,nm0000005,Ingmar Bergman,1918,2007,"writer,director,actor","tt0050976,tt0083922,tt0060827,tt0050986"
...,...,...,...,...,...,...
95,nm0000096,Gillian Anderson,1968,\N,"actress,producer,soundtrack","tt0442632,tt2294189,tt0106179,tt0455590"
96,nm0000097,Pamela Anderson,1967,\N,"actress,producer,director","tt0115624,tt0893509,tt0426592,tt0306047"
97,nm0000098,Jennifer Aniston,1969,\N,"actress,producer,soundtrack","tt1723121,tt0108778,tt1038919,tt3442006"
98,nm0000099,Patricia Arquette,1968,\N,"actress,producer,director","tt0412175,tt0145531,tt1065073,tt0108399"
