# Aquisition 
[Overall Data Source with Dataset Details](https://www.imdb.com/interfaces/)
[Data Location](https://datasets.imdbws.com)

The Internet Movie Database (IMDb) freely provides subset of its data for personal and non-commercial uses. Any instance of web-scraping is strictly prohibited, so although there may have been a means of obtaining data that way, we opted to abide by IMDb's site access policies. That said, all content included herein as far as the data set itself is the property of IMDb. This presentation is purely for educational purposes and is not meant to expressly critique the aggregate data. All that said, lets obtain the data sets. 

There are seven separate sub datasets that are contained in a gzipped, tab-separated-values (TSV) formatted file in the UTF-8 character set. A '/N' is used to denote that a particular field is missing or null for that title/name. 

Each of the seven datasets are successfully merged along the primary key 'tconst' for each of the respective title subsets and the primary key from name.basics.tsv as 'nconst', which was merged with the foreign key 'nconst' in title.principals, which contains the meta data for the cast/crew per title. 

At the end of acquisition, the combined dataframe was saved to .csv and subsequently read in as the dataframe for preparation and pre-processing. 
***

In [1]:
import pandas as pd

In [1]:
df = pd.read_csv('data-2.tsv', sep ='\t')

In [2]:
df

Unnamed: 0,tconst,directors,writers
0,tt0000001,nm0005690,\N
1,tt0000002,nm0721526,\N
2,tt0000003,nm0721526,\N
3,tt0000004,nm0721526,\N
4,tt0000005,nm0005690,\N
...,...,...,...
8946053,tt9916848,"nm5519454,nm5519375","nm6182221,nm1628284,nm2921377"
8946054,tt9916850,"nm5519454,nm5519375","nm6182221,nm1628284,nm2921377"
8946055,tt9916852,"nm5519454,nm5519375","nm6182221,nm1628284,nm2921377"
8946056,tt9916856,nm10538645,nm6951431


In [39]:
df1 = pd.read_csv('data-3.tsv', sep ='\t')

  exec(code_obj, self.user_global_ns, self.user_ns)


In [40]:
df1.to_csv('title_basics_unfiltered.csv', index=False)

In [5]:
import mitosheet
mitosheet.sheet(df1, analysis_to_replay="id-hamtlyclfd")

MitoWidget(analysis_data_json='{"analysisName": "id-hamtlyclfd", "analysisToReplay": null, "code": [], "stepSu…

In [None]:
from mitosheet import *; # Analysis Name:id-hamtlyclfd;
    
# Filtered titleType
df1 = df1[df1['titleType'].apply(lambda val: all(val != s for s in ['short', 'tvEpisode', 'tvMiniSeries', 'tvPilot', 'tvSeries', 'tvShort', 'tvSpecial', 'video', 'videoGame']))]

# Filtered startYear
df1 = df1[df1['startYear'].apply(lambda val: all(val != s for s in ['1999', '1998', '1997', '1995', '1994', '1992', '1991', '1987', '1985', '1984', '1982', '1904', '1903', '1902', '1901', '1900', '1899', '1898', '1897', '1896']))]


In [8]:
df1 = df1[df1['titleType'] == 'movie']

In [None]:
df1

In [14]:
import mitosheet
mitosheet.sheet(df1, analysis_to_replay="id-hhnrsklphg")

MitoWidget(analysis_data_json='{"analysisName": "id-hhnrsklphg", "analysisToReplay": null, "code": [], "stepSu…

In [None]:
from mitosheet import *; # Analysis Name:id-hhnrsklphg;
    
# Filtered startYear
df1 = df1[df1['startYear'] != '\N']


In [9]:
df1.dtypes

tconst            object
titleType         object
primaryTitle      object
originalTitle     object
isAdult           object
startYear         object
endYear           object
runtimeMinutes    object
genres            object
dtype: object

In [12]:
df1.startYear.astype(int)

ValueError: invalid literal for int() with base 10: '\\N'

In [16]:
df1.startYear.unique() # //N is messing shit up

array([1905, 1906, 1907, 1908, 1909, 1910, 1912, 1911, 1913, 1915, 1914,
       1919, 1916, 1917, 1936, 1925, 1918, 1920, 1922, 1921, 1924, 1923,
       1928, 2019, 2021, 1926, 1927, 1929, 2000, 1993, 1935, 1930, 1942,
       1932, 1931, 1934, 1939, 1937, 1933, 1950, 1938, 1951, 1946, 1996,
       1940, 1944, 1947, 1941, 1952, 1970, 1957, 1943, 1948, 1945, 2001,
       1949, 1953, 1954, 1965, 1983, 1980, 1973, 1961, 1955, 1962, 1958,
       1956, 1977, 1964, 1960, 1959, 1967, 1968, 1963, 1971, 1969, 1972,
       1966, 1976, 1990, 1979, 1981, 2020, 1975, 1978, 1989, 1974, 1986,
       '1972', '1971', '1970', '1974', '1973', '1976', '1969', '1981',
       '1968', '1995', '1986', '1987', '1975', '1965', '1978', '1967',
       '1990', '1980', '1985', '2018', '1977', '1989', '1979', '1984',
       '1966', '1982', '1988', '1983', '1991', '1963', '2001', '1961',
       '\\N', '1994', '1993', '1964', '1957', '2019', '1992', '2005',
       '1953', '2004', '1998', '2020', '1947', '2016', '2002',

In [18]:
df1 = df1[df1['startYear'] != '\\N']

In [19]:
df1.startYear.unique()

array([1905, 1906, 1907, 1908, 1909, 1910, 1912, 1911, 1913, 1915, 1914,
       1919, 1916, 1917, 1936, 1925, 1918, 1920, 1922, 1921, 1924, 1923,
       1928, 2019, 2021, 1926, 1927, 1929, 2000, 1993, 1935, 1930, 1942,
       1932, 1931, 1934, 1939, 1937, 1933, 1950, 1938, 1951, 1946, 1996,
       1940, 1944, 1947, 1941, 1952, 1970, 1957, 1943, 1948, 1945, 2001,
       1949, 1953, 1954, 1965, 1983, 1980, 1973, 1961, 1955, 1962, 1958,
       1956, 1977, 1964, 1960, 1959, 1967, 1968, 1963, 1971, 1969, 1972,
       1966, 1976, 1990, 1979, 1981, 2020, 1975, 1978, 1989, 1974, 1986,
       '1972', '1971', '1970', '1974', '1973', '1976', '1969', '1981',
       '1968', '1995', '1986', '1987', '1975', '1965', '1978', '1967',
       '1990', '1980', '1985', '2018', '1977', '1989', '1979', '1984',
       '1966', '1982', '1988', '1983', '1991', '1963', '2001', '1961',
       '1994', '1993', '1964', '1957', '2019', '1992', '2005', '1953',
       '2004', '1998', '2020', '1947', '2016', '2002', '1996'

In [20]:
df1.startYear.astype(int)

498        1905
570        1906
587        1907
610        1907
625        1908
           ... 
8945948    2015
8945975    2007
8945987    2013
8945998    2017
8946008    2013
Name: startYear, Length: 531264, dtype: int64

In [22]:
df1.dtypes

tconst            object
titleType         object
primaryTitle      object
originalTitle     object
isAdult           object
startYear         object
endYear           object
runtimeMinutes    object
genres            object
dtype: object

In [24]:
df1['startYear'] = df1['startYear'].astype(int)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df1['startYear'] = df1['startYear'].astype(int)


In [25]:
df1.dtypes

tconst            object
titleType         object
primaryTitle      object
originalTitle     object
isAdult           object
startYear          int64
endYear           object
runtimeMinutes    object
genres            object
dtype: object

In [35]:
df1 = df1[df1['startYear'] >= 2000]

In [36]:
df1.startYear.unique()

array([2019, 2021, 2000, 2001, 2020, 2018, 2005, 2004, 2016, 2002, 2017,
       2006, 2008, 2009, 2003, 2007, 2022, 2010, 2012, 2013, 2011, 2024,
       2015, 2014, 2023, 2025, 2027, 2026, 2028, 2029])

In [37]:
df1.shape

(280658, 9)

In [38]:
df1.to_csv('title_basics.csv', index=False)

In [2]:
df = pd.read_csv('title_basics.csv')

In [3]:
df

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
0,tt0011801,movie,Tötet nicht mehr,Tötet nicht mehr,0,2019,\N,\N,"Action,Crime"
1,tt0013274,movie,Istoriya grazhdanskoy voyny,Istoriya grazhdanskoy voyny,0,2021,\N,133,Documentary
2,tt0015414,movie,La tierra de los toros,La tierra de los toros,0,2000,\N,60,\N
3,tt0035423,movie,Kate & Leopold,Kate & Leopold,0,2001,\N,118,"Comedy,Fantasy,Romance"
4,tt0062336,movie,The Tango of the Widower and Its Distorting Mirror,El Tango del Viudo y Su Espejo Deformante,0,2020,\N,70,Drama
...,...,...,...,...,...,...,...,...,...
280653,tt9916622,movie,Rodolpho Teóphilo - O Legado de um Pioneiro,Rodolpho Teóphilo - O Legado de um Pioneiro,0,2015,\N,57,Documentary
280654,tt9916680,movie,De la ilusión al desconcierto: cine colombiano 1970-1995,De la ilusión al desconcierto: cine colombiano 1970-1995,0,2007,\N,100,Documentary
280655,tt9916706,movie,Dankyavar Danka,Dankyavar Danka,0,2013,\N,\N,Comedy
280656,tt9916730,movie,6 Gunn,6 Gunn,0,2017,\N,116,\N


In [4]:
df1 = pd.read_csv('data-2.tsv', sep ='\t')

In [7]:
df1

Unnamed: 0,tconst,directors,writers
0,tt0000001,nm0005690,\N
1,tt0000002,nm0721526,\N
2,tt0000003,nm0721526,\N
3,tt0000004,nm0721526,\N
4,tt0000005,nm0005690,\N
...,...,...,...
8946053,tt9916848,"nm5519454,nm5519375","nm6182221,nm1628284,nm2921377"
8946054,tt9916850,"nm5519454,nm5519375","nm6182221,nm1628284,nm2921377"
8946055,tt9916852,"nm5519454,nm5519375","nm6182221,nm1628284,nm2921377"
8946056,tt9916856,nm10538645,nm6951431


In this tutorial, you’ll learn how and when to combine your data in pandas with:

- merge() for combining data on common columns or indices
- .join() for combining data on a key column or an index
- concat() for combining DataFrames across rows or columns

In [None]:
dft = 

In [5]:
df.shape

(280658, 9)

In [6]:
df1.shape

(8946058, 3)

In [9]:
df2 = pd.merge(df, df1, on='tconst')

In [10]:
df2

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres,directors,writers
0,tt0011801,movie,Tötet nicht mehr,Tötet nicht mehr,0,2019,\N,\N,"Action,Crime",nm0681726,"nm0483944,nm0681726"
1,tt0013274,movie,Istoriya grazhdanskoy voyny,Istoriya grazhdanskoy voyny,0,2021,\N,133,Documentary,"nm0412842,nm0895048",\N
2,tt0015414,movie,La tierra de los toros,La tierra de los toros,0,2000,\N,60,\N,nm0615736,\N
3,tt0035423,movie,Kate & Leopold,Kate & Leopold,0,2001,\N,118,"Comedy,Fantasy,Romance",nm0003506,"nm0737216,nm0003506"
4,tt0062336,movie,The Tango of the Widower and Its Distorting Mirror,El Tango del Viudo y Su Espejo Deformante,0,2020,\N,70,Drama,"nm0749914,nm0765384","nm0749914,nm1146177"
...,...,...,...,...,...,...,...,...,...,...,...
280653,tt9916622,movie,Rodolpho Teóphilo - O Legado de um Pioneiro,Rodolpho Teóphilo - O Legado de um Pioneiro,0,2015,\N,57,Documentary,"nm9272491,nm9272490","nm9272490,nm9272491"
280654,tt9916680,movie,De la ilusión al desconcierto: cine colombiano 1970-1995,De la ilusión al desconcierto: cine colombiano 1970-1995,0,2007,\N,100,Documentary,nm0652213,"nm0652213,nm10538576"
280655,tt9916706,movie,Dankyavar Danka,Dankyavar Danka,0,2013,\N,\N,Comedy,nm7764440,nm7933903
280656,tt9916730,movie,6 Gunn,6 Gunn,0,2017,\N,116,\N,nm10538612,nm10538612


In [11]:
df2.to_csv('title_basics_and_title_crew.csv', index=False)

In [12]:
df3 = pd.read_csv('data-4.tsv', sep ='\t')

  exec(code_obj, self.user_global_ns, self.user_ns)


In [None]:
df3

In [16]:
df3.rename(columns={'titleId': 'tconst'}, inplace=True)

In [17]:
df4 = pd.merge(df2, df3, on='tconst')

In [18]:
df4.shape

(1213147, 18)

In [19]:
df4

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres,directors,writers,ordering,title,region,language,types,attributes,isOriginalTitle
0,tt0011801,movie,Tötet nicht mehr,Tötet nicht mehr,0,2019,\N,\N,"Action,Crime",nm0681726,"nm0483944,nm0681726",1,Tötet nicht mehr,DE,\N,imdbDisplay,\N,0
1,tt0011801,movie,Tötet nicht mehr,Tötet nicht mehr,0,2019,\N,\N,"Action,Crime",nm0681726,"nm0483944,nm0681726",2,Misericordia - Tötet nicht mehr!,DE,\N,\N,censored version,0
2,tt0011801,movie,Tötet nicht mehr,Tötet nicht mehr,0,2019,\N,\N,"Action,Crime",nm0681726,"nm0483944,nm0681726",3,Tötet nicht mehr,\N,\N,original,\N,1
3,tt0013274,movie,Istoriya grazhdanskoy voyny,Istoriya grazhdanskoy voyny,0,2021,\N,133,Documentary,"nm0412842,nm0895048",\N,1,History of the Civil War,\N,\N,\N,\N,0
4,tt0013274,movie,Istoriya grazhdanskoy voyny,Istoriya grazhdanskoy voyny,0,2021,\N,133,Documentary,"nm0412842,nm0895048",\N,2,Histoire de la guerre civile,FR,\N,\N,\N,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1213142,tt9916680,movie,De la ilusión al desconcierto: cine colombiano 1970-1995,De la ilusión al desconcierto: cine colombiano 1970-1995,0,2007,\N,100,Documentary,nm0652213,"nm0652213,nm10538576",2,De la ilusión al desconcierto: cine colombiano 1970-1995,\N,\N,original,\N,1
1213143,tt9916706,movie,Dankyavar Danka,Dankyavar Danka,0,2013,\N,\N,Comedy,nm7764440,nm7933903,1,Dankyavar Danka,IN,\N,\N,\N,0
1213144,tt9916730,movie,6 Gunn,6 Gunn,0,2017,\N,116,\N,nm10538612,nm10538612,1,6 Gunn,IN,\N,\N,\N,0
1213145,tt9916754,movie,Chico Albuquerque - Revelações,Chico Albuquerque - Revelações,0,2013,\N,49,Documentary,"nm8349149,nm9272490","nm8349149,nm9272490",1,Chico Albuquerque - Revelações,BR,\N,imdbDisplay,\N,0


In [21]:
df5 = df4[df4['region'] == 'US']

In [22]:
df5.shape

(148940, 18)

In [30]:
df5.titleType.unique()
df5.startYear.unique()
df5.region.unique()

array(['US'], dtype=object)

In [31]:
df5.to_csv('title_basics_and_title_crew_title_akas.csv', index=False)

In [33]:
df6 = pd.read_csv('ratings.tsv', sep ='\t')

In [34]:
df6

Unnamed: 0,tconst,averageRating,numVotes
0,tt0000001,5.7,1882
1,tt0000002,5.9,250
2,tt0000003,6.5,1661
3,tt0000004,5.8,163
4,tt0000005,6.2,2486
...,...,...,...
1247228,tt9916690,6.5,6
1247229,tt9916720,5.2,216
1247230,tt9916730,8.4,6
1247231,tt9916766,6.7,19


In [35]:
df7 = pd.merge(df5, df6, on='tconst')

In [36]:
df7

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres,directors,writers,ordering,title,region,language,types,attributes,isOriginalTitle,averageRating,numVotes
0,tt0035423,movie,Kate & Leopold,Kate & Leopold,0,2001,\N,118,"Comedy,Fantasy,Romance",nm0003506,"nm0737216,nm0003506",35,Kate and Leopold,US,\N,\N,alternative spelling,0,6.4,83822
1,tt0035423,movie,Kate & Leopold,Kate & Leopold,0,2001,\N,118,"Comedy,Fantasy,Romance",nm0003506,"nm0737216,nm0003506",37,Kate & Leopold,US,\N,imdbDisplay,\N,0,6.4,83822
2,tt0062336,movie,The Tango of the Widower and Its Distorting Mirror,El Tango del Viudo y Su Espejo Deformante,0,2020,\N,70,Drama,"nm0749914,nm0765384","nm0749914,nm1146177",5,The Tango of the Widower and Its Distorting Mirror,US,\N,imdbDisplay,\N,0,6.4,161
3,tt0069049,movie,The Other Side of the Wind,The Other Side of the Wind,0,2018,\N,122,Drama,nm0000080,"nm0000080,nm0462648",3,The Other Side of the Wind,US,\N,imdbDisplay,\N,0,6.7,7234
4,tt0088751,movie,The Naked Monster,The Naked Monster,0,2005,\N,100,"Comedy,Horror,Sci-Fi","nm0078540,nm0628399",nm0628399,1,Attack of the B-Movie Monster,US,\N,working,\N,0,5.2,320
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
101462,tt9915872,movie,The Last White Witch,My Girlfriend is a Wizard,0,2019,\N,97,"Comedy,Drama,Fantasy",nm8063415,nm2507310,3,The Last White Witch,US,\N,imdbDisplay,\N,0,6.9,8
101463,tt9916170,movie,The Rehearsal,O Ensaio,0,2019,\N,51,Drama,nm5412267,"nm5412267,nm6743460,nm3245789",5,The Rehearsal,US,\N,imdbDisplay,\N,0,6.7,6
101464,tt9916190,movie,Safeguard,Safeguard,0,2020,\N,90,"Action,Adventure,Thriller",nm7308376,nm7308376,6,Safeguard,US,\N,imdbDisplay,\N,0,3.6,233
101465,tt9916362,movie,Coven,Akelarre,0,2020,\N,92,"Drama,History",nm1893148,"nm1893148,nm3471432",18,Coven,US,\N,imdbDisplay,\N,0,6.4,4659


In [37]:
df8 = pd.read_csv('data-5.tsv', sep ='\t')

In [43]:
df9 = pd.merge(df7, df8, on='tconst')

In [46]:
df9.head()

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres,directors,...,types,attributes,isOriginalTitle,averageRating,numVotes,ordering_y,nconst,category,job,characters
0,tt0035423,movie,Kate & Leopold,Kate & Leopold,0,2001,\N,118,"Comedy,Fantasy,Romance",nm0003506,...,\N,alternative spelling,0,6.4,83822,10,nm0107463,editor,\N,\N
1,tt0035423,movie,Kate & Leopold,Kate & Leopold,0,2001,\N,118,"Comedy,Fantasy,Romance",nm0003506,...,\N,alternative spelling,0,6.4,83822,1,nm0000212,actress,\N,"[""Kate McKay""]"
2,tt0035423,movie,Kate & Leopold,Kate & Leopold,0,2001,\N,118,"Comedy,Fantasy,Romance",nm0003506,...,\N,alternative spelling,0,6.4,83822,2,nm0413168,actor,\N,"[""Leopold""]"
3,tt0035423,movie,Kate & Leopold,Kate & Leopold,0,2001,\N,118,"Comedy,Fantasy,Romance",nm0003506,...,\N,alternative spelling,0,6.4,83822,3,nm0000630,actor,\N,"[""Stuart Besser""]"
4,tt0035423,movie,Kate & Leopold,Kate & Leopold,0,2001,\N,118,"Comedy,Fantasy,Romance",nm0003506,...,\N,alternative spelling,0,6.4,83822,4,nm0005227,actor,\N,"[""Charlie McKay""]"


In [48]:
df9.dtypes

tconst              object
titleType           object
primaryTitle        object
originalTitle       object
isAdult              int64
startYear            int64
endYear             object
runtimeMinutes      object
genres              object
directors           object
writers             object
ordering_x           int64
title               object
region              object
language            object
types               object
attributes          object
isOriginalTitle     object
averageRating      float64
numVotes             int64
ordering_y           int64
nconst              object
category            object
job                 object
characters          object
dtype: object

In [49]:
df10 = pd.read_csv('data-6.tsv', sep ='\t')

In [50]:
df10.head()

Unnamed: 0,nconst,primaryName,birthYear,deathYear,primaryProfession,knownForTitles
0,nm0000001,Fred Astaire,1899,1987,"soundtrack,actor,miscellaneous","tt0031983,tt0053137,tt0072308,tt0050419"
1,nm0000002,Lauren Bacall,1924,2014,"actress,soundtrack","tt0117057,tt0037382,tt0071877,tt0038355"
2,nm0000003,Brigitte Bardot,1934,\N,"actress,soundtrack,music_department","tt0049189,tt0057345,tt0056404,tt0054452"
3,nm0000004,John Belushi,1949,1982,"actor,soundtrack,writer","tt0077975,tt0078723,tt0080455,tt0072562"
4,nm0000005,Ingmar Bergman,1918,2007,"writer,director,actor","tt0050986,tt0060827,tt0083922,tt0050976"


In [52]:
df10.shape

(11654331, 6)

In [51]:
df11 = pd.merge(df9, df10, on='nconst')

In [53]:
df11.shape

(900195, 30)

In [54]:
df11

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres,directors,...,ordering_y,nconst,category,job,characters,primaryName,birthYear,deathYear,primaryProfession,knownForTitles
0,tt0035423,movie,Kate & Leopold,Kate & Leopold,0,2001,\N,118,"Comedy,Fantasy,Romance",nm0003506,...,10,nm0107463,editor,\N,\N,David Brenner,1962,2022,"editor,editorial_department","tt1190080,tt0319262,tt0116629,tt0096969"
1,tt0035423,movie,Kate & Leopold,Kate & Leopold,0,2001,\N,118,"Comedy,Fantasy,Romance",nm0003506,...,10,nm0107463,editor,\N,\N,David Brenner,1962,2022,"editor,editorial_department","tt1190080,tt0319262,tt0116629,tt0096969"
2,tt0309698,movie,Identity,Identity,0,2003,\N,90,"Mystery,Thriller",nm0003506,...,10,nm0107463,editor,\N,\N,David Brenner,1962,2022,"editor,editorial_department","tt1190080,tt0319262,tt0116629,tt0096969"
3,tt0309698,movie,Identity,Identity,0,2003,\N,90,"Mystery,Thriller",nm0003506,...,10,nm0107463,editor,\N,\N,David Brenner,1962,2022,"editor,editorial_department","tt1190080,tt0319262,tt0116629,tt0096969"
4,tt0319262,movie,The Day After Tomorrow,The Day After Tomorrow,0,2004,\N,124,"Action,Adventure,Sci-Fi",nm0000386,...,10,nm0107463,editor,\N,\N,David Brenner,1962,2022,"editor,editorial_department","tt1190080,tt0319262,tt0116629,tt0096969"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
900190,tt9916362,movie,Coven,Akelarre,0,2020,\N,92,"Drama,History",nm1893148,...,6,nm3471432,writer,screenplay by,\N,Katell Guillou,\N,\N,"writer,director","tt9916362,tt12339398,tt2104911,tt8368232"
900191,tt9916428,movie,The Secret of China,Hong xing zhao yao Zhong guo,0,2019,\N,\N,"Adventure,History,War",nm0910951,...,2,nm9445072,actor,\N,"[""Mao Ze Dong""]",Wang Peng Kai,\N,\N,actor,"tt7674346,tt9916428"
900192,tt9916428,movie,The Secret of China,Hong xing zhao yao Zhong guo,0,2019,\N,\N,"Adventure,History,War",nm0910951,...,3,nm8594703,actor,\N,"[""Dr. Hatem""]",Valery Gadreau,\N,\N,actor,tt1424310
900193,tt9916428,movie,The Secret of China,Hong xing zhao yao Zhong guo,0,2019,\N,\N,"Adventure,History,War",nm0910951,...,5,nm0910951,director,\N,\N,Jixing Wang,\N,\N,"director,writer","tt1267271,tt0340445,tt0344062,tt13932406"


In [55]:
df11.to_csv('dirty_df.csv', index=False)

In [3]:
df12 = pd.read_csv('dirty_df.csv')

In [5]:
df12.columns

Index(['tconst', 'titleType', 'primaryTitle', 'originalTitle', 'isAdult',
       'startYear', 'endYear', 'runtimeMinutes', 'genres', 'directors',
       'writers', 'ordering_x', 'title', 'region', 'language', 'types',
       'attributes', 'isOriginalTitle', 'averageRating', 'numVotes',
       'ordering_y', 'nconst', 'category', 'job', 'characters', 'primaryName',
       'birthYear', 'deathYear', 'primaryProfession', 'knownForTitles'],
      dtype='object')

In [6]:
df12.primaryName.unique()

array(['David Brenner', 'Meg Ryan', 'Hugh Jackman', ..., 'Valery Gadreau',
       'Jixing Wang', 'Vincent Matile'], dtype=object)