In [1]:
from textblob import TextBlob

In [39]:
import pandas as pd
import seaborn as sns
from langdetect import detect

ModuleNotFoundError: No module named 'langdetect'

### Reading in Data
Due to the size of the raw data from IMDb, several aspects of the data have been removed to further focus on specifics and so that Jupyter has less trouble handling the Data Frames

#### Actors
- Missing the birthYear value
- Birth years before 1900
- Primary Profession not listing "actor" or "actress"

#### Movies
- Removing all without a runtime
- Removing all that are not labeled as 'movie' (i.e. short films, television, etc)

In [6]:
actors_df = pd.read_csv('actors_data_1900.csv')
movies_df = pd.read_csv('movies_wruntime.csv')

In [14]:
movies_df.head()

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,runtimeMinutes,genres
0,tt0000009,movie,Miss Jerry,Miss Jerry,0,1894,45,Romance
1,tt0000147,movie,The Corbett-Fitzsimmons Fight,The Corbett-Fitzsimmons Fight,0,1897,20,"Documentary,News,Sport"
3,tt0000574,movie,The Story of the Kelly Gang,The Story of the Kelly Gang,0,1906,70,"Biography,Crime,Drama"
4,tt0000679,movie,The Fairylogue and Radio-Plays,The Fairylogue and Radio-Plays,0,1908,120,"Adventure,Fantasy"
5,tt0001184,movie,Don Juan de Serrallonga,Don Juan de Serrallonga,0,1910,58,"Adventure,Drama"


### Data Cleaning
There are several entries that do not have a start year, so first we will clean it up by removing those. Then it is much easier to compare the years since they will now all be integer values.
Also since we are not working with any television shows, we can drop the endYear value and relabel the startYear column.

In [7]:
movies_df = movies_df[(movies_df.startYear !="\\N")].drop(columns = ['endYear'])

In an effort to remove any films listed that may not be relevant (very small productions, never released to a wide audience or during the turn of the century) generally have missing genre information so I will drop the entires without atleast one genre.

In [8]:
movies_df = movies_df[movies_df['genres'] !='\\N']

Now I am curious what the difference between the Original Title and the Primary Title, how often do they not match?

In [21]:
(movies_df['primaryTitle'] != movies_df['originalTitle']).value_counts()

False    116708
True      31315
dtype: int64

There are 31315 instances of the Primary Title and the Original Title not matching. After looking at the titles of the films, it seems that almost always the English title is considered the Primary Title and the Original Title is in the language of the origin country.

In [38]:
diffTitle_df = movies_df[movies_df['primaryTitle'] != movies_df['originalTitle']]
diffTitle_df

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,runtimeMinutes,genres
6,tt0001258,movie,The White Slave Trade,Den hvide slavehandel,0,1910,45,Drama
12,tt0001790,movie,"Les Misérables, Part 1: Jean Valjean",Les misérables - Époque 1: Jean Valjean,0,1913,60,Drama
15,tt0001911,movie,Nell Gwynne,Sweet Nell of Old Drury,0,1911,50,"Biography,Drama,History"
16,tt0001964,movie,The Traitress,Die Verräterin,0,1911,48,Drama
17,tt0002026,movie,Anny - Story of a Prostitute,Anny - en gatepiges roman,0,1912,68,"Drama,Romance"
20,tt0002130,movie,Dante's Inferno,L'Inferno,0,1911,68,"Adventure,Drama,Fantasy"
21,tt0002153,movie,The Great Circus Catastrophe,Dødsspring til hest fra cirkuskuplen,0,1912,45,Drama
22,tt0002186,movie,The Flying Circus,Den flyvende cirkus,0,1912,46,Drama
28,tt0002452,movie,The Independence of Romania,Independenta Romaniei,0,1912,120,"History,War"
29,tt0002461,movie,The Life and Death of King Richard III,Richard III,0,1912,55,Drama
