In [15]:
from textblob import TextBlob

In [16]:
import pandas as pd
import seaborn as sns

### Reading in Data
Due to the size of the raw data from IMDb, several aspects of the data have been removed to further focus on specifics and so that Jupyter has less trouble handling the Data Frames

#### Actors
- Removed rows with missing the birthYear value
- Removed rows with birth years before 1900
- Removed rows where the primary profession was not listing "actor" or "actress"

#### Movies
- Removing all without a runtime
- Removing all that are not labeled as 'movie' (i.e. short films, television, etc)

In [17]:
actors_df = pd.read_csv('actors_data_1900.csv')
movies_df = pd.read_csv('movies_wruntime.csv')

In [18]:
movies_df.head()

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
0,tt0000009,movie,Miss Jerry,Miss Jerry,0,1894,\N,45,Romance
1,tt0000147,movie,The Corbett-Fitzsimmons Fight,The Corbett-Fitzsimmons Fight,0,1897,\N,20,"Documentary,News,Sport"
2,tt0000502,movie,Bohemios,Bohemios,0,1905,\N,100,\N
3,tt0000574,movie,The Story of the Kelly Gang,The Story of the Kelly Gang,0,1906,\N,70,"Biography,Crime,Drama"
4,tt0000679,movie,The Fairylogue and Radio-Plays,The Fairylogue and Radio-Plays,0,1908,\N,120,"Adventure,Fantasy"


### Data Cleaning
There are several entries that do not have a start year, so first we will clean it up by removing those. Then it is much easier to compare the years since they will now all be integer values.
Also since we are not working with any television shows, we can drop the endYear value and titleType columns

In [19]:
movies_df = movies_df[(movies_df.startYear !="\\N")].drop(columns = ['endYear', 'titleType'])

In an effort to remove any films listed that may not be relevant (very small productions, never released to a wide audience or during the turn of the century) generally have missing genre information so I will drop the entires without atleast one genre.

In [20]:
movies_df = movies_df[movies_df['genres'] !='\\N']

Now I am curious what the difference between the Original Title and the Primary Title, how often do they not match?

In [21]:
(movies_df['primaryTitle'] != movies_df['originalTitle']).value_counts()

False    116708
True      31315
dtype: int64

There are 31315 instances of the Primary Title and the Original Title not matching. After looking at the titles of the films, it seems that almost always the English title is considered the Primary Title and the Original Title is in the language of the origin country.

In [25]:
diffTitle_df = movies_df[movies_df['primaryTitle'] != movies_df['originalTitle']].drop(columns = ['originalTitle'])
sameTitle_df = movies_df[movies_df['primaryTitle'] == movies_df['originalTitle']].drop(columns=['originalTitle'])
diffTitle_df.head()

Unnamed: 0,tconst,primaryTitle,isAdult,startYear,runtimeMinutes,genres
6,tt0001258,The White Slave Trade,0,1910,45,Drama
12,tt0001790,"Les Misérables, Part 1: Jean Valjean",0,1913,60,Drama
15,tt0001911,Nell Gwynne,0,1911,50,"Biography,Drama,History"
16,tt0001964,The Traitress,0,1911,48,Drama
17,tt0002026,Anny - Story of a Prostitute,0,1912,68,"Drama,Romance"


The problem here is that after scrubbing through more rows of this data frame, some films were not translated to English, both values in the primaryTitle and originalTitle columns are the same but neither are in English. 

Lets find all the films that have the same primaryTitle and originalTitle and confirm that they are both in English.

### TextBlob
TextBlob is a library for processing text data and I will use it for detecting the language of the titles of the films. The complete documentation for it can be found [here](https://textblob.readthedocs.io/en/dev/)

In [30]:
#b = TextBlob(sameTitle_df['primaryTitle'])
sameTitle_df['language'] = sameTitle_df['primaryTitle'].apply(lambda a: TextBlob(a).language)
#b.detect_language()

AttributeError: 'TextBlob' object has no attribute 'language'