# Data Frame Creation

In [1]:
# import packages
import numpy as np
import pandas as pd

In [2]:
# creating the dfs
characters_df = pd.read_csv('data/movie_characters_metadata.txt', sep=' \+\+\+\$\+\+\+ ',
                            names = ['character_ID', 'character_name', 'movie_ID', 'movie_title', 'gender', 'credit_position'], 
                            index_col='character_ID', engine='python', encoding='ISO-8859-1')
movies_df = pd.read_csv('data/movie_titles_metadata.txt', sep=' \+\+\+\$\+\+\+ ',
                        names=['movie_ID', 'movie_title' , 'movie_year', 'IMDB_rating', 'IMBD_votes', 'genres'], 
                        index_col='movie_ID', engine='python', encoding='ISO-8859-1')
utterances_df = pd.read_csv('data/movie_lines.txt', sep=' \+\+\+\$\+\+\+ ',
                            names=['line_ID', 'character_ID' , 'movie_ID', 'character_name', 'utterance'], 
                            index_col='line_ID', engine='python', encoding='ISO-8859-1')
conversations_df = pd.read_csv('data/movie_conversations.txt', sep=' \+\+\+\$\+\+\+ ',
                                names=['character_ID1', 'character_ID2' , 'movie_ID', 'dialogue'], engine='python')

The data was all separated with ' +++$+++ ' and did not have column names. The README described what each column was in the data so I used that to create column names. Where logical, I made the index of the df the initial ID column.

### characters_df

This data frame lists all characters, their movie, their gender, and their credit position. The `character_ID` column is later referenced in other data frames.

In [3]:
characters_df.shape

(9035, 5)

In [4]:
characters_df.info()
# some characters are unnamed

<class 'pandas.core.frame.DataFrame'>
Index: 9035 entries, u0 to u9034
Data columns (total 5 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   character_name   9033 non-null   object
 1   movie_ID         9035 non-null   object
 2   movie_title      9035 non-null   object
 3   gender           9035 non-null   object
 4   credit_position  9035 non-null   object
dtypes: object(5)
memory usage: 423.5+ KB


In [5]:
characters_df.head()

Unnamed: 0_level_0,character_name,movie_ID,movie_title,gender,credit_position
character_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
u0,BIANCA,m0,10 things i hate about you,f,4
u1,BRUCE,m0,10 things i hate about you,?,?
u2,CAMERON,m0,10 things i hate about you,m,3
u3,CHASTITY,m0,10 things i hate about you,?,?
u4,JOEY,m0,10 things i hate about you,m,6


Some gender makers are missing for characters in this dataframe, I will work on filling in the missing data.

In [6]:
# how many characters are missing gender?
characters_df.groupby('gender').count()

Unnamed: 0_level_0,character_name,movie_ID,movie_title,credit_position
gender,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
?,6018,6020,6020,6020
F,45,45,45,45
M,150,150,150,150
f,921,921,921,921
m,1899,1899,1899,1899


Looks like a lot of missing gender markers! Looks like some of the flags are inconsist too, there is 'm' and 'M.'

### movies_df

This data frame lists all movies, their year, IMBD rating, IMBD votes, and their genres. The `movie_ID` column is later referenced in other data frames.

In [7]:
movies_df.shape

(617, 5)

In [8]:
movies_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 617 entries, m0 to m616
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   movie_title  617 non-null    object 
 1   movie_year   617 non-null    object 
 2   IMDB_rating  617 non-null    float64
 3   IMBD_votes   617 non-null    int64  
 4   genres       617 non-null    object 
dtypes: float64(1), int64(1), object(3)
memory usage: 28.9+ KB


In [9]:
movies_df.head()

Unnamed: 0_level_0,movie_title,movie_year,IMDB_rating,IMBD_votes,genres
movie_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
m0,10 things i hate about you,1999,6.9,62847,"['comedy', 'romance']"
m1,1492: conquest of paradise,1992,6.2,10421,"['adventure', 'biography', 'drama', 'history']"
m2,15 minutes,2001,6.1,25854,"['action', 'crime', 'drama', 'thriller']"
m3,2001: a space odyssey,1968,8.4,163227,"['adventure', 'mystery', 'sci-fi']"
m4,48 hrs.,1982,6.9,22289,"['action', 'comedy', 'crime', 'drama', 'thrill..."


In [10]:
# what is the earliest movie in the df?
movies_df['movie_year'].min()

'1927'

In [11]:
# what is the most recent movie in the df?
movies_df['movie_year'].max()

'2010'

In [12]:
# how many movies per year?
movies_df.groupby('movie_year').count()

Unnamed: 0_level_0,movie_title,IMDB_rating,IMBD_votes,genres
movie_year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1927,2,2,2,2
1931,2,2,2,2
1932,4,4,4,4
1933,2,2,2,2
1934,3,3,3,3
...,...,...,...,...
2007/I,1,1,1,1
2008,1,1,1,1
2009,3,3,3,3
2009/I,1,1,1,1


Some of this data will need to be cleaned up as well.

### utterances_df

This data frame lists the movie lines (utterances) and the character speaking. The `line_ID` column is referenced in the `conversations_df`.

In [13]:
utterances_df.shape

(304713, 4)

In [14]:
utterances_df.info()
# looks like there may be some missing information here

<class 'pandas.core.frame.DataFrame'>
Index: 304713 entries, L1045 to L666256
Data columns (total 4 columns):
 #   Column          Non-Null Count   Dtype 
---  ------          --------------   ----- 
 0   character_ID    304713 non-null  object
 1   movie_ID        304713 non-null  object
 2   character_name  304670 non-null  object
 3   utterance       304446 non-null  object
dtypes: object(4)
memory usage: 11.6+ MB


In [15]:
utterances_df.head()

Unnamed: 0_level_0,character_ID,movie_ID,character_name,utterance
line_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
L1045,u0,m0,BIANCA,They do not!
L1044,u2,m0,CAMERON,They do to!
L985,u0,m0,BIANCA,I hope so.
L984,u2,m0,CAMERON,She okay?
L925,u0,m0,BIANCA,Let's go.


### conversations_df

This data frame lists the `character_ID`s for the two characters in the conversation, the `movie_ID`, and the dialogue through references to `line_ID`s.

In [16]:
conversations_df.shape

(83097, 4)

In [17]:
conversations_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 83097 entries, 0 to 83096
Data columns (total 4 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   character_ID1  83097 non-null  object
 1   character_ID2  83097 non-null  object
 2   movie_ID       83097 non-null  object
 3   dialogue       83097 non-null  object
dtypes: object(4)
memory usage: 2.5+ MB


In [18]:
conversations_df.head()

Unnamed: 0,character_ID1,character_ID2,movie_ID,dialogue
0,u0,u2,m0,"['L194', 'L195', 'L196', 'L197']"
1,u0,u2,m0,"['L198', 'L199']"
2,u0,u2,m0,"['L200', 'L201', 'L202', 'L203']"
3,u0,u2,m0,"['L204', 'L205', 'L206']"
4,u0,u2,m0,"['L207', 'L208']"


This dataframe will only make sense when compiled with the others to read in character names, movie names, and the utterances (stored as line numbers in the dialogue column).

### Data overview

The data in this corpus is composed of 304,713 utterances from 83,097 conversations between 9,035 different characters from 617 movies between 1927 and 2010. Metadata are present for characters and movies. I will have to join across multiple data frames to get the relevant information into the `conversations_df` which will be where a lot of my analysis wll be done.