**Import Section**

Importing relavent packages for the project.

Note: We will ignore warnings for simplicity.

In [918]:
import numpy as py
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

**Creating initial dataframes**

Following files are available for the project:

1. movie_characters_metadata.tsv
2. movie_conversations.tsv
3. movie_lines.tsv
4. movie_titles_metadata.tsv
5. raw_script_urls


Here are the list of initial dataframes for the respective files:
1. characters_df
2. conversations_df
3. lines_df
4. titles_df
5. rawScript_df


General File Features:
1. Files are tab seperated ('\t').
2. There is no header or footers in any of the files.
3. Few bad rows are present in the files. These rows will be dropped while dataframe creation. 


In [919]:
##Creating characters_df dataframe from movie_characters_metadata.tsv

column_names = ['chId', 'chName', 'mId', 'mName', 'gender', 'posCredits']
characters_df = pd.read_csv('movie_characters_metadata.tsv', sep='\t', header=None, names=column_names, on_bad_lines='skip')
print("SUCESS : 'characters_df' dataframe created from 'movie_characters_metadata.tsv'\n")
print('Here is a snapshot of data')
characters_df.head()


SUCESS : 'characters_df' dataframe created from 'movie_characters_metadata.tsv'

Here is a snapshot of data


Unnamed: 0,chId,chName,mId,mName,gender,posCredits
0,u0,BIANCA,m0,10 things i hate about you,f,4
1,u1,BRUCE,m0,10 things i hate about you,?,?
2,u2,CAMERON,m0,10 things i hate about you,m,3
3,u3,CHASTITY,m0,10 things i hate about you,?,?
4,u4,JOEY,m0,10 things i hate about you,m,6


In [920]:
print("Dimension & metadata of characters_df",characters_df.shape)
print(characters_df.info())

Dimension & metadata of characters_df (9034, 6)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9034 entries, 0 to 9033
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   chId        9034 non-null   object
 1   chName      9015 non-null   object
 2   mId         9017 non-null   object
 3   mName       9017 non-null   object
 4   gender      9017 non-null   object
 5   posCredits  9017 non-null   object
dtypes: object(6)
memory usage: 423.6+ KB
None


In [921]:
##Checking on the percentage of NaN for each column in the dataframe

nan_percentage = (characters_df.isna().mean() * 100).round(2)
print("Distribution Of NaN across dataframe")
print(nan_percentage)

Distribution Of NaN across dataframe
chId          0.00
chName        0.21
mId           0.19
mName         0.19
gender        0.19
posCredits    0.19
dtype: float64


In [922]:
num_rows_dropped = characters_df.isna().any(axis=1).sum()
print("Number of rows that would be dropped due to NaN:", num_rows_dropped)

Number of rows that would be dropped due to NaN: 19


In [923]:
# Removing rows with NaN values
characters_df = characters_df.dropna()

In [924]:
# Remove duplicate rows
duplicate_counts = characters_df.duplicated().sum()
print("Total duplicate", duplicate_counts)
if duplicate_counts > 0:
  characters_df = characters_df.drop_duplicates()

Total duplicate 0


**Data Cleansing: Removing Ambiguity From The Column Gender**

In this project we are dividing gender to two groups "Male" (M) & "Female" (F). 

However during the initial data analysis, it is identified that 4 different types of values are available in the column gender('?', 'm', 'f', 'M'). On further investigation, it has been confirmed all gender where data is 'm' & 'M' belong to male class rest can be grouped as female. The rows with '?' as gender must be dropped as chId is the identifier column for the dataframe and there is no chId with multiple gender so there is no scope for data correction.

In this dataframe 'Male' class will be represented as 'M' & 'Female' class will be represented as 'F'.

In [925]:
##Count of values of each gender type
characters_df.gender.value_counts()

gender
?    6006
m    1899
f     921
M     145
F      44
Name: count, dtype: int64

In [926]:
##Data Analysis to check if any chId has multiple gender assigned to it.
##Result: None found
grouped = characters_df.groupby('chId')['gender'].nunique()
chId_with_multiple_genders = grouped[grouped > 1].index.tolist()
print("chId values with more than one gender:", len(chId_with_multiple_genders))

chId values with more than one gender: 0


In [927]:
##Removing data ambiguity from gender
characters_df = characters_df[characters_df.gender != '?']
characters_df.gender = characters_df.gender.apply(lambda g: 'M' if g in ['m', 'M'] else 'F')
print("Data Distribution Per Gender:")
characters_df.gender.value_counts()

Data Distribution Per Gender:


gender
M    2044
F     965
Name: count, dtype: int64

**Data Cleansing: Removing non-numeric values from the column posCredits**

As per the column defination, 'posCredits' is  numeric column containing the information of the position where the character features in the credits. Hence this cannot be non-numeric. 

Since a significant number of rows (330) contain '?' we will replace it with -1. This would indicate the character didnot appear in the posCredit. 


In [928]:
##Find the non-numeric value for posCredits column
non_numeric_posCredits = characters_df.loc[~pd.to_numeric(characters_df['posCredits'], errors='coerce').notna(), 'posCredits'].unique()
print("Non-numeric values in 'posCredits' column:", non_numeric_posCredits)


#Examine the distribution of male & female with this junk data
filtered_df = characters_df[characters_df['posCredits'] == '?']
# Count the number of males and females
male_count = filtered_df[filtered_df['gender'] == 'M'].shape[0]
female_count = filtered_df[filtered_df['gender'] == 'F'].shape[0]

print("Number of males with posCredits = '?':", male_count)
print("Number of females with posCredits = '?':", female_count)

Non-numeric values in 'posCredits' column: ['?']
Number of males with posCredits = '?': 225
Number of females with posCredits = '?': 105


In [929]:
characters_df['posCredits'] = characters_df['posCredits'].replace('?', -1).astype(int)
characters_df.describe()

Unnamed: 0,posCredits
count,3009.0
mean,17.641741
std,111.34138
min,-1.0
25%,1.0
50%,3.0
75%,6.0
max,1000.0


**Movie Title Dataframe**

This dataframe is created from the movie_titles_metadata.tsv file

In [930]:
##Creating the dataframe : titles_df

column_names = ['mId', 'mName', 'mYear', 'mRating', 'mVotes', 'mGenre']
titles_df = pd.read_csv('movie_titles_metadata.tsv', sep='\t', names=column_names, header=None, on_bad_lines='skip')
print("SUCESS : 'titles_df' dataframe created from 'movie_titles_metadata.tsv'\n")
print('Here is a snapshot of data')
titles_df.head()

SUCESS : 'titles_df' dataframe created from 'movie_titles_metadata.tsv'

Here is a snapshot of data


Unnamed: 0,mId,mName,mYear,mRating,mVotes,mGenre
0,m0,10 things i hate about you,1999,6.9,62847.0,['comedy' 'romance']
1,m1,1492: conquest of paradise,1992,6.2,10421.0,['adventure' 'biography' 'drama' 'history']
2,m2,15 minutes,2001,6.1,25854.0,['action' 'crime' 'drama' 'thriller']
3,m3,2001: a space odyssey,1968,8.4,163227.0,['adventure' 'mystery' 'sci-fi']
4,m4,48 hrs.,1982,6.9,22289.0,['action' 'comedy' 'crime' 'drama' 'thriller']


In [931]:
print("Dimension & metadata of the dataframe",titles_df.shape)
print(titles_df.info())


Dimension & metadata of the dataframe (617, 6)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 617 entries, 0 to 616
Data columns (total 6 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   mId      617 non-null    object 
 1   mName    616 non-null    object 
 2   mYear    616 non-null    object 
 3   mRating  616 non-null    float64
 4   mVotes   616 non-null    float64
 5   mGenre   616 non-null    object 
dtypes: float64(2), object(4)
memory usage: 29.1+ KB
None


**Since the number of rows with NAN are very less we will drop rows with NAN**

In [932]:
print("Distribution of NAN across columns in the dataframe")
nan_percentage = (titles_df.isna().mean() * 100).round(2)
print(nan_percentage)

Distribution of NAN across columns in the dataframe
mId        0.00
mName      0.16
mYear      0.16
mRating    0.16
mVotes     0.16
mGenre     0.16
dtype: float64


In [933]:
num_rows_dropped = titles_df.isna().any(axis=1).sum()
print("Number of rows that would be dropped due to NaN:", num_rows_dropped)

Number of rows that would be dropped due to NaN: 1


In [934]:
# Remove rows with NaN values
titles_df = titles_df.dropna()

**No Duplicate to Delete**

In [935]:
# Remove duplicate rows
duplicate_counts = titles_df.duplicated().sum()
print("Total Number of duplicate rows", duplicate_counts)
if duplicate_counts > 0:
  titles_df = titles_df.drop_duplicates()

Total Number of duplicate rows 0


**No data ambiguity in the numeric columns for the dataframe**

In [936]:
titles_df.describe()

Unnamed: 0,mRating,mVotes
count,616.0,616.0
mean,6.865584,49901.698052
std,1.215463,61898.367352
min,2.5,9.0
25%,6.2,9992.5
50%,7.0,27121.5
75%,7.8,66890.0
max,9.3,419312.0


**Data Cleansing: Removing Data Ambiguity from Year Column**

The column 'mYear' contains information on the year of the movie release. The year information cannot be non-numeric and it has to be in 'YYYY' format.

As shown below, some of the rows contain year appended with a string. As part of preprocessing, the year part is extracted from the string.


In [937]:
column_name = 'mYear' 

# Filter non-numeric values in the specified column
non_numeric_values = titles_df.loc[~pd.to_numeric(titles_df[column_name], errors='coerce').notna(), column_name]

# Get unique non-numeric values
unique_non_numeric_values = non_numeric_values.unique()

# Display the unique non-numeric values
print("Unique non-numeric values in column '{}':".format(column_name))
print(unique_non_numeric_values)

Unique non-numeric values in column 'mYear':
['1989/I' '1990/I' '1995/I' '1998/I' '2004/I' '2007/I' '1992/I' '2005/I'
 '2002/I' '1968/I' '1996/I' '2000/I' '2009/I' '2003/I']


In [938]:
column_name = 'mYear'

# Extract numeric portion using regular expression
titles_df[column_name] = titles_df[column_name].str.extract('(\d+)', expand=False)
titles_df[column_name] = titles_df[column_name].astype(int)


**Final Data Snapshot**

In [939]:
titles_df.head()

Unnamed: 0,mId,mName,mYear,mRating,mVotes,mGenre
0,m0,10 things i hate about you,1999,6.9,62847.0,['comedy' 'romance']
1,m1,1492: conquest of paradise,1992,6.2,10421.0,['adventure' 'biography' 'drama' 'history']
2,m2,15 minutes,2001,6.1,25854.0,['action' 'crime' 'drama' 'thriller']
3,m3,2001: a space odyssey,1968,8.4,163227.0,['adventure' 'mystery' 'sci-fi']
4,m4,48 hrs.,1982,6.9,22289.0,['action' 'comedy' 'crime' 'drama' 'thriller']


Movie Lines

In [940]:
column_names = ['lId', 'chId', 'mId', 'chName', 'chLine']
lines_df = pd.read_csv('movie_lines.tsv', sep='\t', header=None,names=column_names, on_bad_lines='skip')
lines_df.head()

Unnamed: 0,lId,chId,mId,chName,chLine
0,L1045,u0,m0,BIANCA,They do not!
1,L1044,u2,m0,CAMERON,They do to!
2,L985,u0,m0,BIANCA,I hope so.
3,L984,u2,m0,CAMERON,She okay?
4,L925,u0,m0,BIANCA,Let's go.


In [941]:
print(lines_df.shape)
print(lines_df.info())

(293202, 5)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 293202 entries, 0 to 293201
Data columns (total 5 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   lId     293202 non-null  object
 1   chId    288917 non-null  object
 2   mId     288917 non-null  object
 3   chName  288874 non-null  object
 4   chLine  288663 non-null  object
dtypes: object(5)
memory usage: 11.2+ MB
None


In [942]:
nan_percentage = (lines_df.isna().mean() * 100).round(2)
print(nan_percentage)

lId       0.00
chId      1.46
mId       1.46
chName    1.48
chLine    1.55
dtype: float64


In [943]:
num_rows_dropped = lines_df.isna().any(axis=1).sum()
print("Number of rows that would be dropped due to NaN:", num_rows_dropped)

Number of rows that would be dropped due to NaN: 4582


In [944]:
lines_df = lines_df.dropna()

In [945]:
# Remove duplicate rows
duplicate_counts = lines_df.duplicated().sum()
print("Total duplicates: ",duplicate_counts)
if duplicate_counts > 0:
  lines_df = lines_df.drop_duplicates()

Total duplicates:  0


In [946]:
print(lines_df.shape)

(288620, 5)


In [947]:
column_names = ['chId1', 'chId2','mId', 'lineList']
conversations_df = pd.read_csv('movie_conversations.tsv', sep='\t',names=column_names, header=None, on_bad_lines='skip')
conversations_df.head()

Unnamed: 0,chId1,chId2,mId,lineList
0,u0,u2,m0,['L194' 'L195' 'L196' 'L197']
1,u0,u2,m0,['L198' 'L199']
2,u0,u2,m0,['L200' 'L201' 'L202' 'L203']
3,u0,u2,m0,['L204' 'L205' 'L206']
4,u0,u2,m0,['L207' 'L208']


In [948]:
nan_percentage = (conversations_df.isna().mean() * 100).round(2)
print(nan_percentage)

chId1       0.0
chId2       0.0
mId         0.0
lineList    0.0
dtype: float64


In [949]:
# Remove duplicate rows
duplicate_counts = lines_df.duplicated().sum()
print("Total Number Of Duplicates: ", duplicate_counts)
if duplicate_counts > 0:
  lines_df = lines_df.drop_duplicates()

Total Number Of Duplicates:  0


In [950]:
conversations_df.shape

(83097, 4)

In [951]:
conversations_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 83097 entries, 0 to 83096
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   chId1     83097 non-null  object
 1   chId2     83097 non-null  object
 2   mId       83097 non-null  object
 3   lineList  83097 non-null  object
dtypes: object(4)
memory usage: 2.5+ MB


In [952]:
column_names = ['mId','mName','url']
rawScript_df = pd.read_csv('raw_script_urls.tsv', sep='\t', names=column_names, header=None,on_bad_lines='skip')
rawScript_df.head()

Unnamed: 0,mId,mName,url
0,m0,10 things i hate about you,http://www.dailyscript.com/scripts/10Things.html
1,m1,1492: conquest of paradise,http://www.hundland.org/scripts/1492-ConquestO...
2,m2,15 minutes,http://www.dailyscript.com/scripts/15minutes.html
3,m3,2001: a space odyssey,http://www.scifiscripts.com/scripts/2001.txt
4,m4,48 hrs.,http://www.awesomefilm.com/script/48hours.txt


In [953]:
rawScript_df.shape

(617, 3)

In [954]:
nan_percentage = (rawScript_df.isna().mean() * 100).round(2)
print(nan_percentage)

mId      0.00
mName    0.16
url      0.16
dtype: float64


In [955]:
num_rows_dropped = rawScript_df.isna().any(axis=1).sum()
print("Number of rows that would be dropped due to NaN:", num_rows_dropped)

Number of rows that would be dropped due to NaN: 1


In [956]:
# Remove rows with NaN values
rawScript_df = rawScript_df.dropna()

In [957]:
# Remove duplicate rows
duplicate_counts = rawScript_df.duplicated().sum()
print("Total Number Of Duplicates: ", duplicate_counts)
if duplicate_counts > 0:
  rawScript_df = rawScript_df.drop_duplicates()

Total Number Of Duplicates:  0


In [958]:
rawScript_df.shape

(616, 3)

Here are the list of initial dataframes:
1. characters_df
2. titles_df
3. conversations_df
4. rawScript_df
5. lines_df

In [959]:
unique_movies_count = characters_df['mId'].nunique()
print("Number of unique movies:", unique_movies_count)

Number of unique movies: 599


In [960]:
# Movies in titles_df but not in characters_df
movies_in_titles_not_in_characters = set(titles_df['mId']) - set(characters_df['mId'])

# Movies in characters_df but not in titles_df
movies_in_characters_not_in_titles = set(characters_df['mId']) - set(titles_df['mId'])

# Convert the sets to lists for easier handling or further analysis
movies_in_titles_not_in_characters_list = list(movies_in_titles_not_in_characters)
movies_in_characters_not_in_titles_list = list(movies_in_characters_not_in_titles)

print("Movies in titles_df but not in characters_df:", movies_in_titles_not_in_characters_list)
print("Movies in characters_df but not in titles_df:", movies_in_characters_not_in_titles_list)


Movies in titles_df but not in characters_df: ['m237', 'm535', 'm602', 'm412', 'm270', 'm406', 'm600', 'm321', 'm338', 'm614', 'm364', 'm484', 'm456', 'm483', 'm521', 'm616', 'm135']
Movies in characters_df but not in titles_df: []


In [961]:
detailed_characters_df = characters_df.merge(titles_df[['mId', 'mYear', 'mRating', 'mVotes', 'mGenre']], on='mId', how='left')
print("Dimension Of Charaters Dataframe",characters_df.shape)
print("Dimension Of Merged Character Dataframe",detailed_characters_df.shape)
detailed_characters_df.head()

Dimension Of Charaters Dataframe (3009, 6)
Dimension Of Merged Character Dataframe (3009, 10)


Unnamed: 0,chId,chName,mId,mName,gender,posCredits,mYear,mRating,mVotes,mGenre
0,u0,BIANCA,m0,10 things i hate about you,F,4,1999,6.9,62847.0,['comedy' 'romance']
1,u2,CAMERON,m0,10 things i hate about you,M,3,1999,6.9,62847.0,['comedy' 'romance']
2,u4,JOEY,m0,10 things i hate about you,M,6,1999,6.9,62847.0,['comedy' 'romance']
3,u5,KAT,m0,10 things i hate about you,F,2,1999,6.9,62847.0,['comedy' 'romance']
4,u6,MANDELLA,m0,10 things i hate about you,F,7,1999,6.9,62847.0,['comedy' 'romance']


In [962]:
duplicate_counts = detailed_characters_df.duplicated().sum()
print("Total Number Of Duplicates: ", duplicate_counts)

Total Number Of Duplicates:  0


In [963]:
nan_percentage = (detailed_characters_df.isna().mean() * 100).round(2)
print(nan_percentage)

chId          0.0
chName        0.0
mId           0.0
mName         0.0
gender        0.0
posCredits    0.0
mYear         0.0
mRating       0.0
mVotes        0.0
mGenre        0.0
dtype: float64


In [964]:
# Characters in lines_df but not in detailed_characters_df
characters_in_lines_not_in_merged = set(lines_df['chId']) - set(detailed_characters_df['chId'])

# Characters in merged_df but not in lines_df
characters_in_merged_not_in_lines = set(detailed_characters_df['chId']) - set(lines_df['chId'])

# Convert the sets to lists for easier handling or further analysis
characters_in_lines_not_in_merged_list = list(characters_in_lines_not_in_merged)
characters_in_merged_not_in_lines_list = list(characters_in_merged_not_in_lines)

print("Characters in lines_df but not in detailed_characters_df:", characters_in_lines_not_in_merged_list)
print("Characters in detailed_characters_df but not in lines_df:", characters_in_merged_not_in_lines_list)


Characters in lines_df but not in detailed_characters_df: ['u4095', 'u5463', 'u6293', 'u3271', 'u815', 'u4259', 'u1085', 'u8313', 'u3160', 'u5896', 'u7278', 'u2964', 'u5149', 'u4669', 'u1876', 'u1808', 'u316', 'u3692', 'u3347', 'u2258', 'u7907', 'u518', 'u448', 'u3721', 'u1683', 'u7882', 'u1075', 'u3855', 'u4947', 'u484', 'u107', 'u4371', 'u2145', 'u1337', 'u2437', 'u5619', 'u410', 'u3038', 'u5879', 'u7095', 'u3958', 'u1914', 'u6097', 'u6644', 'u1590', 'u3320', 'u1291', 'u149', 'u3128', 'u3950', 'u1942', 'u7726', 'u1033', 'u3790', 'u3925', 'u6570', 'u4254', 'u7241', 'u1439', 'u7609', 'u2396', 'u4944', 'u7685', 'u8921', 'u1168', 'u5775', 'u6244', 'u2257', 'u1401', 'u1250', 'u2694', 'u6485', 'u1690', 'u6888', 'u9027', 'u647', 'u1726', 'u396', 'u7466', 'u5786', 'u2703', 'u6973', 'u7393', 'u1182', 'u5765', 'u6275', 'u568', 'u7861', 'u5955', 'u7549', 'u2713', 'u6764', 'u4272', 'u1316', 'u3328', 'u4313', 'u6272', 'u4816', 'u6307', 'u9010', 'u8485', 'u4229', 'u4821', 'u6441', 'u2548', 'u8116'

In [965]:
##When we remove the lId we may see charaters repeating dialogues  
lines_df_mod = lines_df[['chId', 'mId', 'chName', 'chLine']]

# Remove duplicate rows
duplicate_counts = lines_df_mod.duplicated().sum()
print("Total Number Of Duplicates: ", duplicate_counts)
if duplicate_counts > 0:
   lines_df_mod = lines_df_mod.drop_duplicates()
   
lines_df_mod.shape

Total Number Of Duplicates:  5438


(283182, 4)

In [966]:
df = detailed_characters_df.merge(lines_df_mod, on=['chId', 'mId'], how='inner')
df.head()

Unnamed: 0,chId,chName_x,mId,mName,gender,posCredits,mYear,mRating,mVotes,mGenre,chName_y,chLine
0,u0,BIANCA,m0,10 things i hate about you,F,4,1999,6.9,62847.0,['comedy' 'romance'],BIANCA,They do not!
1,u0,BIANCA,m0,10 things i hate about you,F,4,1999,6.9,62847.0,['comedy' 'romance'],BIANCA,I hope so.
2,u0,BIANCA,m0,10 things i hate about you,F,4,1999,6.9,62847.0,['comedy' 'romance'],BIANCA,Let's go.
3,u0,BIANCA,m0,10 things i hate about you,F,4,1999,6.9,62847.0,['comedy' 'romance'],BIANCA,Okay -- you're gonna need to learn how to lie.
4,u0,BIANCA,m0,10 things i hate about you,F,4,1999,6.9,62847.0,['comedy' 'romance'],BIANCA,Like my fear of wearing pastels?


In [967]:
# Filter rows where chName_x does not match chName_y
mismatched_names_df = df[df['chName_x'] != df['chName_y']]

# Display the rows with mismatched names
print("Rows with mismatched chName_x and chName_y:")
print(mismatched_names_df)

Rows with mismatched chName_x and chName_y:
Empty DataFrame
Columns: [chId, chName_x, mId, mName, gender, posCredits, mYear, mRating, mVotes, mGenre, chName_y, chLine]
Index: []


In [968]:
df['chName'] = df['chName_x']
df = df.drop(['chName_y','chName_x'], axis=1)
df.head()

Unnamed: 0,chId,mId,mName,gender,posCredits,mYear,mRating,mVotes,mGenre,chLine,chName
0,u0,m0,10 things i hate about you,F,4,1999,6.9,62847.0,['comedy' 'romance'],They do not!,BIANCA
1,u0,m0,10 things i hate about you,F,4,1999,6.9,62847.0,['comedy' 'romance'],I hope so.,BIANCA
2,u0,m0,10 things i hate about you,F,4,1999,6.9,62847.0,['comedy' 'romance'],Let's go.,BIANCA
3,u0,m0,10 things i hate about you,F,4,1999,6.9,62847.0,['comedy' 'romance'],Okay -- you're gonna need to learn how to lie.,BIANCA
4,u0,m0,10 things i hate about you,F,4,1999,6.9,62847.0,['comedy' 'romance'],Like my fear of wearing pastels?,BIANCA


In [969]:
nan_percentage = (df.isna().mean() * 100).round(2)
print(nan_percentage)

chId          0.0
mId           0.0
mName         0.0
gender        0.0
posCredits    0.0
mYear         0.0
mRating       0.0
mVotes        0.0
mGenre        0.0
chLine        0.0
chName        0.0
dtype: float64


In [970]:
# Remove duplicate rows
duplicate_counts = df.duplicated().sum()
print("Total Number Of Duplicates: ", duplicate_counts)
if duplicate_counts > 0:
   df = df.drop_duplicates()
df.shape

Total Number Of Duplicates:  0


(224203, 11)

In [971]:
##Get the number of unique dialogues spoken by each characters.
df['dialoguesCount'] = df.groupby(['chId', 'mId'])['chLine'].transform('count')

##Get the number of characters per dialogue
df['charCountPerDialogue'] = df.chLine.str.len()

##Get the number of words per dialogue
df['wordCountPerDialogue'] = df.chLine.str.count(' ') + 1

In [972]:
df.head()

Unnamed: 0,chId,mId,mName,gender,posCredits,mYear,mRating,mVotes,mGenre,chLine,chName,dialoguesCount,charCountPerDialogue,wordCountPerDialogue
0,u0,m0,10 things i hate about you,F,4,1999,6.9,62847.0,['comedy' 'romance'],They do not!,BIANCA,92,12,3
1,u0,m0,10 things i hate about you,F,4,1999,6.9,62847.0,['comedy' 'romance'],I hope so.,BIANCA,92,10,3
2,u0,m0,10 things i hate about you,F,4,1999,6.9,62847.0,['comedy' 'romance'],Let's go.,BIANCA,92,9,2
3,u0,m0,10 things i hate about you,F,4,1999,6.9,62847.0,['comedy' 'romance'],Okay -- you're gonna need to learn how to lie.,BIANCA,92,46,10
4,u0,m0,10 things i hate about you,F,4,1999,6.9,62847.0,['comedy' 'romance'],Like my fear of wearing pastels?,BIANCA,92,32,6
