# Significance Tests: Token, Type, & TTR
After I found all those differences in Token Counts, Type Counts and TTR--do they mean anything for my data?

In the code below, I first compare Genders and Roles separately, asking...
* In general, is there a difference in male/female or pro/ant stats?
* Across Disney eras, is there a difference in male, female, pro, and ant stats?
* Within Each Disney Era, is there a difference in male/female or ant/pro stats?
* Across the two companies, Disney and Dreamworks, are there differences in male/female and pro/ant stats?
* Within each company, is there a difference in male/female and pro/ant stats?

Then, I combine role and gender, asking...
* Is there a difference between male pros and ants?
* Is there a difference between female pros and ants?
* Is there a difference between male and female pros?
* Is there a difference between male and female ants?

Specifically, I will investigate the following stats:
* Token Count per Line
* Type Count per Line
* Total Token Count
* Total Type Count
* TTR
* K-Band

# Table of Contents
1. [Data Frames/Data](#data)
2. [Token/Type Count per Line](#toktypeline)
    a. [Gender]
    b. [Role]
    c. [Gender and Role]
3. [Total Token/Type Count per Character](#totaltoktype)
    a. [Gender]
    b. [Role]
    c. [Gender and Role]
4. [TTR](#ttr)
    a. [Gender]
    b. [Role]
    c. [Gender and Role]
5. [K-Band](#kband)
    a. [Gender]
    b. [Role]
    c. [Gender and Role]

In [1]:
import pandas as pd

In [2]:
from scipy import stats

In [3]:
movie_df = pd.read_pickle(r'C:/Users/cassi/Desktop/Data_Science/Animated-Movie-Gendered-Dialogue/private/all_tagged_dialogue.pkl')

In [4]:
movie_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13442 entries, 0 to 13441
Data columns (total 19 columns):
Disney_Period       13442 non-null object
Gender              13442 non-null object
Movie               13442 non-null object
Role                13442 non-null object
Song                13442 non-null object
Speaker             13442 non-null object
Speaker_Status      13442 non-null object
Text                13442 non-null object
UTTERANCE_NUMBER    13442 non-null int64
Year                13442 non-null int64
Tokens              13442 non-null object
Types               13442 non-null object
Token_Count         13442 non-null int64
Type_Count          13442 non-null int64
POS                 13442 non-null object
Tag_Freq            13442 non-null object
Adj_Count           13442 non-null int64
Adv_Count           13442 non-null int64
Adj_over_Tokens     13442 non-null float64
dtypes: float64(1), int64(6), object(12)
memory usage: 1.3+ MB


In [5]:
char_df = pd.read_pickle(r'C:/Users/cassi/Desktop/Data_Science/Animated-Movie-Gendered-Dialogue/private/char_tok_type_TTR.pkl')

In [6]:
char_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 632 entries, 0 to 631
Data columns (total 14 columns):
Disney_Period       632 non-null object
Gender              632 non-null object
Movie               632 non-null object
Role                632 non-null object
Speaker             632 non-null object
Speaker_Status      632 non-null object
Total_Tok_Count     632 non-null float64
Total_Toks          632 non-null object
Total_Type_Count    632 non-null float64
Total_Types         632 non-null object
Year                632 non-null object
TTR                 632 non-null float64
G_TTR               632 non-null float64
AVG_K_BAND          600 non-null float64
dtypes: float64(5), object(9)
memory usage: 46.9+ KB


In [7]:
f_movie_df = movie_df[movie_df.Gender == 'f']
m_movie_df = movie_df[movie_df.Gender == 'm']

In [8]:
pro_movie_df = movie_df[movie_df.Role == 'PRO']
ant_movie_df = movie_df[movie_df.Role == 'ANT']
help_movie_df = movie_df[movie_df.Role == 'HELPER']

## Token and Type Stats
### Average Token and Type Count per Line
### Gender
#### Overall

In [9]:
#Token count per Line M v F
stats.ttest_ind(m_movie_df.Token_Count, f_movie_df.Token_Count, equal_var = False)

Ttest_indResult(statistic=4.094583947758668, pvalue=4.271602422580622e-05)

In [10]:
#Type Count per Line M v F
stats.ttest_ind(m_movie_df.Type_Count, f_movie_df.Type_Count, equal_var = False)

Ttest_indResult(statistic=4.923396910028177, pvalue=8.671647557539395e-07)

Overall, Token Count per line and Type Count per line between genders are significantly different 

#### Gender Over Time

In [11]:
f_movies_early = f_movie_df[f_movie_df.Disney_Period == 'EARLY']
m_movies_early = m_movie_df[m_movie_df.Disney_Period == 'EARLY']

f_movies_mid = f_movie_df[f_movie_df.Disney_Period == 'MID']
m_movies_mid = m_movie_df[m_movie_df.Disney_Period == 'MID']

f_movies_late = f_movie_df[f_movie_df.Disney_Period == 'LATE']
m_movies_late = m_movie_df[m_movie_df.Disney_Period == 'LATE']

In [12]:
# Comparing female Token Count over time
stats.f_oneway(f_movies_early.Token_Count, f_movies_mid.Token_Count, f_movies_late.Token_Count)

F_onewayResult(statistic=7.667236763403715, pvalue=0.0004774701445951512)

In [13]:
stats.ttest_ind(f_movies_early.Token_Count, f_movies_mid.Token_Count, equal_var=False)

Ttest_indResult(statistic=3.5171839133852822, pvalue=0.0004530579575576083)

In [14]:
stats.ttest_ind(f_movies_early.Token_Count, f_movies_late.Token_Count, equal_var=False)

Ttest_indResult(statistic=-0.024549755074654074, pvalue=0.9804174502622016)

In [15]:
stats.ttest_ind(f_movies_mid.Token_Count, f_movies_late.Token_Count, equal_var=False)

Ttest_indResult(statistic=-4.340179617140939, pvalue=1.4888114373993228e-05)

There are significant differences between female token county per line over time.However, to my surprise, female token counts were actually much longer in the early period than in the middle period. Female token counts then went up again in the late period, to a count that isn't very different from the counts in the early period

In [16]:
# Comparing Male Token Count per Line Over Time
stats.f_oneway(m_movies_early.Token_Count, m_movies_mid.Token_Count, m_movies_late.Token_Count)

F_onewayResult(statistic=10.286719936520177, pvalue=3.504240285911992e-05)

In [17]:
stats.ttest_ind(m_movies_early.Token_Count, m_movies_mid.Token_Count, equal_var=False)

Ttest_indResult(statistic=-0.17217560048124253, pvalue=0.8633440525880111)

In [18]:
stats.ttest_ind(m_movies_early.Token_Count, m_movies_late.Token_Count, equal_var=False)

Ttest_indResult(statistic=-3.093215771492932, pvalue=0.002022707332134338)

In [19]:
stats.ttest_ind(m_movies_mid.Token_Count, m_movies_late.Token_Count, equal_var=False)

Ttest_indResult(statistic=-3.9348909201915405, pvalue=8.6114160345075e-05)

Unlike female token counts per line, male token counts per line have constantly increased with each era. Though the difference in their token counts wasn't significant from the early to mid period, there was a significant increase from the middle to late period.

In [20]:
#Comparing Male and Female token counts per line w/in each era
stats.ttest_ind(m_movies_early.Token_Count, f_movies_early.Token_Count, equal_var=False)

Ttest_indResult(statistic=-0.46352345106853576, pvalue=0.6430724764804259)

In [21]:
stats.ttest_ind(m_movies_mid.Token_Count, f_movies_mid.Token_Count, equal_var=False)

Ttest_indResult(statistic=4.842783133866519, pvalue=1.3960234471327022e-06)

In [22]:
stats.ttest_ind(m_movies_late.Token_Count, f_movies_late.Token_Count, equal_var=False)

Ttest_indResult(statistic=2.8962769649577087, pvalue=0.0038082458130593235)

Again, not the disprecancies that I expected. Women spoke more in the earlier movies, but not significantly, while women spoke significantly fewer tokens per line than men did in the middle and late periods!

In [23]:
# Comparing female Type Count over time
stats.f_oneway(f_movies_early.Type_Count, f_movies_mid.Type_Count, f_movies_late.Type_Count)

F_onewayResult(statistic=6.726987795175322, pvalue=0.0012169467684094255)

In [24]:
stats.ttest_ind(f_movies_early.Type_Count, f_movies_mid.Type_Count, equal_var=False)

Ttest_indResult(statistic=3.314075240683437, pvalue=0.0009464450537746284)

In [25]:
stats.ttest_ind(f_movies_early.Type_Count, f_movies_late.Type_Count, equal_var=False)

Ttest_indResult(statistic=0.34030078483681353, pvalue=0.733683148770339)

In [26]:
stats.ttest_ind(f_movies_mid.Type_Count, f_movies_late.Type_Count, equal_var=False)

Ttest_indResult(statistic=-3.7793897481376506, pvalue=0.0001617232188517916)

There are significant differences between female type county per line over time. Like token count, we see type count go down in the middle period and then increase again in the late period.

In [27]:
# Comparing Male Type Count per Line Over Time
stats.f_oneway(m_movies_early.Type_Count, m_movies_mid.Type_Count, m_movies_late.Type_Count)

F_onewayResult(statistic=12.606850658324856, pvalue=3.491505269476744e-06)

In [28]:
stats.ttest_ind(m_movies_early.Type_Count, m_movies_mid.Type_Count, equal_var=False)

Ttest_indResult(statistic=-1.2548702663157338, pvalue=0.20991401781630908)

In [29]:
stats.ttest_ind(m_movies_early.Type_Count, m_movies_late.Type_Count, equal_var=False)

Ttest_indResult(statistic=-3.9851329851193587, pvalue=7.154008858080245e-05)

In [30]:
stats.ttest_ind(m_movies_mid.Type_Count, m_movies_late.Type_Count, equal_var=False)

Ttest_indResult(statistic=-4.0369948049602735, pvalue=5.6162187196356854e-05)

Unlike female type counts per line, male type counts per line have constantly increased with each era. Though the difference in their type counts wasn't significant from the early to mid period, there was a significant increase from the middle to late period.

In [31]:
#Comparing Male and Female type counts per line w/in each era
stats.ttest_ind(m_movies_early.Type_Count, f_movies_early.Type_Count, equal_var=False)

Ttest_indResult(statistic=-0.9816210353750426, pvalue=0.3264854283424139)

In [32]:
stats.ttest_ind(m_movies_mid.Type_Count, f_movies_mid.Type_Count, equal_var=False)

Ttest_indResult(statistic=4.951605339787872, pvalue=8.146538781519938e-07)

In [33]:
stats.ttest_ind(m_movies_late.Type_Count, f_movies_late.Type_Count, equal_var=False)

Ttest_indResult(statistic=3.946576602349502, pvalue=8.1477203619039e-05)

Again, no significant difference in the early era of Disney, but men do use more types than women in the middle and late eras of Disney

#### Gender Across Companies

In [34]:
f_movies_disney = f_movie_df[f_movie_df.Disney_Period != 'DREAMWORKS']
f_movies_dw = f_movie_df[f_movie_df.Disney_Period == 'DREAMWORKS']

m_movies_disney = m_movie_df[m_movie_df.Disney_Period != 'DREAMWORKS']
m_movies_dw = m_movie_df[m_movie_df.Disney_Period == 'DREAMWORKS']

In [35]:
## Between male and female characters is disney films
stats.ttest_ind(m_movies_disney.Token_Count, f_movies_disney.Token_Count, equal_var=False)

Ttest_indResult(statistic=2.935244768690959, pvalue=0.0033454199090632865)

In [36]:
## Between male and female characters in Dreamworks Films
stats.ttest_ind(m_movies_dw.Token_Count, f_movies_dw.Token_Count, equal_var=False)

Ttest_indResult(statistic=5.417598704278908, pvalue=6.6336631151626e-08)

Though in general men have higher token counts per line than women, this difference is much more significant in Dreamworks films

In [37]:
## Between male characters in Dreamworks and Disney
stats.ttest_ind(m_movies_disney.Token_Count, m_movies_dw.Token_Count, equal_var=False)

Ttest_indResult(statistic=3.193562952618235, pvalue=0.0014112565909670273)

In [38]:
## Between female characters in Dreamworks and Disney
stats.ttest_ind(f_movies_disney.Token_Count, f_movies_dw.Token_Count, equal_var=False)

Ttest_indResult(statistic=3.9876699186327174, pvalue=6.799745918946658e-05)

In Disney films, both men and women seem to have higher token counts per line than Dreamworks men and women, but this difference is much more pronounced among female characters.

### Role
#### Overall

In [39]:
stats.ttest_ind(pro_movie_df.Token_Count, ant_movie_df.Token_Count, equal_var = False)

Ttest_indResult(statistic=-3.697383277420495, pvalue=0.00022111887110217427)

In [40]:
stats.ttest_ind(pro_movie_df.Type_Count, ant_movie_df.Type_Count, equal_var = False)

Ttest_indResult(statistic=-5.088952503221761, pvalue=3.794202353395622e-07)

Overall, there do seem to be significant differences in token counts and type counts by line based on role, with protagonists having significantly shorter lines

#### Role Over Time

In [41]:
pro_movies_early = pro_movie_df[pro_movie_df.Disney_Period == 'EARLY']
pro_movies_mid = pro_movie_df[pro_movie_df.Disney_Period == 'MID']
pro_movies_late = pro_movie_df[pro_movie_df.Disney_Period == 'LATE']

ant_movies_early = ant_movie_df[ant_movie_df.Disney_Period == 'EARLY']
ant_movies_mid = ant_movie_df[ant_movie_df.Disney_Period == 'MID']
ant_movies_late = ant_movie_df[ant_movie_df.Disney_Period == 'LATE']

In [42]:
# Comparing Protagonist Token Count over time
stats.f_oneway(pro_movies_early.Token_Count, pro_movies_mid.Token_Count, pro_movies_late.Token_Count)

F_onewayResult(statistic=19.086710660744, pvalue=5.7882250394582755e-09)

In [43]:
stats.ttest_ind(pro_movies_early.Token_Count, pro_movies_mid.Token_Count, equal_var=False)

Ttest_indResult(statistic=3.342640325046175, pvalue=0.0009357815220299626)

In [44]:
stats.ttest_ind(pro_movies_early.Token_Count, pro_movies_late.Token_Count, equal_var=False)

Ttest_indResult(statistic=-0.48186336399539875, pvalue=0.6301653901029498)

In [45]:
stats.ttest_ind(pro_movies_mid.Token_Count, pro_movies_late.Token_Count, equal_var=False)

Ttest_indResult(statistic=-6.735013339613751, pvalue=2.0290663308577336e-11)

Protagonists had more tokens per lines in the early and late periods than in the middle period, to a significant degree. However, there's hardly any difference between token count for protagonists in the middle and late periods.

In [46]:
# Comparing Antagonist Token Count per Line Over Time
stats.f_oneway(ant_movies_early.Token_Count, ant_movies_mid.Token_Count, ant_movies_late.Token_Count)

F_onewayResult(statistic=0.7582647676266144, pvalue=0.4687129230091459)

In [47]:
stats.ttest_ind(ant_movies_early.Token_Count, ant_movies_mid.Token_Count, equal_var=False)

Ttest_indResult(statistic=1.0932941002220513, pvalue=0.2751251911959307)

In [48]:
stats.ttest_ind(ant_movies_early.Token_Count, ant_movies_late.Token_Count, equal_var=False)

Ttest_indResult(statistic=0.7742843723239474, pvalue=0.43924044542440743)

In [49]:
stats.ttest_ind(ant_movies_mid.Token_Count, ant_movies_late.Token_Count, equal_var=False)

Ttest_indResult(statistic=-0.3719478211119041, pvalue=0.7100332792278266)

The Token counts per line for antagonists have not changed significantly over time.

In [50]:
#Comparing Pros and Ants token counts per line w/in each era
stats.ttest_ind(ant_movies_early.Token_Count, pro_movies_early.Token_Count, equal_var=False)

Ttest_indResult(statistic=1.5881175499700033, pvalue=0.11308831371455882)

In [51]:
stats.ttest_ind(ant_movies_mid.Token_Count, pro_movies_mid.Token_Count, equal_var=False)

Ttest_indResult(statistic=6.112808941855052, pvalue=1.5827217314564946e-09)

In [52]:
stats.ttest_ind(ant_movies_late.Token_Count, pro_movies_late.Token_Count, equal_var=False)

Ttest_indResult(statistic=0.8860844606975122, pvalue=0.37588976269590657)

In all eras, antagonists have longer lines than protagonists, but this difference is only significant in the middle period.

In [53]:
# Comparing protagonist Type Count over time
stats.f_oneway(pro_movies_early.Type_Count, pro_movies_mid.Type_Count, pro_movies_late.Type_Count)

F_onewayResult(statistic=14.988691970679247, pvalue=3.3303849789524714e-07)

In [54]:
stats.ttest_ind(pro_movies_early.Type_Count, pro_movies_mid.Type_Count, equal_var=False)

Ttest_indResult(statistic=2.69107256948761, pvalue=0.007512767963674296)

In [55]:
stats.ttest_ind(pro_movies_early.Type_Count, pro_movies_late.Type_Count, equal_var=False)

Ttest_indResult(statistic=-0.40984064622638905, pvalue=0.682163013643241)

In [56]:
stats.ttest_ind(pro_movies_mid.Type_Count, pro_movies_late.Type_Count, equal_var=False)

Ttest_indResult(statistic=-5.8617703782441435, pvalue=5.134894151032304e-09)

There are significant differences between protagonist type counts per line over time. Like token count, we see type count go down in the middle period and then increase again in the late period.

In [57]:
# Comparing Antagonist Type Count per Line Over Time
stats.f_oneway(ant_movies_early.Type_Count, ant_movies_mid.Type_Count, ant_movies_late.Type_Count)

F_onewayResult(statistic=0.22470915182497514, pvalue=0.7987835931168682)

In [58]:
stats.ttest_ind(ant_movies_early.Type_Count, ant_movies_mid.Type_Count, equal_var=False)

Ttest_indResult(statistic=0.6046068962467481, pvalue=0.5458629063425532)

In [59]:
stats.ttest_ind(ant_movies_early.Type_Count, ant_movies_late.Type_Count, equal_var=False)

Ttest_indResult(statistic=0.32822967833693967, pvalue=0.7429124654076498)

In [60]:
stats.ttest_ind(ant_movies_mid.Type_Count, ant_movies_late.Type_Count, equal_var=False)

Ttest_indResult(statistic=-0.34853958227562953, pvalue=0.7275260595788151)

Antagonist type counts aren't significantly different over time.

In [61]:
#Comparing Protagonist and Antagonist type counts per line w/in each era
stats.ttest_ind(pro_movies_early.Type_Count, ant_movies_early.Type_Count, equal_var=False)

Ttest_indResult(statistic=-1.9103389932092292, pvalue=0.056809278586763594)

In [62]:
stats.ttest_ind(pro_movies_mid.Type_Count, ant_movies_mid.Type_Count, equal_var=False)

Ttest_indResult(statistic=-6.71022923005286, pvalue=3.777126543273302e-11)

In [63]:
stats.ttest_ind(pro_movies_late.Type_Count, ant_movies_late.Type_Count, equal_var=False)

Ttest_indResult(statistic=-2.256134803710217, pvalue=0.02441015205671323)

We see protagonists speak significantly fewer types per line than antagonists do in the middle period and the late periods. This difference is not strong in the early period.

#### Roles Across Companies

In [64]:
ant_movies_disney = ant_movie_df[ant_movie_df.Disney_Period != 'DREAMWORKS']
ant_movies_dw = ant_movie_df[ant_movie_df.Disney_Period == 'DREAMWORKS']

pro_movies_disney = pro_movie_df[pro_movie_df.Disney_Period != 'DREAMWORKS']
pro_movies_dw = pro_movie_df[pro_movie_df.Disney_Period == 'DREAMWORKS']

In [65]:
#Between antagonists in Disney and Dreamworks
stats.ttest_ind(ant_movies_disney.Token_Count, ant_movies_dw.Token_Count, equal_var=False)

Ttest_indResult(statistic=3.485318491992578, pvalue=0.0005020874766656302)

In [66]:
#Between protagonists in Disney and Dreamworks
stats.ttest_ind(pro_movies_disney.Token_Count, pro_movies_dw.Token_Count, equal_var=False)

Ttest_indResult(statistic=-0.16553304513676656, pvalue=0.868530290732931)

In [67]:
#Between protagonists and antagonists in Disney
stats.ttest_ind(pro_movies_disney.Token_Count, ant_movies_disney.Token_Count, equal_var=False)

Ttest_indResult(statistic=-4.166800416127634, pvalue=3.2215506266884984e-05)

In [68]:
#Between protagonists and antagonists in Dreamworks
stats.ttest_ind(pro_movies_dw.Token_Count, ant_movies_dw.Token_Count, equal_var=False)

Ttest_indResult(statistic=-0.2901875740549527, pvalue=0.7717081950509582)

There's a significant difference in how many tokens Disney and Dreamworks antagonists use (disney ones use more). Also, in Disney movies protagonists speak significantly fewer tokens than antagonists. This difference isn't significant in Dreamworks movies.

### Gender and Role
Let's see if token or type count by line differs based on both role and gender

In [69]:
movies_gen_role = movie_df[(movie_df.Gender != 'n') & (movie_df.Role != 'N')]

In [70]:
pro_f_movies = movies_gen_role[(movies_gen_role.Gender == 'f') & (movies_gen_role.Role == 'PRO')]
pro_m_movies = movies_gen_role[(movies_gen_role.Gender == 'm') & (movies_gen_role.Role == 'PRO')]

ant_f_movies = movies_gen_role[(movies_gen_role.Gender == 'f') & (movies_gen_role.Role == 'ANT')]
ant_m_movies = movies_gen_role[(movies_gen_role.Gender == 'm') & (movies_gen_role.Role == 'ANT')]

In [71]:
stats.ttest_ind(pro_f_movies.Token_Count, pro_m_movies.Token_Count, equal_var=False)

Ttest_indResult(statistic=-4.689197546723727, pvalue=2.8095641549224497e-06)

In [72]:
stats.ttest_ind(ant_f_movies.Token_Count, ant_m_movies.Token_Count, equal_var=False)

Ttest_indResult(statistic=1.9443693139361118, pvalue=0.05233678715267543)

Female protagonists have significantly lower token counts per line than their male counterparts, but the same isn't true for female antagonists!

In [73]:
stats.ttest_ind(pro_f_movies.Token_Count, ant_f_movies.Token_Count, equal_var=False)

Ttest_indResult(statistic=-4.135851906611012, pvalue=4.0981566143233455e-05)

In [74]:
stats.ttest_ind(pro_m_movies.Token_Count, ant_m_movies.Token_Count, equal_var=False)

Ttest_indResult(statistic=-0.5338934710479438, pvalue=0.5934489750790087)

Also, female protagonists use fewer tokens per line than their antagonist counterparts, but there's no significant difference between token counts per line for male protagonists and antagonists!

### Total Token and Type Counts
### Gender
I decided to also look at these stats over all of the character's dialogue, to account for the short lines. Though a female character may have shorter lines per line, if she has more lines in the movie, this will be reflected by her total token and type counts

#### Overall

In [75]:
f_chars = char_df[char_df.Gender == 'f']
m_chars = char_df[char_df.Gender == 'm']

In [76]:
stats.ttest_ind(f_chars.Total_Tok_Count, m_chars.Total_Tok_Count, equal_var = False)

Ttest_indResult(statistic=0.5818365410871014, pvalue=0.561135928401851)

In [77]:
stats.ttest_ind(f_chars.Total_Type_Count, m_chars.Total_Type_Count, equal_var = False)

Ttest_indResult(statistic=0.6666518458606242, pvalue=0.5055478790046386)

#### Gender Over Time

In [78]:
f_chars_early = f_chars[f_chars.Disney_Period == 'EARLY']
m_chars_early = m_chars[m_chars.Disney_Period == 'EARLY']

f_chars_mid = f_chars[f_chars.Disney_Period == 'MID']
m_chars_mid = m_chars[m_chars.Disney_Period == 'MID']

f_chars_late = f_chars[f_chars.Disney_Period == 'LATE']
m_chars_late = m_chars[m_chars.Disney_Period == 'LATE']

In [79]:
# Comparing female Total Token Count over time
stats.f_oneway(f_chars_early.Total_Tok_Count, f_chars_mid.Total_Tok_Count, f_chars_late.Total_Tok_Count)

F_onewayResult(statistic=0.7565014558817752, pvalue=0.4726744433906328)

In [80]:
stats.ttest_ind(f_chars_early.Total_Tok_Count, f_chars_mid.Total_Tok_Count, equal_var=False)

Ttest_indResult(statistic=1.6710482736104724, pvalue=0.10402612605697355)

In [81]:
stats.ttest_ind(f_chars_early.Total_Tok_Count, f_chars_late.Total_Tok_Count, equal_var=False)

Ttest_indResult(statistic=0.36220205774072417, pvalue=0.7186796184083162)

In [82]:
stats.ttest_ind(f_chars_mid.Total_Tok_Count, f_chars_late.Total_Tok_Count, equal_var=False)

Ttest_indResult(statistic=-1.053741737065515, pvalue=0.296237093244603)

In terms of total token counts, there's no significant different over time for female speakers.

In [83]:
# Comparing Male Total Token Count Over Time
stats.f_oneway(m_chars_early.Total_Tok_Count, m_chars_mid.Total_Tok_Count, m_chars_late.Total_Tok_Count)

F_onewayResult(statistic=0.4556460162458111, pvalue=0.6346665542997648)

In [84]:
stats.ttest_ind(m_chars_early.Total_Tok_Count, m_chars_mid.Total_Tok_Count, equal_var=False)

Ttest_indResult(statistic=-0.9457421468179764, pvalue=0.3467974076518153)

In [85]:
stats.ttest_ind(m_chars_early.Total_Tok_Count, m_chars_late.Total_Tok_Count, equal_var=False)

Ttest_indResult(statistic=-0.03922467420677105, pvalue=0.96878796152072)

In [86]:
stats.ttest_ind(m_chars_mid.Total_Tok_Count, m_chars_late.Total_Tok_Count, equal_var=False)

Ttest_indResult(statistic=0.8212002791912885, pvalue=0.41268728447936076)

Again, no significant difference in total token count over time for men.

In [87]:
#Comparing Male and Female total token counts w/in each era
stats.ttest_ind(m_chars_early.Total_Tok_Count, f_chars_early.Total_Tok_Count, equal_var=False)

Ttest_indResult(statistic=-2.7989846132472493, pvalue=0.010407023906466407)

In [88]:
stats.ttest_ind(m_chars_mid.Total_Tok_Count, f_chars_mid.Total_Tok_Count, equal_var=False)

Ttest_indResult(statistic=-0.3940422821058243, pvalue=0.6955651444998084)

In [89]:
stats.ttest_ind(m_chars_late.Total_Tok_Count, f_chars_late.Total_Tok_Count, equal_var=False)

Ttest_indResult(statistic=-1.823479812519496, pvalue=0.0740212180895443)

In each era, men have fewer total tokens than women, but this difference is only significant in the earliest era.

In [90]:
# Comparing female Total Type Count over time
stats.f_oneway(f_chars_early.Total_Type_Count, f_chars_mid.Total_Type_Count, f_chars_late.Total_Type_Count)

F_onewayResult(statistic=1.2877171787064285, pvalue=0.2816268170693461)

In [91]:
stats.ttest_ind(f_chars_early.Total_Type_Count, f_chars_mid.Total_Type_Count, equal_var=False)

Ttest_indResult(statistic=2.0384464207932558, pvalue=0.049488192282371564)

In [92]:
stats.ttest_ind(f_chars_early.Total_Type_Count, f_chars_late.Total_Type_Count, equal_var=False)

Ttest_indResult(statistic=1.3087484145681199, pvalue=0.19724244472499755)

In [93]:
stats.ttest_ind(f_chars_mid.Total_Type_Count, f_chars_late.Total_Type_Count, equal_var=False)

Ttest_indResult(statistic=-0.5741075157578277, pvalue=0.567943133125478)

Like token count, total female type count isn't significantly different across time except for between the early and mid era.

In [94]:
# Comparing Male Total Type Count Over Time
stats.f_oneway(m_chars_early.Total_Type_Count, m_chars_mid.Total_Type_Count, m_chars_late.Total_Type_Count)

F_onewayResult(statistic=0.8193868326261027, pvalue=0.4421124083605513)

In [95]:
stats.ttest_ind(m_chars_early.Total_Type_Count, m_chars_mid.Total_Type_Count, equal_var=False)

Ttest_indResult(statistic=-0.6502045831434937, pvalue=0.517665071901952)

In [96]:
stats.ttest_ind(m_chars_early.Total_Type_Count, m_chars_late.Total_Type_Count, equal_var=False)

Ttest_indResult(statistic=0.46220112394368235, pvalue=0.6452486919034337)

In [97]:
stats.ttest_ind(m_chars_mid.Total_Type_Count, m_chars_late.Total_Type_Count, equal_var=False)

Ttest_indResult(statistic=1.2272732790040868, pvalue=0.22136747154684625)

No significant difference here either.

In [98]:
#Comparing Male and Female Total type counts w/in each era
stats.ttest_ind(m_chars_early.Total_Type_Count, f_chars_early.Total_Type_Count, equal_var=False)

Ttest_indResult(statistic=-2.855637599373463, pvalue=0.00824262780343056)

In [99]:
stats.ttest_ind(m_chars_mid.Total_Type_Count, f_chars_mid.Total_Type_Count, equal_var=False)

Ttest_indResult(statistic=-0.25617166334484504, pvalue=0.7990317299283365)

In [100]:
stats.ttest_ind(m_chars_late.Total_Type_Count, f_chars_late.Total_Type_Count, equal_var=False)

Ttest_indResult(statistic=-1.5908769966422909, pvalue=0.11717183933557054)

The only significant difference here is in the early era, when men used significantly fewer types than women.

#### Gender Across Companies

In [101]:
f_chars_disney = f_chars[f_chars.Disney_Period != 'DREAMWORKS']
f_chars_dw = f_chars[f_chars.Disney_Period == 'DREAMWORKS']

m_chars_disney = m_chars[m_chars.Disney_Period != 'DREAMWORKS']
m_chars_dw = m_chars[m_chars.Disney_Period == 'DREAMWORKS']

In [102]:
## Between male and female characters is disney films
stats.ttest_ind(m_chars_disney.Total_Tok_Count, f_chars_disney.Total_Tok_Count, equal_var=False)

Ttest_indResult(statistic=-2.500539732650887, pvalue=0.0138384192484122)

In [103]:
## Between male and female characters in Dreamworks Films
stats.ttest_ind(m_chars_dw.Total_Tok_Count, f_chars_dw.Total_Tok_Count, equal_var=False)

Ttest_indResult(statistic=2.22412198514079, pvalue=0.027319709654615342)

While in Disney, men have a significanlty lower total token count, in Dreamworks men have a significantly higher total token count.

In [104]:
## Between male characters in Dreamworks and Disney
stats.ttest_ind(m_chars_disney.Total_Tok_Count, m_chars_dw.Total_Tok_Count, equal_var=False)

Ttest_indResult(statistic=-2.677337304360268, pvalue=0.008055329898577386)

In [105]:
## Between female characters in Dreamworks and Disney
stats.ttest_ind(f_chars_disney.Total_Tok_Count, f_chars_dw.Total_Tok_Count, equal_var=False)

Ttest_indResult(statistic=2.0720465303105766, pvalue=0.04014661347462381)

Male disney characters have fewer total tokens than male dreamworks characters, but female disney characters have more total tokens than female dreamworks characters.

### Role
#### Overall

In [106]:
pro_chars = char_df[char_df.Role == 'PRO']
ant_chars = char_df[char_df.Role == 'ANT']
helper_chars = char_df[char_df.Role == 'HELPER']

In [107]:
stats.ttest_ind(pro_chars.Total_Tok_Count, ant_chars.Total_Tok_Count, equal_var = False)

Ttest_indResult(statistic=5.5464379433307505, pvalue=4.987476883049536e-07)

In [108]:
stats.ttest_ind(pro_chars.Total_Type_Count, ant_chars.Total_Type_Count, equal_var = False)

Ttest_indResult(statistic=4.7940373205004505, pvalue=7.225031831766566e-06)

Overall, there do seem to be significant differences in total token counts and type counts based on role, with protagonists having significantly shorter lines

#### Role Over Time

In [109]:
pro_chars_early = pro_chars[pro_chars.Disney_Period == 'EARLY']
pro_chars_mid = pro_chars[pro_chars.Disney_Period == 'MID']
pro_chars_late = pro_chars[pro_chars.Disney_Period == 'LATE']

ant_chars_early = ant_chars[ant_chars.Disney_Period == 'EARLY']
ant_chars_mid = ant_chars[ant_chars.Disney_Period == 'MID']
ant_chars_late = ant_chars[ant_chars.Disney_Period == 'LATE']

In [110]:
# Comparing Protagonist Token Count over time
stats.f_oneway(pro_chars_early.Total_Tok_Count, pro_chars_mid.Total_Tok_Count, pro_chars_late.Total_Tok_Count)

F_onewayResult(statistic=1.398516240527042, pvalue=0.2616521552658346)

In [111]:
stats.ttest_ind(pro_chars_early.Total_Tok_Count, pro_chars_mid.Total_Tok_Count, equal_var=False)

Ttest_indResult(statistic=-1.4143696959464174, pvalue=0.189107798145821)

In [112]:
stats.ttest_ind(pro_chars_early.Total_Tok_Count, pro_chars_late.Total_Tok_Count, equal_var=False)

Ttest_indResult(statistic=-1.9117645485336412, pvalue=0.07228397238910902)

In [113]:
stats.ttest_ind(pro_chars_mid.Total_Tok_Count, pro_chars_late.Total_Tok_Count, equal_var=False)

Ttest_indResult(statistic=-0.9244706698901672, pvalue=0.36432540920568823)

Protagonist total token count hasn't changed significantly over time

In [114]:
# Comparing Antagonist Total Token Count Over Time
stats.f_oneway(ant_chars_early.Total_Tok_Count, ant_chars_mid.Total_Tok_Count, ant_chars_late.Total_Tok_Count)

F_onewayResult(statistic=1.965322020955908, pvalue=0.1519534577399305)

In [115]:
stats.ttest_ind(ant_chars_early.Total_Tok_Count, ant_chars_mid.Total_Tok_Count, equal_var=False)

Ttest_indResult(statistic=0.7503327647729069, pvalue=0.4656306038134109)

In [116]:
stats.ttest_ind(ant_chars_early.Total_Tok_Count, ant_chars_late.Total_Tok_Count, equal_var=False)

Ttest_indResult(statistic=-0.9265514121928716, pvalue=0.3664334726074592)

In [117]:
stats.ttest_ind(ant_chars_mid.Total_Tok_Count, ant_chars_late.Total_Tok_Count, equal_var=False)

Ttest_indResult(statistic=-1.821989145568108, pvalue=0.08704101024606539)

Total token counts have not changed significanlty over time for antagonists.

In [118]:
#Comparing Pros and Ants token counts w/in each era
stats.ttest_ind(ant_chars_early.Total_Tok_Count, pro_chars_early.Total_Tok_Count, equal_var=False)

Ttest_indResult(statistic=-0.6753758994010711, pvalue=0.5202276886839079)

In [119]:
stats.ttest_ind(ant_chars_mid.Total_Tok_Count, pro_chars_mid.Total_Tok_Count, equal_var=False)

Ttest_indResult(statistic=-3.980850068857966, pvalue=0.001204907399219454)

In [120]:
stats.ttest_ind(ant_chars_late.Total_Tok_Count, pro_chars_late.Total_Tok_Count, equal_var=False)

Ttest_indResult(statistic=-2.3802152393344005, pvalue=0.026600053559682094)

In all eras, antagonists have more total tokens than protagonists, but this difference is only significant in the middle and late periods.

In [121]:
# Comparing protagonist Total Type Count over time
stats.f_oneway(pro_chars_early.Total_Type_Count, pro_chars_mid.Total_Type_Count, pro_chars_late.Total_Type_Count)

F_onewayResult(statistic=1.0127044089686985, pvalue=0.3745812385412033)

In [122]:
stats.ttest_ind(pro_chars_early.Total_Type_Count, pro_chars_mid.Total_Type_Count, equal_var=False)

Ttest_indResult(statistic=-1.6311702476684713, pvalue=0.13954133765393995)

In [123]:
stats.ttest_ind(pro_chars_early.Total_Type_Count, pro_chars_late.Total_Type_Count, equal_var=False)

Ttest_indResult(statistic=-1.6051707473003827, pvalue=0.12728491170541695)

In [124]:
stats.ttest_ind(pro_chars_mid.Total_Type_Count, pro_chars_late.Total_Type_Count, equal_var=False)

Ttest_indResult(statistic=-0.4001939965138298, pvalue=0.6926291756780809)

There's no significant difference between protagonist total type count across eras.

In [125]:
# Comparing Antagonist Total Type Count Over Time
stats.f_oneway(ant_chars_early.Total_Type_Count, ant_chars_mid.Total_Type_Count, ant_chars_late.Total_Type_Count)

F_onewayResult(statistic=2.278641987658419, pvalue=0.11412029570700565)

In [126]:
stats.ttest_ind(ant_chars_early.Total_Type_Count, ant_chars_mid.Total_Type_Count, equal_var=False)

Ttest_indResult(statistic=0.8569156141966944, pvalue=0.40686196373696526)

In [127]:
stats.ttest_ind(ant_chars_early.Total_Type_Count, ant_chars_late.Total_Type_Count, equal_var=False)

Ttest_indResult(statistic=-0.9088485557336637, pvalue=0.37602356135550985)

In [128]:
stats.ttest_ind(ant_chars_mid.Total_Type_Count, ant_chars_late.Total_Type_Count, equal_var=False)

Ttest_indResult(statistic=-2.094113391130965, pvalue=0.05076353895724402)

Antagonist total type counts aren't significantly different over time.

In [129]:
#Comparing Protagonist and Antagonist total type counts w/in each era
stats.ttest_ind(pro_chars_early.Total_Type_Count, ant_chars_early.Total_Type_Count, equal_var=False)

Ttest_indResult(statistic=0.1820044756588397, pvalue=0.8593034431524984)

In [130]:
stats.ttest_ind(pro_chars_mid.Total_Type_Count, ant_chars_mid.Total_Type_Count, equal_var=False)

Ttest_indResult(statistic=4.03442086377755, pvalue=0.0005782591496875579)

In [131]:
stats.ttest_ind(pro_chars_late.Total_Type_Count, ant_chars_late.Total_Type_Count, equal_var=False)

Ttest_indResult(statistic=1.372054322578563, pvalue=0.18261046281460055)

We see protagonists speak significantly fewer types per line than antagonists do in the middle period. This difference is not strong in other periods.

#### Roles Across Companies

In [132]:
ant_chars_disney = ant_chars[ant_chars.Disney_Period != 'DREAMWORKS']
ant_chars_dw = ant_chars[ant_chars.Disney_Period == 'DREAMWORKS']

pro_chars_disney = pro_chars[pro_chars.Disney_Period != 'DREAMWORKS']
pro_chars_dw = pro_chars[pro_chars.Disney_Period == 'DREAMWORKS']

In [133]:
#Between antagonists in Disney and Dreamworks
stats.ttest_ind(ant_chars_disney.Total_Tok_Count, ant_chars_dw.Total_Tok_Count, equal_var=False)

Ttest_indResult(statistic=-1.319365341370763, pvalue=0.20050283315545955)

In [134]:
#Between protagonists in Disney and Dreamworks
stats.ttest_ind(pro_chars_disney.Total_Tok_Count, pro_chars_dw.Total_Tok_Count, equal_var=False)

Ttest_indResult(statistic=-2.4066727724213672, pvalue=0.023034529483623306)

In [135]:
#Between protagonists and antagonists in Disney
stats.ttest_ind(pro_chars_disney.Total_Tok_Count, ant_chars_disney.Total_Tok_Count, equal_var=False)

Ttest_indResult(statistic=4.163468947239521, pvalue=0.0001513498117997193)

In [136]:
#Between protagonists and antagonists in Dreamworks
stats.ttest_ind(pro_chars_dw.Total_Tok_Count, ant_chars_dw.Total_Tok_Count, equal_var=False)

Ttest_indResult(statistic=3.8163881880334496, pvalue=0.0007154369960454306)

There isn't a significant difference in how many total tokens Disney and Dreamworks antagonists use, but there is for protagonists (with disney protagonists speaking less). Also, in Disney and Dreamworks movies protagonists speak significantly more total tokens than antagonists.

### Gender and Role
Let's see if token or type count by line differs based on both role and gender

In [137]:
chars_gen_role = char_df[(char_df.Gender != 'n') & (char_df.Role != 'N')]

In [138]:
pro_f_chars = chars_gen_role[(chars_gen_role.Gender == 'f') & (chars_gen_role.Role == 'PRO')]
pro_m_chars = chars_gen_role[(chars_gen_role.Gender == 'm') & (chars_gen_role.Role == 'PRO')]

ant_f_chars = chars_gen_role[(chars_gen_role.Gender == 'f') & (chars_gen_role.Role == 'ANT')]
ant_m_chars = chars_gen_role[(chars_gen_role.Gender == 'm') & (chars_gen_role.Role == 'ANT')]

In [139]:
stats.ttest_ind(pro_f_chars.Total_Tok_Count, pro_m_chars.Total_Tok_Count, equal_var=False)

Ttest_indResult(statistic=-2.3036127017888, pvalue=0.026241552523793867)

In [140]:
stats.ttest_ind(ant_f_chars.Total_Tok_Count, ant_m_chars.Total_Tok_Count, equal_var=False)

Ttest_indResult(statistic=1.1952359203652967, pvalue=0.24708253417281606)

Female protagonists have significantly lower total token counts than their male counterparts, but the same isn't true for female antagonists!

In [141]:
stats.ttest_ind(pro_f_chars.Total_Tok_Count, ant_f_chars.Total_Tok_Count, equal_var=False)

Ttest_indResult(statistic=2.206418621224415, pvalue=0.03382223433757088)

In [142]:
stats.ttest_ind(pro_m_chars.Total_Tok_Count, ant_m_chars.Total_Tok_Count, equal_var=False)

Ttest_indResult(statistic=5.105537679404112, pvalue=2.0838970182158358e-05)

Also, female and male protagonists use fewer tokens per line than their antagonist counterparts.

## TTR (Total Type Count/Total Token Count)
Because TTR is so sensitive to text length, I decided to look at TTR for Total Type Counts / Total Token Counts. 
### Gender
#### Overall

In [143]:
stats.ttest_ind(f_chars.TTR, m_chars.TTR, equal_var = False)

Ttest_indResult(statistic=-0.9967546797195583, pvalue=0.31987931875373893)

In terms of Total Token Count, Type Count, and TTR, there's no significant difference between genders. This suggests that while men may speak in longer spurts than women, the amount overall their speech variety doesn't differ much.

#### Gender Over Time

In [144]:
# Comparing female TTR over time
stats.f_oneway(f_chars_early.TTR, f_chars_mid.TTR, f_chars_late.TTR)

F_onewayResult(statistic=3.4968329357799712, pvalue=0.03506265614065239)

In [145]:
stats.ttest_ind(f_chars_early.TTR, f_chars_mid.TTR, equal_var=False)

Ttest_indResult(statistic=-2.459117418560822, pvalue=0.018470036291041714)

In [146]:
stats.ttest_ind(f_chars_early.TTR, f_chars_late.TTR, equal_var=False)

Ttest_indResult(statistic=-2.857913574828898, pvalue=0.006705283618760464)

In [147]:
stats.ttest_ind(f_chars_mid.TTR, f_chars_late.TTR, equal_var=False)

Ttest_indResult(statistic=-0.258104078500633, pvalue=0.7972602335168293)

Overall, female ttr has consistently increased over time, but not always significantly. The middle era and late era aren't significantly different.

In [148]:
# Comparing male TTR over time
stats.f_oneway(m_chars_early.TTR, m_chars_mid.TTR, m_chars_late.TTR)

F_onewayResult(statistic=5.82835310707887, pvalue=0.003442284009815076)

In [149]:
stats.ttest_ind(m_chars_early.TTR, m_chars_mid.TTR, equal_var=False)

Ttest_indResult(statistic=-1.410113236483718, pvalue=0.16412009997815027)

In [150]:
stats.ttest_ind(m_chars_early.TTR, m_chars_late.TTR, equal_var=False)

Ttest_indResult(statistic=-3.2587796654324026, pvalue=0.0018959747320116576)

In [151]:
stats.ttest_ind(m_chars_mid.TTR, m_chars_late.TTR, equal_var=False)

Ttest_indResult(statistic=-2.47232123889799, pvalue=0.01436628327530204)

Overall, male TTR has also increased with each era. But this time, the jump between the early era and the mid era is not significant

In [152]:
#Comparing Male and Female ttr w/in each era
stats.ttest_ind(m_chars_early.TTR, f_chars_early.TTR, equal_var=False)

Ttest_indResult(statistic=1.8884068553046045, pvalue=0.06773956895602941)

In [153]:
stats.ttest_ind(m_chars_mid.TTR, f_chars_mid.TTR, equal_var=False)

Ttest_indResult(statistic=0.10439755290356918, pvalue=0.9174256477392705)

In [154]:
stats.ttest_ind(m_chars_late.TTR, f_chars_late.TTR, equal_var=False)

Ttest_indResult(statistic=1.4071864483279044, pvalue=0.16444134895688622)

Within each era, there is no significant difference between male and female TTR (though it appears to be closest to significant in the early period)

#### Gender Across Companies

In [155]:
## Between male and female characters is disney films
stats.ttest_ind(m_chars_disney.TTR, f_chars_disney.TTR, equal_var=False)

Ttest_indResult(statistic=1.738372169581285, pvalue=0.08452149169924421)

In [156]:
## Between male and female characters in Dreamworks Films
stats.ttest_ind(m_chars_dw.TTR, f_chars_dw.TTR, equal_var=False)

Ttest_indResult(statistic=-0.7209148983122594, pvalue=0.4724125836363494)

Female characters actually have higher TTR's in Dreamworks, but this difference is not significant. In fact, the TTRs for each gender in each company aren't signficantly different

In [157]:
## Between male characters in Dreamworks and Disney
stats.ttest_ind(m_chars_disney.TTR, m_chars_dw.TTR, equal_var=False)

Ttest_indResult(statistic=1.7820311936138562, pvalue=0.07579093929932226)

In [158]:
## Between female characters in Dreamworks and Disney
stats.ttest_ind(f_chars_disney.TTR, f_chars_dw.TTR, equal_var=False)

Ttest_indResult(statistic=-0.9136619447139693, pvalue=0.3625361194432649)

There's also no significant difference between male or female lines depending on which company they're from.

### Role
#### Overall

In [159]:
stats.ttest_ind(pro_chars.TTR, ant_chars.TTR, equal_var = False)
#definite difference here!!

Ttest_indResult(statistic=-4.3639708178868295, pvalue=2.8550625202274174e-05)

The overall difference in TTR between male and female characters isn't significant, but this difference is signficant between roles. This suggests that villains speak for longer spurts AND speak a wider variety of words over all.

#### Role Over Time

In [160]:
# Comparing protagonist TTR over time
stats.f_oneway(pro_chars_early.TTR, pro_chars_mid.TTR, pro_chars_late.TTR)

F_onewayResult(statistic=1.0149771898372846, pvalue=0.3737814831647466)

In [161]:
stats.ttest_ind(pro_chars_early.TTR, pro_chars_mid.TTR, equal_var=False)

Ttest_indResult(statistic=1.2268967836893614, pvalue=0.2636442324651084)

In [162]:
stats.ttest_ind(pro_chars_early.TTR, pro_chars_late.TTR, equal_var=False)

Ttest_indResult(statistic=0.2868374487924252, pvalue=0.7807674688491074)

In [163]:
stats.ttest_ind(pro_chars_mid.TTR, pro_chars_late.TTR, equal_var=False)

Ttest_indResult(statistic=-1.377998038581648, pvalue=0.18053833358472451)

Overall, protagonist ttr has not changed significantly over time.

In [164]:
# Comparing antagonist TTR over time
stats.f_oneway(ant_chars_early.TTR, ant_chars_mid.TTR, ant_chars_late.TTR)

F_onewayResult(statistic=3.269206179120879, pvalue=0.047242541211229325)

In [165]:
stats.ttest_ind(ant_chars_early.TTR, ant_chars_mid.TTR, equal_var=False)

Ttest_indResult(statistic=-1.7201220815865133, pvalue=0.10516315962438427)

In [166]:
stats.ttest_ind(ant_chars_early.TTR, ant_chars_late.TTR, equal_var=False)

Ttest_indResult(statistic=0.4894873762823616, pvalue=0.630807487691796)

In [167]:
stats.ttest_ind(ant_chars_mid.TTR, ant_chars_late.TTR, equal_var=False)

Ttest_indResult(statistic=2.4948346357229516, pvalue=0.02046301415775284)

Overall, antagonist TTR has not changed consistently with each era, but the difference is significant. The only two eras with a significant difference are the middle and late era, in which the antagonist TTR drops.

In [168]:
#Comparing pro and ant ttr w/in each era
stats.ttest_ind(pro_chars_early.TTR, ant_chars_early.TTR, equal_var=False)

Ttest_indResult(statistic=-0.35232507387561846, pvalue=0.7329443287112759)

In [169]:
stats.ttest_ind(pro_chars_mid.TTR, ant_chars_mid.TTR, equal_var=False)

Ttest_indResult(statistic=-5.514977956471588, pvalue=3.5378751555015488e-06)

In [170]:
stats.ttest_ind(pro_chars_late.TTR, ant_chars_late.TTR, equal_var=False)

Ttest_indResult(statistic=-0.43593148828609407, pvalue=0.6665238247248573)

Protagonists have significantly lower TTR's than antagonists in the middle era! But this difference isn't significant in any other era

#### Role Across Companies

In [171]:
## Between pro and ant characters is disney films
stats.ttest_ind(pro_chars_disney.TTR, ant_chars_disney.TTR, equal_var=False)

Ttest_indResult(statistic=-3.5363873123988214, pvalue=0.0007091891629031381)

In [172]:
## Between pro and ant characters in Dreamworks Films
stats.ttest_ind(pro_chars_dw.TTR, ant_chars_dw.TTR, equal_var=False)

Ttest_indResult(statistic=-2.381271556663255, pvalue=0.022824811395266955)

In both companies, the protagonists' TTR is significantly less than the antagonists' TTR (though this difference is more pronounced in Disney)

In [173]:
## Between pro characters in Dreamworks and Disney
stats.ttest_ind(pro_chars_disney.TTR, pro_chars_dw.TTR, equal_var=False)

Ttest_indResult(statistic=0.9802699115501557, pvalue=0.33381354005183883)

In [174]:
## Between ant characters in Dreamworks and Disney
stats.ttest_ind(ant_chars_disney.TTR, ant_chars_dw.TTR, equal_var=False)

Ttest_indResult(statistic=0.9206752516740735, pvalue=0.3642994840174262)

There's also no significant difference between protagonist or antagonist TTR depending on which company they're from.

### Gender and Role

In [175]:
stats.ttest_ind(pro_f_chars.TTR, pro_m_chars.TTR, equal_var=False)

Ttest_indResult(statistic=1.08651965020638, pvalue=0.2826315416739064)

In [176]:
stats.ttest_ind(ant_f_chars.TTR, ant_m_chars.TTR, equal_var=False)

Ttest_indResult(statistic=-2.687410158585391, pvalue=0.01180458098250379)

Female antagonists have a significantly lower TTR than male antagonists.

In [177]:
stats.ttest_ind(pro_f_chars.TTR, ant_f_chars.TTR, equal_var=False)

Ttest_indResult(statistic=-0.7943175495401144, pvalue=0.43236641767734063)

In [178]:
stats.ttest_ind(pro_m_chars.TTR, ant_m_chars.TTR, equal_var=False)

Ttest_indResult(statistic=-5.034559650067808, pvalue=4.816235332807276e-06)

Male protagonists have significantly lower TTR's than male antagonists. 

## K-Band
K-bands are another way to measure vocabulary sophistication. Before we can measure this, we need to get rid of NaN values (32 of them)

In [179]:
char_df = char_df[char_df.AVG_K_BAND.notnull()]

In [180]:
char_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 600 entries, 0 to 631
Data columns (total 14 columns):
Disney_Period       600 non-null object
Gender              600 non-null object
Movie               600 non-null object
Role                600 non-null object
Speaker             600 non-null object
Speaker_Status      600 non-null object
Total_Tok_Count     600 non-null float64
Total_Toks          600 non-null object
Total_Type_Count    600 non-null float64
Total_Types         600 non-null object
Year                600 non-null object
TTR                 600 non-null float64
G_TTR               600 non-null float64
AVG_K_BAND          600 non-null float64
dtypes: float64(5), object(9)
memory usage: 49.2+ KB


In [181]:
char_df.AVG_K_BAND.describe()

count    600.000000
mean       2.597918
std        1.667153
min        1.000000
25%        1.810462
50%        2.235572
75%        2.825013
max       20.000000
Name: AVG_K_BAND, dtype: float64

In [182]:
f_chars = char_df[char_df.Gender == 'f']
m_chars = char_df[char_df.Gender == 'm']

pro_chars = char_df[char_df.Role == 'PRO']
ant_chars = char_df[char_df.Role == 'ANT']
helper_chars = char_df[char_df.Role == 'HELPER']

In [183]:
f_chars_early = f_chars[f_chars.Disney_Period == 'EARLY']
m_chars_early = m_chars[m_chars.Disney_Period == 'EARLY']

f_chars_mid = f_chars[f_chars.Disney_Period == 'MID']
m_chars_mid = m_chars[m_chars.Disney_Period == 'MID']

f_chars_late = f_chars[f_chars.Disney_Period == 'LATE']
m_chars_late = m_chars[m_chars.Disney_Period == 'LATE']

In [184]:
f_chars_disney = f_chars[f_chars.Disney_Period != 'DREAMWORKS']
f_chars_dw = f_chars[f_chars.Disney_Period == 'DREAMWORKS']

m_chars_disney = m_chars[m_chars.Disney_Period != 'DREAMWORKS']
m_chars_dw = m_chars[m_chars.Disney_Period == 'DREAMWORKS']

In [185]:
pro_chars_early = pro_chars[pro_chars.Disney_Period == 'EARLY']
pro_chars_mid = pro_chars[pro_chars.Disney_Period == 'MID']
pro_chars_late = pro_chars[pro_chars.Disney_Period == 'LATE']

ant_chars_early = ant_chars[ant_chars.Disney_Period == 'EARLY']
ant_chars_mid = ant_chars[ant_chars.Disney_Period == 'MID']
ant_chars_late = ant_chars[ant_chars.Disney_Period == 'LATE']

ant_chars_disney = ant_chars[ant_chars.Disney_Period != 'DREAMWORKS']
ant_chars_dw = ant_chars[ant_chars.Disney_Period == 'DREAMWORKS']

pro_chars_disney = pro_chars[pro_chars.Disney_Period != 'DREAMWORKS']
pro_chars_dw = pro_chars[pro_chars.Disney_Period == 'DREAMWORKS']

In [186]:
chars_gen_role = char_df[(char_df.Gender != 'n') & (char_df.Role != 'N')]

In [187]:
pro_f_chars = chars_gen_role[(chars_gen_role.Gender == 'f') & (chars_gen_role.Role == 'PRO')]
pro_m_chars = chars_gen_role[(chars_gen_role.Gender == 'm') & (chars_gen_role.Role == 'PRO')]

ant_f_chars = chars_gen_role[(chars_gen_role.Gender == 'f') & (chars_gen_role.Role == 'ANT')]
ant_m_chars = chars_gen_role[(chars_gen_role.Gender == 'm') & (chars_gen_role.Role == 'ANT')]

### Gender
#### Overall

In [188]:
stats.ttest_ind(f_chars.AVG_K_BAND, m_chars.AVG_K_BAND, equal_var = False)

Ttest_indResult(statistic=-1.5214685800338528, pvalue=0.12915173695384446)

A difference, but not significant.

#### Gender Over Time

In [189]:
# Comparing female k-bands over time
stats.f_oneway(f_chars_early.AVG_K_BAND, f_chars_mid.AVG_K_BAND, f_chars_late.AVG_K_BAND)

F_onewayResult(statistic=0.5550447488889522, pvalue=0.5763259908527979)

In [190]:
stats.ttest_ind(f_chars_early.AVG_K_BAND, f_chars_mid.AVG_K_BAND, equal_var=False)

Ttest_indResult(statistic=1.2624659995110097, pvalue=0.21671137414910116)

In [191]:
stats.ttest_ind(f_chars_early.AVG_K_BAND, f_chars_late.AVG_K_BAND, equal_var=False)

Ttest_indResult(statistic=0.4352872140743068, pvalue=0.6655490643381015)

In [192]:
stats.ttest_ind(f_chars_mid.AVG_K_BAND, f_chars_late.AVG_K_BAND, equal_var=False)

Ttest_indResult(statistic=-0.7856604955881593, pvalue=0.43520109428216813)

Between each era, there's no significant difference between k-band for female characters.

In [193]:
# Comparing male k-bands over time
stats.f_oneway(m_chars_early.AVG_K_BAND, m_chars_mid.AVG_K_BAND, m_chars_late.AVG_K_BAND)

F_onewayResult(statistic=1.5721235645385327, pvalue=0.21012081406469596)

In [194]:
stats.ttest_ind(m_chars_early.AVG_K_BAND, m_chars_mid.AVG_K_BAND, equal_var=False)

Ttest_indResult(statistic=-0.8763342592370738, pvalue=0.3829134296971871)

In [195]:
stats.ttest_ind(m_chars_early.AVG_K_BAND, m_chars_late.AVG_K_BAND, equal_var=False)

Ttest_indResult(statistic=0.9316593728992122, pvalue=0.3541216555969938)

In [196]:
stats.ttest_ind(m_chars_mid.AVG_K_BAND, m_chars_late.AVG_K_BAND, equal_var=False)

Ttest_indResult(statistic=1.695411524679044, pvalue=0.09178577905498049)

There's no significant difference between male k-bands either.

In [197]:
#Comparing Male and Female k-band w/in each era
stats.ttest_ind(m_chars_early.AVG_K_BAND, f_chars_early.AVG_K_BAND, equal_var=False)

Ttest_indResult(statistic=0.5588404692572434, pvalue=0.5801201372747906)

In [198]:
stats.ttest_ind(m_chars_mid.AVG_K_BAND, f_chars_mid.AVG_K_BAND, equal_var=False)

Ttest_indResult(statistic=2.889049534067202, pvalue=0.00469035893209242)

In [199]:
stats.ttest_ind(m_chars_late.AVG_K_BAND, f_chars_late.AVG_K_BAND, equal_var=False)

Ttest_indResult(statistic=0.24591397143813828, pvalue=0.8063531699999992)

Within each era, male characters have higher k-bands than female characters, but the only significant difference between male and female k-band is in the middle era.

#### Gender Across Companies

In [200]:
## Between male and female characters is disney films
stats.ttest_ind(m_chars_disney.AVG_K_BAND, f_chars_disney.AVG_K_BAND, equal_var=False)

Ttest_indResult(statistic=2.017418424661412, pvalue=0.04483148616005847)

In [201]:
## Between male and female characters in Dreamworks Films
stats.ttest_ind(m_chars_dw.AVG_K_BAND, f_chars_dw.AVG_K_BAND, equal_var=False)

Ttest_indResult(statistic=0.134804929154227, pvalue=0.8930338538260596)

In both companies, male characters have higher k-bands, but this difference is only significant in Disney movies.

In [202]:
## Between male characters in Dreamworks and Disney
stats.ttest_ind(m_chars_disney.AVG_K_BAND, m_chars_dw.AVG_K_BAND, equal_var=False)

Ttest_indResult(statistic=0.2960567231712937, pvalue=0.7673859718130868)

In [203]:
## Between female characters in Dreamworks and Disney
stats.ttest_ind(f_chars_disney.AVG_K_BAND, f_chars_dw.AVG_K_BAND, equal_var=False)

Ttest_indResult(statistic=-1.0623278021471434, pvalue=0.29095151838251404)

There's also no significant difference between male or female k-bands depending on which company they're from.

### Role
#### Overall

In [204]:
stats.ttest_ind(pro_chars.AVG_K_BAND, ant_chars.AVG_K_BAND, equal_var = False)

Ttest_indResult(statistic=0.45162026995458937, pvalue=0.6525834608516885)

The overall difference in k-band between protagonists and antagonists isn't significant.

#### Role Over Time

In [205]:
# Comparing protagonist k-band over time
stats.f_oneway(pro_chars_early.AVG_K_BAND, pro_chars_mid.AVG_K_BAND, pro_chars_late.AVG_K_BAND)

F_onewayResult(statistic=1.9305376251574389, pvalue=0.16159456052400611)

In [206]:
stats.ttest_ind(pro_chars_early.AVG_K_BAND, pro_chars_mid.AVG_K_BAND, equal_var=False)

Ttest_indResult(statistic=-1.0865872563918046, pvalue=0.3043142432813455)

In [207]:
stats.ttest_ind(pro_chars_early.AVG_K_BAND, pro_chars_late.AVG_K_BAND, equal_var=False)

Ttest_indResult(statistic=-2.126040235119834, pvalue=0.04670118626074485)

In [208]:
stats.ttest_ind(pro_chars_mid.AVG_K_BAND, pro_chars_late.AVG_K_BAND, equal_var=False)

Ttest_indResult(statistic=-1.7565482811389581, pvalue=0.09637230583616663)

Overall, protagonist k-band has gone up over time, but only increased significantly between the early and late periods.

In [209]:
# Comparing antagonist k-band over time
stats.f_oneway(ant_chars_early.AVG_K_BAND, ant_chars_mid.AVG_K_BAND, ant_chars_late.AVG_K_BAND)

F_onewayResult(statistic=1.2934913819917113, pvalue=0.2845365325582649)

In [210]:
stats.ttest_ind(ant_chars_early.AVG_K_BAND, ant_chars_mid.AVG_K_BAND, equal_var=False)

Ttest_indResult(statistic=-0.3556024751029952, pvalue=0.724931342155765)

In [211]:
stats.ttest_ind(ant_chars_early.AVG_K_BAND, ant_chars_late.AVG_K_BAND, equal_var=False)

Ttest_indResult(statistic=2.2035911805753288, pvalue=0.05592876469781373)

In [212]:
stats.ttest_ind(ant_chars_mid.AVG_K_BAND, ant_chars_late.AVG_K_BAND, equal_var=False)

Ttest_indResult(statistic=2.3403136089156553, pvalue=0.025816632560621153)

Overall, antagonist k-band has not changed consistently with each era, but the difference between the mid and late period is significant--antagonists' k-bands have gone down between these two periods.

In [213]:
#Comparing pro and ant k-band w/in each era
stats.ttest_ind(pro_chars_early.AVG_K_BAND, ant_chars_early.AVG_K_BAND, equal_var=False)

Ttest_indResult(statistic=-2.302191356233655, pvalue=0.045407813161706334)

In [214]:
stats.ttest_ind(pro_chars_mid.AVG_K_BAND, ant_chars_mid.AVG_K_BAND, equal_var=False)

Ttest_indResult(statistic=-2.0498917471898546, pvalue=0.049190264549699964)

In [215]:
stats.ttest_ind(pro_chars_late.AVG_K_BAND, ant_chars_late.AVG_K_BAND, equal_var=False)

Ttest_indResult(statistic=2.0320509917553653, pvalue=0.0569016067040367)

Protagonists have significantly lower k-bands in the early and middle period. Though they have higher k-bands than antagonists in the late period, this difference is not significant. 

#### Role Across Companies

In [216]:
## Between pro and ant characters is disney films
stats.ttest_ind(pro_chars_disney.AVG_K_BAND, ant_chars_disney.AVG_K_BAND, equal_var=False)

Ttest_indResult(statistic=-0.7123319965577875, pvalue=0.47833352294552167)

In [217]:
## Between pro and ant characters in Dreamworks Films
stats.ttest_ind(pro_chars_dw.AVG_K_BAND, ant_chars_dw.AVG_K_BAND, equal_var=False)

Ttest_indResult(statistic=1.4419215026164018, pvalue=0.16584992104633853)

In both companies, the difference between protagonist and antagonist k-bands is not signficant.

In [218]:
## Between pro characters in Dreamworks and Disney
stats.ttest_ind(pro_chars_disney.AVG_K_BAND, pro_chars_dw.AVG_K_BAND, equal_var=False)

Ttest_indResult(statistic=-0.8128334727767533, pvalue=0.4257691390935556)

In [219]:
## Between ant characters in Dreamworks and Disney
stats.ttest_ind(ant_chars_disney.AVG_K_BAND, ant_chars_dw.AVG_K_BAND, equal_var=False)

Ttest_indResult(statistic=2.3208894309533283, pvalue=0.023624771660884124)

There's a signficant difference between antagonists across companies--Disney antagonists are more likely to have a higher k-band.

### Gender and Role

In [220]:
stats.ttest_ind(pro_f_chars.AVG_K_BAND, pro_m_chars.AVG_K_BAND, equal_var=False)

Ttest_indResult(statistic=0.3014119565517507, pvalue=0.7645438754926193)

In [221]:
stats.ttest_ind(ant_f_chars.AVG_K_BAND, ant_m_chars.AVG_K_BAND, equal_var=False)

Ttest_indResult(statistic=-0.13147818133675276, pvalue=0.8961701948870642)

There is no significant difference here.

In [222]:
stats.ttest_ind(pro_f_chars.AVG_K_BAND, ant_f_chars.AVG_K_BAND, equal_var=False)

Ttest_indResult(statistic=0.026228542209657403, pvalue=0.9792793645428122)

In [223]:
stats.ttest_ind(pro_m_chars.AVG_K_BAND, ant_m_chars.AVG_K_BAND, equal_var=False)

Ttest_indResult(statistic=-0.3819223985057784, pvalue=0.7039886227359056)

And no significant difference here either.

## A Quick Peek at TTR by Line

In [224]:
movie_df['TTR'] = movie_df.Type_Count / movie_df.Token_Count

In [225]:
f_movie_df = movie_df[movie_df.Gender == 'f']
m_movie_df = movie_df[movie_df.Gender == 'm']

In [226]:
stats.ttest_ind(f_movie_df.TTR, m_movie_df.TTR, equal_var = False)

Ttest_indResult(statistic=5.747565438307308, pvalue=9.365973396130048e-09)

In [227]:
movie_df.groupby('Gender')['TTR'].describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
Gender,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
f,4216.0,0.916061,0.126405,0.238095,0.857143,1.0,1.0,1.0
m,8914.0,0.902342,0.130396,0.214286,0.833333,1.0,1.0,1.0
n,312.0,0.88708,0.174888,0.166667,0.833333,1.0,1.0,1.0


Wow! This makes the difference seem extremely significant! But recall from earlier that females have significantly shorter lines. The caveat of TTR is that as line length goes up, you're more likely to get a lower TTR. The difference above is worth noting, but I don't think that it tells the whole story.

In [228]:
#TTR by line for Role
ant_movie_df = movie_df[movie_df.Role == 'ANT']
pro_movie_df = movie_df[movie_df.Role == 'PRO']

In [229]:
movie_df.groupby('Role')['TTR'].describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
Role,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
ANT,2037.0,0.907082,0.124392,0.214286,0.846154,0.952381,1.0,1.0
HELPER,3549.0,0.898836,0.135681,0.166667,0.833333,1.0,1.0,1.0
N,2094.0,0.912681,0.131447,0.25,0.857143,1.0,1.0,1.0
PRO,5762.0,0.908279,0.128937,0.238095,0.833333,1.0,1.0,1.0


In [230]:
stats.ttest_ind(ant_movie_df.TTR, pro_movie_df.TTR, equal_var = False)

Ttest_indResult(statistic=-0.3698928762036763, pvalue=0.7114835463019438)

And here for role, line by line, there seems to be no significant difference at all in TTR. Maybe line by line, protagonists are just as sophisticated as antagonists, but their vocabulary doesn't vary much across lines.

# Summary: What have we Observed?