# POS Analysis

## Table of Contents
- [1st Section](#Loading-in-the-data) is where I import the files
- [2nd section](#POS-tagging) is where I run the POS tagger
- [3rd section](#POS-extraction) is where I extract and count the POS for analysis
- [4th section](#POS-analysis) is where I analyze POS usage
- [Conclusion](#Conclusion) summarizes the notebook

## Loading in the data

In [None]:
# load in the libraries
import spacy
import numpy as np
import pandas as pd

## POS tagging

In [None]:
# generate nlp object for spacy pos tagging
nlp = spacy.load("en_core_web_sm")

In [None]:
# do i want to run this as word, POS tuples?

In [None]:
# function that tags pos and creates word, pos tuples in a list
def pos_tag(x):
    pos = []
    for y in nlp(x):
        pos_tag = (y, y.pos_)
        pos.append(pos_tag)
    return pos

I like the readability of having (word, POS) tuples that NLTK POS tagger has, so I created the function above to maintain that structure using spaCy.

In [None]:
# adding POS tags to see if any trends arise
utterances_df['pos_tag'] = utterances_df.utterance.map(lambda x: pos_tag(x))

## POS extraction

I will ignore noun and verbs, as they are the most basic elements of phrase structure. I will look at adjectives and adverbs to see how often the speaker modifies other parts of speech, interjections to check for interruptions, and conjunctions to get an idea about sentence complexity.

The parts of speech I will look at are the following:
* ADV (adverb)
* ADJ (adjective)
* CONJ (conjunction)
* INTJ (interjection)

In [None]:
# adverbs
def get_adv(x):
    pattern = r'ADV'
    advs = re.findall(pattern, ' '.join(str(z) for (y,z) in x))
    return advs

# adding data to the data frames
discourse_df['adv_count'] = discourse_df.pos_tag.apply(get_adv).str.len()

In [None]:
# adjectives
def get_adj(x):
    pattern = r'ADJ'
    adjs = re.findall(pattern, ' '.join(str(z) for (y,z) in x))
    return adjs

# adding data to the data frames
discourse_df['adj_count'] = discourse_df.pos_tag.apply(get_adj).str.len()

In [None]:
# conjunctions
def get_conj(x):
    pattern = r'CONJ'
    conjs = re.findall(pattern, ' '.join(str(z) for (y,z) in x))
    return conjs

# adding data to the data frames
discourse_df['conj_count'] = discourse_df.pos_tag.apply(get_conj).str.len()

In [None]:
# interjections
def get_intj(x):
    pattern = r'INTJ'
    intjs = re.findall(pattern, ' '.join(str(z) for (y,z) in x))
    return intjs

# adding data to the data frames
discourse_df['intj_count'] = discourse_df.pos_tag.apply(get_intj).str.len()

In [None]:
discourse_df.head()

In [None]:
print('There are',discourse_df.adv_count.sum(),'adverbs in the corpus.')
print('There are',discourse_df.adj_count.sum(),'adjectives in the corpus.')
print('There are',discourse_df.conj_count.sum(),'conjunctions in the corpus.')
print('There are',discourse_df.intj_count.sum(),'interjections in the corpus.')

Adverbs are by far the most common POS out of the four selected for analysis.

## POS analysis

Let's see how each gender uses these different parts of speech. Because the POS I am analyzing are not required, the usages may be low and a min of 0 per turn can be expected. This is indicated by flashing the head of the data frame above.

### Adverbs

In [None]:
discourse_df.groupby('gender').agg({'adv_count': ['mean', 'min', 'max', 'std']})

In [None]:
sns.catplot(data=discourse_df, x='gender', y='adv_count', kind='box')
plt.title('Adverb Usage by Gender')
plt.show()

Female characters have the highest mean of adverb usage at 0.97. So on avarage, almost every turn a female character will use an adverb in this corpus. However, the highest number of adverbs in one turn is from a male character with 35.

As the boxplot shows the tails across all genders are very long. The turns with 0 instances are bringing down the average.

### Adjectives

In [None]:
discourse_df.groupby('gender').agg({'adj_count': ['mean', 'min', 'max', 'std']})

In [None]:
sns.catplot(data=discourse_df, x='gender', y='adj_count', kind='box')
plt.title('Adjective Usage by Gender')
plt.show()

Characters with unknown gender have the highest mean usage of adjectives at .66, they also have the largest standard deviation, so there is most variability in these character's usage. Again, the most adjectives in a turn is from a male character with 53.

### Conjunctions

In [None]:
discourse_df.groupby('gender').agg({'conj_count': ['mean', 'min', 'max', 'std']})

In [None]:
sns.catplot(data=discourse_df, x='gender', y='conj_count', kind='box')
plt.title('Conjunction Usage by Gender')
plt.show()

'Unknown' characters have the most conjunctions, but only by 0.01. Overall usage seems to be very consistent across the board. Once again, male characters have the highest number of conjunctions for one turn.

### Interjections

In [None]:
discourse_df.groupby('gender').agg({'intj_count': ['mean', 'min', 'max', 'std']})

In [None]:
sns.catplot(data=discourse_df, x='gender', y='intj_count', kind='box')
plt.title('Interjection Usage by Gender')
plt.show()

Very low usage across the board, with female being the higest at 0.08. Both male and female characters have 7 interjections as a maximum per turn.

## Conclusion