# Exploratory Data Analysis on a Natural Language Processing Task
> Authors: Caroline Schmitt, Matt Brems

---

Exploratory data analysis (EDA) is a crucial part of any data science project. EDA helps us discover interesting relationships in the data, detect outliers and errors, examine our own assumptions about the data, and prepare for modeling. During EDA we might discover that we need to clean our data more conscientiously, or that we have more missing data than we realized, or that there aren't many patterns in the data (indicating that modeling may be challenging.)

In this lab you'll bring in a natural language dataset and perform EDA. The dataset contains Facebook statuses taken from between 2009 and 2011 as well as personality test results associated with the users whose Facebook statuses are included.

This dataset uses results from the Big Five Personality Test, also referred to as the five-factor model, which measures a person's score on five dimensions of personality:
- **O**penness
- **C**onscientiousness
- **E**xtroversion
- **A**greeableness
- **N**euroticism

Notoriously, the political consulting group Cambridge Analytica claims to have predicted the personalities of Facebook users by using those users' data, with the goal of targeting them with political ads that would be particularly persuasive given their personality type. Cambridge Analytica claims to have considered 32 unique 'groups' in the following fashion:
- For each of the five OCEAN qualities, a user is categorized as either 'yes' or 'no'.
- This makes for 32 different potential combinations of qualities. ($2^5 = 32$).

Cambridge Analytica's methodology was then, roughly, the following:
- Gather a large amount of data from Facebook.
- Use this data to predict an individual's Big Five personality "grouping."
- Design political advertisements that would be particularly effective to that particular "grouping." (For example, are certain advertisements particularly effective toward people with specific personality traits?)

In this lab you will perform EDA to examine many relationships in the data.

Exploratory data analysis can be a non-linear process, and you're encouraged to explore questions that occur to you as you work through the notebook.

> **Content note**: This dataset contains real Facebook statuses scraped from 2009 to 2011, and some of the statuses contain language that is not safe for work, crude, or offensive. The full dataset is available as `mypersonality.csv`, and a sanitized version containing only statuses that passed an automated profanity check is available as `mypersonality_noprofanity.csv`. Please do not hesitate to use `mypersonality_noprofanity.csv` if you would prefer to. Please note that the automated profanity check is not foolproof. If you have any concerns about working with this dataset, please get in touch with your instructional team.

---

### External resources

These resources are not required reading but may be of use or interest.

- [Python Graph Gallery](https://python-graph-gallery.com/)
- [Wikipedia page](https://en.wikipedia.org/wiki/Big_Five_personality_traits) on the Big Five test
- [A short (3-4 pages) academic paper](./celli-al_wcpr13.pdf) using the `MyPersonality` dataset to model personality

---

## Load packages

In [None]:
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

from sklearn.feature_extraction.text import CountVectorizer

# this setting widens how many characters pandas will display in a column:
pd.options.display.max_colwidth = 350

---

## Import data

This code is provided for you. Some columns with network and score-related data are dropped.

The remaining columns will be `#AUTHID`, `STATUS`, `cEXT`, `cNEU`, `cAGR`, `cCON`, `cOPN`, and `DATE`:

| Variable name | Description                                                                      |
|---------------|----------------------------------------------------------------------------------|
| `#AUTHID`     | Author ID code, unique per user                                                  |
| `STATUS`      | Text of a Facebook status                                                        |
| `cEXT`        | Author extroversion category, `y` for above median and `n` for below median      |
| `cNEU`        | Author neuroticism category, `y` for above median and `n` for below median       |
| `cAGR`        | Author agreeableness category, `y` for above median and `n` for below median     |
| `cCON`        | Author conscientiousness category, `y` for above median and `n` for below median |
| `cOPN`        | Author openness category, `y` for above median and `n` for below median          |
| `DATE`        | Time stamp of original Facebook status                                           |


In [None]:
df = pd.read_csv('data/mypersonality.csv')
# df = pd.read_csv('data/mypersonality_noprofanity.csv') # comment out above & uncomment this to use mypersonality_noprofanity.csv

dropcols = [
    # these are network-related columns:
    'NETWORKSIZE', 'BETWEENNESS', 'NBETWEENNESS', 'DENSITY',
    'BROKERAGE', 'NBROKERAGE', 'TRANSITIVITY',
    # these are score-related columns;
    # we will use the catgories instead:
    'sEXT', 'sNEU', 'sAGR', 'sCON', 'sOPN'
]

df.drop(columns=dropcols, inplace=True)
df.head(3)

## Data cleaning

It's often more convenient to work with integers than strings. Convert the personality columns to 0 and 1, with 0 meaning 'below the median' and 1 meaning 'above the median.'

-----

## A first look at the dataset

In this section, check:

- How many observations are there in the dataset?
- How many unique `#AUTHID` codes are there?
- How many `y` and `n` values are in each of the personality category columns?


---

## EDA on Statuses

Before we even vectorize the text, we might look at the lengths and word counts in each Facebook status. Some personality types might be more long-winded than others!

### Create a new column called `status_char_length` that contains the character length of each status

> Note: You can do this in one line with `map`.

### Create a new column called `status_word_count` that contains the number of words in each status

> Note: You can evaluate this based off of how many strings are separated by whitespaces; you're not required to check that each set of characters set apart by whitespaces is a word in the dictionary.

---

## Longest and shortest statuses

Looking at individual observations can help us get a sense of what the dataset contains.

### Show the five longest and five shortest statuses based off of `status_word_count`

---

## Investigating distribution of post lengths

We've now seen some of the shortest and longest posts in the dataset. But how common are short posts, and how common are long posts? 

Use visuals to show the distributions of post lengths.

> Note: There are multiple different types of visualizations you could use for this, and you could investigate this by looking at `status_word_count`, `status_char_length`, or both.

---

## Exploring personality categories and individual users

Because we have many posts per user for most users, doing EDA on the personality columns might be misleading. If we have 2,000 Facebook statuses from one very high-conscientiousness user, a bar chart of how many `'cCON'` statuses are associated with `1` might be misleading. We'll have to be careful about labeling and titling any visualizations we make off of the dataset.

This dataset has redacted original poster names, but each user is given an `#AUTHID`. How many unique users are there? Do we have the same number of posts per user, or do we have some more posts by some users than others?

---

### Create a new dataframe called `unique_users` that only contains the `#AUTHID` and personality category columns

If you do this correctly, it should have 250 rows and 6 columns.*

(Hint: You can use the pandas [drop_duplicates()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop_duplicates.html) method to make this easier. The only column you want to consider when deciding if a user is duplicated is the `#AUTHID` column.)

> Note: *If using the `noprofanity` dataset, this number may be different.

### Using `unique_users`, investigate personality

For this section, perform EDA on just the unique users. Create 2-3 tables or visuals to investigate.

Here are some prompts to get you started:

- What proportion of above-median openness users also exhibit above-median extroversion vs. below-median extroversion? What about other pairs of personality traits?
- Do any two personality traits appear to be correlated?
- Are about equal numbers of users above median conscientiousness and below median conscientiousness, or is there an imbalanced split? What about the other personality traits?
- Are any users below-median across all five personality traits? How many?
- Are any users above-median all five personality traits? How many?

For each dataframe or plot you end up with, remember to provide interpretation in markdown as well.

## Plots vs. Tables

(Short answer.) Explain when you might present a visualization versus when you might present a table of summary statistics. You can provide your answer using sentences or bullet points.

---

## Exploring status length and word count based on personality

### Using `groupby()`, find the mean status length and status word count for posts by users in the above-median and below-median categories of each of the personality traits

> Note: Using `groupby()` five separate times is the easiest way to do this.

### Distribution of post length for above- and below-median personality traits

Choose one of the personality category columns (i.e. `cOPN`, `cCON`, `cEXT`, `cAGR`, or `cNEU`.) Visualize the distribution of status word counts of posts for users who are both above-median and below-median on that trait.

> Note: This can be done several ways -- using seaborn or matplotlib, and as overlapping histograms or as side-by-side or stacked histograms.

---

## EDA on Word Counts

### Vectorize the text

In order to perform EDA on word count data, we'll need to count-vectorize.

Create a dataframe that contains the count-vectorized text for each Facebook status in the dataset.

To do this, you might follow these steps:
- Instantiate a `CountVectorizer` object
- Fit the count vectorizer on the Facebook statuses
- Store the transformed data
- Convert to a dataframe and store
    - Don't forget that the transformed data will need to be 'densified'. The `toarray()` or `todense()` methods will allow this.
    - Don't forget that the `get_feature_names()` method on a fitted `CountVectorizer` object will bring you back the words learned from the dataset, which you can set as the `columns` argument when creating the dataframe.
    
It's up to you whether or not to keep stopwords in the dataset.

### Show the 15 most common words

### Show the 15 frequency of the most common words as a bar chart

### Investigating `propname`

The word `propname` shows up frequently in this dataset. Show 10 statuses in the dataset that contain `propname`:

#### Provide a short explanation of what you believe `propname` to be:

> Note: The attached PDF also contains an explanation.

-----

## Most common words based on personality category

In order to do more targeted EDA, we'll need to be able to reference not only the dataframe of vectorized statuses, but also the personality columns from the original dataframe.

### Create a new dataframe with the vectorized text _and_ the personality category columns

> Note: One way to do this is by using [`pd.concat`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html).

### Show the 25 most common words for statuses from high-cAGR users

### Show the 25 most common words for statuses from low-cAGR users

-----

### (BONUS) Most common bigrams:

This is a bonus section and not required.

Find the 10 most common bigrams in the dataset.

> Note: The easiest way to do this involves instantiating a new `CountVectorizer`.

### (BONUS) Most common trigrams:

This is a bonus section and not required.

Find the 10 most common trigrams in the dataset.

> Note: The easiest way to do this involves instantiating a new `CountVectorizer`.

---

## Choose your own adventure

By now you've looked at a lot of visualizations and frequency counts.

Come up with 2-3 questions about the data, and try to answer them using descriptive statistics (like counts, averages, etc.) or visualizations.

Some questions you might explore:

- Have numbers been redacted, or are phone numbers, house numbers, or zip codes anywhere in the dataset?
- `PROPNAME` has been used to redact personal names. Given that this data was scraped between 2009 and 2011, investigate if any public figures or famous people show up in the dataset, or their names have been redacted as well.
- Is count of uppercase letters vs. lowercase letters per status related to any personality category or personality score?
- Is _average_ word count per status related to any personality category or personality metric?
- Is punctuation use related to personality?

Or, of course, come up with your own questions to investigate.

The focus here is on "explore" -- you might not find anything of particular interest, but don't let that discourage you.