# EDA & Pre-Processing

### Notebook Summary
In this notebook, I will be performing basic EDA on the text data that I pulled from Reddit in the previous notebook. I will also prepare the data for more extensive modeling in following notebooks by assembling the relevant text into a structured Pandas dataframe, scrubbing unwanted characters, and preparing vectorizers.

In [58]:
import pandas as pd
import json, re
import nltk
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import matplotlib.pyplot as plt
import numpy as np
import pickle

%matplotlib inline

First I will load in the json files containing the raw Reddit data for all the posts I collected from the baseball and Dodgers subreddits in the previous notebook.

In [26]:
with open(f'../data/1540850144_raw_submissions.json', 'r') as f:
    baseball_raw = json.load(f)
    
with open(f'../data/1499265759_raw_submissions.json', 'r') as f:
    dodgers_raw = json.load(f)

## EDA - Characters & Comments

For some initial exploration I want to examine the character counts of the posts from both subreddits. I also want to examine the number of comments generated by each post to see how much engagement each post drives. The function in the following cell will receive a json file, iterate through every post stored in it, and append each post's character count and comment count to a respective list. When all the character and comment counts have been pulled, the lists will be combined into a dataframe and the function will return this dataframe. Posts without body text will trigger a KeyError in the function, so in the event that this occurs the function will increment a counter to keep track of how many posts have failed to compile. The function will print the value of this counter before returning the dataframe.

In [27]:
def chars_and_comments(posts):
    char_count_list = []
    comment_count_list = []
    KeyError_counter = 0
    temp_df = pd.DataFrame(columns=['char_count', 'num_comments'])
    for i in range(len(posts)):
        try:
            char_count_list.append(len(posts[i]['selftext']))
            comment_count_list.append(posts[i]['num_comments'])
        except KeyError:
            KeyError_counter += 1
    print(f'There were {KeyError_counter} KeyErrors.')
    temp_df['char_count'] = char_count_list
    temp_df['num_comments'] = comment_count_list
#     temp_df = pd.DataFrame(data=[char_count_list, comment_count_list], columns=['char_count', 'num_comments'])
    return temp_df

In the next cell I will pass the Dodgers and baseball jsons into the `chars_and_comments` function.

In [28]:
dodgers_char_comm = chars_and_comments(dodgers_raw)
baseball_char_comm = chars_and_comments(baseball_raw)

There were 892 KeyErrors.
There were 3 KeyErrors.


Only 3 of the 20,000 baseball posts triggered errors. I'm observing a fair number of KeyErrors from the Dodgers posts, but since the original quantity of posts was so large I still have more than 95% of the Dodgers posts to work with. Furthermore, the number of discarded posts is not so large as to result in unbalanced classes.

Now I'll take a look at the summary statistics for the Dodgers character counts and comment counts.

In [67]:
dodgers_char_comm.describe()

Unnamed: 0,char_count,num_comments
count,19108.0,19108.0
mean,207.302177,50.424848
std,795.807781,385.298749
min,0.0,0.0
25%,0.0,1.0
50%,0.0,6.0
75%,138.0,17.0
max,14358.0,18846.0


There's unfortunately a very wide range of values here, and it also seems that the numbers are being heavily weighted by a preponderance of posts with no body text. I'll take a closer look, this time at only those posts that have some text beyond the title.

In [65]:
dodgers_char_comm[dodgers_char_comm['char_count'] > 0].describe(
    percentiles=[.25, .5, .67, .75, .9])

Unnamed: 0,char_count,num_comments
count,9121.0,9121.0
mean,434.286811,88.107664
std,1108.259412,553.676147
min,1.0,0.0
25%,28.0,1.0
50%,153.0,6.0
67%,278.0,13.0
75%,379.0,18.0
90%,894.0,49.0
max,14358.0,18846.0


This is encouraging from the perspective of social media outreach. Over half of the posts have no body text at all, and of those that do, two-thirds of them are within Twitter's 280-character limit. I'll take a look at the baseball posts' statistics.

In [68]:
baseball_char_comm.describe()

Unnamed: 0,char_count,num_comments
count,19997.0,19997.0
mean,261.506576,42.79802
std,1221.104661,92.357879
min,0.0,0.0
25%,0.0,3.0
50%,0.0,15.0
75%,89.0,47.0
max,35914.0,3747.0


Similar to the Dodgers posts, most of the baseball posts have no body text beyond their titles. I'll take a look at the numbers once I filter out the title-only posts.

In [69]:
baseball_char_comm[baseball_char_comm['char_count'] > 0].describe(
    percentiles=[.25, .5, .67, .75, .9])

Unnamed: 0,char_count,num_comments
count,6602.0,6602.0
mean,792.085277,41.324599
std,2023.995852,78.420515
min,1.0,0.0
25%,93.0,5.0
50%,246.0,19.0
67%,435.0,36.0
75%,592.0,48.0
90%,1773.0,98.0
max,35914.0,2230.0


Comparing these numbers to the Dodger posts' numbers are still encouraging from a social media perspective. Dodger fans seem to be less dependent on multimedia engagement since more of their posts contain body text, and while the baseball character counts quickly balloon beyond Twitter's character limit, more of the Dodger character counts are below 280 characters.

## Pre-Processing

The `combine_text` function below will take a json of posts and iterate through it, extracting the values in the `title` and `selftext` features for each post and combining them into a single string for easier NLP analysis.

In [31]:
def combine_text(posts):
    text_list = []
    KeyError_counter = 0
    for i in range(len(posts)):
        try:
            text_list.append(' '.join([posts[i]['title'], posts[i]['selftext']]))
        except KeyError:
            KeyError_counter += 1
    print(f'There were {KeyError_counter} KeyErrors.')
    return text_list

In the next cell I will pass the Dodgers and baseball jsons into the `combine_text` function.

In [32]:
dodgers_text = combine_text(dodgers_raw)
baseball_text = combine_text(baseball_raw)

There were 892 KeyErrors.
There were 3 KeyErrors.


Now that I've isolated the relevant text from each subreddit, I will pass each list of text into its own dataframe. Then I will add a target `dodgers` column to each dataframe to differentiate between the positive (Dodgers) and negative (baseball) classes. This target column in the `dodgers_df` dataframe will be filled with 1's and in the `baseball_df` dataframe it will be filled with all 0's.

Once those two dataframes have been created, I will merge them together into a combined dataframe `df`.

In [33]:
dodgers_df = pd.DataFrame(dodgers_text, columns=['text'])
dodgers_df['dodgers'] = 1

baseball_df = pd.DataFrame(baseball_text, columns=['text'])
baseball_df['dodgers'] = 0

df = pd.concat([dodgers_df, baseball_df], ignore_index=True)

To start the text cleaning I will map a lambda function to the dataframe to change all of its text to a uniform lowercase.

In [34]:
df['text'] = df.text.map(lambda x: x.lower())

I don't want the model to be biased by the unbalanced use of the words "Dodger" or "Dodgers" in the Dodgers posts. To balance the influence of every team mention, I will replace every occurrence of any team name with a dummy word. In the following cell I will use a for loop to iterate through a set containing the names of all Major League Baseball teams and any common variations of those names. For each team name the loop will map a lambda function to the dataframe's `text` column and replace every instance of that team's name with the dummy word.

In [36]:
mlb_teams = {'diamondbacks', 'diamondback', 'dbacks', 'dback',
             'braves', 'orioles', 'oriole', 'sox', 'cubs', 'reds',
             'indians', 'indian', 'rockies', 'tigers', 'tiger',
             'astros', 'astro', 'royals', 'royal', 'angels', 'angel',
             'dodgers', 'dodger', 'marlins', 'marlin', 'brewers', 'brewer',
             'twins', 'twin', 'yanks', 'yankees', 'yankee', 'mets',
             'athletics', 'phillies', 'pirates', 'pirate',
             'padres', 'padre', 'giants', 'giant', 'mariners', 'mariner',
             'cardinals', 'cardinal', 'rays', 'ray', 'rangers', 'ranger',
             'jays', 'nationals'}

for team in mlb_teams:
    df['text'] = df.text.map(lambda x: str.replace(x, team, 'team_ref'))

Next I will clean the dataframe's text by mapping a trio of lambda functions with regex strings. The functions will search for text patterns that match the regex strings and remove them from the dataframe.

The first function will remove instances of \[removed\] and \[deleted\].

The second function will remove any other instances where a post is fronted by bracketed text. For example, in posts that reference articles and breaking news, the name of the journalist who is reporting the story will often appear in brackets at the front of the post title. The second function will remove those instances.

The third and final function will remove any remaining non-letter characters from the dataframe.

In [37]:
df['text'] = df.text.map(lambda x: re.sub('\[(removed|deleted)\]', ' ', x))

df['text'] = df.text.map(lambda x: re.sub('\[([A-Za-z0-9_]+)\]', ' ', x))

df['text'] = df.text.map(lambda x: re.sub("[^a-zA-Z]", " ", x))

The text is all prepped for vectorizing and modeling in the next notebooks. I will finish by storing the text in `X` and the classification targets in `y`.

In [38]:
X = df.drop('dodgers', 1)
y = df['dodgers']

Now that the text and target data are ready, I will pickle them out so they can be easily loaded into other notebooks for modeling.

In [39]:
# with open('../data/X_data.pkl', 'wb+') as f:
#     pickle.dump(X, f)
# with open('../data/y_data.pkl', 'wb+') as f:
#     pickle.dump(y, f)

## Preparing the Vectorizers

Before I can fit models to the text data, I need to convert the text into numeric data by using vectorizers. I will prepare the vectorizers now and pickle them out to be used in the modeling notebooks.

### Stopwords

When they are instantiated, each vectorizer will receive a list of stopwords. These stopwords will be ignored during vectorization so as to not bias the models with overly-influential words or confuse them with the noise of overly-common words.

As a starting point, I will import the standard English stopwords from Natural Language Toolkit. These are common English words.

In [61]:
stopwords = nltk.corpus.stopwords.words('english')

I also want the vectorizers to ignore references to the cities of MLB teams and the abbreviations for each team. I have also included common words that are specific to this domain, like "baseball" and "team." "Pgt" occurs frequently in the subreddits as an abbreviation for "post-game thread," so I've included this in a list of additional stopwords.

I will extend the default list of stopwords with the list of custom stopwords.

In [62]:
custom_stopwords = ['arizona', 'atlanta', 'baltimore', 'boston', 'chicago',
                    'cincinnati', 'cinci', 'cleveland', 'colorado', 'detroit',
                    'houston', 'kansas', 'los', 'angeles', 'la', 'miami',
                    'milwaukee', 'minnesota', 'york', 'oakland', 'philadelphia',
                    'philly', 'pittsburgh', 'san', 'diego', 'francisco', 'fran',
                    'seattle', 'st.', 'louis', 'tampa', 'texas', 'toronto',
                    'washington', 'ari', 'tal', 'bal', 'bos', 'chi', 'chc', 'cws',
                    'cin', 'cle', 'col', 'det', 'hou', 'kc', 'lad', 'laa', 'mia',
                    'mil', 'min', 'ny', 'nyy', 'nym', 'oak', 'phi', 'pit', 'sd',
                    'sf', 'stl', 'tb', 'tex', 'tor', 'pgt', 'game', 'team',
                    'player', 'players', 'mlb', 'baseball', 'tonight']

stopwords.extend(custom_stopwords)

Now that the custom stopwords are ready, I can use them while instantiating the vectorizers.

### Count Vectorizer

`CountVectorizer` is a simple way of converting text to numeric data. When fit to a collection of text, it will analyze the number of occurrences of individual words within that collection of text, and create a matrix of the specified `max_features` number of words along with their usage counts in each post.

In [63]:
cvec = CountVectorizer(max_features=500, stop_words=stopwords)

Now I'll pickle out the instantiated vectorizer for use in the next notebook.

In [64]:
with open('../assets/cvec.pkl', 'wb+') as f:
    pickle.dump(cvec, f)

### TF-IDF

Contrary to the `CountVectorizer`'s fairly straightforward approach, the `TfidfVectorizer` is a little more advanced. The "term frequency" (TF) of the vectorizer compares the ratio of the word's appearance frequency in a post to the overall number of words in that post. And the "inverse document frequency" (IDF) gives added predictive weight to rare words. In this instantiation of the vectorizer, a word will need to appear in at least 5 posts from the entire corpus, but not in more than 95% of them, to be considered.

In [51]:
tfidf = TfidfVectorizer(stop_words=stopwords, min_df=5, max_df=.95)

Now I'll pickle out the instantiated vectorizer for use in a later modeling notebook.

In [52]:
# with open('../assets/tfidf_vec.pkl', 'wb+') as f:
#     pickle.dump(tfidf, f)
# with open('../assets/stopwords.pkl', 'wb+') as f:
#     pickle.dump(stopwords, f)

I'm now ready to do some modeling on my data in the following notebooks.