## Table of Contents:
* [Import Data and Packages](#first-bullet)
* [Counts of Words in Titles for Each Corpus](#second-bullet)
* [Combine DataFrames for Baseline Accuracy](#third-bullet)

## Import Data and Packages <a class="anchor" id="first-bullet"></a>

In [1]:
## Import Data and Packagesimport pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

In this notebook, we will be working with the Reddit data that was saved to a csv in the "1_import_data" notebook.  These csvs
will be imported as DataFrames.

In [4]:
# import data
import pandas as pd
import numpy as np
tennis_df = pd.read_csv('./table_tennis_data.csv')
canoe_df = pd.read_csv('./canoe_data.csv')

## Counts of Words in Titles for Each Corpus <a class="anchor" id="second-bullet"></a>

To explore the data, we want to see which are the most common words in the titles of each sub-reddit.  Using count vectorizer will split the data into a matrix of token counts.  We will then calculate sums for each word, and sort to reveal the 20 most common.

### Table Tennis

In [64]:
# Instantiate a CountVectorizer
cvec = CountVectorizer()

In [65]:
# Fit the vectorizer on our corpus
X_train = cvec.fit_transform(tennis_df['post_title'])

In [66]:
# Convert X_train into a DataFrame
cvec_df = pd.DataFrame(X_train.toarray(), columns = cvec.get_feature_names())

### Count of words in tennis corpus, all words

Below are the 20 most common terms in the table tennis sub-reddit titles.  Many of the phrases are common words (is, on, this).  These words do not tell us much about the content of the text, so they should be removed.

In [67]:
pd.DataFrame(cvec_df.sum().sort_values(ascending=False), columns = [ 'Count']
            ).head(20)

Unnamed: 0,Count
table,235
the,204
tennis,192
to,177
for,145
of,104
and,104
is,84
in,83
on,80


Creating a new instance of the count vectorizer.  This time, we are adding in a parameter to remove common Engligh stop-words, from a predetermined list.

In [68]:
# Instantiate a CountVectorizer
cvec = CountVectorizer(stop_words='english')
# Fit the vectorizer on our corpus
X_train = cvec.fit_transform(tennis_df['post_title'] )

In [69]:
# Convert X_train into a DataFrame
cvec_df = pd.DataFrame(X_train.toarray(), columns = cvec.get_feature_names())

### Count of words in corpus, remove english stop words

Below are the resulting 20 most common words from the table tennis sub-reddit titles, without the common English stop-words.  Many of the words are very specific to table tennis.

In [70]:
pd.DataFrame(cvec_df.sum().sort_values(ascending=False), columns = [ 'Count']
            ).head(20)

Unnamed: 0,Count
table,235
tennis,192
advice,71
rubber,71
2018,60
paddle,59
weekly,47
new,46
vs,41
best,40


### Canoeing

Now that we have an idea of the common words in the tennis forum, we will run the same steps on the canoeing forum.  First, create the sparse matrix using count vectorizer, including all common English words.  Second, create another sparse matrix, this time removing English stop words.  Both of these matrices will be summed and turned into tables to reveal the 20 most common words.

In [74]:
# Instantiate a CountVectorizer
cvec = CountVectorizer()
# Fit the vectorizer on our corpus
X_train = cvec.fit_transform(canoe_df['post_title'] )

In [75]:
# Convert X_train into a DataFrame
cvec_df = pd.DataFrame(X_train.toarray(), columns = cvec.get_feature_names())

In [76]:
pd.DataFrame(cvec_df.sum().sort_values(ascending=False), columns = [ 'Count']
            ).head(20)

Unnamed: 0,Count
canoe,369
the,353
to,210
in,198
for,186
on,171
my,155
of,126
and,122
this,119


In [79]:
# Instantiate a CountVectorizer
cvec = CountVectorizer(stop_words='english')
# Fit the vectorizer on our corpus
X_train = cvec.fit_transform(canoe_df['post_title'] )

In [80]:
# Convert X_train into a DataFrame
cvec_df = pd.DataFrame(X_train.toarray(), columns = cvec.get_feature_names())

Just like we saw with the table tennis sub-reddit, many of the most common words in the corpus are specific to the canoeing activity.

In [81]:
pd.DataFrame(cvec_df.sum().sort_values(ascending=False), columns = [ 'Count']
            ).head(20)

Unnamed: 0,Count
canoe,369
river,103
canoeing,90
paddle,87
trip,84
lake,73
old,62
day,50
new,48
just,40


## Combine DataFrames for baseline accuracy <a class="anchor" id="third-bullet"></a>

To determine the baseline accuracy of the model we build, we need to see how many posts we have from each sub-reddit.  Below, we will build a dataframe that combines both canoeing and tennis posts.  Then we well see how many posts we have from each, expressed as a percentage.

In [23]:
full_df = pd.concat([tennis_df, canoe_df], axis=0)

In [24]:
full_df = full_df.reset_index(drop=True)

In [25]:
full_df.head()

Unnamed: 0.1,Unnamed: 0,post_sub,post_title
0,0,tabletennis,Need a new paddle? Check here first!
1,1,tabletennis,"Weekly Table Tennis Advice - March 24, 2019"
2,2,tabletennis,Need opinions on current setup and maybe recom...
3,3,tabletennis,How to stop getting tendonitis and tennis elbo...
4,4,tabletennis,Should I master the pendulum serve first or ca...


In [26]:
full_df.post_sub.value_counts(normalize=True)

canoeing       0.501641
tabletennis    0.498359
Name: post_sub, dtype: float64

Rounding to two digits, we have 50% canoeing posts and 50% table tennis posts.  If our model comes back with an accuracy of 50%, then we know it is not performing any better than if we randomly selected which sub-reddit to classify a post as.

Because we have the sub-reddit name and post title for roughly 1,000 posts for each of our two sub-reddits, this data will be appropriate for us to use to answer our research question (can we classify posts using NLP?).