# <span style="color:Purple">Project 3 :  Web APIs & NLP</span> <img src="../resources/reddit_logo.png" width="110" height="110" />
---
## <span style="color:Orange">EDA </span>      

#### Ryan McDonald


**Imports**

In [1]:
import numpy as np
import pandas as pd

from nltk.tokenize import sent_tokenize, word_tokenize, RegexpTokenizer
from nltk.sentiment.vader import SentimentIntensityAnalyzer


## 1. Read the Data

In [2]:
subs =pd.read_csv('../datasets/submissions')
subs

Unnamed: 0,subreddit,title,author
0,VanLife,Treasure at the end of the rainbow. Gonzaga Ba...,TheKombiChronicles
1,VanLife,Boulder Colorado :),Germscout805
2,VanLife,If you like YouTube Poop you're gonna love my ...,roadkamper
3,VanLife,Love the boondocking near Ventura on the coast,josiahq
4,VanLife,What is this silver box? It's in a 1992 SMB,lucky2bthe1
...,...,...,...
7995,camping,Website for camping?,arstanash
7996,camping,Camping near Wild Willy's Hot Spring/eastern s...,SpicyChickenDinner
7997,camping,Camping in the boundary waters,_spicygin_
7998,camping,Favorite camping spot in El Dorado National Fo...,2themoonanback


### Baseline statistics

In [6]:
# Showing 'object datatypes'
subs.dtypes

subreddit    object
title        object
author       object
dtype: object

In [4]:
# looks like some titles aren't unique!
subs.describe()

Unnamed: 0,subreddit,title,author
count,8000,8000,8000
unique,2,7809,5217
top,VanLife,survival pen,[deleted]
freq,4000,5,156


In [5]:
# some posts are duplicated
subs['title'].value_counts(ascending = False).head(10)

survival pen                                                                                5
Glamping in Yamanashi, Japan a few weekends ago under a full moon with co-workers.          5
Live in 10 minutes                                                                          4
Live now                                                                                    4
Finishing the last pieces to take the van on its first voyage after 15 years of sitting!    4
My camping setup!!                                                                          4
The Modern Nomad | Van Life                                                                 4
August 2020. Bald Eagle State Forest, PA                                                    4
Park Life                                                                                   4
Arnous Village,Lebanon.                                                                     4
Name: title, dtype: int64

In [6]:
# no missing 'title' entries.
subs['title'].isnull().sum()

0

In [7]:
# verifying equal distribution of data between each subreddit
subs['subreddit'].value_counts(normalize = True)

VanLife    0.5
camping    0.5
Name: subreddit, dtype: float64

In [8]:
# number of unique authors
subs['author'].nunique()

5217

In [37]:
subs['author'].value_counts(ascending = False)

# the most prolific author was deleted!.. drunkbackpacker???  Haha!

[deleted]             156
vanlifewithgpayne     144
ArchieandMe            42
drunkbackpacker        27
yellowmoose52          21
                     ... 
joannieoconnells14      1
Boonina                 1
linwemes                1
eggzndbacon             1
QuietInNature           1
Name: author, Length: 5217, dtype: int64

## 2. Preprocessing


**Going to start by breaking the titles down into single sentences for further processing.  If the models don't work as well as they could, further segmentation into smaller/larger groups of words may occur**

    - Will breakdown VanLife and camping titles seperately in order to preserve relationship.
    - Full lists of titles to be seperated into sentences
    - New DataFrame developed with sentence breakdown for further analysis

In [10]:
# starting with VanLife Titles
vl_titles = list(subs['title'][0:4000])
len(vl_titles)

4000

In [11]:
# VanLife tokenized sentences
vl_sent = " ".join(vl_titles)
vl_stoken = sent_tokenize(vl_sent)
len(vl_stoken)

3485

In [12]:
# camping titles
vl_titles = list(subs['title'][4000:8000])
len(vl_titles)

4000

In [13]:
# camping tokenized sentences
cmp_sent = " ".join(vl_titles)
cmp_stoken = sent_tokenize(cmp_sent)
len(cmp_stoken)

2886

**Tokenizing into sentences appears to combine titles here and there, resulting in less unique datapoints. But, may in-turn create a better model**

    - Just to have it available, I'll preserve a list of the titles unaltered
    
**Building a DataFrame with tokenized titles**

In [14]:
token_df = pd.DataFrame(
    {'subreddit':'camping',
    'title': cmp_stoken})
vl_df = pd.DataFrame(
    {'subreddit':'VanLife',
     'title':vl_stoken})

token_df = pd.concat([token_df, vl_df], ignore_index= True)
token_df.head()


Unnamed: 0,subreddit,title
0,camping,First tim camping Camping Tricks: A few of the...
1,camping,Last summer.
2,camping,We were playing war.
3,camping,I caught the moment the cards were read.
4,camping,Joshua Tree National Park back country!


In [15]:
token_df.shape

(6371, 2)

In [16]:
# Save to CSV!

token_df.to_csv('../datasets/tokenized_df', index= False )

### Baseline Score
**With the amount of data pulled from Reddit, and the low baseline score, I would expect modeling to perform much better**

In [17]:
token_df['subreddit'].value_counts(normalize= True)
# This will show baseline 'majority' case.  
# 'Guessing' VanLife each time would be correct 54.4% of the time!

VanLife    0.54701
camping    0.45299
Name: subreddit, dtype: float64

**I prefer to start off with the tokenized sentences because it skews the originally-normalized data.  Having this baseline score in VanLife's favor may product more interesting results down the line**

#### Quick Sentiment Check! (curiousity strikes)

**Does either subreddit as a whole have a better sentiment analysis?**

In [18]:
# Seeing as SIA takes 'length' into account when producing results,
# this is for 'entertainment' purposes only!

sia = SentimentIntensityAnalyzer()

vl_text= '-'.join(subs['title'][0:4000])
camp_text = '-'.join(subs['title'][4000:8000])

In [19]:
sia.polarity_scores(vl_text)

{'neg': 0.031, 'neu': 0.832, 'pos': 0.138, 'compound': 1.0}

In [20]:
sia.polarity_scores(camp_text)

{'neg': 0.035, 'neu': 0.812, 'pos': 0.153, 'compound': 1.0}

**An interesting finding above! Both SubReddits have VERY similar sentiment polarity scores.  Assuming since they are related to each other, and typically involve optimistic people.  That was good to see!.  HOWEVER... to rule out any bias towards total word count, I'll break down the SIA below per title in the DataFrame**

### Applying SIA to the entire DataFrame!

In [24]:
sentiment = subs['title'].apply(sia.polarity_scores)
sentiment_df = pd.DataFrame(sentiment.tolist())
sentiment_df.sort_values(by= ['compound'], ascending = False)

Unnamed: 0,neg,neu,pos,compound
4823,0.000,0.404,0.596,0.9739
1953,0.000,0.627,0.373,0.9720
5726,0.000,0.619,0.381,0.9690
550,0.000,0.504,0.496,0.9669
2083,0.000,0.624,0.376,0.9650
...,...,...,...,...
7993,0.349,0.651,0.000,-0.8619
2239,0.526,0.474,0.000,-0.8750
2761,0.526,0.474,0.000,-0.8750
1217,0.526,0.474,0.000,-0.8750


**Average Sentiment Per Subreddit!**

In [34]:
# Average VanLife Sentiment:
(sentiment_df.loc[0:4000]['compound'].mean())

0.14622314421394667

In [35]:
# Average Camping Sentiment:
sentiment_df.iloc[4000:8001]['compound'].mean()

0.15359095000000042

**It's not by much, but 'camping' subreddit mean sentiment is higher than 'VanLife' subereddit mean sentiment. Perhaps a few posts regarding broken vans or getting lost were written in the 'VanLife' subreddit!**