# ![](https://ga-dash.s3.amazonaws.com/production/assets/logo-9f88ae6c9c3871690e33280fcf557f33.png)  Project 3: Web APIs & Classification
Reddit's API:  Data Wrangling, Natural Language Processing, and Classification Modeling


This project covers three of the biggest concepts in Data Science:
- Data Wrangling/Acquisition
- Natural Language Processing
- Classification Modeling


---
## Technical Report:   *SubReddit Data Cleaning (NLP I)*
This notebook --just one component of the overall project-- reflects the collection, import (and cleaning?? ) of two subreddits of my choosing. . .

Part 1 of the project focuses on **Data wrangling/gathering/acquisition**. 
The expectatiion is that not all acquired data will be clean or in a structured/organized format (like a single .csv file or SQL table). While an API request for data is ideal, some scraping may be required if the website of interest does not have an API (or it's terribly documented).

. . . At the end of this notebook, scraped (& cleaned?? ) data is saved to .csv datasets which can be referenced here:
- `dataset1.csv`:  [subreddit_Today_I_Learned](../data/dataset1.csv)
- `subreddit_Health.csv`:  [subreddit topic: Health](../data/subreddit_Health.csv)

Ultimately this data will be used with NLP to train a classifier on which subreddit a given post came from. **This is a binary classification problem**.



**Data Cleaning and EDA**
- Are missing values imputed/handled appropriately?
- Are distributions examined and described?
- Are outliers identified and addressed?
- Are appropriate summary statistics provided?
- Are steps taken during data cleaning and EDA framed appropriately?
- Does the student address whether or not they are likely to be able to answer their problem statement with the provided data given what they've discovered during EDA?

**Preprocessing and Modeling**
- Is text data successfully converted to a matrix representation?
- Are methods such as stop words, stemming, and lemmatization explored?
- Does the student properly split and/or sample the data for validation/training purposes?
- Does the student test and evaluate a variety of models to identify a production algorithm (**AT MINIMUM:** Bayes and one other model)?
- Does the student defend their choice of production model relevant to the data at hand and the problem?
- Does the student explain how the model works and evaluate its performance successes/downfalls?

**Evaluation and Conceptual Understanding**
- Does the student accurately identify and explain the baseline score?
- Does the student select and use metrics relevant to the problem objective?
- Does the student interpret the results of their model for purposes of inference?
- Is domain knowledge demonstrated when interpreting results?
- Does the student provide appropriate interpretation with regards to descriptive and inferential statistics?



Goal in modeling:  "generalize estimates well"

### X / Y relationship
 - "x & y relationship is the signal; all else is noise..."
 - X = "predictors"
 - y = "predictions"  
 
         - regression:  y is continuous
         - classification: y is discrete  *(probability that y is one of two binary classes)


### Bias / Variance Trade-off
- under-fit = high bias
- over-fit = high variance (too specific)

    variance mitigation:
    - more data
    - fewer features
    - REGULARIZATION  ("alpha" = the strength of regularization)
        - always scale features before you regularize
        

    

##### (3) types of model errors:
 - Bias:  "how bad our model is at predicting "True y" ("too simple")...
          - bias gets smaller as our PREDICTION gets closer to the true value
 - Variance:  "how bad our model is at generalizing to new data" / "how spread out our predictions are"
          - variance gets higher, the more variables we have ("too complex")...
 - Irreducable:
 
 

Goal in model evaluation:  Minimize errors to find the "line of best fit" (between features & target)

### (4) steps to build a model:
 - 1. instantiate model
 - 2. fit model to train data:
      - model.fit(X, y)
      - get coefficients (to determine bias)
      - get intercepts (to determine bias)
 - 3. generate predictions
      - model.predict(X)
 - 4. evaluate model
      - decide which model evaluation metrics (loss functions) to use...
          - y.mean = naive baseline prediction
          - MSE = loss function
          - if MSE is smaller than baseline, it means "on average" our residuals (errors) are smaller and we have a better fit...
          - if R2 is larger than baseline, we have a better fit...

         - regression:  evaluation metrics are for ERRORS
         - classification:  evaluation metrics are for ACCURACY
             - accuracy
             - misclassification
             - sensitivity ("recall")
             - specificity
             - precision



### Logistic Regression
 - logit link function:  "the log of the odds of success"
 - the null model = the most frequent class
 - interpret the coefficient for a given variable by "exponentiating" it (np.exp(lr.coef_)
 - deal with UNBALANCED CLASSES via bias correction, weighting observations, stratified cross validation...
 
 



---
## Natural language processing (NLP) 

Natural language processing (NLP) describes the field of getting computers to understand language how we as humans do. Natural language processing has many, many applications including:
- voice-to-text services for people who are hard of hearing.
- text-to-voice services for people who have difficulty reading.
- automated chatbots for organizations.
- translation services.

Generally when we get text data, strings aren't broken out into individual words or even sentences. We might have a full tweet, full chapter of a book, or full .pdf file all in one long string.

Today, we're diving into the practical side of NLP - taking text data and breaking it out into words that we can then leverage into $n$-grams or $tf$-$idf$ vectorizers.

#### Agenda
- Pre-Processing
- Sentiment Analysis

#### Learning Objectives
- Define and implement tokenizing, lemmatizing, and stemming.
- Describe what RegEx does.
- Apply sentiment analysis.
- Preprocess text data.



## NLP I: Tokenizing/Lemmatization and Sentiment Analysis

In [2]:
# libraries required for Natural Language Processing
import nltk

# Import Tokenizer
from nltk.tokenize import RegexpTokenizer

# Import Regular Expressions
import regex as re

# import Lemmatizer
from nltk.stem import WordNetLemmatizer
# lemmatizer = WordNetLemmatizer()

# Import Stemmer
from nltk.stem.porter import PorterStemmer

# import BeautifulSoup
from bs4 import BeautifulSoup             

# Import stopwords.
from nltk.corpus import stopwords


# ------------------------------------
# Import CountVectorizer.
from sklearn.feature_extraction.text import CountVectorizer

# ------------------------------------
# Import logistic regression.
from sklearn.linear_model import LogisticRegression


In [None]:
# module required for Regular Expressions
# !pip install regex


In [1]:
# required to make API requests
# import requests
# # required to throttle your scraping loop... 
# import time

In [2]:
# Python libraries used for data analysis
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

## Pre-Processing
When dealing with text data, there are some common pre-processing steps we might use. However, we won't necessarily use all of them every time we deal with text data.
- Tokenizing
- Regular Expression
- Lemmatizing/Stemming
- Cleaning (i.e. removing HTML)


#### Tokenizing
When we "tokenize" data, we take it and split it up into distinct chunks based on some pattern. A RegexpTokenizer splits a string into substrings using regular expressions.

In [None]:
# Instantiate Tokenizer
tokenizer = RegexpTokenizer(r'\w+') ## We'll talk about this in a moment.

In [None]:
# "Run" Tokenizer
spam_tokens = tokenizer.tokenize(spam.lower())

In [None]:
# Show Results
spam_tokens

#### Regular Expressions
Regular Expressions, or RegEx, are an extraordinarily helpful way for us to detect patterns in text.
This is a tool of which you should be aware.

Using RegEx can be incredibly helpful if you want to find text matching a specific pattern.
- People used to use two spaces after a period to split sentences up; you could use RegEx to detect that pattern and tokenize on entire sentences.
- Chapters in a book could be titled "Chapter" followed by a number; you could use RegEx to detect that pattern and tokenize a book by its chapters.
- When Python libraries are upgraded, syntax changes! Perhaps you want to detect a certain pattern of syntax so you can update your code efficiently.

In [None]:
for i in spam_tokens:
    print(re.findall('\d+', i), i)

# RegEx in Python 3 understands \d+ to identify numeric digits. 
# Therefore, the above code searched through spam_tokens to see if any numeric digits were in there.

In [None]:
# Instantiate tokenizer.
tokenizer_1 = RegexpTokenizer('\w+|\$[\d\.]+|\S+')

# Run tokenizer.
tokenizer_1.tokenize(s)

# tokenizer_1 splits tokens up by spaces or by periods that are not attached to a digit.

In [None]:
# Instantiate tokenizer.
tokenizer_2 = RegexpTokenizer('\s+', gaps=True)

# Run tokenizer.
tokenizer_2.tokenize(s)

# tokenizer_2 will identify the spaces. By setting gaps = True, we're grabbing everything else: thus, we're splitting our tokens up by spaces.

In [None]:
# Instantiate tokenizer.
tokenizer_3 = RegexpTokenizer('[A-Z]\w+')

# Run tokenizer.
tokenizer_3.tokenize(s)

# tokenizer_3 returns only words that begin with a capital letter.

#### Lemmatizing & Stemming
- "He is running really fast!"
- "He ran the race."
- "He runs a five-minute mile."

If we wanted a computer to interpret these sentences, I might count up how many instances of each word I observe. The computer will treat words like "running," "ran," and "runs" differently although they mean about the same thing (in this context).
Lemmatizing and stemming are two forms of shortening words so we can combine similar forms of the same word.

When we "lemmatize" data, we take words and attempt to return their lemma, or the base/dictionary form of a word.

Lemmatizing is usually the more correct and precise way of handling things from a grammatical/morphological point of view, but also might not have much of an effect.

In [None]:
# Instantiate lemmatizer. 
lemmatizer = WordNetLemmatizer()

In [None]:
# Lemmatize tokens.
tokens_lem = [lemmatizer.lemmatize(i) for i in spam_tokens]

In [None]:
# Compare tokens to lemmatized version.
list(zip(spam_tokens, tokens_lem))

In [None]:
# Print only those lemmatized tokens that are different.
for i in range(len(spam_tokens)):
    if spam_tokens[i] != tokens_lem[i]:
        print((spam_tokens[i], tokens_lem[i]))
        

In [None]:
# We can also do this on individual words.
# Lemmatize the word "computers."
lemmatizer.lemmatize("computers")

When we "stem" data, we take words and attempt to return a base form of the word. It tends to be cruder than using lemmatization. There's a method developed by Porter in 1980 that explains the algorithm used below.

In [None]:
# Instantiate object of class PorterStemmer.
p_stemmer = PorterStemmer()

In [None]:
# Stem tokens.
stem_spam = [p_stemmer.stem(i) for i in spam_tokens]

In [None]:
# Compare tokens to stemmed version.
list(zip(spam_tokens, stem_spam))

In [None]:
# Print only those stemmed tokens that are different.

for i in range(len(spam_tokens)):
    if spam_tokens[i] != stem_spam[i]:
        print((spam_tokens[i], stem_spam[i]))

In [None]:
# We can also do this on individual words as well.

# Stem the word "computer."
p_stemmer.stem("computer")

# Stem the word "computation."
p_stemmer.stem("computation")

## Sentiment Analysis

Let's start with a very simple example.
Sentiment analysis is an area of natural language processing in which we seek to classify text as having positive or negative emotion.
Let's build a simple function that can classify text as either having positive or negative sentiment.
What words tell us whether certain text is positive?

In [None]:
# Let's come up with a list of positive and negative words we might observe.

positive_words = ['delight', 'good', 'great', 'awesome', 'tremendous', 'fabulous', 'amazing', 'stellar']
negative_words = ['garbage', 'sad', 'trash', 'ugly', 'bad', 'disgusting', 'terrible', 'gross']

In [None]:
# create function
def simple_sentiment(text):
    # Instantiate tokenizer.
    tokenizer = RegexpTokenizer(r'\w+')
    
    # Tokenize text.
    tokens = tokenizer.tokenize(text.lower())
    
    # Instantiate stemmer.
    p_stemmer = PorterStemmer()
    
    # Stem words.
    stemmed_words = [p_stemmer.stem(i) for i in tokens]
    
    # Stem our positive/negative words.
    positive_stems = [p_stemmer.stem(i) for i in positive_words]
    negative_stems = [p_stemmer.stem(i) for i in negative_words]

    # Count "positive" words.
    positive_count = sum([1 for i in stemmed_words if i in positive_stems])
    
    # Count "negative" words
    negative_count = sum([1 for i in stemmed_words if i in negative_stems])
    
    # Calculate Sentiment Percentage 
    # (Positive Count - Negative Count) / (Total Count)

    return round((positive_count - negative_count) / len(tokens), 2)

In [None]:
# Run our sentiment analyzer on our spam email.
simple_sentiment(spam)

In [None]:
# for given text (ex: yelp review)
# Calculate sentiment of yelp_1.
simple_sentiment(yelp_1)

## Sorting Positive from Negative Reviews
The easiest way to do sentiment classification of analysis is by training a model on data we've already labeled.

This is a huge consideration for your capstone, even if it isn't related to NLP!

Today, we will begin by reviewing the basic NLP techniques we learned yesterday to create a sentiment analyzer from Rotten Tomatoes Movie review. This code-along is adapted from Kaggle's tutorial, available here.

#### Step One: Import The Data

In [None]:
# Import pandas.
import pandas as pd       

# Read in training data.
train = pd.read_csv("../labeledTrainData.tsv",
                    header=0,
                    delimiter="\t",
                    quoting=3)

In [None]:
# View the first five rows.
train.head()

In [None]:
# Examine the first review.
train['review'][0]

#### Train/Test Split

In [None]:
# Read in testing data.
test = pd.read_csv("../testData.tsv",
                   header=0, 
                   delimiter="\t",
                   quoting=3)

In [None]:
# View the first five rows.
test.head()

Remember that our Kaggle data was organized train.csv and test.csv. However, the test.csv didn't contain any labels!

Let's do a train/test split by splitting up train.csv.

In [None]:
# Import train_test_split.
from sklearn.model_selection import train_test_split

# Create train_test_split.
X_train, X_test, y_train, y_test = train_test_split(train[['id','review']],
                                                    train['sentiment'],
                                                    test_size = 0.25,
                                                    random_state = 42)

There are a few steps we'll take to clean up the text data before it's ready for processing.
- Remove the HTML code artifacts from the text.
- Remove punctuation.
- Remove stopwords. (We'll cover these shortly.)

## Step One: Remove HTML code artifacts

In [None]:
# Initialize the BeautifulSoup object on a single movie review     
example1 = BeautifulSoup(X_train['review'][0])

# Print the raw review and then the output of get_text(), for 
# comparison
print(X_train['review'][0])
print()
print(example1.get_text())

## Step Two: Remove Punctuation
Punctuation can be removed using regular expressions

In [None]:
# Use regular expressions to do a find-and-replace
letters_only = re.sub("[^a-zA-Z]",           # The pattern to search for
                      " ",                   # The pattern to replace it with
                      example1.get_text())   # The text to search


In [None]:
# Show first fifty letters of letters_only.
letters_only[0:50]

In [None]:
# Convert letters_only to lower case.
lower_case = letters_only.lower()

# Split lower_case up at each space.
words = lower_case.split() # This is like a manual tokenizer!

In [None]:
# Check first ten words.
words[0:10]

## Step Three: Remove Stop Words
With our Yelp reviews above, you noticed that our sentiment scores were right around zero. While there were some positive and negative words, the vast majority of the words had neither a positive sentiment nor negative sentiment!
- Examples include "the," "of," "and," "a," "to," and "in."

Stopwords are very common words that are often removed because they amount to unnecessary information and removing them can dramatically speed up processing.
If you didn't complete the NLTK download, you may run into some issues here.

In [None]:
# Print English stopwords.
print(stopwords.words("english"))

In [None]:
# Remove stop words from "words."
words = [w for w in words if not w in stopwords.words('english')]

In [None]:
# Check "words" to make sure we did this properly.
print(words)

## Step Four: Combine our cleaning into one function
Check: Why should we do everything with one function?

In [None]:
def review_to_words(raw_review):
    # Function to convert a raw review to a string of words
    # The input is a single string (a raw movie review), and 
    # the output is a single string (a preprocessed movie review)
    
    # 1. Remove HTML.
    review_text = BeautifulSoup(raw_review).get_text()
    
    # 2. Remove non-letters.
    letters_only = re.sub("[^a-zA-Z]", " ", review_text)
    
    # 3. Convert to lower case, split into individual words.
    words = letters_only.lower().split()
    
    # 4. In Python, searching a set is much faster than searching
    # a list, so convert the stop words to a set.
    stops = set(stopwords.words('english'))
    
    # 5. Remove stop words.
    meaningful_words = [w for w in words if not w in stops]
    
    # 6. Join the words back into one string separated by space, 
    # and return the result.
    return(" ".join(meaningful_words))

## Step Five (Finally!) Applying our Function

In [None]:
# Get the number of reviews based on the dataframe size.
total_reviews = train.shape[0]
print(f'There are {total_reviews} reviews.')

# Initialize an empty list to hold the clean reviews.
clean_train_reviews = []
clean_test_reviews = []

In [None]:
print("Cleaning and parsing the training set movie reviews...")

j = 0
for train_review in X_train['review']:
    # Convert review to words, then append to clean_train_reviews.
    clean_train_reviews.append(review_to_words(train_review))
    
    # If the index is divisible by 1000, print a message
    if (j + 1) % 1000 == 0:
        print(f'Review {j + 1} of {total_reviews}.')
    
    j += 1

# Let's do the same for our testing set.

print("Cleaning and parsing the testing set movie reviews...")

for test_review in X_test['review']:
    # Convert review to words, then append to clean_train_reviews.
    clean_test_reviews.append(review_to_words(test_review))
    
    # If the index is divisible by 1000, print a message
    if (j + 1) % 1000 == 0:
        print(f'Review {j + 1} of {total_reviews}.')
        
    j += 1

## Our data is finally ready.....

## Vectorizer...
You'll describe this in greater detail in a later lesson, but CountVectorizer will transform the lists of the cleaned reviews above into features that we can pass into a model.
It will create columns (also knon as vectors), where each column counts how many times each word is observed in each review.

In [None]:
# Import CountVectorizer.
from sklearn.feature_extraction.text import CountVectorizer

# Instantiate the "CountVectorizer" object, which is scikit-learn's
# bag of words tool.
vectorizer = CountVectorizer(analyzer = "word",
                             tokenizer = None,
                             preprocessor = None,
                             stop_words = None,
                             max_features = 5000)

In [None]:
# fit_transform() does two functions: First, it fits the model
# and learns the vocabulary; second, it transforms our training data
# into feature vectors. The input to fit_transform should be a list of 
# strings.

train_data_features = vectorizer.fit_transform(clean_train_reviews)

test_data_features = vectorizer.transform(clean_test_reviews)

# Numpy arrays are easy to work with, so convert the result to an 
# array.
train_data_features = train_data_features.toarray()

In [None]:
print(train_data_features.shape)

In [None]:
print(test_data_features.shape)

In [None]:
train_data_features[0:6]

In [None]:
vocab = vectorizer.get_feature_names()
print(vocab)

## Now we have an array that we can use for classification!

In [None]:
# Import logistic regression.

from sklearn.linear_model import LogisticRegression

In [None]:
# Instantiate logistic regression model.

lr = LogisticRegression()

In [None]:
# Fit model to training data.

lr.fit(train_data_features, y_train)

In [None]:
# Evaluate model on training data.
lr.score(train_data_features, y_train)

In [None]:
# Evaluate model on testing data.

lr.score(test_data_features, y_test)

A couple things to note:

NLP broadly describes:
- how we can get unstructured text data into a more structured form that can be interpreted by computers and
- algorithms for interpreting text data.
That does not mean these tools we used today work to the exclusion of other methods. You can and should include other variables in your model!
- For example, maybe the length of a review tells us something about how much people liked/disliked the movie, or maybe additional information about the reviewer (i.e. geography, age, how many reviews they had submitted) has predictive value.

## SubReddit Data Cleaning (NLP I) is complete

Proceed to the next notebook:
- [SubReddit_NLP_Vectorizing](03_SubReddit_NLP_Vectorizing.ipynb)


i.e. "text feature extraction"...

---
## Do all cleaning / de-duping B4 vectorizing...

#### Regex Cleaning

Let's use regex to remove the words `snake, snakes, spider, spiders`. Let's also remove *any* mention of a subreddit, as well as all URLs.

In [None]:
# remove any URLs
# subreddits['title'] = subreddits.title.map(lambda x: re.sub('http[s]?:\/\/[^\s]*', ' ', x))

In [None]:
# remove punctuation
# subreddits['title'] = subreddits.title.map(lambda x: re.sub("[^a-zA-Z]", " ", x))

In [None]:
# remove specific text
# subreddits['title'] = subreddits.title.map(lambda x: re.sub('[Weekly thread]*', ' ', x))

In [None]:
# lowercase everything
# subreddits.title.lower()

# # per lesson 5.03:
# # Convert letters_only to lower case.
# lower_case = subreddits['title'].lower()
# # Split lower_case up at each space.
# words = lower_case.split() # This is like a manual tokenizer!

