In [1]:
import numpy as np
import pandas as pd
import json
import io
import gc
import jsonlines
import requests

from IPython.display import Math
from IPython.display import Latex

# gc.get_objects()
# locals()
# globals()


# Lab 1: Logistic Regression

In this lab, we will be using twitter data to train and classify the sentiment of new tweets.  The sentiment we will be classifying will be if a tweet is happy or sad, and we will be using the Python Natural Language Tool Kit (NLTK) to do so.  The process to rigorously classify the sentiment of documents, in this case Tweets, is:

\begin{enumerate}
    \item Import Functions and Data
    \item Prepare the Data
    \item Define a Sigmoid for Logistic Regression
    \item Define a Model for Logistic Regression using a Cost Function
    \item Implement a Gradient Descent Function
    \item Extract the Twitter Features
    \item Train the Model
    \item Test the Logistic Regression
    \item Error Analysis
    \item Predict out of Sample using New Tweets
\end{enumerate}


## 1. Import functions and data

In [22]:
# run this cell to import nltk and to call directory location
import nltk
from os import getcwd

### Imported functions

Download the data needed for this Lab. Check out the [documentation for the twitter_samples dataset](http://www.nltk.org/howto/twitter.html).

* twitter_samples: if you're running this notebook on your local computer, you will need to download it using:
```Python
nltk.download('twitter_samples')
```

* stopwords: if you're running this notebook on your local computer, you will need to download it using:
```python
nltk.download('stopwords')
```

In [3]:
# Run this cell to import nltk dictionaries (only need to run once)

nltk.download('twitter_samples')
nltk.download('stopwords')

[nltk_data] Downloading package twitter_samples to
[nltk_data]     /Users/JosephNavelski/nltk_data...
[nltk_data]   Package twitter_samples is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/JosephNavelski/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

#### Import some helper functions that we provided in the utils.py file:
* `process_tweet()`: cleans the text, tokenizes it into separate words, removes stopwords, and converts words to stems.
* `build_freqs()`: this counts how often a word in the 'corpus' (the entire set of tweets) was associated with a positive label '1' or a negative label '0', then builds the `freqs` dictionary, where each key is a (word,label) tuple, and the value is the count of its frequency within the corpus of tweets.

In [23]:
# add folder, tmp2, from our local workspace containing pre-downloaded corpora files to nltk's data path
# this enables importing of these files without downloading it again when we refresh our workspace

filePath = f"{getcwd()}/../tmp2/"
nltk.data.path.append(filePath)

filePath

'/Users/JosephNavelski/../tmp2/'

In [5]:
import numpy as np
import pandas as pd

from nltk.corpus import twitter_samples 
from utils import process_tweet, build_freqs

### Prepare the data
* The `twitter_samples` contains subsets of 5,000 positive tweets, 5,000 negative tweets, and the full set of 10,000 tweets.  
    * If you used all three datasets, we would introduce duplicates of the positive tweets and negative tweets.  
    * You will select just the five thousand positive tweets and five thousand negative tweets.

In [6]:
# select the set of positive and negative tweets
all_positive_tweets = twitter_samples.strings('positive_tweets.json')
all_negative_tweets = twitter_samples.strings('negative_tweets.json')


* Train test split: 20% will be in the test set, and 80% in the training set.

In [7]:
# split the data into two pieces, one for training and one for testing (validation set) 
test_pos = all_positive_tweets[4000:]
train_pos = all_positive_tweets[:4000]
test_neg = all_negative_tweets[4000:]
train_neg = all_negative_tweets[:4000]

train_x = train_pos + train_neg 
test_x = test_pos + test_neg

print("length of training data =", len(train_x), "observations")
print("length of test data =", len(test_x), "observations")

type(train_x)
type(test_x)

# Look at the data!
print(train_x[1:10])
print(train_x[(len(train_x)-10):len(train_x)])
print("------------------------------------------")
print(test_x[1:10])
print(test_x[(len(test_x)-10):len(test_x)])

length of training data = 8000 observations
length of test data = 2000 observations
['@Lamb2ja Hey James! How odd :/ Please call our Contact Centre on 02392441234 and we will be able to assist you :) Many thanks!', '@DespiteOfficial we had a listen last night :) As You Bleed is an amazing track. When are you in Scotland?!', '@97sides CONGRATS :)', 'yeaaaah yippppy!!!  my accnt verified rqst has succeed got a blue tick mark on my fb profile :) in 15 days', '@BhaktisBanter @PallaviRuhail This one is irresistible :)\n#FlipkartFashionFriday http://t.co/EbZ0L2VENM', "We don't like to keep our lovely customers waiting for long! We hope you enjoy! Happy Friday! - LWWF :) https://t.co/smyYriipxI", '@Impatientraider On second thought, there’s just not enough time for a DD :) But new shorts entering system. Sheep must be buying.', 'Jgh , but we have to go to Bayan :D bye', 'As an act of mischievousness, am calling the ETL layer of our in-house warehousing app Katamari.\n\nWell… as the name impli

* Create the numpy array of positive labels and negative labels.

In [8]:
# combine positive and negative labels
train_y = np.append(np.ones((len(train_pos), 1)), np.zeros((len(train_neg), 1)), axis=0)
test_y = np.append(np.ones((len(test_pos), 1)), np.zeros((len(test_neg), 1)), axis=0)

In [9]:
# Print the shape train and test sets (Note: This is just a vector of 1's and 0's)
print("train_y.shape = " + str(train_y.shape))
print("test_y.shape = " + str(test_y.shape))

train_y.shape = (8000, 1)
test_y.shape = (2000, 1)


* Create the frequency dictionary using the imported `build_freqs()` function.  
    * It is recomended to open `utils.py` and read the `build_freqs()` function to understand what it is doing.

```Python
    for y,tweet in zip(ys, tweets):
        for word in process_tweet(tweet):
            pair = (word, y)
            if pair in freqs:
                freqs[pair] += 1
            else:
                freqs[pair] = 1
```
* Notice how the outer for loop goes through each tweet, and the inner for loop steps through each word in a tweet.
* The `freqs` dictionary is the frequency dictionary that's being built. 
* The key is the tuple (word, label), such as ("happy",1) or ("happy",0).  The value stored for each key is the count of how many times the word "happy" was associated with a positive label, or how many times "happy" was associated with a negative label.

In [10]:
# create frequency dictionary
freqs = build_freqs(train_x, train_y)

# check the output
print("type(freqs) = " + str(type(freqs)))
print("len(freqs) = " + str(len(freqs.keys())))

# freqs
# type(freqs)
# freqsdf = pd.DataFrame.from_dict(freqs, orient = 'index',columns=freqs.keys())
# str(freqsdf.shape)

type(freqs) = <class 'dict'>
len(freqs) = 11340


### Process tweet using `process_tweet()` function
The given function `process_tweet()` tokenizes the tweet into individual words, removes stop words and applies stemming.

In [11]:
# test the function below
print('This is an example of a positive tweet: \n', train_x[0])
print('\nThis is an example of the processed version of the tweet: \n', process_tweet(train_x[0]))

This is an example of a positive tweet: 
 #FollowFriday @France_Inte @PKuchly57 @Milipol_Paris for being top engaged members in my community this week :)

This is an example of the processed version of the tweet: 
 ['followfriday', 'top', 'engag', 'member', 'commun', 'week', ':)']


#### Expected output
```
This is an example of a positive tweet: 
 #FollowFriday @France_Inte @PKuchly57 @Milipol_Paris for being top engaged members in my community this week :)
 
This is an example of the processes version: 
 ['followfriday', 'top', 'engag', 'member', 'commun', 'week', ':)']
```

# 2. Define a Sigmoid for Logistic Regression


### Part 1.1: Sigmoid
You will learn to use logistic regression for text classification. 
* The sigmoid function is defined as: 

$$ h(z) = \frac{1}{1+\exp^{-z}} \tag{1}$$

It maps the input 'z' to a value that ranges between 0 and 1, and so it can be treated as a probability. 

<div style="width:image width px; font-size:100%; text-align:center;"><img src='../tmp2/sigmoid_plot.jpg.png' alt="alternate text" width="width" height="height" style="width:300px;height:200px;" /> Figure 1 </div>