# Seniment Analysis with Logistic Regression

I will be implementing logistic regression for sentiment analysis on tweets. Given a tweet, I will decide if it has a positive sentiment or a negative one. Specifically I will:

* Learn how to extract features for logistic regression given some text
* Implement logistic regression from scratch
* Apply logistic regression on a natural language processing task
* Test using I logistic regression
* Perform error analysis

### Import Functions and Data

In [2]:
# Import nltk
import nltk
from os import getcwd

# download twitter_samples and stopwords
nltk.download('twitter_samples')
nltk.download('stopwords')

[nltk_data] Downloading package twitter_samples to
[nltk_data]     C:\Users\tarza\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\twitter_samples.zip.
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\tarza\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


True

#### Import some helper functions that provided in the utils.py file:
* process_tweet: cleans the text, tokenizes it into separate words, removes stopwords, and converts words to stems.

* build_freqs: this counts how often a word in the 'corpus' (the entire set of tweets) was associated with a positive label '1' or a negative label '0', then builds the 'freqs' dictionary, where each key is the (word,label) tuple, and the value is the count of its frequency within the corpus of tweets.

In [3]:
filePath = f"{getcwd()}/../tmp2"
nltk.data.path.append(filePath)

In [4]:
import numpy as np
import pandas as pd
from nltk.corpus import twitter_samples
from utils import process_tweet, build_freqs

### Prepare the Data
* The `twitter_samples` contains subsets of five thousand positive_tweets, five thousand negative_tweets, and the full set of 10,000 tweets.    
    * I will select just the five thousand positive tweets and five thousand negative tweets.

In [6]:
# select the set of positive and negative tweets
all_positive_tweets = twitter_samples.strings('positive_tweets.json')
all_negative_tweets = twitter_samples.strings('negative_tweets.json')

* Train test split: 20% will be the test set, and 80% in the training set.

In [7]:
# split the data into two pieces, one for training and one for testing (validation set)
train_pos = all_positive_tweets[:4000]
test_pos = all_positive_tweets[4000:]
train_neg = all_negative_tweets[:4000]
test_neg = all_negative_tweets[4000:]

X_train = train_pos + train_neg
X_test = test_pos + test_neg

* Create the numpy array of positive labels

In [8]:
# combine positive and negative labels
y_train = np.append(np.ones((len(train_pos), 1)), np.zeros((len(train_neg), 1)), axis=0)
y_test = np.append(np.ones((len(test_pos), 1)), np.zeros((len(test_neg), 1)), axis=0)

In [9]:
# print the shape train and test sets
print('y_train.shape = '+str(y_train.shape))
print('y_test.shape = '+str(y_test.shape))

y_train.shape = (8000, 1)
y_test.shape = (2000, 1)


* Create the frequency dictionary using the imported build_freqs function.

In [11]:
# Create frequency dictonary
freqs = build_freqs(X_train, y_train)

# check the output
print('type(freqs) = '+str(type(freqs)))
print('len(freqs) = '+str(len(freqs.keys())))

type(freqs) = <class 'dict'>
len(freqs) = 11427


### Process Tweet
The given function 'process_tweet' tokenizes the tweet into individual words, removes stop words and applies stemming.

In [12]:
# test the function below
print('This is an example of a positive tweet: \n', X_train[0])
print('\nThis is and example of the processed version of the tweet: \n', process_tweet(X_train[0]))

This is an example of a positive tweet: 
 #FollowFriday @France_Inte @PKuchly57 @Milipol_Paris for being top engaged members in my community this week :)

This is and example of the processed version of the tweet: 
 ['followfriday', 'top', 'engag', 'member', 'commun', 'week', ':)']
