# Logistic Regression in Natural Language Processing (NLP)

In this work we want to explore the usage of Logistic Regression (LR) in NLP. The dataset is available in [NLTK](https://www.nltk.org/howto/twitter.html) library. The library includes the positive and negative tweets. We want to predict the sentiment (positive and negative opinions) of each tweets. The machine learning approach here is a supervised learning approach since we use the 80% to train the model.  

In [97]:
import nltk                         
from os import getcwd
import pandas as pd                 
from nltk.corpus import twitter_samples 
import matplotlib.pyplot as plt
import numpy as np 
import pandas as pd

Apart from the python libraries we need to import two functions: process_tweet and build_freqs. The process_tweet function cleans up the tweets. 

## Load the dataset

In [98]:
pos_twts = twitter_samples.strings('positive_tweets.json')
neg_twts = twitter_samples.strings('negative_tweets.json')

twts = pos_twts + neg_twts

The tweets which have the positive sentiment are labeled as 1 and negative tweets are labeled as 0. The labels is created down here as the numpy arrays.

In [99]:
labels = np.append( np.ones((len(pos_twts),1)), np.zeros((len(neg_twts),1)), axis = 0 )

## Cleaning up the tweets

In [100]:
import re
import string
import numpy as np

from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import TweetTokenizer

def process_tweet(tweet):
    
    # Instantiate stemming class
    stemmer = PorterStemmer()
    stopwords_english = stopwords.words('english')
    
    tweet = re.sub(r'^RT[\s]+','', tweet)
    tweet = re.sub(r'https?:\/\/.*[\r\n]*', '', tweet)
    tweet = re.sub(r'#', '', tweet)
    tweet = re.sub(r'@', '', tweet)
    tweet = re.sub(r'@', '', tweet)
    tweet = re.sub(r'\$\w*','', tweet)
    
    tokenizer = TweetTokenizer(preserve_case=False, strip_handles=True, reduce_len=True)
    twt_tokened = tokenizer.tokenize(tweet)
    
    clean_twt = []
    
    for word in twt_tokened:
        if (word not in stopwords_english and word not in string.punctuation):
            clean_twt.append(word)
    
    stemed_twt = []
    for word in clean_twt:
        word_stemmed = stemmer.stem(word)
        stemed_twt.append(word_stemmed)
    
    
    return stemed_twt

# create a empty list for clean tweets
clean_twts_lst = []

for i in twts:
    clean_twts_lst.append(process_tweet(i))

The tweets after the process need are tokenized. So we need to put them back together.

In [101]:
clean = " "
clean_twts_lst2 = []

for i in clean_twts_lst:
    clean = " "
    clean = clean.join(i)
    clean_twts_lst2.append(clean)

The tweets are cleaned up and stored in **clean_twts_lst2**. The next step is to combine the strings of tweets are their corresponding 1 and 0 labels. 
There are 10,000 tweets and 10,000 labels.

In [102]:
print('The length of the tweets is: ', len(clean_twts_lst2), ', which is the same \
as the length of labels. The length of labels is: ', len(labels) )

The length of the tweets is:  10000 , which is the same as the length of labels. The length of labels is:  10000


In [103]:
# convert the labels arrays to a list

# labels list
labels2 = []

for i in labels:
    labels2.append(i[0])
    
    
# create the columns of a dataframe
columns = {'Tweets':clean_twts_lst2, 'Label': labels2}

# This is the dataframe which has the tweets and their corresponding 
#labels

df_twts_lbls = pd.DataFrame(columns)

In [104]:
df_twts_lbls

Unnamed: 0,Tweets,Label
0,followfriday france_int pkuchli 57 milipol_par...,1.0
1,lamb 2ja hey jame odd :/ pleas call contact ce...,1.0
2,despiteoffici listen last night :) bleed amaz ...,1.0
3,97side congrat :),1.0
4,yeaaah yipppi accnt verifi rqst succeed got bl...,1.0
...,...,...
9995,wanna chang avi usanel :(,0.0
9996,puppi broke foot :(,0.0
9997,where' jaebum babi pictur :(,0.0
9998,mr ahmad maslan cook :(,0.0


## Make the data ready for training

Train test split the data. The x data is the strings of tweets and the y data is the label. The steps are here:

    1- Train & test split the data
    2- Vectorize the strings of the tweets.
    3- Fit and evaluate the model with different solvers. 

### 1- Train & test split the data

In [105]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer

x_train, x_test, y_train, y_test = train_test_split(df_twts_lbls['Tweets'], df_twts_lbls['Label'],
                                                    test_size = 0.2, random_state=40)

print('shape of x_train: ', x_train.shape, 'shape of y_train: ', y_train.shape,'\n',
     'shape of x_test: ', x_test.shape, 'shape of y_test: ', y_test.shape,'\n')

shape of x_train:  (8000,) shape of y_train:  (8000,) 
 shape of x_test:  (2000,) shape of y_test:  (2000,) 



### 2- Vectorize the strings of the tweets.

Now we use the TfidfVectorizer to vectorize the strings

In [106]:
print("Vectorizing the tweets....\n")

vectorizer= TfidfVectorizer()
tf_x_train = vectorizer.fit_transform(x_train)
tf_x_test = vectorizer.transform(x_test)

print('The shape of the x_test: ', x_test.shape, ' and the shape of the x_train is: ', x_train.shape,'\n')

print('The shape of the vectorized x_train is: ', tf_x_train.shape,' and the shape of the vectorized x-test is: ', 
     tf_x_test.shape)



Vectorizing the tweets....

The shape of the x_test:  (2000,)  and the shape of the x_train is:  (8000,) 

The shape of the vectorized x_train is:  (8000, 14607)  and the shape of the vectorized x-test is:  (2000, 14607)


### 3- Fit and evaluate the model with different solvers. 

There are different solvers which is accessible in the **SVM**. 

In [107]:
from sklearn.svm import LinearSVC
clf = LinearSVC(random_state=0)

clf.fit(tf_x_train, y_train)
y_pred = clf.predict(tf_x_test)

from sklearn.metrics import classification_report

report=classification_report(y_test, y_pred,output_dict=True)

report

{'0.0': {'precision': 0.7521697203471552,
  'recall': 0.7507218479307026,
  'f1-score': 0.7514450867052023,
  'support': 1039},
 '1.0': {'precision': 0.731048805815161,
  'recall': 0.7325702393340271,
  'f1-score': 0.7318087318087318,
  'support': 961},
 'accuracy': 0.742,
 'macro avg': {'precision': 0.7416092630811582,
  'recall': 0.7416460436323649,
  'f1-score': 0.7416269092569671,
  'support': 2000},
 'weighted avg': {'precision': 0.7420211209145321,
  'recall': 0.742,
  'f1-score': 0.7420098181774484,
  'support': 2000}}