<a href="https://colab.research.google.com/github/Trantracy/Movie-Review-Sentiment-Analysis-/blob/master/Movie-Review.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Logistic Regression

## Basic NLP Flow

- **Step 1**: Clean data
    - Remove all irrelevant characters such as any non alphanumeric characters
    - Tokenize your text by separating it into individual words 
    - Remove words that are not relevant, such as “@” twitter mentions or urls 
    - Convert all characters to **lowercase**, in order to treat words such as “hello”, “Hello”, and “HELLO” the same 
    - Consider **lemmatization** (reduce words such as “am”, “are”, and “is” to a common form such as “be”)
- **Step 2**: Representation
    - Bag of Words or TFIDF
- **Step 3**: Classification
    - Naive Bayes
    - Logistic Regression

In [0]:
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


In [0]:
# Import pandas, numpy and the dataset, save it in a object called 'sentiment'
# Your code here
import numpy as np
import pandas as pd
# Read file with param './data/train.csv', encoding='latin-1'
sentiment = pd.read_csv('/content/drive/My Drive/Student Files/FTMLE - Tonga/Data/sentiment.csv', encoding='latin-1')

# Let's check sentiment.head(10) and sample(10)
# Your code here
sentiment.sample(10)


Unnamed: 0,ItemID,Sentiment,SentimentText
84404,84416,1,"@catieronquillo, the world is wide."
99603,99615,0,@Crossbow1 YOU'RE TELLING ME.
9666,9678,0,&quot;Lars and the Real Girl&quot; is such a s...
8154,8157,1,#unfollowdiddy and follow me instead! i am coo...
99044,99056,1,@craftygirljen I shall consider it.
65203,65215,1,@breakinporcelan I'm looking for it right now....
5417,5420,1,#followfriday go follow @brendonuriesays and @...
88715,88727,1,@circus_clown In passing you may have
25422,25434,0,@3b1srobinson my dogs get me up at 6am
41300,41312,0,@andrzejkala Me too. Shall we arrange somethin...


## Sentiment analysis
This contest is taken from the real task of Text Processing.

The task is to build a model that will determine the tone (positive, negative) of the text. To do this, you will need to train the model on the existing data (train.csv). The resulting model will have to determine the class (neutral, positive, negative) of new texts. The dataset contains the following fields:

| Field name | Meaning |
|------------|-----------|
| ItemID  | id of tweet|
| Sentiment | sentiment (1-positive, 0-negative)|
| SentimentText | text of the tweet|

Let's first of all have a look at the data

As we can see, the structure of a tweet varies a lot between tweet and tweet. They have different lengths, letters, numbers, extrange characters, etc. 

It is also important to note that **a lot** of words are not correctly spelled, for example the word _"Juuuuuuuuuuuuuuuuussssst"_ or the word _"sooo"_

This makes it hard to mesure how positive or negative are the words within the tweets.

So we need a way of scoring the words such that words that appear in positive tweets have greater score that those that appear in negative tweets.

But first... how do we represent the tweets as vectors we can input to our algorithm?

### Bag of words

One thing we could do to represent the tweets as equal-sized vectors of numbers is the following:

* Create a list (vocabulary) with all the unique words in the whole corpus of tweets. 
* We construct a feature vector from each tweet that contains the counts of how often each word occurs in the particular tweet

_Note that since the unique words in each tweet represent only a small subset of all the words in the bag-of-words vocabulary, the feature vectors will mostly consist of zeros_

Lets construct the bag of words. We will work with a smaller example for illustrative purposes, and at the end we will work with our real data.

In [0]:
tweets = [
    'This is amazing!',
    'ML is the best, yes it is',
    'I am not sure about how this is going to end...'
]

Let's import [CountVectorizer.](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) It'll help us to convert a collection of text documents to a matrix of token counts.

In [0]:
# Define an object of CountVectorizer() as count
# This will convert our documents into matrices of count
from sklearn.feature_extraction.text import CountVectorizer

count = CountVectorizer()

# With count object, fit and transfom your tweets and save result in a variable name 'bag'
# Your code here
bag = count.fit_transform(tweets)

In [0]:
# Find in document of CountVectorizer a function that show us list of feature names
# hint: get_feature_names
# Your code here
count.get_feature_names()

['about',
 'am',
 'amazing',
 'best',
 'end',
 'going',
 'how',
 'is',
 'it',
 'ml',
 'not',
 'sure',
 'the',
 'this',
 'to',
 'yes']

As we can see from executing the preceding command, the vocabulary is stored in a Python array that maps the unique words to integer indices. Next, let's print the feature vectors that we just created:

In [0]:
# Call toarray() on your 'bag' to see the feature vectors
# Your code here
bag.toarray()

array([[0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0],
       [0, 0, 0, 1, 0, 0, 0, 2, 1, 1, 0, 0, 1, 0, 0, 1],
       [1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0]])

In [0]:
# What is the index of the word 'is' and how many times it occurs in all three tweets?
# Hint: You can directly count on feature fectors
# Your answer here
count.vocabulary_

{'about': 0,
 'am': 1,
 'amazing': 2,
 'best': 3,
 'end': 4,
 'going': 5,
 'how': 6,
 'is': 7,
 'it': 8,
 'ml': 9,
 'not': 10,
 'sure': 11,
 'the': 12,
 'this': 13,
 'to': 14,
 'yes': 15}

Each index position in the feature vectors corresponds to the integer values that are stored as dictionary items in the CountVectorizer vocabulary. For example, the first feature at index position 0 resembles the count of the word 'about' , which only occurs in the last document. These values in the feature vectors are also called the **raw term frequencies**: `tf(t,d )` —the number of times a term `t` occurs in a document `d`.


### How relevant are words? Term frequency-inverse document frequency

We could use these raw term frequencies to score the words in our algorithm. There is a problem though: If a word is very frequent in _all_ documents, then it probably doesn't carry a lot of information. In order to tacke this problem we can use **term frequency-inverse document frequency**, which will reduce the score the more frequent the word is accross all tweets. It is calculated like this:

\begin{equation*}
tf-idf(t,d) = tf(t,d) ~ idf(t,d)
\end{equation*}

_tf(t,d)_ is the raw term frequency descrived above. _idf(t,d)_ is the inverse document frequency, than can be calculated as follows:

\begin{equation*}
\log \frac{n_d}{1+df\left(d,t\right)}
\end{equation*}

where `n` is the total number of documents and _df(t,d)_ is the number of documents where the term `t` appears. 

The `1` addition in the denominator is just to avoid zero term for terms that appear in all documents, will not be entirely ignored. Ans the `log` ensures that low frequency term don't get too much weight.

Fortunately for us `scikit-learn` does all those calculations for us:

In [0]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer()

# Formatting the number to 2 digits after the decimal point by showing on this notebook
np.set_printoptions(precision=2)

# Feed the tf-idf Vectorizer with tweets using fit_transform()
tfidf_vec = tfidf.fit_transform(tweets)

In [0]:
tfidf.get_feature_names()

['about',
 'am',
 'amazing',
 'best',
 'end',
 'going',
 'how',
 'is',
 'it',
 'ml',
 'not',
 'sure',
 'the',
 'this',
 'to',
 'yes']

In [0]:
# Now what is the weight of the word 'is' and 'amazing'?
# hint: using 
# tfidf.get_feature_names()
# tfidf_vec.toarray()
# Your answer here
tfidf_vec.toarray()[:,7].mean()
tfidf_vec.toarray()[:,2].mean()

0.24011114968499644

## Step 1: Data clean up

### Removing stop words

Now that we know how to format and score our input. Let's look at our **real** vocabulary. Specifically, the most common words:

In [0]:
from collections import Counter

# Example
count = Counter()
for word in ['red', 'blue', 'red', 'green', 'blue', 'blue']:
    count[word] += 1
print(count)
print(count.most_common(2))

In [0]:
vocab = Counter()

# Let's apply the example above to count words in our SentimentText
# Your code here
for document in sentiment['SentimentText']:
  for word in document.split(' '):
    vocab[word] += 1

In [0]:
vocab.most_common(20)

[('', 123916),
 ('I', 32879),
 ('to', 28810),
 ('the', 28087),
 ('a', 21321),
 ('you', 21180),
 ('i', 15995),
 ('and', 14565),
 ('it', 12818),
 ('my', 12385),
 ('for', 12149),
 ('in', 11199),
 ('is', 11185),
 ('of', 10326),
 ('that', 9181),
 ('on', 9020),
 ('have', 8991),
 ('me', 8255),
 ('so', 7612),
 ('but', 7220)]

As you can see, the most common words are meaningless in terms of sentiment: _I, to, the, and_... they don't give any information on positiveness or negativeness. They're basically **noise** that can most probably be eliminated. These kind of words are called _stop words_, and it is a common practice to remove them when doing text analysis.

In [0]:
import nltk

nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [0]:
from nltk.corpus import stopwords
stop = stopwords.words('english')

vocab_reduced = Counter()
# Go through all of the items of vocab using vocab.items() and pick only words that are not in 'stop' 
# and save them in vocab_reduced
# Your code here
for word, count in vocab.items():
  if not word in stop:
    vocab_reduced[word] = count

vocab_reduced.most_common(20)

[('', 123916),
 ('I', 32879),
 ("I'm", 6416),
 ('like', 5086),
 ('-', 4922),
 ('get', 4864),
 ('u', 4194),
 ('good', 3953),
 ('love', 3494),
 ('know', 3472),
 ('go', 2990),
 ('see', 2868),
 ('one', 2787),
 ('got', 2774),
 ('think', 2613),
 ('&amp;', 2556),
 ('lol', 2419),
 ('going', 2396),
 ('really', 2287),
 ('im', 2200)]

This looks better, only in the 20 most common words we already see words that make sense: good, love, really... 

### Removing special characters and "trash"

If you look closer, you'll see that we're also taking into consideration punctuation signs ('-', ',', etc) and other html tags like `&amp`. We can definitely remove them for the sentiment analysis, but we will try to keep the emoticons, since those _do_ have a sentiment load:

In [0]:
import re

def preprocessor(text):
    """ Return a cleaned version of text
    """
    # Remove HTML markup
    text = re.sub('<[^>]*>', '', text)
    # Save emoticons for later appending
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text)
    # Remove any non-word character and append the emoticons,
    # removing the nose character for standarization. Convert to lower case
    text = (re.sub('[\W]+', ' ', text.lower()) + ' ' + ' '.join(emoticons).replace('-', ''))
    
    return text

# Create some random texts for testing the function preprocessor()
print(preprocessor(''))

 


We are almost ready! There is another trick we can use to reduce our vocabulary and consolidate words. If you think about it, words like: love, loving, etc. _Could_ express the same positivity. If that was the case, we would be  having two words in our vocabulary when we could have only one: lov. This process of reducing a word to its root is called **stemming**.

We also need a _tokenizer_ to break down our tweets in individual words. We will implement two tokenizers, a regular one and one that does steaming:

In [0]:
from nltk.stem import PorterStemmer

porter = PorterStemmer()

# write a function called `tokenizer()` that split a text into list of words
def tokenizer(text):
    # Your code here
    return text.split()


# write a function named `tokenizer_porter()` that split a text into list of words and apply stemming technic
# Hint: porter.stem(word)
def tokenizer_porter(text):
    # Your code here
    return [porter.stem(word) for word in text.split()]

# Testing
print(tokenizer('Hi there, I am loving this, like with a lot of love'))
print(tokenizer_porter('Hi there, I am loving this, like with a lot of love'))

['Hi', 'there,', 'I', 'am', 'loving', 'this,', 'like', 'with', 'a', 'lot', 'of', 'love']
['Hi', 'there,', 'I', 'am', 'love', 'this,', 'like', 'with', 'a', 'lot', 'of', 'love']


## Step 2: Representation

In [0]:
from sklearn.feature_extraction.text import TfidfVectorizer

stop = stopwords.words('english')

def tokenizer_porter(text):
    # Your code here
    return [porter.stem(word) for word in text.split()]

def preprocessor(text):
    """ Return a cleaned version of text
    """
    # Remove HTML markup
    text = re.sub('<[^>]*>', '', text)
    # Save emoticons for later appending
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text)
    # Remove any non-word character and append the emoticons,
    # removing the nose character for standarization. Convert to lower case
    text = (re.sub('[\W]+', ' ', text.lower()) + ' ' + ' '.join(emoticons).replace('-', ''))
    
    return text

tfidf = TfidfVectorizer(stop_words=stop,
                        tokenizer=tokenizer_porter,
                        preprocessor=preprocessor)

## Step 3: Classification

We are finally ready to train our algorithm. 

In [0]:
# split the dataset in train and test
# Your code here

from sklearn.model_selection import train_test_split

X = sentiment['SentimentText']

y = sentiment['Sentiment']

X_train, X_test, y_train, y_test = train_test_split(X, y , test_size = 0.2)

In [0]:
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression

# A pipeline is what chains several steps together, once the initial exploration is done. 
# For example, some codes are meant to transform features — normalise numericals, or turn text into vectors, 
# or fill up missing data, they are transformers; other codes are meant to predict variables by fitting an algorithm,
# they are estimators. Pipeline chains all these together which can then be applied to training data
clf = Pipeline([('vect', tfidf),
                ('clf', LogisticRegression(random_state=0))])

clf.fit(X_train, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


Pipeline(memory=None,
         steps=[('vect',
                 TfidfVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.float64'>,
                                 encoding='utf-8', input='content',
                                 lowercase=True, max_df=1.0, max_features=None,
                                 min_df=1, ngram_range=(1, 1), norm='l2',
                                 preprocessor=<function preprocessor at 0x7f19fb484488>,
                                 smooth_idf=True,
                                 stop_words=['i', 'me', 'my', 'myself', '...
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=<function tokenizer_porter at 0x7f19fb44eb70>,
                                 use_idf=True, vocabulary=None)),
                ('clf',
                 LogisticRegression(C=1.0, class_weight=None, dual=False,
         

In [0]:
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# Now apply those above metrics to evaluate your model
# Your code here
predictions = clf.predict(X_test)
accuracy_score(y_test, predictions)

0.7546254625462546

In [0]:
print(classification_report(y_test, predictions))

              precision    recall  f1-score   support

           0       0.75      0.67      0.71      8804
           1       0.76      0.82      0.79     11194

    accuracy                           0.75     19998
   macro avg       0.75      0.75      0.75     19998
weighted avg       0.75      0.75      0.75     19998



Finally, let's run some tests :-)

In [0]:
tweets = [
    "This is really bad",
    "I love this!",
    ":)",
]

preds = clf.predict_proba(tweets)

for i in range(len(tweets)):
    print(f'{tweets[i]} --> Negative, Positive = {preds[i]}')

This is really bad --> Negative, Positive = [0.96 0.04]
I love this! --> Negative, Positive = [0.07 0.93]
:) --> Negative, Positive = [0.39 0.61]


If we would like to use the classifier in another place, or just not train it again and again everytime, we can save the model in a pickle file:

In [0]:
import pickle
import os

pickle.dump(clf, open(os.path.join('logisticRegression.pkl'), 'wb'))

In [0]:
model = pickle.load(open('logisticRegression.pkl', 'rb'))

**Good Job!**