<h1><center> Logisitic Regression Classification </center></h1>

*https://www.dataquest.io/blog/tutorial-text-classification-in-python-using-spacy/*

## *Written by Nathanael Hitch*

## Example
***
This model will look at the review text of Amazon products and predict whether it is positive or negative.
***

The information is a tab-separated-value file (.tsv) with 5 columns:

- Rating: rating each user gave the Alexa (out of 5)
- Date: date of the review
- Variation: the model the user is reviewing
- Verified_Review: text of each review
- Feedback: contains a sentiment label; 1 = positive, 2 = negative

The feedback column already includes whether the review was positive or negative, we can use that to test the model.

Start with uploading the necessary data:

In [10]:
# Importing needed packages
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer
from sklearn.base import TransformerMixin
from sklearn.pipeline import Pipeline

df_amazon = pd.read_csv("Files/amazon_alexa.tsv", sep="\t")
# Loading the Amazon .tsv file.
    # 'sep' is the delimeter to use - can't automatically detect the separator
    # '\t' = tab
    
""" Useful DataFrame (df) functions """

print(df_amazon.head(),"\n")
# The first 5 records from the DataFrame

print(df_amazon.shape,"\n")
# Returns a tuple of the dimensions of the DataFrame

print(df_amazon.info(),"\n")
# View data information

print(df_amazon.feedback.value_counts())
# Values of each feedback option (i.e. 1 or 0)

   rating       date         variation  \
0       5  31-Jul-18  Charcoal Fabric    
1       5  31-Jul-18  Charcoal Fabric    
2       4  31-Jul-18    Walnut Finish    
3       5  31-Jul-18  Charcoal Fabric    
4       5  31-Jul-18  Charcoal Fabric    

                                    verified_reviews  feedback  
0                                      Love my Echo!         1  
1                                          Loved it!         1  
2  Sometimes while playing a game, you can answer...         1  
3  I have had a lot of fun with this thing. My 4 ...         1  
4                                              Music         1   

(3150, 5) 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3150 entries, 0 to 3149
Data columns (total 5 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   rating            3150 non-null   int64 
 1   date              3150 non-null   object
 2   variation         3150 non-null   object
 3   ve

Next we'll use spaCy to tokenise the data, strip information we don't need (stopwords, punctuation etc.), perform Lemmatisation and lowercaste the text.

`Print's have been commented out for when creating the Logistic Regression Model at the end.`

In [9]:
import string # Contains a useful list of punctuation marks.
import spacy
from spacy.lang.en.stop_words import STOP_WORDS

nlp = spacy.load("en_core_web_sm")

punctuations = string.punctuation
# Create list of punctuation marks

def spacy_cleaner(sentence):
    
    #print("Input sentence:\n", sentence,"\n")
    
    doc = nlp(sentence)
    # Pass text into model's pipeline.
    
    myTokens = [token for token in doc]
    # Creating a list of the words in the sentence.
    #print("Sentence tokenised:\n", myTokens,"\n")
    
    myTokens = [token for token in myTokens if token.is_stop == False and token.text not in punctuations]
    # List of words without stopwords or punctuations.
    #print("Sentence without stopwords or punctuations:\n", myTokens, "\n")
    
    myTokens = [token.lemma_.strip().lower() if token.pos_ != "PROPN" else token.lemma_.strip() \
                for token in myTokens]
    # Words are lemmatised, spaces at end removed and (if not a proper noun) lowercased.
    
    #print("Sentence lemmatisted, no spaces and lowercase (except Proper Noun):\n", myTokens, "\n")
    
    return myTokens
    
spacy_cleaner("This is a test sentence, for testing tests from London.")

['test', 'sentence', 'test', 'test', 'London']

To further clean our text data we will create a **custom transformer**, removing end spaces and converting text into lower case. Here, we will create a custom predictors class wich inherits the TransformerMixin class(used to transform data/text). This class overrides the transform, fit and get_parrams methods. We’ll also create a clean_text() function that removes spaces and converts text into lowercase.

In [3]:
# Custom transformer using spaCy

from sklearn.base import TransformerMixin
class predictors(TransformerMixin):
# Inheriting from TransformerMixin
    
    def transform(self, X, **transform_params):
        # Cleaning Text
        return [clean_text(text) for text in X]

    def fit(self, X, y=None, **fit_params):
    # fit: used for training your model without any pre-processing on the data    
        return self

    def get_params(self, deep=True):
        return {}

# Basic function to clean the text
def clean_text(text):
    # Removing spaces and converting text into lowercase
    return text.strip().lower()

To classify text in positive or negative labels is called Sentiment Analysis; to do that we need to represent our text numerically.<br>
We can, amongst other ways, use the Bag-of-Words model to convert the text into a matrix of how many times a word has occured.

We can generate a BoW matrix by using *Scikit-Learn*'s **CountVectoriser**. We will tell the CountVectoriser to use our function *spacy_cleaner* as its tokeniser, and define the n-gram range. For this one, we will be using uni-grams (one words), and the ngrams will be assigned to 'bow_vector'.<br>
It would be good to look at the TF-IDF; this can be generated by using the *Scikit-Learn*'s **TfidfVectoriser**, with the tokeniser our *spacy_cleaner* function and the result assigned to 'tfidf_vector':

In [4]:
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer

#bow_vector = CountVectorizer(tokenizer = spacy_cleaner, ngram_range=(1,1))
bow_vector = CountVectorizer(ngram_range=(1,1))
# Bag-of-Words n-gram matrix

tfidf_vector = TfidfVectorizer(tokenizer = spacy_cleaner)
# TF-IDF result

### Splitting the data between **'Training'** and **'Test'** sets

We will use half the data set as a **training** set, and the other half as a **testing** set.<br>
Fortunately, *scikit-learn* has a built in function for doing this, **train_test_split()**:

- X = what we want to split
- ylabels = labels we want to test aganist
- Plus the size of the test set, in percentage form

In [5]:
""" Splitting between Training and Testing sets """

from sklearn.model_selection import train_test_split

# df_amazon is the .tsv file that has been previously loaded:
    # df_amazon = pd.read_csv("Files/amazon_alexa.tsv", sep="\t")

X = df_amazon['verified_reviews'] # 'verified_reviews' is what we want to analyse

ylabels = df_amazon['feedback'] # 'feedback' is the label/answer to test against

X_train, X_test, y_train, y_test = train_test_split(X, ylabels, test_size=0.3)
# Appling X, ylabels and the test size as needed

### Building the **Logistice Regression Model**

Start by importing the LogisiticRegression module; then create a LogisticRegression classifier object. 

*For this example*, we will build a pipeline with 3 components:

- Classifier: clean and preprocess the text
- Vectoriser: creates a BoW matrix for our text
- Classifier: performs a logisitc regression to classify sentiment

In [6]:
# Logistic Regression Classifier

from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline

classifier = LogisticRegression()
# Creating the LogisticRegression classifier object

# Create pipeline using Bag of Words
    # These will using previously made function
pipe = Pipeline([("cleaner", predictors()),
                 ('vectorizer', bow_vector),
                 ('classifier', classifier)])

pipe.fit(X_train,y_train)
# Generating/training the model using the previously stated training sets

Pipeline(memory=None,
         steps=[('cleaner', <__main__.predictors object at 0x000001248707A848>),
                ('vectorizer',
                 CountVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True, max_df=1.0,
                                 max_features=None, min_df=1,
                                 ngram_range=(1, 1), preprocessor=None,
                                 stop_words=None, strip_accents=None,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=None, vocabulary=None)),
                ('classifier',
                 LogisticRegression(C=1.0, class_weight=None, dual=False,
                                    fit_intercept=True, intercept_scaling=1,
                                    l1_ratio=None, max_iter=100,
 

### Evaluation

Let's see how our model actually performs not that it has been trained.<br>
This is done by using the *metrics* module; we will put our test data through the pipeline to come up with predictions and use the various functions from the metrics module to look at aspects of the pipeline:

- Accuracy: percentage of the total predictions that are completely correct.
- Precision: ratio of true positives plus false positive.
- Recall: ratio of true positives to true positives plus false negatives.

In [7]:
from sklearn import metrics
# Predicting with a test dataset
predicted = pipe.predict(X_test)

# Model Accuracy
print("Logistic Regression Accuracy:",metrics.accuracy_score(y_test, predicted)) # Accuracy
print("Logistic Regression Precision:",metrics.precision_score(y_test, predicted)) # Precision
print("Logistic Regression Recall:",metrics.recall_score(y_test, predicted)) # Recall

Logistic Regression Accuracy: 0.9365079365079365
Logistic Regression Precision: 0.9442586399108138
Logistic Regression Recall: 0.9883313885647608


Basically, the model correctly identified the comment's sentiment **94.1%** of the time.<br>
When a review was predicted as positive, it was positive **95%** of the time<br>
When given a positive review, the model identified it as positive **98.6%** of the time.