<h1><center> Logisitic Regression Classification </center></h1>

*https://realpython.com/logistic-regression-python/#multi-variate-logistic-regression*<br>
*https://www.dataquest.io/blog/tutorial-text-classification-in-python-using-spacy/*

## *Written by Nathanael Hitch*

<hr>

<span style="background-color:DeepPink; color:white; font-size:20px">Appends Added:</span>

1. Text_cleaner without using a spaCy model
2. Model testing and training with different files

<hr>

## What is it?

Classification refers to supervised machine learning that *tries* to predict which category an entity belongs in based on their features. In standard classification, these categories are discrete, finite possibilities, true/false, positive/negative.<br>
Regression refers to continuos, unbounded outputs, e.g. estimating an employee's salary from features of their job.

Logistic estimates the parameters between 2 outcomes, or more if the model needs to.

## How does it work?

The classification is done by a *Sigmoid function*:

<img src="Images\Sigmoids.png">

The logistic regression's goal is to calculate its function such that its predicted outcome, $p(x_i)$, is as close as possible to the actual outcome. **Remember**, the actual response is, usually, a binary classification problem, e.g. 1 or 0. This means that each $p(x_i)$ should be close to either 0 or 1; hence why it’s convenient to use the sigmoid function.

`Complex mathematically methodology:`<br>
https://realpython.com/logistic-regression-python/#multi-variate-logistic-regression

Using data where the actual outcome is know, the model will need to be trained/fitted and tested to show that it can accurately predict outcomes for new inputs. The training determines the best predicted weights for the predicited function, $p(x_i)$. Once determined, the predicted outcome = 1 when $p(x_i)$ > 0.5, and 0 otherwise. [The threshold hold value doesn't have to be 0.5 (usually is) and can be altered depending on the situation.]<br>

## Classification Performance

There are four possible types of results; as an example, a machine detecting whether a patient DOES (1) or DOES NOT (0) have cancer:

1. **True Negative** = Correct Predicition: correctly predicted negatives (0)<br>
Correctly predicted that the patient DOES NOT have cancer.

2. **True Positive** = Correct Predicition: correclty predicted positives (1)<br>
Correctly predicted that the patient DOES have cancer.

3. **False Positive** = Incorrect Prediction: incorrectly predicted negatives (0)<br>
Incorrectly predicted that the patient DOES have cancer when they DO NOT.

4. **Fales Negative** = Incorrect Prediction: incorrectly predicted positives (1)<br>
Incorrectly predicted that the patient DOES NOT have cancer when they DO.

While the simplest indicator of the model's accuracy is the ratio of the number of correct predictions to the total number of predictions, there are other indicators of binary classifiers:

- The positive/negative predictive value: ratio of the number of true positives/negatives to the sum of the numbers of true and false positives/negatives.
- The sensitivity (recall or true positive rate): the ratio of the number of true positives to the number of actual positives.
- The specificity (or true negative rate): ratio of the number of true negatives to the number of actual negatives.

### Variates in Logisitic Regression

**Single-variate** logisitic regression has one independent variable; one variable to make the predictive outcome on.

<img src="Images\single-variate.png" style="width:750px">

There is a given set of input-output (x-y) pairs, represented by green circles, that are your observations. Remember, the outcome is binary and can only be 0 or 1: e.g. for the leftmost green circle, x = 0 and the actual output y = 0. The rightmost observation has x = 9 and y = 1.

Logistic regression finds the predicited weights, which find the the *logit*, $f(x)$, the dashed black line. This defines the predicted probability, $p(x) = 1 / (1 + exp(-f(x)))$, the full black line.<br>
In the above case, the threshold $p(x)$ = 0.5 and $f(x)$ = 0 corresponds to $x$ being slightly higher than 3. This value is the limit between the inputs with the predicted outputs of 0 and 1.

In the *real-world*, values where $x$ > 3 can have an actual outcome of 1 rather than the predicted value of 0 for this model.<br>
This is where the Logisitic Regression model can be weak.

**Multi-variate** logisitic regression has more than one input variable.

<img src="Images\multi-variate.png" style="width:750px">

The above graph is different from the single-variate graph as both axes represent the inputs. The outputs differ in color: white circles = 0, the green circles = 1.

For multi-variate function, there would be more predicted weights required for the logit, and the predicited probability for the logisitic function, $p(x_1,x_2) = 1 / (1 + exp(-f(x_1,x_2)))$.<br>
The dash-dotted black line (i.e. the logit) linearly separates the two classes with the line corresponding to $p(x_1,x_2)$ = 0.5 and $f(x_1,x_2)$ = 0.

### Overfit & Regularisation

Overfitting occurs when a model learns the training data too well; the model learns not only the relationships among data but also the noise in the dataset.<br>
Overfitting usually occurs with complex models; models tend to perform well with data used to fit them, the training data, while performing poorly with unseen data, or test data.

Regularisation can significantly improve model performance on unseen data. This is done by reducing the complexity of the model, with techniques applied with logistic regression mostly tending to penalize large predicted weights.

# Example
***
This model will look at the review text of Amazon products and predict whether it is positive or negative.
***

The information is a tab-separated-value file (.tsv) with 5 columns:

- Rating: rating each user gave the Alexa (out of 5)
- Date: date of the review
- Variation: the model the user is reviewing
- Verified_Review: text of each review
- Feedback: contains a sentiment label; 1 = positive, 2 = negative

The feedback column already includes whether the review was positive or negative, we can use that to test the model.

Start with uploading the necessary data:

In [1]:
# Importing needed packages
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer
from sklearn.base import TransformerMixin
from sklearn.pipeline import Pipeline

df_amazon = pd.read_csv("Files/amazon_alexa.tsv", sep="\t")
# Loading the Amazon .tsv file.
    # 'sep' is the delimeter to use - can't automatically detect the separator
    # '\t' = tab
    
""" Useful DataFrame (df) functions """

print(df_amazon.head(),"\n")
# The first 5 records from the DataFrame

print(df_amazon.shape,"\n")
# Returns a tuple of the dimensions of the DataFrame

print(df_amazon.info(),"\n")
# View data information

print(df_amazon.feedback.value_counts())
# Values of each feedback option (i.e. 1 or 0)

   rating       date         variation  \
0       5  31-Jul-18  Charcoal Fabric    
1       5  31-Jul-18  Charcoal Fabric    
2       4  31-Jul-18    Walnut Finish    
3       5  31-Jul-18  Charcoal Fabric    
4       5  31-Jul-18  Charcoal Fabric    

                                    verified_reviews  feedback  
0                                      Love my Echo!         1  
1                                          Loved it!         1  
2  Sometimes while playing a game, you can answer...         1  
3  I have had a lot of fun with this thing. My 4 ...         1  
4                                              Music         1   

(3150, 5) 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3150 entries, 0 to 3149
Data columns (total 5 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   rating            3150 non-null   int64 
 1   date              3150 non-null   object
 2   variation         3150 non-null   object
 3   ve

Next we'll use spaCy to tokenise the data, strip information we don't need (stopwords, punctuation etc.), perform Lemmatisation and lowercaste the text.

`Print's have been commented out for when creating the Logistic Regression Model at the end.`

In [2]:
import string # Contains a useful list of punctuation marks.
import spacy
from spacy.lang.en.stop_words import STOP_WORDS

nlp = spacy.load("en_core_web_sm")

punctuations = string.punctuation
# Create list of punctuation marks

def spacy_cleaner(sentence):
    
    #print("Input sentence:\n", sentence,"\n")
    
    doc = nlp(sentence)
    # Pass text into model's pipeline.
    
    myTokens = [token for token in doc]
    # Creating a list of the words in the sentence.
    #print("Sentence tokenised:\n", myTokens,"\n")
       
    myTokens = [token for token in myTokens if token.is_stop == False and token.text not in punctuations]
    # List of words without stopwords or punctuations.
    #print("Sentence without stopwords or punctuations:\n", myTokens, "\n")
    
    myTokens = [token.lemma_.strip().lower() if token.pos_ != "PROPN" else token.lemma_.strip() \
                for token in myTokens]
    # Words are lemmatised, spaces at end removed and (if not a proper noun) lowercased.
    
    #print("Sentence lemmatisted, no spaces and lowercase (except Proper Noun):\n", myTokens, "\n")
    
    return myTokens
    
spacy_cleaner("This is not a test sentence, for testing tests from London.")

['test', 'sentence', 'test', 'test', 'London']

To further clean our text data we will create a **custom transformer**, removing end spaces and converting text into lower case. Here, we will create a custom predictors class wich inherits the TransformerMixin class(used to transform data/text). This class overrides the transform, fit and get_parrams methods. We’ll also create a clean_text() function that removes spaces and converts text into lowercase.

In [3]:
# Custom transformer using spaCy

from sklearn.base import TransformerMixin
class predictors(TransformerMixin):
# Inheriting from TransformerMixin
    
    def transform(self, X, **transform_params):
        # Cleaning Text
        return [clean_text(text) for text in X]

    def fit(self, X, y=None, **fit_params):
    # fit: used for training your model without any pre-processing on the data    
        return self

    def get_params(self, deep=True):
        return {}

# Basic function to clean the text
def clean_text(text):
    # Removing spaces and converting text into lowercase
    return text.strip().lower()

To classify text in positive or negative labels is called Sentiment Analysis; to do that we need to represent our text numerically.<br>
We can, amongst other ways, use the Bag-of-Words model to convert the text into a matrix of how many times a word has occured.

We can generate a BoW matrix by using *Scikit-Learn*'s **CountVectoriser**. We will tell the CountVectoriser to use our function *spacy_cleaner* as its tokeniser, and define the n-gram range. For this one, we will be using uni-grams (one words), and the ngrams will be assigned to 'bow_vector'.<br>
It would be good to look at the TF-IDF; this can be generated by using the *Scikit-Learn*'s **TfidfVectoriser**, with the tokeniser our *spacy_cleaner* function and the result assigned to 'tfidf_vector':

In [4]:
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer

#bow_vector = CountVectorizer(tokenizer = spacy_cleaner, ngram_range=(1,1))
bow_vector = CountVectorizer(ngram_range=(1,1))
# Bag-of-Words n-gram matrix

tfidf_vector = TfidfVectorizer(tokenizer = spacy_cleaner)
# TF-IDF result

### Splitting the data between **'Training'** and **'Test'** sets

We will use half the data set as a **training** set, and the other half as a **testing** set.<br>
Fortunately, *scikit-learn* has a built in function for doing this, **train_test_split()**:

- X = what we want to split
- ylabels = labels we want to test aganist
- Plus the size of the test set, in percentage form

In [5]:
""" Splitting between Training and Testing sets """

from sklearn.model_selection import train_test_split

# df_amazon is the .tsv file that has been previously loaded:
    # df_amazon = pd.read_csv("Files/amazon_alexa.tsv", sep="\t")

X = df_amazon['verified_reviews'] # 'verified_reviews' is what we want to analyse

ylabels = df_amazon['feedback'] # 'feedback' is the label/answer to test against

X_train, X_test, y_train, y_test = train_test_split(X, ylabels, test_size=0.3)
# Appling X, ylabels and the test size as needed

### Building the **Logistice Regression Model**

Start by importing the LogisiticRegression module; then create a LogisticRegression classifier object. 

*For this example*, we will build a pipeline with 3 components:

- Classifier: clean and preprocess the text
- Vectoriser: creates a BoW matrix for our text
- Classifier: performs a logisitc regression to classify sentiment

In [6]:
# Logistic Regression Classifier

from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline

classifier = LogisticRegression()
# Creating the LogisticRegression classifier object

# Create a pipeline:
    # These will using previously made function
pipe = Pipeline([("cleaner", predictors()),
                 ('vectorizer', bow_vector), # Bag-of-Words CountVectoriser
                 ('classifier', classifier)])

pipe.fit(X_train,y_train)
# Generating/training the model using the previously stated training setsa 

Pipeline(memory=None,
         steps=[('cleaner', <__main__.predictors object at 0x000002068601CBC8>),
                ('vectorizer',
                 CountVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True, max_df=1.0,
                                 max_features=None, min_df=1,
                                 ngram_range=(1, 1), preprocessor=None,
                                 stop_words=None, strip_accents=None,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=None, vocabulary=None)),
                ('classifier',
                 LogisticRegression(C=1.0, class_weight=None, dual=False,
                                    fit_intercept=True, intercept_scaling=1,
                                    l1_ratio=None, max_iter=100,
 

### Evaluation

Let's see how our model actually performs not that it has been trained.<br>
This is done by using the *metrics* module; we will put our test data through the pipeline to come up with predictions and use the various functions from the metrics module to look at aspects of the pipeline:

- Accuracy: percentage of the total predictions that are completely correct.
- Precision: ratio of true positives plus false positive.
- Recall: ratio of true positives to true positives plus false negatives.

In [7]:
from sklearn import metrics
# Predicting with a test dataset
predicted = pipe.predict(X_test)

# Model Accuracy
print("Logistic Regression Accuracy:",metrics.accuracy_score(y_test, predicted)) # Accuracy
print("Logistic Regression Precision:",metrics.precision_score(y_test, predicted)) # Precision
print("Logistic Regression Recall:",metrics.recall_score(y_test, predicted)) # Recall

Logistic Regression Accuracy: 0.9365079365079365
Logistic Regression Precision: 0.9415656008820287
Logistic Regression Recall: 0.991869918699187


Basically, the model correctly identified the comment's sentiment **94.1%** of the time.<br>
When a review was predicted as positive, it was positive **95%** of the time<br>
When given a positive review, the model identified it as positive **98.6%** of the time.

## Multi-class Classification - One Vs. Rest (OVR)

The standard Logistic Regression model is for use in binary classification. In the case of more than 2 classes, a heuristic method is needed to classify the data. There are 2 methods:

- One-vs-Rest (OVR)
- One-vs-One (OVO)

Both methods split the multi-class dataset into separate binary classification problems. **One-vs-Rest** splits the problems into one possibility versus the rest; in the scenario used for the NLP project, their are 3 classifications, Positive, Neutral and Negative. Hence **OVR** splits the problems:

1. Positive vs [Neutral, Negative]
2. Neutral vs [Positive, Negative]
3. Negative vs [Positive, Neutral]

**One-vs-One** splits the problems into classifications aganist other individual classifications:

1. Positive vs Neutral
2. Positive vs Negative
3. Neutral vs Negative

In datasets with more than 3 classifications, this leads to far more binary problems.<br>
We will be looking at OVR as OVO this approch is primarily suggested for Support Vector Machines (SVM).

The code for OVR is very similar to a standard Logisitic Regression. The **difference** comes when declaring the classifier; the *OneVsRestClassifier* needs to be imported from sklearn, then declared with a Logisitic Regression classifier injected into it:

In [15]:
from sklearn.linear_model import LogisticRegression
from sklearn.multiclass import OneVsRestClassifier

LogReg = LogisticRegression()
# A Logisitic Regression object needs to be injected into OvR

ovr = OneVsRestClassifier(LogReg)

print(dir(ovr))
# Lists the classes Attributes and Methods

['__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getstate__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__setstate__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_estimator_type', '_first_estimator', '_get_param_names', '_get_tags', '_more_tags', '_pairwise', '_required_parameters', 'coef_', 'decision_function', 'estimator', 'fit', 'get_params', 'intercept_', 'multilabel_', 'n_classes_', 'n_jobs', 'partial_fit', 'predict', 'predict_proba', 'score', 'set_params']


The code below is from the NLP project with the same classification possibilities.

`Code below taken from LR-OVR_Model-NH_MT.ipynb`

In [None]:
                        # Importing needed packages
    
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
import string
import spacy
from sklearn.multiclass import OneVsRestClassifier

#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-

                        # Reading .csv file
    
################################################################################################################
                ##### Filepaths will need to be changed #####
################################################################################################################

df_train = pd.read_csv("Files/raw/test/tweets-test_1.csv")

#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-

                        # Creating custom cleaner function

nlp = spacy.load("en_core_web_sm")

punctuations = string.punctuation
# Create list of punctuation marks

def spacy_cleaner(sentence):
    
    #print("Input sentence:\n", sentence,"\n")
    
    doc = nlp(sentence.strip())
    # Pass text into model's pipeline.
    
    myTokens = [token for token in doc]
    # Creating a list of the words in the sentence.
    #print("Sentence tokenised:\n", myTokens,"\n")
    
    myTokens = [token for token in myTokens if token.is_stop == False and token.text not in punctuations]
    # List of words without stopwords or punctuations.
    #print("Sentence without stopwords or punctuations:\n", myTokens, "\n")
    
    myTokens = [token.lemma_.strip().lower() if token.pos_ != "PROPN" else token.lemma_.strip() \
                for token in myTokens]
    # Words are lemmatised, spaces at end removed and (if not a proper noun) lowercased.
    
    myTokens = [token for token in myTokens if token != ""]
    
    #print("Sentence lemmatisted, no spaces and lowercase (except Proper Noun):\n", myTokens, "\n")
    
    return myTokens

#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-

                        # Creating Bag-of-Words Vectoriser

bow_vector = CountVectorizer(tokenizer = spacy_cleaner, ngram_range=(1,1))

#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-

                        # Splitting into Training and Testing sets

X_tr = df_train['text']
Y_tr = df_train['sentiment']

# Below needed if splitting one .csv file into training and testing sets
X_train, X_test, Y_train, Y_test = train_test_split(X_tr, Y_tr, test_size=0.3)

#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-

                        # Building OVR Logisitic Regression Classifier
    
LogReg = LogisticRegression()
# Logisitic Regression classifier declared

ovr = OneVsRestClassifier(LogReg)
# One-vs-Rest classifier declared with 'LogReg' injected

pipe = Pipeline([('vectorizer', bow_vector)
                 ,('classifier', ovr)])

pipe.fit(X_train, Y_train)

#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-

                        # Evaluating the model

predicted = pipe.predict(X_test)

# Evaluation
print("RAW DATA - BoW:\n")
print("Logistic Regression Accuracy:\n",metrics.accuracy_score(Y_test, predicted),"\n") # Accuracy
print("Logistic Regression Precision:\n",metrics.precision_score(Y_test, predicted, average='macro'),"\n") # Precision
print("Logistic Regression Recall:\n",metrics.recall_score(Y_test, predicted, average='macro'),"\n") # Recall
print("Logistic Regression F1 Score:\n",metrics.f1_score(Y_test, predicted, average='macro')) # F1 Score

<h1><center>APPEND 1</center></h1>

# Text_cleaner without using a spaCy model

As mentioned, the text_cleaner developed for the Logistic Regression model uses a spaCy model. These have their advantages as they return tokenised objects which have methods that can be used directly, e.g. Lemmatisation (*lemma_*) or Stop Words (*is_stop*). On the other hand, NLTK returns string objects that have been affected by NLTK classes, like a Lemmatiser.<br>
**However**, the spaCy model can slow down the model as the model is applied to each string that comes from the file.

While it doesn't slow down the Logistic Regression model, using the spaCy model on the 'Random Forest' classifier drastically slowed it down.<br>
Below is a code for tokenising and cleaning the text the same way as the 'spacy_cleaner' in this notebook, but without using a spaCy model.

In [1]:
import spacy
import pandas as pd
import re
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer
import string
from sklearn.model_selection import KFold, cross_val_score
from sklearn.ensemble import RandomForestClassifier
import nltk

Lemmatiser = nltk.stem.WordNetLemmatizer()
# Instantiating the NLTK Lemmatiser

punctuations = string.punctuation
# Putting punctuation symbols into an object

nlp = spacy.load("en_core_web_sm")
# Import spacy model

spacy_stopwords = spacy.lang.en.stop_words.STOP_WORDS
# A list of stopwords that can be filtered out
    # NLTK also has a stop words object but it has fewer words

def text_cleaner(sentence):    
                
    sentence = "".join([char for char in sentence.strip() if char not in punctuations])
    # Getting rid of any punctuation characters
    
    myTokens = re.split('\W+', sentence)
    # Tokenising the words
    
    myTokens = [token.lower() for token in myTokens if token not in stopwords]
    # Removing stop words
    
    myTokens = [Lemmatiser.lemmatize(token) for token in myTokens]
    # Lemmatising the words and putting in lower case except for proper nouns
    
    return myTokens

<h1><center>APPEND 2</center></h1>

# Model testing and training with different files

You can use different files to train and then test the model. In order to do this, when initialising the variable for the dataframe's column, you need to caste them as type 'string'.<br>
This appears to be something that *train_test_split* does automatically.

In [None]:
                        # Importing needed packages
    
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
from sklearn import preprocessing
import string
import spacy
from spacy.lang.en.stop_words import STOP_WORDS
import nltk
import re
from sklearn.multiclass import OneVsRestClassifier

import winsound

#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-

                        # Reading .csv file

df_train = pd.read_csv("Files/raw/tweets-train.csv")

df_test = pd.read_csv("Files/raw/tweets-test.csv")

X_train = df_train['text'].astype(str)
Y_train = df_train['sentiment'].astype(str)

X_test = df_test['text'].astype(str)
Y_test = df_test['sentiment'].astype(str)

# The X/Y_test/train functions ensured 
    # Otherwise the .fit() function would come up with an error

#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-

                        # Creating custom cleaner function

Lemmatiser = nltk.stem.WordNetLemmatizer()
# Instantiating the NLTK Lemmatiser

punctuations = string.punctuation
# Putting punctuation symbols into an object

nlp = spacy.load("en_core_web_sm")
# Import spacy model

stopwords = spacy.lang.en.stop_words.STOP_WORDS
# A list of stopwords that can be filtered out
    # NLTK also has a stop words object but it has fewer words

def text_cleaner(sentence):    
                
    sentence = "".join([char for char in sentence.strip() if char not in punctuations])
    # Getting rid of any punctuation characters
    
    myTokens = re.split('\W+', sentence)
    # Tokenising the words
    
    myTokens = [token.lower() for token in myTokens if token not in stopwords]
    # Removing stop words
    
    myTokens = [Lemmatiser.lemmatize(token) for token in myTokens]
    # Lemmatising the words and putting in lower case except for proper nouns
    
    return myTokens 

#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-

                        # Creating Bag-of-Words Vectoriser

bow_vector = CountVectorizer(tokenizer = text_cleaner, ngram_range=(1,1))

#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-

                        # Building OVR Logisitic Regression Classifier

LogReg = LogisticRegression(max_iter=1000)

ovr = OneVsRestClassifier(LogReg)

pipe = Pipeline([('vectorizer', bow_vector)
                 ,('classifier', ovr)])

pipe.fit(X_train, Y_train)

#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-

                        # Evaluating the model

predicted = pipe.predict(X_test)

# Evaluation
print("RAW DATA - BoW:\n")
print("Logistic Regression Accuracy:\n",metrics.accuracy_score(Y_test, predicted),"\n") # Accuracy
print("Logistic Regression Precision:\n",metrics.precision_score(Y_test, predicted, average='macro'),"\n") # Precision
print("Logistic Regression Recall:\n",metrics.recall_score(Y_test, predicted, average='macro'),"\n") # Recall
print("Logistic Regression F1 Score:\n",metrics.f1_score(Y_test, predicted, average='macro')) # F1 Score

winsound.PlaySound("Files/Alarm07.wav", winsound.SND_FILENAME)