<h1><center> K-Fold Analysis </center></h1>

LinkedIn Learning: https://www.linkedin.com/learning/nlp-with-python-for-machine-learning-essential-training/cross-validation-and-evaluation-metrics?u=78163626<br>
https://machinelearningmastery.com/k-fold-cross-validation/<br>
https://scikit-learn.org/stable/modules/cross_validation.html

## *Written by Nathanael Hitch*

<div style="color:purple">Background in classification and other metric in:</div>
<ul style="color:purple">
    <li>NLP_Logisitic-Regression.ipynb</li>
    <li>Random_Forest.ipynb</li>
</ul>

In previous looks at NLP sentiment models, there have been ways of analysing how good a model is when using testing data. For example:

- Accuracy: percentage of the total predictions that are completely correct.

# What is it?

It is when "the full data set is divided into *k*-subsets (*k* being a number) with the holdout method repeated *k* times. Each time, one of the subsets is used to test the model while all the other subsets are put together to train the model".

For a **5 fold validation**, k = 5, with a test set of 10,000 examples:

1. The test set is split into **5** subsets.
2. The first subset is put aside to test; the other 4 subsets train the model.
3. The accuracy (metric) for the model is determined by using the testing subset.
4. Each subset is used as a testing subset, with the other 4 training the model to be tested.
5. Each time the accuracy score is attained from each model.
6. The full array of avergae scores is output, along with the average of the model's accuracy.

You can see what difference between the lowest and highest score compared with the model's average score.<br>
Be aware that even small drops in accuracy can cause big changes in the results, e.g. a business setting where there are millions of pounds involved.

## Usefulness

Testing the model over a number of differenct sets will give a better indication about how the model will perform in the real-world.<br>
Depending on how big the differences between the averages are, we understand how well the model will work when put into production.

We can use sklearn to import in the K_Fold API for ease of use. Using the KFold object, you can see the split in the data set:

In [3]:
from numpy import array
from sklearn.model_selection import KFold

data = array([0.1, 0.2, 0.3, 0.4, 0.5, 0.6])
# data sample

kfold = KFold(3, True, 1)
# prepare cross validation

for train, test in kfold.split(data):
    print('train: %s, test: %s' % (data[train], data[test]))
# enumerate splits

train: [0.1 0.4 0.5 0.6], test: [0.2 0.3]
train: [0.2 0.3 0.4 0.6], test: [0.1 0.5]
train: [0.1 0.2 0.3 0.5], test: [0.4 0.6]


# Example - sentiment analysis

We first need to upload the data needed; we can upload a csv file with the needed information (using pandas).<br>
The necessary columns will then be separated into variables for the testing and training of the model:

- body_text = text to be analysed
- sentiment = sentiment of the text

In [None]:
import pandas as pd

df_data = pd.read_csv("test_data.csv")
# csv data file

X_data = df['body_text']
# Text to be analysed

Y_labels = df['sentiment']
# Sentiment of the text

There will be code to tokenise the text, as well as additionally to clean it.

The k-fold analysis is then set up using sklearn modules:

In [None]:
from sklearn.model_selection import KFold, cross_val_score
from sklearn.ensemble import RandomForestClassifier

rfc = RandomForestClassifier(n_jobs=-1)

k_fold = KFold(n_splits=5)
# n_splits is the number of splits of the data needed, 5 in this case

cross_val_score(estimator=rfc, X=X, y=Y_labels, cv=k_fold, scoring='accuracy', n_jobs=-1)
# Cross Validating the scores

There are number of attributes that can be used with 'cross_val_score'. The couple used above are the necessary ones:

- estimator: **which classifier are you using?**<br>
In this example, it is the Random Forest Classifier; however other classifiers can be used, such as Logistic Regression, SVM or Bayesian.

- X: **the variable that is being analysed.**<br>
E.g. the 'body_text' data

- y: **what the sentiment of the text should be.**<br>
E.g. the 'sentiment' data

- cv: **the cross-validation splitting strategy.**<br>
Input the cross-validation generator (as above), or an iterable for how many k-folds are wanted.

- scoring: **what metric do you want analysed in the models?**<br>
This can be accuracy, precision, recall etc.

- n_jobs = -1: **models to be made in parallel.**<br>
Models made independent of eachother and are trained and tested more quickly.

## Example Code

Below is a complete code for k-fold analysis for a Random Forest Regressor using a Bag-of-Words vectoriser, with some additional code for cleaning the data.

In [2]:
import spacy
import pandas as pd
import re
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer
import string
from sklearn.model_selection import KFold, cross_val_score
from sklearn.ensemble import RandomForestClassifier
import nltk
from sklearn.pipeline import Pipeline

import winsound

#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-

                        # Reading .csv file

df = pd.read_csv("Files/raw/tweets-train.csv")

X = df['text'].astype(str)
# Convert to type 'string' as pandas converts inputs to their most relevant type
    # The issue is sometimes pandas converts data to a 'float'; this doesn't work with the evaluation functions

Y = df['sentiment'].astype(str)

#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-

                        # Creating custom tokeniser and cleaning function

Lemmatiser = nltk.stem.WordNetLemmatizer()
# Instantiating the NLTK Lemmatiser

punctuations = string.punctuation
# Putting punctuation symbols into an object

nlp = spacy.load("en_core_web_sm")
# Import spacy model

stopwords = spacy.lang.en.stop_words.STOP_WORDS
# A list of stopwords that can be filtered out
    # NLTK also has a stop words object but it has fewer words

def text_cleaner(sentence):    
                
    sentence = "".join([char for char in sentence.strip() if char not in punctuations])
    # Getting rid of any punctuation characters
    
    myTokens = re.split('\W+', sentence)
    # Tokenising the words
    
    myTokens = [token.lower() for token in myTokens if token not in stopwords]
    # Removing stop words
    
    myTokens = [Lemmatiser.lemmatize(token) for token in myTokens]
    # Lemmatising the words and putting in lower case except for proper nouns
    
    return myTokens    

#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-

                        # Creating Vectoriser and Classifier

bow_vector = CountVectorizer(tokenizer = text_cleaner, ngram_range=(1,1))

rfc = RandomForestClassifier(n_jobs=-1)

#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-

                        # Evaluating the model

pipe = Pipeline([('vectorizer', bow_vector)
                 ,('classifier', rfc)])

k_fold = KFold(n_splits=5)
# 5 splits

print(cross_val_score(estimator=pipe, X=X, y=Y, cv=k_fold, scoring='accuracy', n_jobs=-1))

winsound.PlaySound("Files/Alarm07.wav", winsound.SND_FILENAME)

[0.69365108 0.69505095 0.69359534 0.68941048 0.69978166]
