# Introduction to NLP: Opinion Mining

## Table of content <br>
* ### Preface
* ### How to run the program
* ### Import libraries
* ### Pipelines Hyperparameters
* ### Pipeline 1 step 1: Analyse the data and the task
* ### Pipeline 1 step 2: Data preprocessing
* ### Pipeline 1 step 3: Product feature extraction
* ### Pipeline 1 step 4: Sentiment analysis
* ### Pipeline 1 step 5: Evaluation and discussion
* ### Pipeline 2 step 3: Product feature extraction
* ### Pipeline 2 step 4: Sentiment analysis
* ### Pipeline 2 step 5: Evaluation and discussion
* ### References




# Preface

This notebook presents two opinion mining pipelines capable of ingesting raw unstructured text data, extract features and predict their sentiment polarity.
<br>

The pipeline outputs are a series of evaluation metrics that check the quality of extracted features, prediction accuracy of machine learning models and a summary table containing most important features and their sentiment count. 
<br> 

The user is invited to run them both to experience the seamless front to end process where all files are collected from thier directories, processed through the pipeline and statistical outputs for each product is presented. 
<br>

The data analysed on this notebook is Amazon customer reviews across 17 different products. 

# How to run the program

To run the program the user is required to input below the path to the `Data` folder. Note the `Data` sub-directory is already present in the `load_data` function path in Step 1 therefore there's no need to include it.<br>
Afterwards, all cells can be run and outputs for the two pipelines will be shown in sequntial order. Enjoy!

In [None]:
your_path_to_data = # Please add your path to 'Data'

# Import libraries

Main libraries used on this assigment are NLTK and Spacy in the data pre-processing and feature extraction phase (pipeline steps 2 and 3). Sklearn is used for sentiment analysis and evaluation (pipeline steps 4 and 5) 

In [23]:
import re 
import nltk 
import string 
import warnings 

import Levenshtein 
import numpy as np 
import pandas as pd 
import contractions 
from pprint import pprint

import spacy 
from spacy.scorer import Scorer

import nltk 
from nltk import UnigramTagger 
from nltk.corpus import stopwords 
from nltk.corpus import conll2000 
from wordfreq import get_frequency_dict 
from nltk.stem.snowball import SnowballStemmer 
from nltk.stem.lancaster import LancasterStemmer 
from nltk import word_tokenize, pos_tag, pos_tag_sents, ne_chunk

from sklearn.svm import SVC 
from sklearn import metrics 
from sklearn.model_selection import KFold 
from sklearn.pipeline import make_pipeline 
from sklearn.naive_bayes import MultinomialNB 
from sklearn.linear_model import SGDClassifier 
from sklearn.model_selection import GridSearchCV 
from sklearn.neighbors import KNeighborsClassifier 
from sklearn.ensemble import RandomForestClassifier 
from sklearn.linear_model import LogisticRegression 
from sklearn.model_selection import train_test_split 
from sklearn.preprocessing import FunctionTransformer 
from sklearn.feature_extraction.text import TfidfVectorizer 
from sklearn.model_selection import RepeatedStratifiedKFold 
from sklearn.metrics import confusion_matrix,accuracy_score,make_scorer,classification_report

warnings.filterwarnings('ignore') 
nlp = spacy.load("en_core_web_sm") 
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/davidesecoli/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

# Pipelines Hyperparameters

The pipeline hyperparameters below are set to test model performance and output quality of different components across both pipelines. All tests have been set to be reproducable using a random seed. More details on each of them will be provided later on. Stay tuned! 

In [3]:
global RANDOM_SEED, chosen_feature, levenshtein_ratio

RANDOM_SEED = 37

# Word similarity pipe 1
levenshtein_ratio_one = 0.7

# Word similarity pipe 2
levenshtein_ratio_two = 0.6

# List of available features to train models on
chosen_feature = ['Review_without_stopwords','Str_Stemmed_tokens','Extracted_feature',
                  'Extracted_adjectives','Str_PoS_tokens']

# Select index num of chosen_feature above
feature_num = 0

# Pipeline 1 step 1: Analyse the data and the task

As part of the assignment a `Data` folder was provided which contains three sub-folders:
<br>
* Customer_review_data
* CustomerReviews-3_domains
* Reviews-9-products.
<br>

The first two folders contain 5 and 3 txt files respectively of different product reviews and a readme.txt file detailing the format used on such files. <br>
Last folder includes 9 txt files all about product review without any readme files.

Function `load_data` parses the files using pandas `read_fwf` with `delimiter='##'` and `skiprows = 10` parameters for the first two folders and `delimiter='##'`,`skiprows = 1` and `encoding="ISO-8859-1"` for the third folder to accomodate appropriate parsing. Then drops a few unnecessary rows and stores all parsed dataframes in a list. <br>

In [4]:
def load_data(your_path_to_data):
    
    # First two folders in Data 
    files_dir = ['/Data/Customer_review_data/Apex AD2600 Progressive-scan DVD player.txt',
                 '/Data/Customer_review_data/Canon G3.txt',
                 '/Data/Customer_review_data/Creative Labs Nomad Jukebox Zen Xtra 40GB.txt',
                 '/Data/Customer_review_data/Nikon coolpix 4300.txt',
                 '/Data/Customer_review_data/Nokia 6610.txt',
                 '/Data/CustomerReviews-3_domains/Computer.txt',
                 '/Data/CustomerReviews-3_domains/Router.txt',
                 '/Data/CustomerReviews-3_domains/Speaker.txt']
    
    # Third folder which requires ISO-8859-1 encoding 
    files_dir_two = ['/Data/Reviews-9-products/Canon PowerShot SD500.txt',
                     '/Data/Reviews-9-products/Canon S100.txt',
                     '/Data/Reviews-9-products/Diaper Champ.txt',
                     '/Data/Reviews-9-products/Hitachi router.txt',
                     '/Data/Reviews-9-products/ipod.txt',
                     '/Data/Reviews-9-products/Linksys Router.txt',
                     '/Data/Reviews-9-products/MicroMP3.txt',
                     '/Data/Reviews-9-products/Nokia 6600.txt',
                     '/Data/Reviews-9-products/norton.txt']

    items = ['apex_ad_2600', 'canon_g3', 'Creative Labs','Nikkon Coolpix 4300',
             'Nokia 6610','Computer','Router','Speaker']
    
    # ISO-8859-1 encoding files
    items_two = ['Canon Powershot SD500','Canon S100','Diaper Champ','Hitachi Router',
                 'Ipod','Linksys Router','Micro MP3','Nokia 6600','Norton']
    
    global tot_items
    tot_items = ['Apex AD 2600','Canon G3','Creative Labs','Nikkon Coolpix 4300',
                 'Nokia 6610','Computer','Router','Speaker','Canon Powershot SD500',
                 'Canon S100','Diaper Champ','Hitachi Router','Ipod','Linksys Router',
                 'Micro MP3','Nokia 6600','Norton']
    
    df_dict = {}
    for idx, file in enumerate(files_dir):      
        df_dict[items[idx]] = pd.read_fwf(your_path_to_data+file,
                                          skiprows = 10,header=None, delimiter='##',
                                          skip_blank_lines=True)
        # Drop meaningless rows
        if items[idx] == 'Creative Labs':
            df_dict[items[idx]] = df_dict[items[idx]].drop([1564,1568,1570,1574,1576,1596,1598,1602,1604,1606,
                                                            1609,1613,1615,1617,1622,1624,1627,1632,1634,1636])
        elif items[idx] == 'Computer':
            df_dict[items[idx]] = df_dict[items[idx]].drop([306,309,313,316,393,391])
        elif items[idx] == 'Router':
            df_dict[items[idx]] = df_dict[items[idx]].drop([789,794,798,810,814,816,820,827,840,850])
    
    # ISO-8859-1 encoding files
    for idx, file in enumerate(files_dir_two):

        df_dict[items_two[idx]] = pd.read_fwf(your_path_to_data+file,
                                          skiprows = 1,header=None, delimiter='##',
                                          encoding="ISO-8859-1", skip_blank_lines=True)
        # Exclude [t] rows and dropna 
        df_dict[items_two[idx]] = df_dict[items_two[idx]][df_dict[items_two[idx]] != '[t]'].dropna()
        
        # Drop meaningless rows
        if items_two[idx] == 'Linksys Router':
            df_dict[items_two[idx]] = df_dict[items_two[idx]].drop([195,217])
        elif items_two[idx] == 'Micro MP3':
            df_dict[items_two[idx]] = df_dict[items_two[idx]].drop([19,37,196,218])
        elif items_two[idx] == 'Nokia 6600':
            df_dict[items_two[idx]] = df_dict[items_two[idx]].drop([19,37,196,218])
    
    return list(df_dict.values())

# Pipeline 1 step 2: Data preprocessing

Function `data_preprocessing_with_lables` gets passed in a list of dataframes from `load_data` and applies the following pre-processing methods:
<br>

* Removes capital letters
* Removes digits
* Removes extra spaces
* Fixes word contractions
* Removes punctuation
* Splits concatenated words (Viterbi algorithm)
* Applies tokenization
* Removes stop words
* Applies stemming
<br>

The end goal of this second step in the pipeline is to clean the data and have it ready for downstream consumption.
<br>

The first five points above should be self-explanatory and are achived by using conventional python methods except for the imported `contractions` library.
<br>

To split concatenated words the Viterbi algorithm is used. This function calculates the maximum posteriori probability estimate of the most likely sequence given an obervation sequence and selects the one with the highest probability [1][2]. 
<br>

Tokenization is used for breaking the raw text into individual tokens. These tokens help in interpreting the meaning of the text by analysing the sequence of words and is an essential step for subsequent methods which will be touched upon in the next section.
<br>

Stop words are a set of commonly used words in a language. Examples of English stop words are “a”, “the”, “is" and etc. In the context of Natural Language Processing (NLP) these commonly used words are eliminated as they carry high noise to signal ratio when training models.
<br>

Stemming is a technique used to extract the base form of the words by removing affixes from them. It is just like cutting down the branches of a tree to its stems. For example, the stem of the words eating, eats, eaten is eat. This is also the method used by search engines to index words.

In [5]:
def data_preprocessing(data):
    
    all_datasets_pre = []
    for review_dataset in data:

        # Orinal text to stip out reviews in feature_extraction func
        review_data = review_dataset.copy()

        # Remove capital letters
        data = review_dataset.iloc[:,0].str.lower()

        # Remove digits        
        data = data.apply(lambda x: ''.join([i for i in x if not i.isdigit()]))

        # Remove extra spaces
        data = data.apply(lambda x: " ".join(x.split()))

        # Fix word contractions (i.e: I'd like -> I would like)
        data = data.apply(lambda x: contractions.fix(x))

        # Define punctuation string library
        english_punctuations = string.punctuation

        # Punctuation function 
        def remove_punctuations(text):
            translator = str.maketrans('', '', english_punctuations)
            return text.translate(translator)

        # Remove punctuation 
        data = data.apply(remove_punctuations)

        # Apply Viterbi algorithm to split concatenated words
        def split_concat_words(text):
            new_text = ' '.join(viterbi_algorithm(wordmash) for wordmash in text.split())
            return new_text

        data = data.apply(lambda x: split_concat_words(x))

        # Use DataFrame format
        dataframe = pd.DataFrame(data)

        dataframe['Raw_text'] = review_data

        # Apply tokenization on raw but cleaned data 
        dataframe["PoS_Tokens"] = dataframe.iloc[:,0].apply(word_tokenize)
        dataframe['Str_PoS_tokens'] = dataframe['PoS_Tokens'].apply(lambda x: ' '.join(x))
        
        # Remove stop words
        stop = stopwords.words('english')
        dataframe['Review_without_stopwords'] = dataframe.iloc[:,0].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))

        # Apply word tokenize
        dataframe["Tokens"] = dataframe["Review_without_stopwords"].apply(word_tokenize)

        # Apply stemming
        stemmer = SnowballStemmer("english")
        dataframe['Stemmed_tokens'] = dataframe['Tokens'].apply(lambda x: [stemmer.stem(y) for y in x])

        # Transform stemmed tokens back to string format 
        dataframe['Str_Stemmed_tokens'] = dataframe['Stemmed_tokens'].apply(lambda x: ' '.join(x))

        all_datasets_pre.append(dataframe)
        
    return all_datasets_pre


word_prob = get_frequency_dict(lang='en', wordlist='large')
max_word_len = max(map(len, word_prob)) 


def viterbi_algorithm(text):
    probs, lasts = [1.0], [0]
    for i in range(1, len(text) + 1):
        new_probs = []
        for j in range(max(0, i - max_word_len), i):
            substring = text[j:i]
            length_reward = np.exp(len(substring))
            freq = word_prob.get(substring, 0) * length_reward
            compounded_prob = probs[j] * freq
            new_probs.append((compounded_prob, j))
        
        # max of a touple is the max across the first elements, which is the max of the compounded probabilities
        prob_k, k = max(new_probs)
        probs.append(prob_k)
        lasts.append(k)

    # when text is a word that doesn't exist, the algorithm breaks it into individual letters.
    # in that case, return the original word instead
    if len(set(lasts)) == len(text):
        return text
    
    words = []
    k = len(text)
    while 0 < k:
        word = text[lasts[k]:k]
        words.append(word)
        k = lasts[k]
    words.reverse()
    return ' '.join(words)


# Pipeline 1 step 3: Product feature extraction

In the third step of the pipeline features are extracted from the data cleaned in step two.<br>
Function `feature_extraction` gets a list of dataframes from `data_preprocessing` and performs the following operations:<br>
* Extracts Part of Speech (PoS) tags from tokenised words using NLTK
* Applies NLTK Chunker on PoS tags and extracts noun chunks
* Cleans chunks to extract product features
* Extracts adjectives from PoS tags and cleans them
* Extracts product reviews from original text.

PoS tags are label assigned to each token (word) in a text corpus to assign grammatical information of each word of the sentence (adjectives, nouns ect..). PoS tagging is applied on tokenized words using a supervised learning algorithm that uses features like previous word and next word to determine the statistical most likely tag.<br>
On top of PoS a Chunker is applied. Chunking is the process of extracting phrases from PoS tags. This pipeline explores through the `grammar` variable the use of regex parsing to filter out noun phrases and adjectives. The following pipeline, among other things, explores the quality of noun phrases extracted using the `Spacy` library.<br>
Regex is also used to extract customer product reviews denoted with a `+`, `-` sign and a `1,2,3` magnidute inside `[]` brakets. Note the magnite was not taken into account as more likely to be user subjective.
<br>

As a last step, rows containing `NaN` values are dropped to prepare datasets for vectorisation and ultimatly be fed to machine learning models. This is a pivotal point where crucial design and engineering decisions had to be made since the number of NaN rows (majority attributed to missing reviews) is considerably large. Because of this, pipeline 2 will expand on this approach by exploring a different engineering solution in the attempt to minimise data wastage. 
<br>

Back to the current workflow, the list of datrames is passed to `split_data_apply_tfidf` which splits the data into an 80/20 training and testing set and applies `TfidfVectorizer fit_transofrm` method on the traing using `chosen_feature[feature_num]` pipeline settable parameters and `transform` on the testing set using same parameters settings. These parameters are the product features and sentiment-bearing sentences. Worth noting that several combinations have been tried and in our case `Review_without_stopwords` yielded the highest overall accururacy as well as recall, precision and F1 score. The pipeline is currently set to use this feature facilitated via variable `feature_num = 0`.<br>
Term frequency-inverse document frequency (TD-IDF) combines 2 concepts, Term Frequency (TF) and Document Frequency (DF). Term frequency is the number of occurrences of a specific term in a document. Inverse document frequency (IDF) calculates the weight of a term by reducing it's weight if the term’s occurrences are very present in the document. <br>The role of TD-IDF is to transform text into vectors that can be consumed by machine learning models.
To conclude, once both training and testing datasets across all product reviews are vectorised, they get appended to `all_datasets_split` dictionary concluding the feature extraction step.


In [6]:
def feature_extraction(all_datasets_pre):
    
    
    global pos_eval, chunks_eval, adj_eval, df_global_features
    pos_eval = []
    chunks_eval = []
    adj_eval = []
    df_global_features = []
    
    all_datasets_extract = {}
    
    for idx, review_dataset in enumerate(all_datasets_pre):
        
        dataframe = review_dataset.copy()

        # extract Part of Speech tags from tokenised words
        dataframe['PoS_Full'] = dataframe['PoS_Tokens'].apply(lambda x: pos_tag(x))

        # Define chunk grammar 
        grammar = ["NP: {<DT>?<JJ.*>*<NN.*>+}","JJ: {<VB>*<JJ.*>+}","NP: {<[CDJNP].*>+}",
                   "NP: {<NN.?>+<NN.?>}"]


        # Create chunk parser
        chunkParser = nltk.RegexpParser(grammar[0])

        # Apply NLTK chunk parser
        dataframe['Chunk'] = dataframe['PoS_Full'].apply(lambda x: chunkParser.parse(x))

        def extract_chunks(chunks):
            for a in chunks:
                if isinstance(a, nltk.tree.Tree):
                    if a.label() == "NP":
                        return a

        dataframe['Extracted_Chunks'] = dataframe['Chunk'].apply(lambda x: extract_chunks(x))

        def clean_extracted_chunks(chunks):

            if chunks is None:
                return None

            feature = ""
            extracted_feature = ""
            for num, idx in enumerate(list(chunks)):

                # Include only words longer than 2 latters
                if len(idx[0]) >= 3:            
                    feature += idx[0]
                    feature += " "
            feature = feature[:-1] # Get rid of last space 
            extracted_feature = feature
            return extracted_feature

        dataframe['Extracted_feature'] = dataframe['Extracted_Chunks'].apply(lambda x: clean_extracted_chunks(x))

        # Create adjective chunk parser
        chunkParser = nltk.RegexpParser(grammar[1])

        def extract_chunks_jj(chunks):
            for a in chunks:
                if isinstance(a, nltk.tree.Tree):
                    if a.label() == "JJ":
                        return a

        dataframe['Adjectives'] = dataframe['PoS_Full'].apply(lambda x: (extract_chunks_jj(chunkParser.parse(x))))
        dataframe['Extracted_adjectives'] = dataframe['PoS_Full'].apply(lambda x: clean_extracted_chunks(extract_chunks_jj(chunkParser.parse(x))))
        
        # Store values for each dataset for evaluation metrics
        pos_eval.append(dataframe['PoS_Full'])
        chunks_eval.append(dataframe['Chunk'])
        adj_eval.append(dataframe['Adjectives'])

        # Store negative reviews
        negative_reviews = ["-1","-2","-3"]

        # Extract reviews
        def split_review(raw_review):

            # Regex to get [x] from text
            review = re.findall(r"\[\s*\+?(-?\d+)\s*\]", raw_review)

            if len(review) == 0: # Missing review
                return None
            elif review[0] in negative_reviews:
                return 0
            return 1

        # Apply review extraction
        dataframe['Review'] = dataframe['Raw_text'].apply(lambda x: split_review(x))

        # those with none values
        df_global = dataframe.dropna() 
        
        # Append dataframes to dict of lists
        all_datasets_extract[idx] = []
        all_datasets_extract[idx].append(df_global)
    
    return all_datasets_extract


def split_data_apply_tfidf(all_datasets_extract):

    # Use also to vectorize nan reviews
    global vectorizer_list
    vectorizer_list = []
    all_datasets_split = {}

    for key, review_dataset in all_datasets_extract.items():

        df_global = pd.DataFrame(review_dataset[0])       
            
        # Dataset
        X = df_global.iloc[:,:-1]

        # Labels
        y = df_global.iloc[:,-1:]

        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=5)
        
        # Apply tfidf vectorizer
        vectorizer = TfidfVectorizer(stop_words='english',ngram_range=(1, 2))

        def vectorized_reviews(vectorizer, X_train, X_test):
            X_train = vectorizer.fit_transform(X_train[chosen_feature[feature_num]])
            
            # Append fit transform state for nan review
            vectorizer_list.append(vectorizer)
            
            X_test = vectorizer.transform(X_test[chosen_feature[feature_num]])

            return X_train, X_test

        X_train, X_test = vectorized_reviews(vectorizer, X_train, X_test)
    
        all_datasets_split[key] = []
        all_datasets_split[key].extend([X_train, X_test, y_train, y_test, df_global])
    
    return all_datasets_split 

# Pipeline 1 step 4: Sentiment analysis

Step four is the modelling phase. This is where all the hard work performed in step one through three (data parsing, data pre-processing and features extraction) come to fruition. 
<br>

This section of the pipeline explores the performance of a baseline Random Forest Classifier (RFC) and compares its performance against an hyperparameter optimised version of itself. Random forest was chosen because of its good performance with high dimentional datasets thanks to the splitting of data into subsets. Since `TfidfVectorizer` was used, which largly increases the curse of dimensionality, using Random Forest seemed to be the right choice to start with. Pipeline two will expand on this by applying several different machine learing models and compare their performance.
<br>

Back to RFC, this classification algorithm is formed by many decisions trees. Uses bagging and feature randomness when building each individual tree in the attempt to create an uncorrelated forest of trees whose prediction by committee is more accurate than that of any individual tree. Bagging is what makes Random Forests performing classifiers. At its core is an ensemble algorithm that fits multiple models on different subsets of a training dataset to then combine all the predictions by coming up with a more accurate and stable forecast.
<br>

In terms of implementation, function `models_training` reads in a dictionary of lists from `split_data_apply_tfidf` each containing product reviews vectorised training and test datasets. 
An RFC baseline model is fitted on each product train set and predicted results for each product are stored for downstream comparisons. 
Next a new RFC model is trained using GridSearchCV which is a technique that searches through a predefined parameter space to find the optimal values. Could be thought of as a cross-validation method where the model and the parameters are fed in on different folders and the best parameter set are extracted and predictions using this set are made.
<br>

Below are the parameters used [3]: 

* `bootstrap`: Whether bootstrap samples are used when building trees. If False, the whole dataset is used to build each tree
* `n_estimators`: Number of trees in the forest
* `criterion`: Function to measure the quality of a split.<br>
Supported criteria are “gini” for the Gini impurity and “entropy” for the information gain.<br>
Note: this parameter is tree-specific.
* `min_samples_leaf`: The minimum number of samples required to be at a leaf node. A split point at any depth will only be considered if it leaves at least min_samples_leaf training samples in each of the left and right branches. This may have the effect of smoothing the model, especially in regression.<br>
If int, then consider min_samples_leaf as the minimum number.<br>
If float, then min_samples_leaf is a fraction and ceil(min_samples_leaf * n_samples) are the minimum number of samples for each node.
* `max_features`: Number of features to consider when looking for the best split:<br>
If “sqrt”, then max_features=sqrt(n_features).<br>
If “log2”, then max_features=log2(n_features).



<br>

In [19]:

def models_training(all_datasets_split):
    
    
    global model, grid_search, grid_params
    model = []
    grid_search = []
    
    all_datasets_training = {}
    for key, review_dataset in all_datasets_split.items():
        
        X_train = review_dataset[0]
        X_test = review_dataset[1]
        y_train = review_dataset[2]
        y_test = review_dataset[3]
        df_global = review_dataset[4]

        model.append(RandomForestClassifier())
                
        # Train baseline model for comparison
        baseline_model = model[key].fit(X_train, y_train)
        baseline_pred = baseline_model.predict(X_test)
        
        # Set parameter grid
        bootstrap_v = [True, False]
        n_estimators_v = list(range(100,600,100)) 
        # n_estimators_v = list(range(100,2000,200))
        criterion = ['gini', 'entropy']
        min_sample_leaf_v = list(range(1,5,2))
        max_features_v = ['sqrt', 'log2']

        grid_params = {
            'bootstrap' : bootstrap_v,
            'n_estimators' : n_estimators_v,
            'criterion' : criterion,
            'min_samples_leaf' : min_sample_leaf_v,
            'max_features' : max_features_v
        }

        # Search best set of params
        grid_search.append(GridSearchCV(estimator=model[key], param_grid=grid_params, cv=3, verbose=1))

        # Fit model using best params
        grid_search[key].fit(X_train, y_train)
        
        # Predict
        predictions = grid_search[key].predict(X_test)
        
        all_datasets_training[key] = []
        all_datasets_training[key].extend([y_test, baseline_pred, predictions, df_global])

    return all_datasets_training

# Pipeline 1 step 5: Evaluation and discussion

In this final step of the pipeline, preformance metrics are outputted on the following: 
<br>
* PoS tags: Breakdown of Precicion, Recall, F-1 and Accuracy score across all tags (i.e. DT, IN, JJ, NN)
* Adjective phrases: Breakdown of Precicion, Recall, F-1 score and Accuracy across adjectives tags
* Chunked noun sentences: IOB, Precision, Recall, F-measure and Accuracy
* Baseline RFC model: Precision, Recall, F-measure and Accuracy
* Hyperparameter optimised RFC: Precision, Recall, F-measure and Accuracy
<br>

`Precision` quantifies the number of positive class predictions that actually belong to the positive class. Is caclulated as `TruePositives / (TruePositives + FalsePositives)`
<br>

`Recall` quantifies the number of positive class predictions made out of all positive examples in the dataset. Is calculated as `TruePositives / (TruePositives + FalseNegatives)` 
<br>

`F-Measure` provides a single score that balances both the concerns of precision and recall in one number. Is calculated as `(2 * Precision * Recall) / (Precision + Recall)`
<br>

To calculate these stats a `UnigramTagger`, which is a tagger that uses a single word as its context for determining PoS tags, is trained on the `conll2000.tagged_sents` train.txt corpus which contains 270k words of Wall Street Journal text.<br>
These trained tags are the 'gold standard' with which the results of the PoS tags, adjectives and chunked noun sentences from step three are compared against.
<br>
The results across the spectrum are quite encouraging with Pecision showing slightly higher figures (about 85%) than Recall on PoS tags and Adjective phrases indicating that the trained model returns most of the relevant results. <br>
Recall is higher on Chunked sentenses (about 85%) than Precision which averages to about 60%. This is a considerable difference driven by many false positives and because of this, unlike for the PoS tags and adjectives, F-measure also takes a performance hit settling around 70% on average.
<br>
<br>
Moving on to the models performance, its worth noting the relatively high score of the baseline Random Forest Classifier. This is the model trained without any hyperparameter optimisation. Although, comes with quite some precision variance on some datasets where scores are in the 60 percentile vs 80+ on the others. <br>
The hyperparameter optimised models do perform better but require computational resorces to achieve full potential. In the interest of keeping running time managable `n_estimators_v = list(range(100,2000,200))` was left commented out. This increases considerably the search space and allows the cost function to get closer to global minima. 

In [8]:
# Code borrowed from NLTK [4]
class UnigramChunker(nltk.ChunkParserI):
        def __init__(self, train_sents):
            train_data = [[(t,c) for w,t,c in nltk.chunk.tree2conlltags(sent)]
                          for sent in train_sents]
            self.tagger = nltk.UnigramTagger(train_data)

        def parse(self, sentence):
            pos_tags = [pos for (word,pos) in sentence]
            tagged_pos_tags = self.tagger.tag(pos_tags)
            chunktags = [chunktag for (pos, chunktag) in tagged_pos_tags]
            conlltags = [(word, pos, chunktag) for ((word,pos),chunktag)
                         in zip(sentence, chunktags)]
            return nltk.chunk.conlltags2tree(conlltags)

In [9]:
def evaluate(all_datasets_training):
    
    all_datasets_eval = all_datasets_training.copy()

    for idx, (key, review_dataset) in enumerate(all_datasets_training.items()):
        
        # Unpack dict
        y_test = review_dataset[0]
        baseline_pred = review_dataset[1]
        predictions = review_dataset[2]
        df_global = review_dataset[3]
        
        train_sentences = conll2000.tagged_sents('train.txt')

        test_sentences = {'Part of Speech tags': pos_eval[key].dropna().to_list(),
                          'Adjectives phrases': adj_eval[key].dropna().to_list()}

        # Train the tagger
        unigram_tagger = UnigramTagger(train_sentences)
        
        print('===='*27)
        print(' '*33,f'Evaluation metrics for {tot_items[idx]} dataset')
        print('===='*27)

        for key, test_sentence in test_sentences.items():

            print(f'\nEvaluation metrics for {tot_items[idx]}: {key}')
            print('='*52)
            tagged_test_sentences = unigram_tagger.tag_sents([[token for token,tag in sent] for sent in test_sentence])
            gold = [str(tag) for sentence in test_sentence for token,tag in sentence]
            pred = [str(tag) for sentence in tagged_test_sentences for token,tag in sentence]

            print(metrics.classification_report(gold, pred),'\n\n')

        print(f'Evaluation metrics for {tot_items[idx]}: Chunked sentences')
        print('='*53)
        train_sents = conll2000.chunked_sents('train.txt', chunk_types=['NP'])
        test_sentences = chunks_eval[idx].dropna()

        train_sents = conll2000.chunked_sents('train.txt', chunk_types=['NP'])
        unigram_chunker = UnigramChunker(train_sents)
        print(unigram_chunker.evaluate(test_sentences))
        print('='*53)
        print(f'\n\nBest Parameters for {tot_items[idx]} classsifier')
        print('='*53)
        print(grid_search[idx].best_params_)
        print('='*53)

        #pprint(grid_search[idx].get_params())
        report = classification_report(y_test, predictions)
        baseline_report = classification_report(y_test, baseline_pred)

        score = accuracy_score(y_true=y_test, y_pred=predictions)
        baseline_score = accuracy_score(y_true=y_test, y_pred=baseline_pred)
        
        print(f'\n\nReport of the baseline Random Forest Classifier model')
        print('=='*27)
        print(baseline_report)
        print('=='*27)
        print("{} {:0.2f}%".format("Accuracy Score: ", baseline_score*100))
        print('=='*27)
        print(f'\n\nReport of the Optimised Random Forest Classifier model')
        print('=='*27)
        print(report)
        print('=='*27)
        print("{} {:0.2f}%".format("Accuracy Score: ", score*100))
        print('=='*27)
        print('\n\n\n')
    
    return all_datasets_eval


To complete the evaluation metrics step, for each dataset, function `review_similarity_count` outpus a summary table of top features and their review polarity count. <br> 

As a first step the function reads in the reviews and transforms them from binary `1`, `0` (used in training) to `Positive` and `Negative` strings. <br>
Then all extracted features, which are derived from noun phrases, are compared against each other using Levenshtein distance. Levenshtein is a metric that measures the difference between strings. Can be thought as the euclidean distance between two words is the minimum number of edits required to change one string into the other [5]. Various testings have been carried on to find the right balance between merging to words together. On this pipeline the optimal point was found at 0.7 and this ratio is also set as a tunable pipeline hyperparameter under the `levenshtein_ratio_one` variable. 
<br>

The main constrained encountered was the lack of features aboundance mainly because lerge percentage of data on each dataset was dropped due to missing reviews. The engeneering solution of pipeline two was designed to overcome such shortage.
<br>
<br>

This brings us to the end of this pipeline! In the notebook cell below `review_similarity_count` function, evaluation summaries split by product review can be found, followed by summary tables of top features for each product review. Enjoy!


In [10]:

def review_similarity_count(all_datasets_eval):
    
    for idx, (key, review_dataset) in enumerate(all_datasets_eval.items()):
                
        # Unpack df_global
        review_count = review_dataset[3].copy().dropna()
        
        # Convert bool review to string 
        review_count['Review'] = review_count['Review'].apply(lambda x: 'Positive' if x == 1 else 'Negative')

        print('===='*27)
        print(' '*33,f'Top Features extracted from {tot_items[idx]} dataset')
        print('===='*27,'\n')
        
        # Loop through features 
        for idx, word1 in enumerate(review_count['Extracted_feature']):
            for idy, word2 in enumerate(review_count['Extracted_feature']):
                # Assess word similarity
                ratio = Levenshtein.ratio(word1, word2)
                if ratio > levenshtein_ratio_one:
                    # Get smallest name
                    if len(word1) < len(word2):
                        review_count['Extracted_feature'].iloc[idy] = word1.capitalize()
                    else:
                        review_count['Extracted_feature'].iloc[idx] = word2.capitalize()

        # Gruop product feature review and count 
        review_count = review_count.groupby(['Extracted_feature', 'Review']).agg({'Review': ['count']})
        review_count = review_count.sort_values(('Review', 'count'), 
                                                ascending=False).groupby(level=1).head(5).sort_index().reset_index()
        cols =[('Extracted_feature',''),
               ('Review',''),
               ('Review', 'count')]

        review_count = review_count.groupby(['Extracted_feature',('Review','')])[cols].sum()
        review_count.index.names = ['Extracted Features','Reviews']
        review_count.columns = review_count.columns.set_levels(['Count'], level=1)

        print(review_count,'\n\n')

In [20]:
pipeline_one = review_similarity_count(evaluate(
         models_training(
         split_data_apply_tfidf(
         feature_extraction(
         data_preprocessing(
         load_data(your_path_to_data)))))))

Fitting 3 folds for each of 80 candidates, totalling 240 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 240 out of 240 | elapsed:   41.0s finished


Fitting 3 folds for each of 80 candidates, totalling 240 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 240 out of 240 | elapsed:   35.1s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


Fitting 3 folds for each of 80 candidates, totalling 240 fits


[Parallel(n_jobs=1)]: Done 240 out of 240 | elapsed:  1.5min finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


Fitting 3 folds for each of 80 candidates, totalling 240 fits


[Parallel(n_jobs=1)]: Done 240 out of 240 | elapsed:   29.7s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


Fitting 3 folds for each of 80 candidates, totalling 240 fits


[Parallel(n_jobs=1)]: Done 240 out of 240 | elapsed:   36.1s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


Fitting 3 folds for each of 80 candidates, totalling 240 fits


[Parallel(n_jobs=1)]: Done 240 out of 240 | elapsed:   35.4s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


Fitting 3 folds for each of 80 candidates, totalling 240 fits


[Parallel(n_jobs=1)]: Done 240 out of 240 | elapsed:   36.8s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


Fitting 3 folds for each of 80 candidates, totalling 240 fits


[Parallel(n_jobs=1)]: Done 240 out of 240 | elapsed:   37.9s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


Fitting 3 folds for each of 80 candidates, totalling 240 fits


[Parallel(n_jobs=1)]: Done 240 out of 240 | elapsed:   28.7s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


Fitting 3 folds for each of 80 candidates, totalling 240 fits


[Parallel(n_jobs=1)]: Done 240 out of 240 | elapsed:   30.7s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


Fitting 3 folds for each of 80 candidates, totalling 240 fits


[Parallel(n_jobs=1)]: Done 240 out of 240 | elapsed:   30.7s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


Fitting 3 folds for each of 80 candidates, totalling 240 fits


[Parallel(n_jobs=1)]: Done 240 out of 240 | elapsed:   31.6s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


Fitting 3 folds for each of 80 candidates, totalling 240 fits


[Parallel(n_jobs=1)]: Done 240 out of 240 | elapsed:   31.4s finished


Fitting 3 folds for each of 80 candidates, totalling 240 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 240 out of 240 | elapsed:   31.6s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


Fitting 3 folds for each of 80 candidates, totalling 240 fits


[Parallel(n_jobs=1)]: Done 240 out of 240 | elapsed:   59.5s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


Fitting 3 folds for each of 80 candidates, totalling 240 fits


[Parallel(n_jobs=1)]: Done 240 out of 240 | elapsed:   45.6s finished


Fitting 3 folds for each of 80 candidates, totalling 240 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 240 out of 240 | elapsed:   32.5s finished


                                  Evaluation metrics for Apex AD 2600 dataset

Evaluation metrics for Apex AD 2600: Part of Speech tags
              precision    recall  f1-score   support

          CC       1.00      1.00      1.00       423
          CD       0.88      1.00      0.93        91
          DT       0.99      0.98      0.99      1172
          EX       0.69      1.00      0.82        20
          FW       0.20      0.50      0.29         2
          IN       0.91      0.97      0.94      1133
          JJ       0.89      0.58      0.71       986
         JJR       0.70      0.95      0.81        44
         JJS       1.00      0.58      0.73        31
          MD       1.00      1.00      1.00       212
          NN       0.90      0.53      0.67      2938
         NNP       0.00      0.00      0.00         7
         NNS       0.94      0.63      0.76       502
        None       0.00      0.00      0.00         0
         PDT       0.00      0.00      0.00         8

# Pipeline 2 

The first two steps of pipeline two, data parsing and pre-processing, are shared with pipeline one.This is because the data ingestion and cleaning phase are carried on in the same way. <br>
The two pipelines start branching off from step three: features extraction. The following is an holistic view of pipiline's two main features and design characteristics: 
<br><br>
* Spacy `noun_chunks` is used to extract noun phrases. This method is used because our research has shown a richness of quality results in tends to be returned. This approach was taken in the attempt to obtain a larger noun sample size than using NLTK
* Dataframes with missing reviews are kept to extract valuable product features
* On each dataset five machine learning models are trained both as standard baseline and also through an optimised hyperparameters search.The following are the five models:<br> `Naive Bayes Classifier`,`Support Vector Machine Classifier`,`Random Forest Classifier`,`Logistic Regression` and `Stocastic Gradient Decent Classifier`
<br>
This results in five + five (baseline + optimised) models being trained on each of the seventeen datasets for a total of one hundred and seventy models
* The best out of sample performing model on each dataset is used to predict the sentiment of sentences with missing reviews
* Extracted features with reviews are combined with features thathad no reviews and now have predicted reviews. Features evaluation summaries with reviews count split by product are printed out 

# Pipeline 2 step 3: Product feature extraction

Feature extraction step of pipeline two uses Spacy `noun_chunks` method to extract Noun chunks. These are “base noun phrases” that have a noun as their head and can be thought as a noun plus the words describing the noun. 
<br>

As highlighted in the section above the reason behind using this method is to collect a semantically rich noun set to present a denser feature review output table. 
<br>

The other key difference of this step is that copies of dataframes, before NaN rows get dropped, are stored under `dataframe_full` variable to be consumed by downstream functions. 
<br>

The remaining procedures: stripping out reviews, splitting datasets for test and training as well as vectorisation are carried on in the same way as pipeline 1. Please refer to that section for a more detailed explaination. 

In [12]:
def feature_extraction_two(all_datasets_pre):    
        
    all_datasets_extract = {}
    
    for idx, review_dataset in enumerate(all_datasets_pre):
        
        dataframe = review_dataset.copy()

        # Apply Spacy chunk parser 
        dataframe['Spacy_Chunk'] = dataframe['Str_PoS_tokens'].apply(lambda x: [chunk for chunk in nlp(x).noun_chunks])    

        # Store negative reviews
        negative_reviews = ["-1","-2","-3"]

        # Extract reviews
        def split_review(raw_review):

            # Regex to get [x] from text
            review = re.findall(r"\[\s*\+?(-?\d+)\s*\]", raw_review)

            if len(review) == 0: # Missing review
                return None
            elif review[0] in negative_reviews:
                return 0
            return 1

        # Apply review extraction
        dataframe['Review'] = dataframe['Raw_text'].apply(lambda x: split_review(x))
        
        # NOTE: The difference with pipeline 1 at this stage is that a copy of dataframe 
        # with NaN rows (mostly missing reviews) is kept to extract valuable features
        # and forecast their opinion tendency in subsequent steps 

        # Keep copy of dataframe for predicting missing reviews and retain more features
        dataframe_full = dataframe.copy()
        
        # Drop nan values for training
        dataframe = dataframe.dropna() 
        
        # Append dataframes to dict of lists
        all_datasets_extract[idx] = []
        all_datasets_extract[idx].extend([dataframe,dataframe_full])
    
    return all_datasets_extract


def split_data_apply_tfidf_two(all_datasets_extract):
         
    # Use also to vectorize nan reviews
    global vectorizer_list
    vectorizer_list = []
    all_datasets_split = {}
    
    for key, review_dataset in all_datasets_extract.items():
        
        # Unpack 
        dataframe = review_dataset[0]
        dataframe_full = review_dataset[1]
        
        # To pandas dataframe
        dataframe = pd.DataFrame(dataframe)       
        
        # Dataset
        X = dataframe.iloc[:,:-1]

        # Labels
        y = dataframe.iloc[:,-1:]

        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=5)
        
        # Apply tfidf vectorizer
        vectorizer = TfidfVectorizer(stop_words='english',ngram_range=(1, 2))

        
        def vectorized_reviews(vectorizer, X_train, X_test):
            X_train = vectorizer.fit_transform(X_train[chosen_feature[feature_num]])
            
            # Append fit transform state for nan review
            vectorizer_list.append(vectorizer)
            
            X_test = vectorizer.transform(X_test[chosen_feature[feature_num]])

            return X_train, X_test

        X_train, X_test = vectorized_reviews(vectorizer, X_train, X_test)
    
        all_datasets_split[key] = []
        all_datasets_split[key].extend([X_train, X_test, y_train, y_test, dataframe_full])
    
    return all_datasets_split 

# Pipeline 2 step 4: Sentiment analysis

Function `models_training_two` as the name suggets, trains various machine learning models to compare their performance across datasets.
<br>

The following are the classifers trained: `Naive Bayes Classifier`,`Support Vector Machine Classifier`,`Random Forest Classifier`,`Logistic Regression` and `Stocastic Gradient Decent Classifier`. 
<br>

I've chosen these models because they are known to be good performers and was interested in expanding my knowledge in this area. LSTM + GloVe as well as neural networks could have been another interesting approach but I've researched and applied them on previous assignments.
<br>

On each of the above models, a base version is fitted to get a minimum threshold measure of performance. From there an hyperparametrised version using sklearn `GridSearchCV` is trained for performance improvements. 
<br>

Note: due to the number of hyperparametrised models trained (170) the search space has been constrained to a reasonable configuration. Commented out parameter settings can be found on each grid for performence boost. This choice was made to let the user experience the output in a reasonable time-manner. Using current settings, the whole pipeline takes about 9 minutes to complete on a 8GB MacBook Pro Apple M1 chip.
<br>

The following section will go through the innerworkings of each model. 
<br>
<br>
`MultinominalNB`:
<br>

Multimodal Naive Bayes is a specialized version of Naive Bayes designed to handle text documents using word counts as it's underlying method of calculating probability. Below is the Bayes equation: 

\begin{equation*}
P(A|B) = (P(B|A) * P(A)) / P(B)
\end{equation*}

This is a probabilistic equation that infers a probability from previous data where: 

\begin{equation*}
P(A|B)
\end{equation*}
<br>
Is the probability of `A` given `B`, also called Posterior probability.
<br>

\begin{equation*}
P(B|A)
\end{equation*}
<br>

Is the probability of `B` given `A`, also called likelihood or Conditional probability.
<br>

\begin{equation*}
P(A)
\end{equation*}
<br>

Is the probability of `A`, called Prior or Class probability.
<br>

\begin{equation*}
P(B)
\end{equation*}
<br>

Is the probability of `B` or Evidence.
<br>
<br>

The `MultinominalNB` hyperparameter search space has the following parameters [6]:
<br>
 
* `alpha`: float, default=1.0
Additive (Laplace/Lidstone) smoothing parameter (0 for no smoothing).

* `fit_prior`: bool, default=True
Whether to learn class prior probabilities or not. If false, a uniform prior will be used.

<br>
<br>

`Support Vector Machine Classifier`:
<br>

The objective of support vector machine is to find a hyperplane in an N-dimensional space that distinctly classifies data points. Several different hyperplanes could be chosen to separate the two classes of data point. The objective is to find a plane that maximises the distance between data points of both classes. Maximizing the margin distance provides some reinforcement so that future data points can be classified with more confidence.
<br>

Also, thanks to the so called ‘kernel trick’,  SVMs can efficiently perform a non-linear classification by implicitly mapping their inputs into high-dimensional feature spaces.

These are the parameters used in the optimisation phase [7]:
<br>

* `C`: float, default=1.0
Regularization parameter. The strength of the regularization is inversely proportional to C. Must be strictly positive. The penalty is a squared `l2` penalty.
* `kernel`: {`linear`, `poly`, `rbf`, `sigmoid`, `precomputed`}, default=`rbf`
Specifies the kernel type to be used in the algorithm. It must be one of `linear`, `poly`, `rbf`, `sigmoid`, `precomputed` or a callable. If none is given, `rbf` will be used. If a callable is given it is used to pre-compute the kernel matrix from data matrices; that matrix should be an array of shape (n_samples, n_samples).
* `gamma`{`scale`, `auto`} or float, default=`scale`
Kernel coefficient for `rbf`, `poly` and `sigmoid`.
if gamma=`scale` (default) is passed then it uses `1 / (n_features * X.var())` as value of gamma,
if `auto`, uses `1 / n_features`

<br>
<br>

`Logistic Regression`:
<br>

Contrary to what the name might suggest, logistic regression is a classification method to predict binary outcomes. This method  is used when the dependent variable (target) is categorical.
<br>
Since the outcome is a probability, the dependent variable is bounded between 0 and 1 and therefore a logit transformation is applied on the odds. In practice this means that the probability of success is divided by the probability of failure, also commonly known as the log odds. The following Is the logistic regression formulas:

 
\begin{equation*}
Logit(pi) = 1/(1+ exp(-pi))
\end{equation*}

<br>

\begin{equation*}
ln(pi/(1-pi)) = Beta_0 + Beta_1*X_1 + … + B_k*K_k
\end{equation*}

<br>

Where `logit(pi)` is the dependent variable and `x` is the independent variable.
<br>

Maximum likelihood estimation (MLE) is used to find the beta parameter via multiple iterations to optimize for the best fit of log odds. This produces the log likelihood function and logistic regression seeks to maximize it to find the best parameter estimate
<br>

The following are the solver parameters used [8]:
<br>

* `newton-cg`: Uses an Hessian matrix. It's slow for large datasets, because it computes the second derivatives.

* `lbfgs`: Approximates the second derivative matrix updates with gradient evaluations. It stores only the last few updates therefore saving memory

* `liblinear`: Uses a coordinate descent algorithm which tries to minimise a multivariate function by solving univariate optimization problems in a loop

* `sag`: Stochastic Average Gradient descent uses a random sample of previous gradient values and works well on  large datasets

* `saga`: Extension of sag that also allows for L1 regularization. Should generally train faster than sag

<br>
<br>

`Stocastic Gradient Decent`:

<br>

Stocastic Gradient Decent (SGD) Classifier is a linear classifier (SVM, logistic regression, a.o.) optimized by the SGD. Gradient descent is used to minimize a cost function towards reaching global minima.
<br>

Computes the gradient using a single sample and allows for minibatch learning. Therefore, SGD works well on large scale problems. Another advantage of using SGD is that the cost function in Logistic Regression cannot be calculated directly, so we try to minimize it via Stochastic Gradient Descent, also known as Online Gradient Descent. The goal is to descend along the cost function towards its minimum for each new training observation.

These are the parameters used [9]:

* `loss`: str, default=`hinge`
The loss function to be used. Defaults to `hinge`, which gives a linear SVM.
The possible options are `hinge`, `log`, `modified_huber`, `squared_hinge`, `perceptron`, or a regression loss: `squared_error`, `huber`, `epsilon_insensitive`, or `squared_epsilon_insensitive`
The `log` loss gives logistic regression, a probabilistic classifier. `modified_huber` is another smooth loss that brings tolerance to outliers as well as probability estimates. `squared_hinge` is like hinge but is quadratically penalized. `perceptron` is the linear loss used by the perceptron algorithm. The other losses are designed for regression but can be useful in classification as well; see SGDRegressor for a description

* `penalty`: {`l2`, `l1`, `elasticnet`}, default=`l2`
The penalty (aka regularization term) to be used. Defaults to `l2` which is the standard regularizer for linear SVM models. `l1` and `elasticnet` might bring sparsity to the model (feature selection) not achievable with `l2`

* `alpha`: float, default=0.0001
Constant that multiplies the regularization term. The higher the value, the stronger the regularization. Also used to compute the learning rate when set to learning_rate is set to `optimal`

<br>
<br>

Please refer to step 4 of pipeline 2 for an explaination of `Random Forest Classifier`

In [13]:
def models_training_two(all_datasets_split):
        
    # To store models created
    best_models = {}
    
    for key, review_dataset in all_datasets_split.items():
                
        # Unpack
        X_train = review_dataset[0]
        X_test = review_dataset[1]
        y_train = np.ravel(review_dataset[2])
        y_test = np.ravel(review_dataset[3])
        dataframe_full = review_dataset[4]
        
        # Model 1 of 5: Multinomial Naive Bayes     
        
        bayes_est_base = MultinomialNB()
        
        # Train MultinomialNB baseline model for comparison
        bayes_baseline = bayes_est_base.fit(X_train, y_train)
        bayes_baseline_pred = bayes_baseline.predict(X_test)
        
        bayes_hyperparams = {
                'alpha'     : [0.8, 1.0, 1.2, 1.4],
                'fit_prior' : [True, False],
        }
        
        bayes_estimator = MultinomialNB()
        best_model_multinominal_nb = get_best_model(bayes_estimator, bayes_hyperparams, 
                                                    X_train, y_train, X_test, y_test, best_models, key,
                                                    'MultinominalNB')
        
        
        # Store best model and baseline predictions
        best_models[key].extend([best_model_multinominal_nb.best_estimator_,bayes_baseline_pred])
        
        
        # Model 2 of 5: Support Vector Machine
        
        svc_est_base = SVC()
        
        # Train SVM Classifier baseline model for comparison
        svc_baseline = svc_est_base.fit(X_train, y_train)
        svc_baseline_pred = svc_baseline.predict(X_test)
                
        svc_hyperparams = {
                        'C': [0.1, 1, 10, 100],
                    'gamma': [1, 0.1, 0.01, 0.001],
                   'kernel': ['rbf','linear']
        }

        svc_estimator = SVC(random_state=1)
        best_model_svc = get_best_model(svc_estimator, svc_hyperparams,
                                        X_train, y_train, X_test, y_test, best_models, key,
                                       'SVMClassifier')
        
        # Store best model and baseline predictions
        best_models[key].extend([best_model_svc.best_estimator_,svc_baseline_pred])
                
        # Model 3 of 5: Random Forest Classifier
        
        rfc_est_base = RandomForestClassifier()
        
        # Train Random Forest Classifier baseline model for comparison
        rfc_baseline = rfc_est_base.fit(X_train, y_train)
        rfc_baseline_pred = rfc_baseline.predict(X_test)

        rfc_hyperparams = {
            'bootstrap' : [True, False],
            'n_estimators' : list(range(100,200,100)), 
            #'n_estimators' : list(range(200,2000,100))
            'criterion' : ['gini', 'entropy'],
            'min_samples_leaf' : list(range(1,5,2)),
            'max_features' : ['sqrt', 'log2']
        }
        
        rfc_estimator = RandomForestClassifier(random_state=1)
        best_model_rfc = get_best_model(rfc_estimator, rfc_hyperparams,
                                                 X_train, y_train, X_test, y_test, best_models, key,
                                                 'RandomForestClassifier')
        
        # Store best model and baseline predictions
        best_models[key].extend([best_model_rfc.best_estimator_, rfc_baseline_pred])
         
            
            
        # Model 4 of 5: Logistic Regression
        
        log_est_base = LogisticRegression()
        
        # Train Logistic Regression baseline model for comparison
        log_baseline = log_est_base.fit(X_train, y_train)
        log_baseline_pred = log_baseline.predict(X_test)
        
        log_hyperparams = [{
                'penalty' : ['l2'],
                #'penalty' : ['l2', 'elasticnet'],
                'C' : [0.01, 0.1, 1,10],
                'solver'  : ['newton-cg', 'lbfgs', 'liblinear'],
                #'solver' : ['lbfgs','newton-cg','liblinear','sag','saga'],
                }
            ]
                
        log_estimator = LogisticRegression(random_state=1)
        best_model_log = get_best_model(log_estimator, log_hyperparams,
                                                 X_train, y_train, X_test, y_test, best_models, key,
                                                 'LogisticRegression')
        
        # Store best model and baseline predictions
        best_models[key].extend([best_model_log.best_estimator_, log_baseline_pred])
        
        
        # Model 5 of 5: SGD Classifier
        sgd_est_base = SGDClassifier()
        
        # Train Stocastic Gradient Decent model for comparison
        sgd_baseline = sgd_est_base.fit(X_train, y_train)
        sgd_baseline_pred = sgd_baseline.predict(X_test)
        
        sgd_hyperparams = {
                'loss'    : ['hinge', 'log'],
                #'loss'    : ['hinge', 'log', 'modified_huber', 'squared_hinge', 'perceptron'],
                'penalty' : ['l2', 'elasticnet'],
                #'penalty' : ['l1', 'l2', 'elasticnet'],
                'alpha'   : [0.001, 0.01, 0.1]
            }
        
        sgd_estimator = SGDClassifier(random_state=1, early_stopping=True)
        best_model_sgd = get_best_model(sgd_estimator, sgd_hyperparams, 
                                        X_train, y_train, X_test, y_test, best_models, key,
                                        'SGDClassifier')

        # Store best model and baseline predictions
        best_models[key].extend([best_model_sgd.best_estimator_, sgd_baseline_pred, dataframe_full])
    
    return best_models
        


def get_best_model(estimator, hyperparams, X_train, y_train, X_test, y_test, 
                   best_models, key, name, fit_params={}):
    
    # 10 splits 3 folds
    cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
    
    # Go through search space with chosen params
    grid_search = GridSearchCV(estimator=estimator, param_grid=hyperparams, n_jobs=-1, cv=cv, scoring="accuracy")
    
    # Fit train data
    best_model = grid_search.fit(X_train, y_train, **fit_params)
    
    # Get best params
    best_params = best_model.best_estimator_.get_params()
    
    best_predict = best_model.predict(X_test)
    
    if key not in best_models.keys():
        best_models[key] = []
        best_models[key].extend([name, best_params, best_predict, y_test])
    else:
        best_models[key].extend([name, best_params, best_predict, y_test])
        
    return best_model


# Pipeline 2 step 5: Evaluation and discussion

In this final step, function `evaluate_two` processes each dataset sequentially and for every model trained on that dataset (five baseline and five hyperparametrised models) prints out evaluation reports containing the following statistics: `precision`, `recall`, `f1-score`, `accuracy`, `average` and `weighted average`. 
<br>

Worth noting that some baseline models perform better than the one using `GridSreachCV` and this is because of the constrained parameters set which doesn't allow the cost function to reach local or global minima. Since the scope of this assignement is not to optimise for accuracy but rather to demonstrate an understanding of NLP I've chosen not to deploy substantial computational resourses and training time towrads it in favour of conent quality. Although as explained above, should the user whish to exploit its full potential, uncommenting the parameters on each parameter grid in step 4 will help achieve that. 
<br>

In terms of accuracy, the following is the model rank averaged across datasets where, for each model, the highest score between baseline or parametrised is taken:
* Support Vector Machine: 78.91% 
* Logistic Regression: 78.87% 
* Multimodal Naive Bayes: 78.66%
* Stocastic Gradient Decent: 78.33%
* Random Forest Classifier: 77.16%
<br>

The highest models average score across datasets is registered on Apex AD 2600 with an accuracy of 88.4%, whereas the lowest is on Nokia 6600 dataset with an average accuracy of 68.22%. 
<br>

In terms of single model across datasets, `Stocastic Gradient Decent` gets the highest score with an accuracy of 93.02% on Diaper Champ dataset, whereas the lowest goes to `Random Forest Classifier` with a score of 57.53%. 
<br>

In [15]:
def evaluate_two(best_models):
    
    all_datasets_eval = best_models.copy()
    model_scores = {}

    for idx, (key, review_dataset) in enumerate(best_models.items()):
        
        print('===='*27)
        print(' '*33,f'Evaluation metrics for {tot_items[idx]} dataset')
        print('===='*27)

        # Each model has 6 variables. [:-1] because contains dataset_full (used later)
        for idx, num in enumerate(range(0,len(review_dataset[:-1]),6)):
            
            name = review_dataset[0+num]
            best_params = review_dataset[1+num]
            best_predict = review_dataset[2+num]
            y_test = review_dataset[3+num]
            best_model = review_dataset[4+num]
            baseline_pred = review_dataset[5+num]

            print('\n',' '*12,f'MODEL {1+idx}: {name}')
            print(f'\nBest Parameters for {name} classsifier')
            print('='*53)
            print(best_params)
            print('='*53)

            report = classification_report(y_test, best_predict)
            baseline_report = classification_report(y_test, baseline_pred)
            
            score = accuracy_score(y_true=y_test, y_pred=best_predict)

            # Store score to select best model in evaluate_two
            if idx not in model_scores.keys():
                model_scores[idx] = []
                model_scores[idx].append(score.copy())
            else:
                model_scores[idx].append(score.copy())
            
            baseline_score = accuracy_score(y_true=y_test, y_pred=baseline_pred)

            print(f'\n\nPerformance report of the baseline {name} model')
            print('=='*27)
            print(baseline_report)
            print('=='*27)
            print("{} {:0.2f}%".format("Accuracy Score: ", baseline_score*100))
            print('=='*27)
            print(f'\n\nPerformance report of the Optimised {name} model')
            print('=='*27)
            print(report)
            print('=='*27)
            print("{} {:0.2f}%".format("Accuracy Score: ", score*100))
            print('=='*27)
            print('\n')
    
    return all_datasets_eval, model_scores

The features summary reports are printed out by  `review_similarity_count_two` which takes in `all_datasets_eval` which is a dictionary of datasets containing also rows with NaN reviews and `model_scores` which are all the `GridSearchCV` models scores from `evaluate_two` function.
<br>
<br>
`review_similarity_count_two` function performs the following steps: 

* Calculates the best parametrised model among trained models for each product review 
* Splits each dataset into one with and without reviews
* Performs vectorization using the corresponding previoustly trained vectoriser for that dataset and the `transform` method on the newly split missing review dataset
* Predicts the semantic orientation of missing product reviews using the best model for that specific dataset
* Merges the dataset with original reviews and the one with forecasted reviews
* Creates two dictionaries of positive and negative reviews and for each feature performs a review count
* Checks for word similarity using Levenshtein distance method seen in pipeline 1, although fine tuned with a different pipeline hyperparameter ratio (0.6 instead of 0.7 due to larger number of features)
* Prints out features review table summaries for each dataset as well as details about the model used to forecast missing reviews, its out of sample accuracy and what percentage of the original dataset the forecasted missing reviews reppresented

Interestingly, `MultinominalNB` was selected to be the forecasting model on 11 out of 17 datasets. Logistic Regression was selected three times, Random Forest twice and Support Vector Machine once. This model selection is diametrically opposite to what was observed in the step above where SVM was the highest scoring model. The difference comes from the fact that the model chosen here are `GridSearchCV` models whereas before the best between the two was picked. This in turn suggests that `SVM` baseline model on average performs better than the others whereas `MultinominalNB` is on average the best hyperparametrized model.

The engineering solution of retaining data with missing reviews and the use Spacy `noun_chunks` method to extract noun phrases has paid its dividends in terms of quality and number of extracted reviews. This aaproach has delivered a much more semantically rich features set as opposed to the solution seen in pipeline 1.
<br>
<br>

In [17]:
def review_similarity_count_two(all_datasets_eval, model_scores):
    
    
    # store models and index values 
    ml_models = {0:'MultinominalNB',1:'Support Vector Machine Classifier',2:'Random Forest Classifier',
                 3:'Logistic Regression',4:'Stocastic Gradient Decent Classifier'}
        
    # Get best model among trained models for each product review        
    highest_score_idx = []
    highest_score_model = []
    for i in range(len(all_datasets_eval)):
        model_idx = 0
        model_score = 0
        for key, value in model_scores.items():
            if value[i] > model_score:
                model_score = value[i]
                model_idx = key
        highest_score_idx.append(model_idx)
        highest_score_model.append(model_score)
        
    for idx, (key, review_dataset) in enumerate(all_datasets_eval.items()):
        
        print('===='*27)
        print(' '*33,f'Top Features extracted from {tot_items[idx]} dataset')
        print('===='*27)
        
        print(f'\nSemantic orientation of missing reviews is forecasted using {ml_models[highest_score_idx[idx]]}')
        print(f' since this is the best performing model on this dataset with an out of sample accuracy of {round((highest_score_model[idx]*100),2)}%\n')
        
        # Best trained model for each ML group (MultinominalNB,SVM,RFC,LR,SGD)
        # is store from idx 4 in steps of 6: 4;10;16
        if highest_score_idx[idx] == 0:
            best_model = review_dataset[4]
        elif highest_score_idx[idx] == 1:
            best_model = review_dataset[10]
        elif highest_score_idx[idx] == 2:
            best_model = review_dataset[16]
        elif highest_score_idx[idx] == 3:
            best_model = review_dataset[22]
        elif highest_score_idx[idx] == 4:
            best_model = review_dataset[28]
        
        
        df_global_features = review_dataset[30] 
                
        
        # Dataframe with reviews
        complete_reviews = df_global_features[df_global_features['Review'].notna()]
        
        # Dataframe without reviews
        missing_reviews = df_global_features[df_global_features['Review'].isna()]
        
        tot_reviews = complete_reviews.shape[0]+missing_reviews.shape[0]
        
        print(f'The number of forecasted (missing) reviews is {missing_reviews.shape[0]} ' \
              f'which equates to {round((missing_reviews.shape[0]/tot_reviews)*100,2)}% of all reviews\n')

        # Apply tfidf vectorizer
        X_test = vectorizer_list[idx].transform(missing_reviews[chosen_feature[feature_num]])

        # Predict missing reviews
        pred_miss_reviews = best_model.predict(X_test)
        
        # Drop Review col (Nan)
        missing_reviews = missing_reviews.drop(['Review'],axis=1)

        # Add predicted reviews
        missing_reviews['Review'] = pred_miss_reviews

        # Concat all reviews
        all_reviews = pd.concat([complete_reviews,missing_reviews]).reset_index()
                               
        # Get spacy noun chunked reviews
        spacy_chunk = all_reviews['Spacy_Chunk'].tolist()
        
        spacy_dict_pos = {}
        spacy_dict_neg = {}
        spacy_list_reviews = []
        
        for idx_review, noun_sequence in enumerate(spacy_chunk):
            for single_noun in noun_sequence:
                container = ''
                for idx, letter in enumerate(single_noun):
                    if len(letter) > 3 and str(letter) not in stopwords.words():
                        container += str(letter) + " "
                # Add positive review
                if all_reviews['Review'][idx_review] == 1:
                    if container[:-1] not in spacy_dict_pos.keys() and container != '':
                        # Append pos review
                        spacy_dict_pos[container[:-1]] = []
                        spacy_dict_pos[container[:-1]].append(all_reviews['Review'][idx_review])
                    elif container != '':
                        # Append positive review if key already present
                        spacy_dict_pos[container[:-1]].append(all_reviews['Review'][idx_review])
                
                # Add negative review 
                elif all_reviews['Review'][idx_review] == 0:
                    if container[:-1] not in spacy_dict_neg.keys() and container != '':
                        # Append neg review
                        spacy_dict_neg[container[:-1]] = []
                        spacy_dict_neg[container[:-1]].append(1)
                    elif container != '':
                        # Append negative review if key already present
                        spacy_dict_neg[container[:-1]].append(1)

        # Sum values in dics
        spacy_dict_pos = {k: sum(v) for k, v in spacy_dict_pos.items()}
        spacy_dict_neg = {k: sum(v) for k, v in spacy_dict_neg.items()}

        # Check for word similarity in positive dict
        tot = 0
        for idx, key in enumerate(spacy_dict_pos.copy()):
            for idy, key_two in enumerate(spacy_dict_pos.copy()):
                # Check keys for name similarity
                ratio = Levenshtein.ratio(key,key_two)
                if ratio > levenshtein_ratio_one and idx != idy:
                    # Get the shorter and longer keys
                    longer = key if len(key)>= len(key_two) else key_two
                    shorter = key if key != longer else key_two

                    # Delete if key in dic and not the only key left 
                    if longer in spacy_dict_pos.keys() and list(spacy_dict_pos.keys()).index(longer) != idy:
                        # Store values before deleting
                        tot = spacy_dict_pos[longer]
                        # Delete
                        del spacy_dict_pos[longer]

                        if shorter in spacy_dict_pos.keys():
                            # Add value back to same key
                            spacy_dict_pos[shorter] += tot
                        else:
                            spacy_dict_pos[shorter] = []
                            spacy_dict_pos[shorter] = tot

        # Check for word similarity in negative dict
        tot = 0
        for idx, key in enumerate(spacy_dict_neg.copy()):
            for idy, key_two in enumerate(spacy_dict_neg.copy()):
                # Check keys for name similarity
                ratio = Levenshtein.ratio(key,key_two)
                if ratio > levenshtein_ratio_one and idx != idy:
                    # Get the shorter and longer keys
                    longer = key if len(key)>= len(key_two) else key_two
                    shorter = key if key != longer else key_two

                    # Delete if key in dic and not the only key left 
                    if longer in spacy_dict_neg.keys() and list(spacy_dict_neg.keys()).index(longer) != idy:
                        # Store values before deleting
                        tot = spacy_dict_neg[longer]
                        # Delete
                        del spacy_dict_neg[longer]

                        if shorter in spacy_dict_neg.keys():
                            # Add value back to same key
                            spacy_dict_neg[shorter] += tot
                        else:
                            spacy_dict_neg[shorter] = []
                            spacy_dict_neg[shorter] = tot

        # Sort dicts
        spacy_dict_pos = dict(sorted(spacy_dict_pos.items(), reverse=True, key=lambda item: item[1]))
        spacy_dict_neg = dict(sorted(spacy_dict_neg.items(), reverse=True, key=lambda item: item[1]))

        # Dict to dataframe
        spacy_pos_df = pd.DataFrame(spacy_dict_pos.items(), columns=['Features', 'Positive Review Count'])
        spacy_neg_df = pd.DataFrame(spacy_dict_neg.items(), columns=['Features', 'Negative Review Count'])

        # Concat dataframes on common features
        spacy_pos_neg_df = pd.concat([spacy_pos_df.set_index('Features'),spacy_neg_df.set_index('Features')],
                                     axis=1, join='inner').reset_index()

        # Capitalise first letter
        spacy_pos_neg_df['Features'] = spacy_pos_neg_df['Features'].apply(lambda x: str.capitalize(x))
        
        print(spacy_pos_neg_df)
        print('\n\n\n')
    


In [18]:
a = review_similarity_count_two(
    *evaluate_two(models_training_two(split_data_apply_tfidf_two(
         feature_extraction_two(
         data_preprocessing(
         load_data(your_path_to_data)))))))


                                  Evaluation metrics for Apex AD 2600 dataset

              MODEL 1: MultinominalNB

Best Parameters for MultinominalNB classsifier
{'alpha': 1.0, 'class_prior': None, 'fit_prior': True}


Performance report of the baseline MultinominalNB model
              precision    recall  f1-score   support

         0.0       0.90      0.98      0.93        44
         1.0       0.95      0.80      0.87        25

    accuracy                           0.91        69
   macro avg       0.92      0.89      0.90        69
weighted avg       0.92      0.91      0.91        69

Accuracy Score:  91.30%


Performance report of the Optimised MultinominalNB model
              precision    recall  f1-score   support

         0.0       0.90      0.98      0.93        44
         1.0       0.95      0.80      0.87        25

    accuracy                           0.91        69
   macro avg       0.92      0.89      0.90        69
weighted avg       0.92      0.91      0

# References


* [1] https://en.wikipedia.org/wiki/Viterbi_algorithm
* [2] https://stackoverflow.com/questions/195010/how-can-i-split-multiple-joined-words
* [3] https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
* [4] https://www.nltk.org/book/ch07.html
* [5] https://en.wikipedia.org/wiki/Levenshtein_distance
* [6] https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html
* [7] https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html?highlight=support+vector+machine
* [8] https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
* https://machinelearningmastery.com/precision-recall-and-f-measure-for-imbalanced-classification/
* https://www.cs.uic.edu/~liub/publications/aaai04-featureExtract.pdf
* https://www.cs.uic.edu/~liub/publications/kdd04-revSummary.pdf

Thanks for your attention, we hope you have enjoyed reading through this aasignment!