# This Notebook is a Multiclassification Pipeline for Text Observations
## The pipeline consists of:  
1. Gathering the data  
1. Processing the data  
1. Exploring the data  
1. Feature engineering  
1. Preliminary feature selection based on univariate proxies  
1. ML models selection  
1. Training chosen model  
1. Generating performance metrics  

## The Data:
A data set from ??? that holds texts and a class for each.  
There are 8 total classes for this set, so the appropriate solution is a NLP-ML classifier.

```
## Approach:
A simple generic ML approach.  
Meaning, I don't leverage any NLP-dedicated models (like BERT or GPT3).  
But rather I design numerical features for the various text terms (One Hot Encoding/Bag of Words/TFIDF), and pipe them into generic ML models.  
```

In [12]:
import numpy as np
import pandas as pd
import matplotlib
import scipy
import re

# ML imports: 
from sklearn.feature_extraction.text import TfidfVectorizer,CountVectorizer
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV, StratifiedKFold
import sklearn.linear_model as lm
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier



In [13]:
# file_name: The file from Hackerrank that holds the raw data
# do_preprocessing: Logical, should preprocessing be performed
# do_feature_eng: Logical
# do_cleaning: Logical
# maximize_a_priori: Logocal, should the univariate preliminary feature selection be based on a priori or a postiori stats
# num_chosen_features_per_class: Int, for the preliminary feature selection, how many features should be selected per class
# test_size: ratio between 0 - 1
# feature_eng_details: Either "TfidfVectorizer" (for TFIDF feature eng.) or "CountVectorizer" (for one hot encoding)
config_dict = {'file_name': "trainingdata_2.txt",
               'do_preprocessing': True,
               'do_feature_eng': True,
               'do_cleaning': True,
               'maximize_a_priori': True,
               'num_chosen_features_per_class': 4,
               'test_size': 0.2,
               'feature_eng_details': "CountVectorizer",
               'ngram_range_min': 1,
               'ngram_range_max': 2,
               'max_features': 2000}

pd.set_option('display.max_colwidth', None)
pd.set_option('display.max_rows', None)


In [14]:
dataset_raw = pd.read_csv(config_dict["file_name"])

In [15]:
dataset_parsed = pd.DataFrame([])
dataset_parsed[["class", "text"]] = dataset_raw.iloc[:, 0].str.split(" ", 1, expand=True)

dataset_parsed.head().style.set_properties(**{'text-align': 'left'})

Unnamed: 0,class,text
0,1,champion products ch approves stock split champion products inc said its board of directors approved a two for one stock split of its common shares for shareholders of record as of april the company also said its board voted to recommend to shareholders at the annual meeting april an increase in the authorized capital stock from five mln to mln shares reuter
1,2,computer terminal systems cpml completes sale computer terminal systems inc said it has completed the sale of shares of its common stock and warrants to acquire an additional one mln shares to sedio n v of lugano switzerland for dlrs the company said the warrants are exercisable for five years at a purchase price of dlrs per share computer terminal said sedio also has the right to buy additional shares and increase its total holdings up to pct of the computer terminal s outstanding common stock under certain circumstances involving change of control at the company the company said if the conditions occur the warrants would be exercisable at a price equal to pct of its common stock s market price at the time not to exceed dlrs per share computer terminal also said it sold the technolgy rights to its dot matrix impact technology including any future improvements to woodco inc of houston tex for dlrs but it said it would continue to be the exclusive worldwide licensee of the technology for woodco the company said the moves were part of its reorganization plan and would help pay current operation costs and ensure product delivery computer terminal makes computer generated labels forms tags and ticket printers and terminals reuter
2,1,cobanco inc cbco year net shr cts vs dlrs net vs assets mln vs mln deposits mln vs mln loans mln vs mln note th qtr not available year includes extraordinary gain from tax carry forward of dlrs or five cts per shr reuter
3,1,am international inc am nd qtr jan oper shr loss two cts vs profit seven cts oper shr profit vs profit revs mln vs mln avg shrs mln vs mln six mths oper shr profit nil vs profit cts oper net profit vs profit revs mln vs mln avg shrs mln vs mln note per shr calculated after payment of preferred dividends results exclude credits of or four cts and or nine cts for qtr and six mths vs or six cts and or cts for prior periods from operating loss carryforwards reuter
4,1,brown forman inc bfd th qtr net shr one dlr vs cts net mln vs mln revs mln vs mln nine mths shr dlrs vs dlrs net mln vs mln revs billion vs mln reuter
