# This Notebook is a Multiclassification Pipeline for Text Observations
## The pipeline consists of:  
1. Gathering the data  
1. Processing the data  
1. Exploring the data  
1. Feature engineering  
1. Preliminary feature selection based on univariate proxies  
1. ML models selection  
1. Training chosen model  
1. Generating performance metrics  

## The Data:
A data set from ??? that holds texts and a class for each.  
There are 8 total classes for this set, so the appropriate solution is a NLP-ML classifier.

## Approach:
A simple generic ML approach.  
Meaning, I don't leverage any NLP-dedicated models (like BERT or GPT3).  
But rather I design numerical features for the various text terms (One Hot Encoding/Bag of Words/TFIDF), and pipe them into generic ML models.  

In [3]:
import numpy as np
import pandas as pd
import matplotlib
import scipy
import re

# ML imports: 
from sklearn.feature_extraction.text import TfidfVectorizer,CountVectorizer
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV, StratifiedKFold
import sklearn.linear_model as lm
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier



In [4]:
# file_name: The file from Hackerrank that holds the raw data
# do_preprocessing: Logical, should preprocessing be performed
# do_feature_eng: Logical
# do_cleaning: Logical
# maximize_a_priori: Logocal, should the univariate preliminary feature selection be based on a priori or a postiori stats
# num_chosen_features_per_class: Int, for the preliminary feature selection, how many features should be selected per class
# test_size: ratio between 0 - 1
# feature_eng_details: Either "TfidfVectorizer" (for TFIDF feature eng.) or "CountVectorizer" (for one hot encoding)
config_dict = {'file_name': "trainingdata_2.txt",
               'do_preprocessing': True,
               'do_feature_eng': True,
               'do_cleaning': True,
               'maximize_a_priori': True,
               'num_chosen_features_per_class': 4,
               'test_size': 0.2,
               'feature_eng_details': "CountVectorizer",
               'ngram_range_min': 1,
               'ngram_range_max': 2,
               'max_features': 2000}

pd.set_option('display.max_colwidth', None)
pd.set_option('display.max_rows', None)


In [5]:
dataset_raw = pd.read_csv(config_dict["file_name"])

In [7]:
# dataset_parsed = pd.DataFrame([])
# dataset_parsed[["class", "text"]] = dataset_raw.iloc[:, 0].str.split(" ", 1, expand=True)

# dataset_parsed.head().style.set_properties(**{'text-align': 'left'})