# 0. Intro

The Sentiment Analysis process needs the following steps:

- Feature extraction
- Modeling

#### Feature Extraction

This phase is based on statistical bag-of-words methods (CountVectorizer, TfidfVectorizer)

#### Modeling

The models involved are Traditional Machine Learning ones: DecisionTree, RandomForest, GradientBoosting


Both these points entail some parameters. In order to maximize results some of these parameters are going to be tuned. This operation represents the core of this notebook 



# 1. Packages & Basic Settings

In [1]:
import numpy as np
import pandas as pd
import os
import pickle
import json

from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, HistGradientBoostingClassifier
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import TimeSeriesSplit
from sklearn.metrics import make_scorer, matthews_corrcoef

In [2]:
from utilities import NlpPipeBayesSearch

In [3]:
data_path = '../data'

In [4]:
seed = 0


# 2. Data Import

In [5]:
preprocessed_df_filename = 'df_preprocessed.parquet'

df = pd.read_parquet(os.path.join(data_path, 'intermediate', preprocessed_df_filename))

# 3. Data Segmentation

In [6]:
overall_test_size = 0.2

split_point = int(round(len(df)*(1-overall_test_size)))


In [7]:
X_train=df['clean_bow'].values[:split_point]
y_bin_train=df['binary_label'].values[:split_point]
y_ter_train=df['binary_label'].values[:split_point]

# 4. Parameters Search

Vectorizers hyperparameters:
- max features -> number of tokens (most frequent) to keep as features 

Models hyperparameters:

DecisionTree: [max_depth, min_samples_leaf]

RandomForest: [n_estimators, max_depth, max_features]

GB: [max_iter, max_depth, learning_rate]


#### Evaluation method: BayesSearch
#### Evaluation metric: Matthews correlation coefficient (MCC)

## 4.1. Search params initialization

#### Vectorizers search space

In [8]:
vects_space = {'cv':{'object':CountVectorizer(ngram_range=(1,4)), 'space':{'max_features':(2**8,2**11)}},
        'tfidf':{'object':TfidfVectorizer(ngram_range=(1,4)), 'space':{'max_features':(2**8,2**11)}}}

#### Models search space

In [9]:
models_space = {'DT':{'object':DecisionTreeClassifier(random_state=seed), 'space':{'max_depth':(2,4), 'min_samples_leaf':(2**3, 2**5)}},
    'RF':{'object':RandomForestClassifier(random_state=seed), 'space':{'max_depth':(2,4), 'n_estimators':(2**5, 2**8), 'max_features':(1/2**6,1/2**2)}},
    'HGB':{'object':HistGradientBoostingClassifier(random_state=seed), 'space':{'max_depth':(2,4), 'max_iter':(2**5, 2**8), 'learning_rate':(1/2**9,1/2**3)}}}

#### Scorer

In [10]:
mcc = make_scorer(matthews_corrcoef)

#### CV object

In [11]:
n_cv = 4
cv_test_size = 250

ts_split = TimeSeriesSplit(n_splits=n_cv, test_size=cv_test_size)

#### N jobs

In [12]:
nj = os.cpu_count() - 1


## 4.2. Bayes Search

In [13]:
n_iterations = 16

In [14]:
binary_pipe_bay_search = NlpPipeBayesSearch(vects_dict=vects_space,
                                     clfs_dict=models_space,
                                     cv_object=ts_split,
                                     n_iter=n_iterations,
                                     random_state=seed,
                                     n_jobs=nj,
                                     scoring=mcc,
                                     std_penalty=True)


binary_pipe_bay_search.search(X_train, y_bin_train)

  0%|          | 0/6 [00:00<?, ?it/s]

In [19]:
ternary_pipe_bay_search = NlpPipeBayesSearch(vects_dict=vects_space,
                                     clfs_dict=models_space,
                                     cv_object=ts_split,
                                     n_iter=n_iterations,
                                     random_state=seed,
                                     n_jobs=nj,
                                     scoring=mcc,
                                     std_penalty=True)


ternary_pipe_bay_search.search(X_train, y_ter_train)

  0%|          | 0/6 [00:00<?, ?it/s]

# 5. Parameters selection

In [20]:
pipelines_instructions = {}

pipelines_instructions['binary'] = binary_pipe_bay_search.pipelines_instructions

pipelines_instructions['ternary'] = ternary_pipe_bay_search.pipelines_instructions

# 6. Export

In [27]:
pickle.dump(pipelines_instructions, open(os.path.join(data_path, 'output', 'NLP_FSA_pipelines_instructions.pkl'), 'wb'))