# 0. Intro

The Sentiment Analysis process needs the following steps:

- Feature extraction
- Modeling

#### Feature Extraction

This phase is based on statistical bag-of-words methods (CountVectorizer, TfidfVectorizer)

#### Modeling

The models involved are some of Traditional Machine Learning ones: DecisionTree, RandomForest, GradientBoosting


Both these points entail some parameters. In order to maximize results some of these parameters are going to be tuned. This operation represents the core of this notebook 



# 1. Packages & Basic Settings

In [1]:
import numpy as np
import pandas as pd
import yfinance as yf
from datetime import timedelta
import os
from collections import defaultdict
from ast import literal_eval
import pickle
import json

In [2]:
from utilities import EvalClfPipeParams

In [3]:
data_path = '../data'

In [4]:
seed = 0

test_size = 0.25

In [5]:
pipelines_instructions = {}

# 2. Data Import

In [6]:
preprocessed_df_filename = 'df_preprocessed.parquet'

df = pd.read_parquet(os.path.join(data_path, 'intermediate', preprocessed_df_filename))

# 3. Data Segmentation

In [7]:
overall_test_size = 0.25

split_point = int(round(len(df)*(1-overall_test_size)))


In [8]:
df_train = df.iloc[:split_point].copy()

# 4. Evaluate Parameters

Vectorizers hyperparameters:
- ngram range -> token min and max length in terms of number of words
- max features -> number of tokens (most frequent) to keep as features

Models hyperparameters:

DecisionTree: [max_depth, min_samples_leaf]

RandomForest: [n_estimators, max_depth, max_features]

GB: [n_estimators, max_depth, learning_rate]


#### Evaluation method: GridSearch
#### Evaluation metrics: Matthews correlation coefficient (MCC), accuracy

In [9]:
vectorizers = ['cv','tfidf']

vectorizers_grid = {'ngram_range':[(1,3), (1,4)],'max_features':[2**9, 2**10]}

models_grids = {'DecisionTreeClassifier': {'fixed':{'random_state':seed}, 'tuning':{'max_depth':[3,4], 'min_samples_leaf':[2**4,2**5]}},
                'RandomForestClassifier': {'fixed':{'random_state':seed}, 'tuning':{'n_estimators':[2**6, 2**7], 'max_depth':[2,3,4], 'max_features':['sqrt','log2']}},
                'GradientBoostingClassifier': {'fixed':{'random_state':seed}, 'tuning':{'n_estimators':[2**6, 2**7], 'max_depth':[2,3,4], 'learning_rate':[0.01,0.05]}}}



cv_test_size = 300

n_cv = 3

In [10]:
binary_process_eval = EvalClfPipeParams(vectorizers=vectorizers,
                 vects_grid=vectorizers_grid,
                 models_grid=models_grids,
                 text=df_train['clean_bow'].tolist(),
                 y=df_train['binary_label'].tolist(),
                 test_size=cv_test_size,
                 cv=n_cv)

binary_process_eval.eval()

In [11]:
ternary_process_eval = EvalClfPipeParams(vectorizers=vectorizers,
                 vects_grid=vectorizers_grid,
                 models_grid=models_grids,
                 text=df_train['clean_bow'].tolist(),
                 y=df_train['ternary_label'].tolist(),
                 test_size=cv_test_size,
                 cv=n_cv)

ternary_process_eval.eval()

# 5. Parameters selection

In [12]:
eval_l = [('binary', binary_process_eval), ('ternary',ternary_process_eval)]


## 5.1. Vectorizers parameters

The fetaure extractors hyperparameters are choosen based on the ones with the best average MCC

In [13]:
for clf, pipe_eval in eval_l:

    pipelines_instructions[clf] = {'FeatureExtraction':{}, 'Models': {}}

    df_pipelines_eval = pipe_eval.results.copy()

    df_pipelines_eval['v_par_str'] = df_pipelines_eval['Vect_parameters'].astype('str')

    by_fe = df_pipelines_eval.groupby(['Vectorizer','v_par_str'], as_index=False).agg({'MCC':'mean','accuracy':'mean'}).sort_values(['MCC','accuracy'], ascending=False).reset_index(drop=True)

    for vect in vectorizers:

        pipelines_instructions[clf]['FeatureExtraction'][vect] = literal_eval(by_fe[by_fe['Vectorizer']==vect]['v_par_str'].iloc[0])


## 5.2. Models parameters

The selected models hyperparameters are the ones with highest score (MCC) using the predefined (picked the cell above) vectorizer  

In [14]:
for clf, pipe_eval in eval_l:

    df_pipelines_eval = pipe_eval.results.copy()

    for vect in vectorizers:

        selected_fe = pipelines_instructions[clf]['FeatureExtraction'][vect]

        pipelines_instructions[clf]['Models'][vect] = {}

        for model in df_pipelines_eval['Model'].unique():

            pipelines_instructions[clf]['Models'][vect][model] = df_pipelines_eval.loc[(df_pipelines_eval['Vect_parameters']==selected_fe) & \
                                                                                     (df_pipelines_eval['Model']==model) , 'Model_parameters'].iloc[0]


In [15]:
print(json.dumps(pipelines_instructions, indent=3))

{
   "binary": {
      "FeatureExtraction": {
         "cv": {
            "ngram_range": [
               1,
               4
            ],
            "max_features": 1024
         },
         "tfidf": {
            "ngram_range": [
               1,
               4
            ],
            "max_features": 1024
         }
      },
      "Models": {
         "cv": {
            "GradientBoostingClassifier": {
               "n_estimators": 128,
               "max_depth": 2,
               "learning_rate": 0.05,
               "random_state": 0
            },
            "RandomForestClassifier": {
               "n_estimators": 128,
               "max_depth": 4,
               "max_features": "sqrt",
               "random_state": 0
            },
            "DecisionTreeClassifier": {
               "max_depth": 4,
               "min_samples_leaf": 32,
               "random_state": 0
            }
         },
         "tfidf": {
            "GradientBoostingClassifier": {
  

# 6. Export

In [16]:
pickle.dump(pipelines_instructions, open(os.path.join(data_path, 'output', 'NLP_FSA_pipelines_instructions.pkl'), 'wb'))