<h1>Part 5: fine-tuning<h1>

**Loading the necessary libraries and setting display settings**

In [1]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
import pandas as pd
from sklearn.ensemble import VotingClassifier
from sklearn.metrics import roc_auc_score, fbeta_score, roc_curve, auc, confusion_matrix, ConfusionMatrixDisplay
from xgboost import XGBClassifier
from sklearn.preprocessing import label_binarize, MinMaxScaler
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

pd.set_option('display.max_colwidth', None)

**Preparing the data for model training (analogous to the previous notebook)**

In [2]:
# Loading the data
df = pd.read_csv("train_final.csv")

# Applying MinMaxScaler, so Logistic Regression converges quicker
minmax = MinMaxScaler()
df2 = pd.DataFrame(minmax.fit_transform(df.drop(columns=["sentiment"])), columns=df.drop(columns="sentiment").columns)
df2 = pd.concat([df2, df[["sentiment"]]], axis=1)

# Splitting the data to train and test
train_df, test_df = train_test_split(df2, test_size=0.3, random_state=2092)

# Feautre and target selection 
X_train = train_df.drop(columns=['sentiment'])
y_train = train_df['sentiment']

X_test = test_df.drop(columns=['sentiment'])
y_test = test_df['sentiment']

**Fine-tuning Logistic Regression** <br>
Due to computational limitations, we are unable to apply advanced techniques such as cross-validation. Therefore, we fine-tune the model using basic loops over solvers and penalties. Firstly, we tune over 100 iterations to find out how different penalties and C values affect results.


In [7]:
results = []

for C in [0.01, 0.1, 1, 10, 100]:

    model = LogisticRegression(
        penalty="l1",
        C=C,
        solver='saga',
        max_iter=100,
        random_state=42,
        n_jobs=-1
    )

    model.fit(X_train, y_train)

    preds_train = model.predict(X_train)
    f2_train = fbeta_score(y_train, preds_train, beta=2, average=None)

    preds_test = model.predict(X_test)
    f2_test = fbeta_score(y_test, preds_test, beta=2, average=None)

    print(f"With L1, C = {C} we get F2 = {f2_train} on training data, F2 = {f2_test} on testing data")

    results.append({
        "Penalty": "L1",
        "C": C,
        "F2 Score (Train)": f2_train,
        "F2 Score (Test)": f2_test
    })

for C in [0.01, 0.1, 1, 10, 100]:

    model = LogisticRegression(
        penalty="l2",
        C=C,
        solver='saga',
        max_iter=100,
        random_state=42,
        n_jobs=-1
    )

    model.fit(X_train, y_train)

    preds_train = model.predict(X_train)
    f2_train = fbeta_score(y_train, preds_train, beta=2, average=None)

    preds_test = model.predict(X_test)
    f2_test = fbeta_score(y_test, preds_test, beta=2, average=None)

    print(f"With L2, C = {C} we get F2 = {f2_train} on training data, F2 = {f2_test} on testing data")

    results.append({
        "Penalty": "L2",
        "C": C,
        "F2 Score (Train)": f2_train,
        "F2 Score (Test)": f2_test
    })



With L1, C = 0.01 we get F2 = [0.         0.         0.95880423] on training data, F2 = [0.        0.        0.9593406] on testing data




With L1, C = 0.1 we get F2 = [0.32020285 0.13037142 0.96195167] on training data, F2 = [0.27996071 0.10611561 0.95960616] on testing data




With L1, C = 1 we get F2 = [0.60567291 0.45296977 0.96105124] on training data, F2 = [0.51481143 0.41673647 0.95366142] on testing data




With L1, C = 10 we get F2 = [0.63286015 0.48416146 0.95995554] on training data, F2 = [0.53081305 0.44707049 0.95095301] on testing data




With L1, C = 100 we get F2 = [0.63575088 0.48675532 0.95978674] on training data, F2 = [0.52944984 0.45008251 0.95055134] on testing data
With L2, C = 0.01 we get F2 = [0.01744924 0.025356   0.95949513] on training data, F2 = [0.0112765  0.01960784 0.95977626] on testing data




With L2, C = 0.1 we get F2 = [0.40591367 0.28528245 0.96349664] on training data, F2 = [0.34480376 0.2636493  0.95944404] on testing data




With L2, C = 1 we get F2 = [0.6044825  0.46285083 0.96112061] on training data, F2 = [0.50933786 0.42804797 0.95279096] on testing data




With L2, C = 10 we get F2 = [0.63205357 0.4847308  0.95989238] on training data, F2 = [0.52907654 0.44880264 0.95086193] on testing data
With L2, C = 100 we get F2 = [0.6352459  0.48689733 0.95976973] on training data, F2 = [0.52938131 0.45004538 0.95052096] on testing data




**Displaying the results**

In [8]:
pd.DataFrame(results)

Unnamed: 0,Penalty,C,F2 Score (Train),F2 Score (Test)
0,L1,0.01,"[0.0, 0.0, 0.958804228946409]","[0.0, 0.0, 0.9593405964396787]"
1,L1,0.1,"[0.3202028541101545, 0.13037142069400134, 0.9619516710191398]","[0.27996070726915523, 0.10611561016475844, 0.9596061576330996]"
2,L1,1.0,"[0.6056729094076655, 0.4529697662783218, 0.9610512384193609]","[0.5148114315542216, 0.4167364717708159, 0.9536614223273915]"
3,L1,10.0,"[0.6328601500512765, 0.4841614577815447, 0.9599555389013561]","[0.5308130502330399, 0.4470704900421453, 0.950953010926552]"
4,L1,100.0,"[0.6357508762469668, 0.48675531727275934, 0.959786742132326]","[0.5294498381877023, 0.4500825082508251, 0.9505513354569959]"
5,L2,0.01,"[0.017449238578680203, 0.025355998214937725, 0.9594951341690374]","[0.011276499774470004, 0.0196078431372549, 0.9597762590202742]"
6,L2,0.1,"[0.40591366739960655, 0.28528244851690915, 0.9634966418406703]","[0.3448037589828635, 0.26364929752801, 0.9594440360583433]"
7,L2,1.0,"[0.6044824953648162, 0.4628508334520587, 0.9611206139337063]","[0.5093378607809848, 0.4280479680213191, 0.9527909593464117]"
8,L2,10.0,"[0.6320535666072682, 0.48473080317740513, 0.9598923839316916]","[0.5290765444890558, 0.44880264244426094, 0.9508619253422711]"
9,L2,100.0,"[0.6352459016393442, 0.4868973300885268, 0.9597697276992142]","[0.529381309862801, 0.4500453757940764, 0.9505209609597836]"


As we can see, small C values, i.e. 0.01 and 0.10 yield inferior results. Let's train the models again on more iterations with C = 1, 10, 100.

In [9]:
results2 = []

for C in [1, 10, 100]:

    model = LogisticRegression(
        penalty="l2",
        C=C,
        solver='saga',
        max_iter=500,
        random_state=42,
        n_jobs=-1
    )

    model.fit(X_train, y_train)

    preds_train = model.predict(X_train)
    f2_train = fbeta_score(y_train, preds_train, beta=2, average=None)

    preds_test = model.predict(X_test)
    f2_test = fbeta_score(y_test, preds_test, beta=2, average=None)

    print(f"With L2, C = {C} we get F2 = {f2_train} on training data, F2 = {f2_test} on testing data")

    results2.append({
        "Penalty": "L2",
        "C": C,
        "F2 Score (Train)": f2_train,
        "F2 Score (Test)": f2_test
    })

for C in [1, 10, 100]:

    model = LogisticRegression(
        penalty="l1",
        C=C,
        solver='saga',
        max_iter=500,
        random_state=42,
        n_jobs=-1
    )

    model.fit(X_train, y_train)

    preds_train = model.predict(X_train)
    f2_train = fbeta_score(y_train, preds_train, beta=2, average=None)

    preds_test = model.predict(X_test)
    f2_test = fbeta_score(y_test, preds_test, beta=2, average=None)

    print(f"With L1, C = {C} we get F2 = {f2_train} on training data, F2 = {f2_test} on testing data")

    results2.append({
        "Penalty": "L1",
        "C": C,
        "F2 Score (Train)": f2_train,
        "F2 Score (Test)": f2_test
    })



With L2, C = 1 we get F2 = [0.64500485 0.49348033 0.96162915] on training data, F2 = [0.54336603 0.45815087 0.95207755] on testing data




With L2, C = 10 we get F2 = [0.69217336 0.55279352 0.96136378] on training data, F2 = [0.58229127 0.50764103 0.94893311] on testing data




With L2, C = 100 we get F2 = [0.69871761 0.55856325 0.96113123] on training data, F2 = [0.5879017  0.51357925 0.94852362] on testing data




With L1, C = 1 we get F2 = [0.66844208 0.49987662 0.96055022] on training data, F2 = [0.57417793 0.46170985 0.95112162] on testing data




With L1, C = 10 we get F2 = [0.6948848  0.55327655 0.96116468] on training data, F2 = [0.58764187 0.50836446 0.94872375] on testing data
With L1, C = 100 we get F2 = [0.69875449 0.55916031 0.96112635] on training data, F2 = [0.5879758  0.51402627 0.94860337] on testing data




**Displaying the results**

In [10]:
pd.DataFrame(results2)

Unnamed: 0,Penalty,C,F2 Score (Train),F2 Score (Test)
0,L2,1,"[0.6450048496605237, 0.4934803349941694, 0.9616291505612679]","[0.5433660299432111, 0.45815087168470625, 0.9520775541810839]"
1,L2,10,"[0.6921733608509287, 0.5527935194520739, 0.9613637803924883]","[0.5822912719464444, 0.5076410339782149, 0.9489331142831846]"
2,L2,100,"[0.698717610427991, 0.5585632483081728, 0.961131234915952]","[0.5879017013232514, 0.5135792460478313, 0.948523622047244]"
3,L1,1,"[0.6684420772303595, 0.49987661719603765, 0.9605502227955466]","[0.5741779250573541, 0.46170985311107443, 0.9511216218868369]"
4,L1,10,"[0.6948848023673642, 0.5532765513645055, 0.9611646793371249]","[0.587641866330391, 0.5083644632126035, 0.9487237517958003]"
5,L1,100,"[0.6987544859615791, 0.5591603053435115, 0.9611263509829365]","[0.5879758003529115, 0.5140262688503324, 0.9486033739493317]"


Logistic Regression with L1 and C=100 proved to be the best model. Let's train it again along XGBoost and try different ensemble models.

**Creating ensemble models**

In [None]:
linear_model = LogisticRegression(
        penalty="l1",
        C=100,
        solver='saga',
        max_iter=500,
        random_state=42,
        n_jobs=-1
    )

print("Training linear model")
linear_model.fit(X_train, y_train)
print("Finished training linear model")
linear_preds = linear_model.predict(X_test)
linear_f2 = fbeta_score(y_test, linear_preds, beta=2, average=None)
print(f"F2 for linear model: {linear_f2}")

xg_model = XGBClassifier(use_label_encoder=False, eval_metric='mlogloss', random_state=42)
print("Training XGBoost")
xg_model.fit(X_train, y_train)
print("Finished training XGBoost")
xg_preds = xg_model.predict(X_test)
xg_f2 = fbeta_score(y_test, xg_preds, beta=2, average=None)
print(f"F2 for XGBoost: {xg_f2}")

Training linear model
Finished training linear model
F2 for linear model: [0.5879758  0.51402627 0.94860337]
Training XGBoost
Finished training XGBoost
F2 for XGBoost: [0.46598998 0.52673021 0.96063354]


In [6]:
import itertools

weights_list = [(x, y) for x, y in itertools.product(
    [round(i * 0.1, 1) for i in range(1, 10)],
    [round(i * 0.1, 1) for i in range(1, 10)]
)]

proba_lr = linear_model.predict_proba(X_test)
proba_xgb = xg_model.predict_proba(X_test)

results = []

for w_lr, w_xgb in weights_list:

    ensemble_proba = w_lr * proba_lr + w_xgb * proba_xgb

    y_pred = np.argmax(ensemble_proba, axis=1)

    fbeta_each = fbeta_score(y_test, y_pred, beta=2, average=None, zero_division=0)
    fbeta_macro = fbeta_score(y_test, y_pred, beta=2, average='macro', zero_division=0)

    results.append({
        'w_lr': w_lr,
        'w_xgb': w_xgb,
        'fbeta_0': fbeta_each[0],
        'fbeta_1': fbeta_each[1],
        'fbeta_2': fbeta_each[2],
        'fbeta_macro': fbeta_macro
    })

res = pd.DataFrame(results)

In [8]:
top5_fbeta0 = res.nlargest(5, 'fbeta_0')
top5_fbetamacro = res.nlargest(5, 'fbeta_macro')

In [9]:
top5_fbeta0

Unnamed: 0,w_lr,w_xgb,fbeta_0,fbeta_1,fbeta_2,fbeta_macro
72,0.9,0.1,0.58998,0.517312,0.951058,0.686116
45,0.6,0.1,0.589944,0.516411,0.95196,0.686105
54,0.7,0.1,0.589646,0.517199,0.951348,0.686065
63,0.8,0.1,0.589646,0.517635,0.951137,0.68614
73,0.9,0.2,0.587193,0.515266,0.953111,0.68519


In [10]:
top5_fbetamacro

Unnamed: 0,w_lr,w_xgb,fbeta_0,fbeta_1,fbeta_2,fbeta_macro
63,0.8,0.1,0.589646,0.517635,0.951137,0.68614
72,0.9,0.1,0.58998,0.517312,0.951058,0.686116
45,0.6,0.1,0.589944,0.516411,0.95196,0.686105
54,0.7,0.1,0.589646,0.517199,0.951348,0.686065
73,0.9,0.2,0.587193,0.515266,0.953111,0.68519


The chosen weights are [w_lr, w_xgb] = [0.9, 0.1].

**Note:** the final ensemble model is trained in `train_model.py`,
all its performance metrics are written to `results.txt`.