## Modeling

Our goal is to predict the most accurately the score of a comment given:
* Its usual features (day, time... See Notebook1)
* Its network features (score of the parent comment... See Notebook 01)
* Its textual content: see text mining in notebook 21 (sentiment analysis), 2 (preprocessing), 22 (word association), 3 (TF IDF).

We decided on using a LIGHT GBM regression model as it is known to be one of the most accurate for Kaggle competition.

### Set-up

In [None]:
#If the lightgbm library is not installed:
#!pip install lightgbm

In [1]:
import pickle

import numpy as np
import pandas as pd

from scipy.stats import norm
from scipy.sparse import csr_matrix
from scipy.sparse import hstack

from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn.model_selection import cross_val_score, cross_val_predict
from sklearn.metrics import mean_squared_log_error

import lightgbm as lgb

Input your repository path here:

In [6]:
repsource = "/Users/alicetourret/Downloads/au_secours/data/"

In [7]:
file1=open(repsource+"df_body_cleaned","rb")
df_cleaned=pickle.load(file1)
file1.close()

### To be executed

In [8]:
df_cleaned=df_cleaned.drop(["author","link_id","name","parent_id","popularity","top_comment","active_user2","active_user3",
                            "active_user4","parent_seg1","parent_seg2","parent_seg3","parent_seg4","monday","tuesday",
                            "thursday","friday", "saturday","sunday","wednesday"], axis=1)

In [None]:
df_cleaned.sample(1)

In [None]:
df_cleaned.dtypes

In [None]:
df_cleaned.day_week = df_cleaned.day_week.astype('int64')

In [None]:
def test_split(dataset):
    Test_DF = dataset[pd.isna(dataset.ups)]
    return Test_DF

def train_split(dataset):
    Train_DF = dataset[pd.isna(dataset.ups) == False]
    return Train_DF

In [None]:
Train_DF = train_split(df_cleaned)
Test_DF = test_split(df_cleaned)

In [None]:
print(Train_DF.shape)
print(Test_DF.shape)

In [None]:
test_id = Test_DF.id

In [None]:
train_target = Train_DF['ups']
train_feats = Train_DF[Train_DF.columns.difference(['ups', 'body', 'id'])]
test_target = Test_DF['ups']
test_feats = Test_DF[Test_DF.columns.difference(['ups', 'body', 'id'])]

In [None]:
print(train_feats.shape)
print(test_feats.shape)

Make Dense dataframe Sparse, and Combine with TF-IDF features.

In [None]:
train_feats_mat = csr_matrix(train_feats.values)
test_feats_mat = csr_matrix(test_feats.values)

Based on TF-IDF

In [None]:
file2=open(repsource+"word_features","rb")
train_word_features=pickle.load(file2)
test_word_features=pickle.load(file2)
file2.close()

In [None]:
X = hstack([train_feats_mat,train_word_features])
testing = hstack([test_feats_mat,test_word_features])

In [None]:
X_train, X_val, y_train, y_val = train_test_split(X,train_target,test_size=0.33, random_state=2021)

In [None]:
print(X_train.shape)
print(X_val.shape)
print(y_train.shape)
print(y_val.shape)

In [None]:
hyper_params = {
    'task': 'train',
    'boosting_type': 'gbdt',
    'objective': 'regression',
    'metric': ['l2', 'auc'],
    'learning_rate': 0.01,
    'verbose': 0,
    "max_depth": 35,
    "num_leaves": 500,  
    "num_iterations": 1000,
    "n_estimators": 10
}

In [None]:
gbm = lgb.LGBMRegressor(**hyper_params)

In [None]:
gbm.fit(X_train, y_train,
        eval_set=[(X_val, y_val)],
        eval_metric='l1',
        early_stopping_rounds=500)

**SUBMISSION**

In [None]:
test_pred = gbm.predict(testing)

In [None]:
submission = pd.DataFrame({'id': test_id,'predicted': test_pred}).astype(dtype={'id': 'string','predicted': 'float64'})

In [None]:
submission.head()

In [None]:
submission.dtypes

In [None]:
submission.to_csv(repsource+"submission.csv",index=False)

**Remarks**
* We trained also the model removing the variables relative the sentiment (positive, neutral, negative comment). It doesn't deteriorate the performance of the model. These scores are interesting for the textual analysis but not very significant for the current model.
* We could also tune the model trying to opimize the hyperparameters but we think the current model could be improved in terms of features especially. This one results in a private score of 14.75235 and a public score of 14.90260.