
## Task 1: Logistic Regression

### Key Task Deliverables
1. Code implementation of the Logistic Regression model.
2. Prediction made by your Logistic Regression on the Test set. Note that you are welcome to submit your predicted labels to Kaggle but you will need to submit the final prediction output in the final project submission. Please label the file as "LogRed_Prediction.csv".

### Notations

- `n` : number of features
- `m` : number of training examples
- `X` : input data matrix of shape (`m` X `n`)
- `y` : true/target value (0 or 1)
- `x(i), y(i)` : ith training example
- `w` : weights (parameters) of shape (`n` x 1)
- `b` : bias (parameter), a real number that can be broadcasted
- `y_hat`: Hypothesis (output values between 0 and 1)

In [1]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
import pandas as pd
from sklearn.metrics import f1_score
 
def sigmoid(z):
  return 1.0/(1 + np.exp(-z))

def loss(y, y_hat):
  loss = -np.mean(y*(np.log(y_hat)) - (1-y)*np.log(1-y_hat))
  return loss

def gradients(X, y, y_hat):
  # X --> Input.
  # y --> true/target value.
  # y_hat --> hypothesis/predictions.
  # w --> weights (parameter).
  # b --> bias (parameter).
  
  # m-> number of training examples.
  m = X.shape[0]
  
  # Gradient of loss w.r.t weights.
  dw = (1/m)*np.dot(X.T, (y_hat - y))
  
  # Gradient of loss w.r.t bias.
  db = (1/m)*np.sum((y_hat - y)) 
  
  return dw, db

def plot_decision_boundary(X, w, b):
    
    # X --> Inputs
    # w --> weights
    # b --> bias
    
    # The Line is y=mx+c
    # So, Equate mx+c = w.X + b
    # Solving we find m and c
    x1 = [min(X[:,0]), max(X[:,0])]
    m = -w[0]/w[1]
    c = -b/w[1]
    x2 = m*x1 + c
    
    # Plotting
    fig = plt.figure(figsize=(10,8))
    plt.plot(X[:, 0][y==0], X[:, 1][y==0], "g^")
    plt.plot(X[:, 0][y==1], X[:, 1][y==1], "bs")
    plt.xlim([-2, 2])
    plt.ylim([0, 2.2])
    plt.xlabel("feature 1")
    plt.ylabel("feature 2")
    plt.title('Decision Boundary')
    plt.plot(x1, x2, 'y-')

def normalize(X):
    
    # X --> Input.
    
    # m-> number of training examples
    # n-> number of features 
    m, n = X.shape
    
    # Normalizing all the n features of X.
    for i in range(n):
        X = (X - X.mean(axis=0))/X.std(axis=0)
        
    return X

def train(X, y, bs, epochs, lr):
  # X --> Input.
  # y --> true/target value.
  # bs --> Batch Size.
  # epochs --> Number of iterations.
  # lr --> Learning rate.
      
  # m-> number of training examples
  # n-> number of features 
  m, n = X.shape
  
  # Initializing weights and bias to zeros.
  w = np.zeros((n,1))
  b = 0
  
  # Reshaping y.
  y = y.reshape(m,1)
  
  # Normalizing the inputs.
  # x = normalize(X)
  
  # Empty list to store losses.
  losses = []
  
  # Training loop.
  for epoch in range(epochs):
      for i in range((m-1)//bs + 1):
          
          # Defining batches. SGD.
          start_i = i*bs
          end_i = start_i + bs
          xb = X[start_i:end_i]
          yb = y[start_i:end_i]
          
          # Calculating hypothesis/prediction.
          y_hat = sigmoid(np.dot(xb, w) + b)
          
          # Getting the gradients of loss w.r.t parameters.
          dw, db = gradients(xb, yb, y_hat)
          
          # Updating the parameters.
          w -= lr*dw
          b -= lr*db
      
      # Calculating loss and appending it in the list.
      l = loss(y, sigmoid(np.dot(X, w) + b))
      # print(l)
      losses.append(l)
      
  # returning weights, bias and losses(List).
  return w, b, losses

def predict(X):
  # X --> Input.
  # Normalizing the inputs.
  # x = normalize(X)
  
  # Calculating presictions/y_hat.
  preds = sigmoid(np.dot(X, w) + b)
  
  # Empty List to store predictions.
  pred_class = []
  # if y_hat >= 0.5 --> round up to 1
  # if y_hat < 0.5 --> round up to 1
  pred_class = [1 if i > 0.5 else 0 for i in preds]
  
  return np.array(pred_class)


In [2]:
train_df = pd.read_csv("train_tfidf_features.csv")
test_df = pd.read_csv("test_tfidf_features.csv")


In [3]:
test_df.head(5)
# Number of ratio of 1s to 0s in label (class balance)
print(sum(train_df["label"])/len(train_df["label"]))

0.38122672253258844


In [4]:
from sklearn.model_selection import train_test_split
from sklearn.model_selection import StratifiedKFold

X = train_df.drop(['id','label'], axis = 1).values
y = train_df['label'].values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=1)

In [5]:
from sklearn.linear_model import LogisticRegression
# sk-learn's implementation

reg = LogisticRegression().fit(X_train, y_train)
y_pred_SK = reg.predict(X_test)
f1_SK = f1_score(y_test, y_pred_SK, average='macro')
print("F1 Score for scikitlearn's logistic regression: ")
print(f1_SK)

F1 Score for scikitlearn's logistic regression: 
0.6752934368606008


In [13]:
w, b, l = train(X_train, y_train, bs=100, epochs=1000, lr=0.1)
y_pred = predict(X_test)
f1 = f1_score(y_test, y_pred, average='macro')
print("F1 Score for our logistic regression: ")
print(f1)

F1 Score for our logistic regression: 
0.6822884427261123


In [14]:
idColDf = test_df["id"]
X_submit_test = test_df.drop(['id'], axis = 1).values

# Submission using OUR logistic regression weights
print("training...")
w, b, l = train(X, y, bs=100, epochs=1000, lr=0.1)
print("predicting...")
y_submit_pred = predict(X_submit_test)
y_submit_pred_df = pd.DataFrame(y_submit_pred)
print("concatenating...")
submission = pd.concat([idColDf,y_submit_pred_df],axis=1)
print("saving...")
submission.to_csv('LogRed_Prediction.csv', index=False)

# Takes approximately 25+ minutes to run


predicting...
concatenating...
saving...



## Task 2: PCA

### Key Task Deliverables
1. Code implementation of PCA on the train and test sets. Note that you are allowed to use the sklearn package for this task.
2. Report the Macro F1 scores for applying 2000, 1000, 500, and 100 components on the test set. Note that you will have to submit your predicted labels to Kaggle to retrieve the Macro F1 scores for the test set and report the results in your final report. Use KNN as the machine learning model for your training and prediction (You are also allowed to use the sklearn package for KNN implementation) (set n_neighbors=2).







In [11]:
from sklearn.decomposition import PCA
from sklearn.neighbors import KNeighborsClassifier

for num in [2000,1000,500,100] :
  X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
  neigh = KNeighborsClassifier(n_neighbors=2)
  pca = PCA(n_components=num)
  pca.fit(X_train)
  
  X_train = pca.transform(X_train)
  X_test = pca.transform(X_test)
  
  neigh.fit(X_train, y_train)
  y_pred = neigh.predict(X_test)
  f1 = f1_score(y_test, y_pred, average='macro')
  print(f"F1 Score for {num} Components: {f1}")

# Report the Macro F1 scores for applying 2000, 1000, 500, and 100 components on the test set
# Use KNN as the machine learning model for your training and prediction (You are also allowed to use the sklearn package for KNN implementation) (set n_neighbors=2)

F1 Score for 2000 Components: 0.48337877074774893
F1 Score for 1000 Components: 0.5665376732277406
F1 Score for 500 Components: 0.5415714030120706
F1 Score for 100 Components: 0.5489194994537527


### Results:
- F1 Score for 2000 Components: 0.48337877074774893
- F1 Score for 1000 Components: 0.5665376732277406
- F1 Score for 500 Components: 0.5415714030120706
- F1 Score for 100 Components: 0.5489194994537527

### Analysis:
- F1_score is lowest at 2000 Components, and highest at 1000 Components.
- For our train-test split, the number of components did not appear to have a strong correlation with the F1_score.
- Higher scores for lower components could be due to the removal of substantial noise, even though we might expect the opposite to be true due to the the decrease in explained variance for low components.
- This may be because the dataset has a high amount of noise that can be removed to yield more predictive features.


## Task 3: Machine Learning Model for Hate Text Classification

### Key Task Deliverables

1. Code implementation of all the models that you have tried. Please include comments on your implementation (i.e., tell us the models you have used and list the key hyperparameter settings.

2. Submit your predicted labels for the test set to Kaggle. You will be able to see your model performance on the public leaderboard. Please make your submission under your registered team name! We will award the points according to the ranking of the registered team name.



### 💡 Multinomial NB   


 ##### 👉    What is the model ? 

  * **Multinomial NB** implements the naive Bayes algorithm for multinomially distributed data, and is one of the two classic naive Bayes.

           

    

    
 ##### 👉    Why we chose to test this model ? 
  
 * The reason behind testing this model is based on the research paper released regarding the classification of hate speech 'A comparison of classification algorithms for hate speech detection'. The results show that the Multinomial Naive Bayes algorithm produces the best model with the highest recall value of **93.2%** which has an accuracy value of **71.2%** for the classification of hate speech. (**Putri et al., 2020**).

### 💡 Hyperparameters of the model 
* **alpha:** float, default = 1.0
<br> Additive (Laplace/Lidstone) smoothing parameter (0 for no smoothing).

* **fit_prior:** bool, default = True
<br> Whether to learn class prior probabilities or not. If false, a uniform prior will be used.

* **class_prior:** array-like of shape (n_classes,), default = None <br> Prior probabilities of the classes. If specified, the priors are not adjusted according to the data

### 💡 Hyperparameters tuning 

   *  We are utilizing the **Optuna** hyperparameter optimization framework to tune our hyperparameters.

   
 ##### 👉    Why we chose this framework ?

  *  Optuna boasts the following features: 
      - **Lightweight, versatile, and platform agnostic architecture**
           - Handle a wide variety of tasks with a simple installation that has few requirements.
       - **Pythonic search spaces**
            - Define search spaces using familiar Python syntax including conditionals and loops.
       - **Efficient optimization algorithms**
            - Adopt state-of-the-art algorithms for sampling hyperparameters and efficiently pruning unpromising trials.
       - **Easy parallelization**
            - Scale studies to tens or hundreds or workers with little or no changes to the code.
       - **Quick visualization**
            - Inspect optimization histories from a variety of plotting functions.
     
     Optuna also allowed us to find the optimal parameters at a faster rate as opposed to grid search though there is a trade off in accuracy. 



### 💡 Results 

##### 👉    F1 score of Model on public dataset (20%) 

* **0.7100**
 

Kaggle Notebook Link: https://www.kaggle.com/visshalnatarajan/multinomial-nb-with-optuna-920b9f62



In [None]:
# Model Implementation

# Imports
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.model_selection import cross_val_score
import pandas as pd
import optuna
import matplotlib.pyplot as plt
plt.style.use("bmh")

In [None]:
# Reading the data
print("reading CSVs")
train_df = pd.read_csv("train_tfidf_features.csv")
test_df = pd.read_csv("test_tfidf_features.csv")

print("creating X and y")
X = train_df.drop(['id','label'], axis = 1).values
y = train_df['label'].values


In [None]:
def objective(trial):
    cv = RepeatedStratifiedKFold(n_splits=3,n_repeats=3,random_state=1)
    
    ### define params grid to search maximum accuracy
    alpha = trial.suggest_float('alpha',2.0,5.0)
    fit_prior =  trial.suggest_categorical('fit_prior', [True, False])

    ### modeling with suggested params
    model = MultinomialNB(alpha = alpha,fit_prior = fit_prior)

    ## cross validation score
    score = cross_val_score(model, X, y, n_jobs=2, cv=cv, scoring="f1_macro")
    f1_mean = score.mean()
    return f1_mean
    
study = optuna.create_study(direction='maximize') # maximize accuracy
study.optimize(objective, n_trials=None, timeout=42000, n_jobs = 2,)

In [None]:
study.best_trial.params

In [None]:
FILE_NAME= '/kaggle/working/Winner.csv'
model =  MultinomialNB(alpha =study.best_trial.params['alpha'],fit_prior = study.best_trial.params['fit_prior'],)
model.fit(X,y)
idColDf = test_df["id"]
X_submit_test = test_df.drop(['id'], axis = 1).values

# Submission using OUR logistic regression weights
y_submit_pred = model.predict(X_submit_test)
y_submit_pred_df = pd.DataFrame(y_submit_pred)
submission = pd.concat([idColDf,y_submit_pred_df],axis=1)
submission.columns = ['id','label']
submission.to_csv(FILE_NAME, index=False)

---
### 💡 XGBoost


 ##### 👉    What is the model ?

  * **XGBoost** is a tree-based ensemble model which stands for “gradient-boosted decision tree (GBDT)”. Compared to the random forest, XGBoost implement the idea of “boosting” which establishes a connection between trees. Therefore, trees in XGBoost models are no longer independent of each other and the model eventually becomes an orderly collective decision-making system.


 ##### 👉    Why we chose to test this model ?
 * XGBoost is an ensemble algorithm that has higher predicting power and performance, and it is achieved by improvisation on Gradient Boosting framework by introducing some accurate approximation algorithms. We speculate that by **gradient boost mechanism** and **self-regularization** embedded in the model can potentially increase the accuracy of predictions and reduce the risk of overfitting for tree-based models.


### 💡 Hyperparameters of the model
* **eta:** float, default = 0.3
<br> The step size in fitting.

* **objective:** default = reg:squarederror
<br> Specify the learning task.

* **num_round:** integer
<br> Same as “n_estimators”, the number for boosting, which also decides the number of trees in the ensemble forest.

* **subsample:** float, default = 1
<br> Subsample ratio of the training instances (prevent overfitting).

* **min_child_weight:** float, default = 1
<br> Minimum sum of instance weight needed in a child. The larger, the more conservative the model will be.

* **max_depth:** integer, default = 6
<br> Maximum depth of a tree.

* **gamma:** float, default = 0
<br> Minimum loss reduction required to make a further partition on a leaf node.

* **colsample_bytree:** default = 1
<br> The subsample ratio of columns when constructing each tree.


### 💡 Parameters tuning

   *  We are utilizing the **RandomSearchCV** provided by scikit-learn to tune our parameters.


 ##### 👉    Why we chose this framework ?

  * RandomSearchCV is useful when we have **many parameters** to try and the training time is very long. The **training time** for XGBoost is relatively long compared to non-tree-based models. Considering that cross-validation takes a longer time as well, the train load is even heavier.
  * The **number of parameters** to consider for XGBoost trees is particularly high and the magnitudes of influence are imbalanced. Compared to GridSearch, RandomSearch is more suitable in this situation.


### 💡 Results

##### 👉    F1 score of Model on public dataset (20%)
* **0.6607**

In [None]:
# Model Implementation

# Imports
import numpy as np
import pandas as pd
from xgboost import XGBClassifier
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold

In [None]:
# read the data
print("reading CSVs...")
train_df = pd.read_csv("train_tfidf_features.csv")
test_df = pd.read_csv("test_tfidf_features.csv")

print("creating training X and y...")
X = train_df.drop(['id','label'], axis = 1).values
Y = train_df['label'].values

In [None]:
# cross validation
folds = 5
skf = StratifiedKFold(n_splits=folds, shuffle = True)

# xgb classifier
xgb = XGBClassifier(objective='binary:logistic', silent=True, nthread=1, eta=0.2)

# searching parameters
params = {
    'subsample': np.linspace(0.6, 1, 3),
    'n_estimators': np.linspace(200, 600, 4, dtype=int),
    'min_child_weight': [1, 2, 5],
    'gamma': np.linspace(1, 3, 10),
    'colsample_bytree': np.linspace(0.6, 0.8, 3)
    }

In [None]:
# random search
random_search = RandomizedSearchCV(xgb, param_distributions=params, n_iter=30, scoring='f1', cv=skf.split(X,Y), verbose=1, random_state=1001)
random_search.fit(X, Y)

# best parameter
# {'subsample': 0.7333333333333333,
#  'n_estimators': 600,
#  'min_child_weight': 2,
#  'gamma': 2.7777777777777777,
#  'colsample_bytree': 0.6}

In [None]:
# retrain the model with a smaller eta
cv = RepeatedStratifiedKFold(n_splits=5, random_state=3, n_repeats=1)
xgb_final_model = XGBClassifier(eta=0.01, objective='binary:logistic', silent=True, nthread=1, use_label_encoder=False, subsample=0.73, n_estimator=600, min_child_weight=2, max_depth=200, gamma=2.7, colssample_bytree=0.6)

score_rf1 = cross_val_score(xgb_final_model, X, Y, scoring='f1_macro', cv=cv, n_jobs=2, verbose=3, error_score="raise")

### 💡 ExtraTreesClassifier


 ##### 👉    What is the model ? 

  * **ExtraTreesClassifier** is an ensemble machine learning model that is derived from deciesion trees. Also known as "Extremely Randomized Trees", ExtraTreesClassifier builds multiple trees and fits a number of randomized decision trees on various sub-samples of the dataset, averaging them to improve predictive accuracy and control over-fitting.

           

    

    
 ##### 👉    Why we chose to test this model ? 
  
 * Random Forest is generally quite capable of capturing the complex non-linear relationships of certain datasets while maintaining variance at a moderate level. However, due to the large number of features in this hate speech dataset, there is a much greater risk of over-fitting. Extra Trees's increase randomness compared to Random Forest allows it to have substantially lower variance which is well suited to the high dimensionality of our dataset.

### 💡 Hyperparameters of the model 
* **n_estimators:** int, default = 100
<br> The number of trees in the forest.

* **criterion:** {“gini”, “entropy”, “log_loss”}, default = "gini"
<br> The function to measure the quality of a split. 

* **max_depth:** int, default = None <br> The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.

* **min_samples_split:** int or float, default = 2 <br> The minimum number of samples required to split an internal node. If int, then consider min_samples_split as the minimum number. If float, then min_samples_split is a fraction and ceil(min_samples_split * n_samples) are the minimum number of samples for each split.

* **min_samples_leaf:** int or float, default = 1 <br> The minimum number of samples required to be at a leaf node. A split point at any depth will only be considered if it leaves at least min_samples_leaf training samples in each of the left and right branches. 
  
* **min_weight_fraction_leaf:** float, default = 0.0 <br> The minimum weighted fraction of the sum total of weights (of all the input samples) required to be at a leaf node. Samples have equal weight when sample_weight is not provided.

* **max_features:** {“sqrt”, “log2”, None}, int or float, default = ”sqrt” <br> e number of features to consider when looking for the best split.If int, then consider max_features features at each split.

If float, then max_features is a fraction and round(max_features * n_features) features are considered at each split. If “auto”, then max_features=sqrt(n_features). If “sqrt”, then max_features=sqrt(n_features). If “log2”, then max_features=log2(n_features). If None, then max_features=n_features.

* **max_leaf_nodes:** int, default = None <br> Grow trees with max_leaf_nodes in best-first fashion. Best nodes are defined as relative reduction in impurity. If None then unlimited number of leaf nodes.

* **min_impurity_decrease:** float, default = 0.0 <br> A node will be split if this split induces a decrease of the impurity greater than or equal to this value.

* **bootstrap:** bool, default=False <br> Whether bootstrap samples are used when building trees. If False, the whole dataset is used to build each tree.

* **random_state:** int, RandomState instance or None, default=None <br> Seed to control sources of randomness

* **class_weight:** {“balanced”, “balanced_subsample”}, dict or list of dicts, default=None <br> Weights associated with classes in the form {class_label: weight}. If not given, all classes are supposed to have weight one. 

* **ccp_alpha:** non-negative float, default=0.0 <br> Complexity parameter used for Minimal Cost-Complexity Pruning. The subtree with the largest cost complexity that is smaller than ccp_alpha will be chosen. By default, no pruning is performed. 

* **max_samples:** int or float, default=None <br> If bootstrap is True, the number of samples to draw from X to train each base estimator.

### 💡 Hyperparameters tuning 

   *  We are utilizing the **Optuna** hyperparameter optimization framework to tune our hyperparameters.

   
 ##### 👉    Why we chose this framework ?

  *  Optuna boasts the following features: 
      - **Lightweight, versatile, and platform agnostic architecture**
           - Handle a wide variety of tasks with a simple installation that has few requirements.
       - **Pythonic search spaces**
            - Define search spaces using familiar Python syntax including conditionals and loops.
       - **Efficient optimization algorithms**
            - Adopt state-of-the-art algorithms for sampling hyperparameters and efficiently pruning unpromising trials.
       - **Easy parallelization**
            - Scale studies to tens or hundreds or workers with little or no changes to the code.
       - **Quick visualization**
            - Inspect optimization histories from a variety of plotting functions.
     
     Optuna also allowed us to find the optimal parameters at a faster rate as opposed to grid search though there is a trade off in accuracy. 



### 💡 Results 

##### 👉    F1 score of Model on public dataset (20%) 

* **0.72273**
 

In [None]:
# Model Implementation

# Imports
from sklearn.ensemble import ExtraTreesClassifier

In [None]:
print("reading CSVs")
train_df = pd.read_csv("train_tfidf_features.csv")
test_df = pd.read_csv("test_tfidf_features.csv")

print("creating X and y")
X = train_df.drop(['id', 'label'], axis=1).values
y = train_df['label'].values
print("done")


### 💡 Hyperparameters tuning 
- The following trial was ran on a kaggle notebook for 12 hours:

In [None]:
def objective(trial):
    cv = StratifiedKFold(n_splits=3, shuffle=True, random_state=1)

    ### define params grid to search maximum accuracy
    n_estimators = trial.suggest_int('n_estimators', 50, 250)
    max_depth = trial.suggest_int('max_depth', 500, 1100)
    criterion = trial.suggest_categorical('criterion', ['gini', 'entropy'])
    class_weight = trial.suggest_categorical(
        'class_weight', ['balanced', 'balanced_subsample'])
    ccp_alpha = trial.suggest_loguniform('ccp_alpha', 1e-5, 1e-1)
    min_samples_split = trial.suggest_int('min_samples_split', 2, 256)
    min_weight_fraction_leaf = trial.suggest_loguniform(
        'min_weight_fraction_leaf', 1e-6, 1e-2)
    min_impurity_decrease = trial.suggest_float(
        "min_impurity_decrease", 0, 0.1)
    max_features = trial.suggest_loguniform('max_features', 1e-3, 1)
    max_samples = trial.suggest_float("max_samples", 0.5, 0.9)
    bootstrap = trial.suggest_categorical('bootstrap', [True, False])
    ### modeling with suggested params
    model = ExtraTreesClassifier(n_estimators=n_estimators,
                                 max_depth=max_depth,
                                 class_weight=class_weight,
                                 ccp_alpha=ccp_alpha,
                                 min_samples_split=min_samples_split,
                                 min_weight_fraction_leaf=min_weight_fraction_leaf,
                                 min_impurity_decrease=min_impurity_decrease,
                                 max_features=max_features,
                                 max_samples=max_samples,
                                 criterion=criterion,
                                 bootstrap=bootstrap,
                                 n_jobs=2,
                                 random_state=1)  # do not tune the seed

    ## cross validation score
    score = cross_val_score(model, X, y, n_jobs=2, cv=cv, scoring="f1_macro")
    f1_mean = score.mean()

    return f1_mean


study = optuna.create_study(direction='maximize')  # maximize accuracy
study.optimize(objective, n_trials=None, timeout=42000, n_jobs=2,)

print(study.best_trial.params)
print(study.best_value)

### 💡 Result of Hyperparemeter tuning with Optuna: 

In [None]:
# Cross-validated results: 0.71643547
# Test results: 0.72273
model = ExtraTreesClassifier(n_estimators=144,
                            max_depth=589,
                            criterion="entropy",
                            class_weight="balanced_subsample",
                            ccp_alpha=6.267622143679782e-05,
                            min_samples_split=157,
                            min_weight_fraction_leaf=4.8022857076483334e-05,
                            min_impurity_decrease=1.5576259402879695e-05,
                            max_features=0.00502175457189458,
                            max_samples=0.8999810323985775,
                            bootstrap=True)



In [None]:
# Cross validation and Submission

FILE_NAME = '/kaggle/working/etc2.csv'
np.seterr(all="ignore")

cv = RepeatedStratifiedKFold(n_splits=5, random_state=3, n_repeats=3)

score = cross_val_score(model, X, y, scoring='f1_macro',
                        cv=cv, n_jobs=2, verbose=3, error_score="raise")

print("LATEST")
print(score)
print(score.mean())
# 0.7164
print(score.std())
#0.00810

model.fit(X, y)
idColDf = test_df["id"]
X_submit_test = test_df.drop(['id'], axis=1).values
# Submission using OUR logistic regression weights
y_submit_pred = model.predict(X_submit_test)
y_submit_pred_df = pd.DataFrame(y_submit_pred)
submission = pd.concat([idColDf, y_submit_pred_df], axis=1)
submission.columns = ['id', 'label']
submission.to_csv(FILE_NAME, index=False)


### 💡 AutoML + Hyper-parameter tuning with TPOT


 ##### 👉    What is TPOT ? 

  * **TPOT** stands for Tree-based Pipeline Optimization Tool. It is a Python Automated Machine Learning tool that optimizes machine learning pipelines using genetic programming

           

    

    
 ##### 👉    Why we chose to use TPOT ? 
  
 * TPOT automates the most tedious part of machine learning by intelligently exploring thousands of possible pipelines to find the best one the given data. 
 * It is built on scikit-learn (familiar interface)

### 💡 Key parameters of TPOT 
* **generations:** int or None, optional. default = 100
<br> Number of iterations to the run pipeline optimization process.

* **population_size:** nt, optional (default=100)
<br> Number of individuals to retain in the genetic programming population every generation. Must be a positive number.

* **scoring:** string or callable, optional (default='accuracy') <br> Function used to evaluate the quality of a given pipeline for the classification problem. The following built-in scoring functions can be used:

* **cv:** int, cross-validation generator, or an iterable, optional (default=5) <br> Cross-validation strategy used when evaluating pipelines.

* **config_dict:** Python dictionary, string, or None, optional (default=None) <br> A configuration dictionary for customizing the operators and parameters that TPOT searches in the optimization process.
  

* **random_state:** int, RandomState instance or None, default=None <br> he seed of the pseudo random number generator used in TPOT.


### Imports: 

In [None]:
from tpot import TPOTClassifier
import time
print("reading CSVs")
train_df = pd.read_csv("/kaggle/input/50007-dataset/train_tfidf_features.csv")
test_df = pd.read_csv("/kaggle/input/50007-dataset/test_tfidf_features.csv")

print("creating X and y")
X = train_df.drop(['id', 'label'], axis=1).values
y = train_df['label'].values


### TPOT Search Space:

In [None]:
searchDict = {

    # Classifiers
    'sklearn.naive_bayes.BernoulliNB': {
        'alpha': [1.882, 5, 0.1],
        'fit_prior': [False]
    },
    
    'sklearn.ensemble.ExtraTreesClassifier': {
        'n_estimators': [100],
        'criterion': ["gini"],
        'max_features': np.arange(0.05, 1.01, 0.05),
        'min_samples_split': range(2, 21),
        'min_samples_leaf': range(1, 21),
        "max_depth": range(1, 2002,100),
        'bootstrap': [True, False]
    },

    'sklearn.tree.DecisionTreeClassifier': {
        'criterion': ["gini", "entropy"],
        'max_depth': range(1, 11),
        'min_samples_split': range(2, 21),
        'min_samples_leaf': range(1, 21),
        'ccp_alpha':[0.0001,0.001,0.005,0.01,0.05]
    },
    
    'sklearn.svm.LinearSVC': {
        'penalty': ["l1", "l2"],
        'loss': ["hinge", "squared_hinge"],
        'dual': [True, False],
        'tol': [1e-5, 1e-4, 1e-3, 1e-2, 1e-1],
        'C': [1e-4, 1e-3, 1e-2, 1e-1, 0.5, 1., 5., 10., 15., 20., 25.]
    },


    'sklearn.linear_model.SGDClassifier': {
        'loss': ['log', 'hinge', 'modified_huber', 'squared_hinge', 'perceptron'],
        'penalty': ['elasticnet'],
        'alpha': [3e-06, 0.01, 0.001],
        'learning_rate': ['invscaling'],
        'fit_intercept': [True, False],
        'l1_ratio': [0.25, 0.0, 1.0, 0.75, 0.5],
        'eta0': [0.1, 1.0, 0.01,0.0012],
        'power_t': [0.5, 0.0, 1.0, 0.1, 100.0, 10.0, 50.0]
    },

    # Preprocesssors
    'sklearn.preprocessing.Binarizer': {
        'threshold': np.arange(0.0, 1.01, 0.05)
    },

    'sklearn.cluster.FeatureAgglomeration': {
        'linkage': ['ward', 'complete', 'average'],
        'affinity': ['euclidean', 'l1', 'l2', 'manhattan', 'cosine']
    },

    'sklearn.preprocessing.MaxAbsScaler': {
    },

    'sklearn.preprocessing.MinMaxScaler': {
    },

    'sklearn.preprocessing.Normalizer': {
        'norm': ['l1', 'l2', 'max']
    },

    'sklearn.kernel_approximation.RBFSampler': {
        'gamma': np.arange(0.0, 1.01, 0.05)
    },

    'sklearn.preprocessing.RobustScaler': {
    },

    'sklearn.preprocessing.StandardScaler': {
    },

    'tpot.builtins.ZeroCount': {
    },

    # Selectors
    'sklearn.feature_selection.SelectFwe': {
        'alpha': np.arange(0, 0.05, 0.001),
        'score_func': {
            'sklearn.feature_selection.f_classif': None
        }
    },

    'sklearn.feature_selection.SelectPercentile': {
        'percentile': range(1, 100),
        'score_func': {
            'sklearn.feature_selection.f_classif': None
        }
    },

    'sklearn.feature_selection.VarianceThreshold': {
        'threshold': [0.0001, 0.0005, 0.001, 0.005, 0.01, 0.05, 0.1, 0.2]
    }

}

### Code that is run on Kaggle (set to approximately 12 hours each run):

In [None]:
import random

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=3)
# define search
print("Starting with highly custom light config dict")
start_time = time.time()
periodic_checkpoint_folder = "/kaggle/working/checkpoints/"
model = TPOTClassifier(generations=500, population_size=30, cv=cv, config_dict=searchDict,
                       scoring='f1_macro', verbosity=3, random_state=random.randint(1, 999999999),
                       n_jobs=2, periodic_checkpoint_folder=periodic_checkpoint_folder,
                       max_time_mins=700, max_eval_time_mins=15)
# perform the search
model.fit(X, y)
# export the best model
model.export('tpot_best_model.py')
end_time = time.time()

# Results
print('TPOT classifier finished in %s seconds' % (end_time - start_time))
print('Best pipeline test score: %.3f' % model.score(X, y))


### Some of the best cross-validated **Results** from multiple runs of TPOT on Kaggle:

In [None]:
from sklearn.pipeline import make_pipeline, make_union
from sklearn.preprocessing import Normalizer, FunctionTransformer, RobustScaler, MaxAbsScaler, Binarizer, MinMaxScaler
from sklearn.feature_selection import VarianceThreshold, SelectPercentile, f_classif
from tpot.export_utils import set_param_recursive
from tpot.builtins import StackingEstimator
from sklearn.naive_bayes import BernoulliNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import SGDClassifier
from copy import copy

# Average CV score on the training set was: 0.7165325962375743
cl0 = make_pipeline(
    make_union(
        FunctionTransformer(copy),
        make_pipeline(
            make_union(
                Normalizer(norm="max"),
                SelectPercentile(score_func=f_classif, percentile=5)
            ),
            MaxAbsScaler()
        )
    ),
    StackingEstimator(estimator=SGDClassifier(alpha=0.001, eta0=1.0, fit_intercept=False, l1_ratio=0.5,
                                              learning_rate="invscaling", loss="modified_huber", penalty="elasticnet", power_t=0.5)),
    BernoulliNB(alpha=1.0, fit_prior=True)
)
# Fix random state for all the steps in exported pipeline
set_param_recursive(cl0.steps, 'random_state', 2)


# Average CV score on the training set was: 0.7167734749214567
cl2 = make_pipeline(
    StackingEstimator(estimator=DecisionTreeClassifier(
        criterion="gini", max_depth=8, min_samples_leaf=9, min_samples_split=6)),
    StackingEstimator(estimator=SGDClassifier(alpha=0.001, eta0=1.0, fit_intercept=False, l1_ratio=0.75,
                                              learning_rate="invscaling", loss="perceptron", penalty="elasticnet", power_t=0.1)),
    BernoulliNB(alpha=1.0, fit_prior=False)
)
# Fix random state for all the steps in exported pipeline
set_param_recursive(cl2.steps, 'random_state', 222)


# Average CV score on the training set was: 0.7192365888770997
cl12 = make_pipeline(
    SelectPercentile(score_func=f_classif, percentile=77),
    SelectPercentile(score_func=f_classif, percentile=68),
    StackingEstimator(estimator=SGDClassifier(alpha=0.001, eta0=0.0012, fit_intercept=False,
                                              l1_ratio=0.0, learning_rate="invscaling", loss="hinge", penalty="elasticnet", power_t=0.5)),
    BernoulliNB(alpha=2.300000000000001, fit_prior=False)
)
# Fix random state for all the steps in exported pipeline
set_param_recursive(cl12.steps, 'random_state', 422)




### 💡 **Final Model**: VotingClassifier (Kaggle Link: https://www.kaggle.com/overspleen/submit/edit)


 ##### 👉    What is the model ? 

  * **VotingClassifier** is a machine learning model that trains an ensemble of user-defined models and predicted an output based on the weights (voting power) assigned to each of the model.
    
 ##### 👉    Motivation for using VotingClassifier 
  
 * Assuming that classifier are independent, increasing the number of classifiers can increase the validation and test results, especially if the weights (voting power) for each classifiers are tuned. This is can be true even if some of the models do not have high validation score.

### 💡 Main Hyperparameters of the model 
* **estimators:** list of (str, estimator) tuples
<br> nvoking the fit method on the VotingClassifier will fit clones of those original estimators

* **voting:** ‘hard’, ‘soft’}, default=’hard’
<br> If ‘hard’, uses predicted class labels for majority rule voting. Else if ‘soft’, predicts the class label based on the argmax of the sums of the predicted probabilities, which is recommended for an ensemble of well-calibrated classifiers.

* **weights:** array-like of shape (n_classifiers,), default=None<br> Sequence of weights (float or int) to weight the occurrences of predicted class labels (hard voting) or class probabilities before averaging (soft voting). Uses uniform weights if None.

### 💡 Hyperparameters tuning (Kaggle Link: https://www.kaggle.com/code/overspleen/xtra-voting/edit/run/102572783)

   *  We utilised the **Optuna** hyperparameter optimization framework to tune the weights.
   

### Reading Data:

In [None]:
print("reading CSVs")
train_df = pd.read_csv("train_tfidf_features.csv")
test_df = pd.read_csv("test_tfidf_features.csv")
print("creating X and y")
X = train_df.drop(['id', 'label'], axis=1).values
y = train_df['label'].values


### Models Chosen for Voting Classifier: ExtraTreesClassifier and a mix of dissimilar models/pipelines from TPOT

The Reason for this approach is to ensure that ExtraTreesClassifier does not overfit too much to the public test set on kaggle. The TPOT models hence has a regularising effect on the overall model (we assess that the reduction in variance is worth the potential increase in bias)

In [None]:
# Kaggle link: https://www.kaggle.com/overspleen/submit/edit

from sklearn.ensemble import VotingClassifier, ExtraTreesClassifier
from sklearn.pipeline import make_pipeline, make_union
from sklearn.preprocessing import Normalizer, FunctionTransformer, RobustScaler, MaxAbsScaler, Binarizer, MinMaxScaler
from sklearn.feature_selection import VarianceThreshold, SelectPercentile, f_classif
from tpot.export_utils import set_param_recursive
from tpot.builtins import StackingEstimator
from sklearn.naive_bayes import BernoulliNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import SGDClassifier
from copy import copy

cv = StratifiedKFold(n_splits=3, shuffle=True, random_state=1)

# Weights chosen after Optuna hyperparamter tuning and manual adjustments
w1 = 1
w2 = 2
w3 = 3.5
w4 = 10
weights = [w1, w2, w3, w4]

cl0 = make_pipeline(
    make_union(
        FunctionTransformer(copy),
        make_pipeline(
            make_union(
                Normalizer(norm="max"),
                SelectPercentile(score_func=f_classif, percentile=5)
            ),
            MaxAbsScaler()
        )
    ),
    StackingEstimator(estimator=SGDClassifier(alpha=0.001, eta0=1.0, fit_intercept=False, l1_ratio=0.5,
                                              learning_rate="invscaling", loss="modified_huber", penalty="elasticnet", power_t=0.5)),
    BernoulliNB(alpha=1.0, fit_prior=True)
)
# Fix random state for all the steps in exported pipeline
set_param_recursive(cl0.steps, 'random_state', 2)

# Average CV score on the training set was: 0.7167734749214567
cl2 = make_pipeline(
    StackingEstimator(estimator=DecisionTreeClassifier(
        criterion="gini", max_depth=8, min_samples_leaf=9, min_samples_split=6)),
    StackingEstimator(estimator=SGDClassifier(alpha=0.001, eta0=1.0, fit_intercept=False, l1_ratio=0.75,
                                              learning_rate="invscaling", loss="perceptron", penalty="elasticnet", power_t=0.1)),
    BernoulliNB(alpha=1.0, fit_prior=False)
)
# Fix random state for all the steps in exported pipeline
set_param_recursive(cl2.steps, 'random_state', 222)

# Average CV score on the training set was: 0.7192365888770997
cl12 = make_pipeline(
    SelectPercentile(score_func=f_classif, percentile=77),
    SelectPercentile(score_func=f_classif, percentile=68),
    StackingEstimator(estimator=SGDClassifier(alpha=0.001, eta0=0.0012, fit_intercept=False,
                                              l1_ratio=0.0, learning_rate="invscaling", loss="hinge", penalty="elasticnet", power_t=0.5)),
    BernoulliNB(alpha=2.300000000000001, fit_prior=False)
)
# Fix random state for all the steps in exported pipeline
set_param_recursive(cl12.steps, 'random_state', 422)

etc = ExtraTreesClassifier(n_estimators=144,
                           max_depth=589,
                           criterion='entropy',
                           class_weight='balanced_subsample',
                           ccp_alpha=6.267622143679782e-05,
                           min_samples_split=157,
                           min_weight_fraction_leaf=4.8022857076483334e-05,
                           min_impurity_decrease=1.5576259402879695e-05,
                           max_features=0.00502175457189458,
                           max_samples=0.8999810323985775,
                           bootstrap=True)
estimatorsLast = [("cl0", cl0), ("cl2", cl2), ("cl12", cl12), ("etc", etc)]


model = VotingClassifier(estimators=estimatorsLast,
                         voting='soft',
                         verbose=False,
                         n_jobs=2, weights=weights)


### Validation Results:

In [None]:
## cross validation score
cv2 = RepeatedStratifiedKFold(n_splits=5, random_state=3, n_repeats=3)
score = cross_val_score(model, X, y, n_jobs=2, cv=cv2, scoring="f1_macro")

print("LATEST")
print(score)
print(score.mean())
print(score.std())

# LATEST
# [0.73517317 0.71984801 0.70466551 0.71652156 0.72720915 0.71664385
#  0.71723571 0.71669589 0.72687481 0.72986298 0.72181864 0.7291815
#  0.72164984 0.7171528  0.71593628]

# 0.7210979799320694
# 0.007306143600200346


### Create CSV:

In [None]:
FILE_NAME = '/kaggle/working/etcBasicCouncil7.csv'
model.fit(X,y)
idColDf = test_df["id"]
X_submit_test = test_df.drop(['id'], axis = 1).values
# Submission using OUR logistic regression weights
y_submit_pred = model.predict(X_submit_test)
y_submit_pred_df = pd.DataFrame(y_submit_pred)
submission = pd.concat([idColDf,y_submit_pred_df],axis=1)
submission.columns = ['id','label']
submission.to_csv(FILE_NAME, index=False)


### Test Results: 0.72046

### Comparison with ExtraTreesClassifier and Conclusion:

Despite ExtraTreesClassifier having a higher test result (**0.72273**) on the public leaderboard than the VotingClassifier (**0.72046**), we still decided to use the results from the VotingClassifier as its 5-repeats 3-fold stratified cross-validated f1-score is higher (at **0.72109**) than that of the ExtraTreesClassifier (at **0.7164**). This implies that the VotingClassifier may be more robust to unseen test examples compared to the ExtraTreesClassifier.