# classification with xgboost
model v5
- preliminary model (no hyperparameter tuning)
- data with all 1352 important features based on pca and threshold
- imbalanced classification dealt with using over- and under-sampling techniques

Source: [10 Techniques To Deal With Imbalanced Classes in Machine Learning](https://www.analyticsvidhya.com/blog/2020/07/10-techniques-to-deal-with-class-imbalance-in-machine-learning/)

In [1]:
# ignore warnings
import warnings
warnings.filterwarnings("ignore")

# data
import pandas as pd
import ast
import numpy as np
from numpy import mean

# visualization
import matplotlib.pyplot as plt

# chosen models
from xgboost import XGBClassifier

# feature engineering
from xgboost import plot_importance

# imbalanced data
from imblearn.under_sampling import RandomUnderSampler, TomekLinks, NearMiss
from imblearn.over_sampling import RandomOverSampler, SMOTE


# model training selection
from sklearn.model_selection import train_test_split

## model evaluation metrics
from collections import Counter
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score
from sklearn.metrics import roc_auc_score
from sklearn.metrics import log_loss
from sklearn.metrics import matthews_corrcoef
from sklearn.metrics import cohen_kappa_score

In [2]:
df = pd.read_csv('../data/feature_engineering/combined_feng_dropna.csv', index_col=0)
df.shape

(2197, 7353)

In [3]:
df.head(3)

Unnamed: 0_level_0,Number of Founders,Number of Funding Rounds,Trend Score (7 Days),Trend Score (30 Days),Trend Score (90 Days),Early Stage Venture,M&A,Seed,Made Acquisitions,Made Acquisitions; Was Acquired,...,last_funding_amount,cvr,last_funding_amount_usd,last_equity_funding_amount,last_equity_funding_amount_usd,total_equity_funding_amount,total_equity_funding_amount_usd,total_funding_amount,total_funding_amount_usd,female_led
Organization Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
CMC,1,1,-0.4,-0.8,-1.2,1.0,0.0,0.0,0.0,0.0,...,10000000000,0.16,1600000000.0,10000000000,1600000000.0,10000000000,1600000000.0,10000000000,1600000000.0,0
Ping An Healthcare Management,1,1,-0.1,-0.2,-0.4,1.0,0.0,0.0,0.0,0.0,...,1150000000,1.0,1150000000.0,1150000000,1150000000.0,1150000000,1150000000.0,1150000000,1150000000.0,0
LeSee,1,1,-0.4,0.0,-0.5,1.0,0.0,0.0,0.0,0.0,...,1080000000,1.0,1080000000.0,1080000000,1080000000.0,1080000000,1080000000.0,1080000000,1080000000.0,0


### get all top features based on pca

In [4]:
with open('high_var_org_col_index.txt', 'r') as reader:
    high_var_org_col_index = reader.read()

In [5]:
high_var_org_col_index = ast.literal_eval(high_var_org_col_index)

In [6]:
df1352 = df[df.columns[high_var_org_col_index]]

### get data

In [7]:
X = df1352
y = df['female_led']

In [8]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

## Imbalanced Data

In [9]:
Counter(y)

Counter({0: 2055, 1: 142})

### 1. random under-sampling

In [10]:
rus = RandomUnderSampler(random_state=42, replacement=True)

In [11]:
X_rus, y_rus = rus.fit_resample(X, y)

In [12]:
Counter(y_rus)

Counter({0: 142, 1: 142})

In [13]:
X_rus_train, X_rus_test, y_rus_train, y_rus_test = train_test_split(
    X_rus, y_rus, test_size=0.33, random_state=42)

### 2. random over-sampling

In [14]:
ros = RandomOverSampler(random_state=42)

In [15]:
X_ros, y_ros = ros.fit_resample(X, y)

In [16]:
Counter(y_ros)

Counter({0: 2055, 1: 2055})

In [17]:
X_ros_train, X_ros_test, y_ros_train, y_ros_test = train_test_split(
    X_ros, y_ros, test_size=0.33, random_state=42)

### 3. tomek links under-sampling

Tomek links are pairs of very close instances but of opposite classes. Removing the instances of the majority class of each pair increases the space between the two classes, facilitating the classification process.

Tomek’s link exists if the two samples are the nearest neighbors of each other

![img](https://miro.medium.com/max/700/1*KxFmI15rxhvKRVl-febp-Q.png)

In [18]:
tl = TomekLinks(sampling_strategy='majority')

In [19]:
X_tl, y_tl = tl.fit_resample(X, y)

In [20]:
Counter(y_tl)

Counter({0: 1968, 1: 142})

In [21]:
X_tl_train, X_tl_test, y_tl_train, y_tl_test = train_test_split(
    X_tl, y_tl, test_size=0.33, random_state=42)

### 4. synthetic minority over-sampling (SMOTE)

This technique generates synthetic data for the minority class.

SMOTE (Synthetic Minority Oversampling Technique) works by randomly picking a point from the minority class and computing the k-nearest neighbors for this point. The synthetic points are added between the chosen point and its neighbors.

![img](https://miro.medium.com/max/734/1*yRumRhn89acByodBz0H7oA.png)

SMOTE algorithm works in 4 simple steps:

1. Choose a minority class as the input vector
2. Find its k nearest neighbors (k_neighbors is specified as an argument in the SMOTE() function)
3. Choose one of these neighbors and place a synthetic point anywhere on the line joining the point under consideration and its chosen neighbor
4. Repeat the steps until data is balanced

In [22]:
smote = SMOTE()

In [23]:
X_smote, y_smote = smote.fit_resample(X, y)

In [24]:
Counter(y_smote)

Counter({0: 2055, 1: 2055})

In [25]:
X_smote_train, X_smote_test, y_smote_train, y_smote_test = train_test_split(
    X_smote, y_smote, test_size=0.33, random_state=42)

### 5. NearMiss
NearMiss is an under-sampling technique. Instead of resampling the Minority class, using a distance, this will make the majority class equal to the minority class.

In [26]:
nm = NearMiss()

In [27]:
X_nm, y_nm = nm.fit_resample(X, y)

In [28]:
Counter(y_nm)

Counter({0: 142, 1: 142})

In [29]:
X_nm_train, X_nm_test, y_nm_train, y_nm_test = train_test_split(
    X_nm, y_nm, test_size=0.33, random_state=42)

## model evaluation

In [30]:
def metrics(y_true, y_pred):
    accuracy = accuracy_score(y_true, y_pred) #perfect=1
    precision = precision_score(y_true, y_pred) #perfect=1
    recall = recall_score(y_true, y_pred) #perfect=1
    f1 = f1_score(y_true, y_pred) #perfect=1
#     roc_auc = roc_auc_score(y, clf.decision_function(X)) #perfect=1
#     log = log_loss(y_true, y_pred) #perfect=0
    mcc = matthews_corrcoef(y_true, y_pred) #perfect=1
    kappa = cohen_kappa_score(y_true, y_pred) #perfect=1
    
#     print(f'accuracy={accuracy},\
#             precision={precision}, recall={recall}, \
#             f1={f1}, mcc={mcc}, kappa={kappa}')
    
#     return [accuracy, precision, recall]
    return [accuracy, precision, recall, f1, mcc, kappa]

In [31]:
def evaluate(model, X_train=X_train, X_test=X_test, y_train=y_train, y_test=y_test):
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    return metrics(y_test, y_pred)

In [32]:
def important_cols(xgb):
    important_features_dict = xgb.get_booster().get_score(importance_type='weight')
    return list(important_features_dict.keys())

In [33]:
def plot_important_cols(xgb, n=50):
    plt.rcParams["figure.figsize"] = (20, 15)
    plot_importance(xgb, max_num_features=n)

## machine learning

In [34]:
xgb = XGBClassifier()

## model evaluation

### 1. random under-sampling

In [35]:
metrics_rus = evaluate(xgb, X_rus_train, X_rus_test, y_rus_train, y_rus_test)
metrics_rus



[0.6914893617021277,
 0.6818181818181818,
 0.6666666666666666,
 0.6741573033707865,
 0.3813850356982369,
 0.3812982296867907]

In [36]:
cols_rus = important_cols(xgb)
len(cols_rus)

25

In [37]:
# plot_important_cols(xgb)

### 2. random over-sampling

In [38]:
metrics_ros = evaluate(xgb, X_ros_train, X_ros_test, y_ros_train, y_ros_test)
metrics_ros



[0.9815770081061165,
 0.9633967789165446,
 1.0,
 0.9813571961222969,
 0.9638156088600836,
 0.9631613981403158]

In [39]:
cols_ros = important_cols(xgb)
len(cols_ros)

78

In [40]:
# plot_important_cols(xgb)

### 3. tomet links

In [41]:
metrics_tl = evaluate(xgb, X_tl_train, X_tl_test, y_tl_train, y_tl_test)
metrics_tl



[0.9253945480631277,
 0.0,
 0.0,
 0.0,
 -0.02285726195738519,
 -0.013138033208475397]

In [42]:
cols_tl = important_cols(xgb)
len(cols_tl)

47

In [43]:
# plot_important_cols(xgb)

### 4. smote

In [44]:
metrics_smote = evaluate(xgb, X_smote_train, X_smote_test, y_smote_train, y_smote_test)
metrics_smote



[0.9550478997789241,
 0.9502262443438914,
 0.9574468085106383,
 0.9538228614685844,
 0.9100584481338556,
 0.9100337032613149]

In [45]:
cols_smote = important_cols(xgb)
len(cols_smote)

46

In [46]:
# plot_important_cols(xgb)

### 5. nearmiss

In [47]:
metrics_nearmiss = evaluate(xgb, X_nm_train, X_nm_test, y_nm_train, y_nm_test)
metrics_nearmiss



[0.8191489361702128,
 0.8333333333333334,
 0.7777777777777778,
 0.8045977011494253,
 0.6379658352924606,
 0.6366530241018645]

In [48]:
cols_nearmiss = important_cols(xgb)
len(cols_nearmiss)

21

In [49]:
# plot_important_cols(xgb)

In [50]:
# set up
algos = ['rus', 'ros', 'tomet_links', 'smote', 'nearmiss']
metrics_lst = [metrics_rus, metrics_ros, metrics_tl, metrics_smote, metrics_nearmiss]
cols_lst = [cols_rus, cols_ros, cols_tl, cols_smote, cols_nearmiss]
all_cols = list(df1352.columns)

# set up
metrics_data = []
cols_data = []

# loop through different weights
for i in range(len(algos)):
    
    # get metrics
    metrics_data.append(['xgboost_v5', f'{algos[i]}'] + metrics_lst[i])
    
    # get important columns
    important_cols = cols_lst[i]
    # one hot encode important cols in all columns dataframe
    encoding = [1 if col in important_cols else 0 for col in all_cols]
    cols_data.append(encoding)

In [51]:
df_metrics5 = pd.DataFrame(np.array(metrics_data),
                columns=['model', 'setting', 
                'accuracy', 'precision', 'recall', 'f1', 'mcc', 'kappa'])

In [52]:
# df_metrics5

In [53]:
df_cols5 = pd.DataFrame(np.array(cols_data), columns=all_cols)

In [54]:
# df_cols5

In [55]:
# identified as important features by more than one method
len(df_cols5.columns[df_cols5.sum()>2])

37

In [56]:
df_metrics4 = pd.read_csv('../data/results_df/metrics_v4_dna.csv')
df_cols4 = pd.read_csv('../data/results_df/important_cols_v4_dna.csv')

In [57]:
df_metrics = pd.concat([df_metrics4, df_metrics5])

In [58]:
df_cols = pd.concat([df_cols4, df_cols5])

In [59]:
# df_metrics.to_csv('../data/results_df/metrics_v45_dna.csv', index=False)

In [60]:
# df_cols.to_csv('../data/results_df/important_cols_v45_dna.csv', index=False)

### Feature Selection
Somehow the selected PCA columns also resulted in a very different feature selection in the xgboost model compared to the first time. 
- v1: 92/7532
- v2: 6/100
- v3: 48/1352
- v4:  79/1352
- v5: 
    - random under-sampling: 25/1352
    - random over-sampling: 79/1352
    - tomet links: 39/1352
    - smote: 45/1352
    - nearmiss: 21/1352

v1-v3 does not deal with imbalance of the dataset

### Version Differences

model_version | feature selection | imbalanced class | hyperparameter tuning | cross validation
--------------|-------------------|------------------|-----------------------|-----------------
xgboost_v1 | no | no | no | no
xgboost_v2 | yes, top100 | no | no | no
xgboost_v3 | yes, top all 1352 | no | no | no
xgboost_v4 | yes, top all 1352 | yes (weighted 20-50) | no | no
xgboost_v5 | yes, top all 1352 | yes (under/over-sampling * 5) | no | no