#**Approach Description**

We are having skewed dataset here. with Problem Category count : (1, 6001), (2, 301), (0, 137)

I am creating a dataset with undersampling Category 1, and Oversampling Category 0. I want to make a dataset with 300 samples for each Category respectively. And, I want to check how Classifier results come for this Dataset.

#Loading data

In [5]:
import pandas as pd

train = pd.read_csv('Train.csv')
train.head()

Unnamed: 0,L_Id,Date of Creation,Agent Category Assigned,Type of Request,Description of the Request,Location,Street Type,Region Type,Ward No,Estimated Date of Completion,Request Solution Category,Actual Date of Completion,Team Assigned,A_1,A_2,Problem Category
0,LM_1,18-12-2018 00:29,1,10,2,13,11223.0,0,7,18-12-2018 09:29,4,18-12-2018 05:34,45,1,1,1
1,LM_2,30-11-2018 22:19,1,15,17,13,10019.0,0,30,01-12-2018 07:19,8,01-12-2018 00:35,22,2,2,1
2,LM_3,12-12-2018 02:37,1,11,22,0,11237.0,2,7,12-12-2018 11:37,8,12-12-2018 04:29,17,1,1,1
3,LM_4,26-12-2018 01:45,1,11,22,12,11213.0,0,7,26-12-2018 10:45,7,26-12-2018 05:26,33,1,1,1
4,LM_5,24-12-2018 03:22,1,2,27,13,10460.0,0,6,24-12-2018 12:22,4,24-12-2018 10:20,24,0,0,1


#**Numerical Variables**

In [6]:
train.dtypes

L_Id                             object
Date of Creation                 object
Agent Category Assigned           int64
Type of Request                   int64
Description of the Request        int64
Location                          int64
Street Type                     float64
Region Type                       int64
Ward No                           int64
Estimated Date of Completion     object
Request Solution Category         int64
Actual Date of Completion        object
Team Assigned                     int64
A_1                               int64
A_2                               int64
Problem Category                  int64
dtype: object

We will keep all Numeric features and remove all Other Columns

---



In [7]:
train.drop(['L_Id', 'Date of Creation', 'Estimated Date of Completion', 'Actual Date of Completion'], axis=1, inplace=True)

In [8]:
train.head()

Unnamed: 0,Agent Category Assigned,Type of Request,Description of the Request,Location,Street Type,Region Type,Ward No,Request Solution Category,Team Assigned,A_1,A_2,Problem Category
0,1,10,2,13,11223.0,0,7,4,45,1,1,1
1,1,15,17,13,10019.0,0,30,8,22,2,2,1
2,1,11,22,0,11237.0,2,7,8,17,1,1,1
3,1,11,22,12,11213.0,0,7,7,33,1,1,1
4,1,2,27,13,10460.0,0,6,4,24,0,0,1


In [9]:
#train['Street Type'] = train['Street Type'].fillna(train['Street Type'].mean())

train['Street Type'] = train['Street Type'].fillna(train['Street Type'].mode()[0])

In [10]:
train.isnull().sum()

Agent Category Assigned       0
Type of Request               0
Description of the Request    0
Location                      0
Street Type                   0
Region Type                   0
Ward No                       0
Request Solution Category     0
Team Assigned                 0
A_1                           0
A_2                           0
Problem Category              0
dtype: int64

#**Checking Dataset SkewNess**

In [11]:
# Checking Problem Category wise Data frequency 
print('Problem Category frequency :')
print(train['Problem Category'].value_counts(normalize=True))
print(' ')
print('Problem Category count :')
print(train['Problem Category'].value_counts())

Problem Category frequency :
1    0.931977
2    0.046746
0    0.021277
Name: Problem Category, dtype: float64
 
Problem Category count :
1    6001
2     301
0     137
Name: Problem Category, dtype: int64


#**Undersampling Category 1**

Dataset is highly skewed. 
I will go for technique to deal with imbalanced Dataset.

We have two options here : 
(1) UnderSampling
(2) OverSampling

This time, I will try doing Undersampling Category 1 and Oversampling Category 0 together.

In [12]:
# Problem Category count :
#1    6001
#2     301
#0     137

ProbCategory_0 = train.loc[train['Problem Category'] == 0]
ProbCategory_1 = train.loc[train['Problem Category'] == 1][:300]
ProbCategory_2 = train.loc[train['Problem Category'] == 2]

normal_distributed_df = pd.concat([ProbCategory_0, ProbCategory_1, ProbCategory_2])

# Shuffle dataframe rows
new_df = normal_distributed_df.sample(frac=1, random_state=42)

new_df.head()

Unnamed: 0,Agent Category Assigned,Type of Request,Description of the Request,Location,Street Type,Region Type,Ward No,Request Solution Category,Team Assigned,A_1,A_2,Problem Category
170,1,15,5,13,10065.0,2,30,10,34,2,2,1
216,1,10,34,13,11691.0,2,14,4,55,3,3,1
100,1,2,31,13,10457.0,0,6,8,20,0,0,1
1638,1,13,22,7,11385.0,4,31,12,0,5,5,2
23,1,11,22,0,10003.0,0,30,7,8,2,2,1


In [13]:
new_df.shape

(738, 12)

In [14]:
# Splitting Independent and Dependent Variables
X = new_df.loc[:, new_df.columns != 'Problem Category'].values
y = new_df['Problem Category'].values

from sklearn.model_selection import train_test_split

# Splitting the dataset into the Training set and Test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0, stratify=y)

print(X_train.shape)

(553, 11)


In [15]:
# Checking Problem Category wise Data frequency 
print('Problem Category frequency :')
print(new_df['Problem Category'].value_counts(normalize=True))
print(' ')
print('Problem Category count :')
print(new_df['Problem Category'].value_counts())

Problem Category frequency :
2    0.407859
1    0.406504
0    0.185637
Name: Problem Category, dtype: float64
 
Problem Category count :
2    301
1    300
0    137
Name: Problem Category, dtype: int64


#**Oversampling Category 0**

In [16]:
from imblearn.over_sampling import SMOTE
smote = SMOTE('minority')
x_sm, y_sm = smote.fit_sample(X_train, y_train)

print(X_train.shape)
print(x_sm.shape)



(553, 11)
(675, 11)




# **Applying Classifiers**

In [17]:
import numpy as np
from sklearn.metrics import roc_auc_score
from sklearn.preprocessing import LabelBinarizer
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score

#In an imbalanced classification problem with more than two classes, precision is 
#calculated as the sum of true positives across all classes divided by the sum of 
#true positives and false positives across all classes.

#When using the precision_score() function for multiclass classification, it is 
#important to specify the minority classes via the “labels” argument and to perform 
#set the “average” argument to ‘micro‘ to ensure the calculation is performed as we expect.
# calculate prediction
def multiclass_precision_score(y_test, y_pred, average="macro"):
  precision = precision_score(y_test, y_pred, labels=[0,2], average=average)
  return precision

#Recall is a metric that quantifies the number of correct positive predictions made 
#out of all positive predictions that could have been made.
# calculate recall
def multiclass_recall_score(y_test, y_pred, average='macro'):
  recall = recall_score(y_test, y_pred, average=average)
  return recall

In [18]:
from sklearn.linear_model import LogisticRegression
logisticRegr = LogisticRegression()
logisticRegr.fit(x_sm, y_sm)

predictions = logisticRegr.predict(X_test)
score = logisticRegr.score(X_test, y_test)
print(score)

0.6918918918918919


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


In [19]:
import xgboost as xgb
import warnings
warnings.filterwarnings("ignore")
warnings.filterwarnings(module='sklearn*', action='ignore', category=DeprecationWarning)
from sklearn.tree import DecisionTreeClassifier 
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix

dtree_model = DecisionTreeClassifier(max_depth=2).fit(x_sm, y_sm)   #Optimum Value for max_depth : 10
dtree_predictions_2 = dtree_model.predict(X_test)

#print('Accuracy : ', accuracy_score(y_test, dtree_predictions_2))
#print('ROC for DecisionTreeClassifier : ', multiclass_roc_auc_score(y_test, dtree_predictions_2))

print('DecisionTreeClassifier + SMOTE')
print('Precision: ', multiclass_recall_score(y_test, dtree_predictions_2))
print('Recall: ', multiclass_precision_score(y_test, dtree_predictions_2))

DecisionTreeClassifier + SMOTE
Precision:  0.6470588235294118
Recall:  0.14953271028037382


In [20]:
# Creates a confusion matrix
cm = confusion_matrix(y_test, dtree_predictions_2)
cm_df = pd.DataFrame(cm)
cm_df.style.background_gradient(cmap='coolwarm')

Unnamed: 0,0,1,2
0,32,2,0
1,0,75,0
2,75,1,0


In [21]:
rf_clf = RandomForestClassifier(random_state=0, n_jobs=-1)   
rf_clf.fit(x_sm, y_sm)
rf_predictions_2 = rf_clf.predict(X_test)
    
#print('Accuracy : ', accuracy_score(y_test, rf_predictions_2))
#print('ROC for RandomForestClassifier : ', multiclass_roc_auc_score(y_test, rf_predictions_2))

print('RandomForestClassifier + SMOTE')
print('Precision: ', multiclass_recall_score(y_test, rf_predictions_2))
print('Recall: ', multiclass_precision_score(y_test, rf_predictions_2))

RandomForestClassifier + SMOTE
Precision:  0.6859580323357414
Recall:  0.532520325203252


In [22]:
# Creates a confusion matrix
cm = confusion_matrix(y_test, rf_predictions_2)
cm_df = pd.DataFrame(cm)
cm_df.style.background_gradient(cmap='coolwarm')

Unnamed: 0,0,1,2
0,23,0,11
1,1,74,0
2,45,1,30


In [23]:
#XGB Classifier
xg_cl = xgb.XGBClassifier(objective= "multi:softprob", 
                                  n_estimators=5500, 
                                  learning_rate=0.1, 
                                  seed=123,
                                  max_depth=2)

xg_cl.fit(x_sm, y_sm)

preds_5000_2 = xg_cl.predict(X_test)
        
#print('Accuracy : ', accuracy_score(y_test, preds_5000_2))
#print('ROC for XGBClassifier(5000) : ', multiclass_roc_auc_score(y_test, preds_5000_2))

print('XGBClassifier(5000) + SMOTE')
print('Precision: ', multiclass_recall_score(y_test, preds_5000_2))
print('Recall: ', multiclass_precision_score(y_test, preds_5000_2))

XGBClassifier(5000) + SMOTE
Precision:  0.6762125902992776
Recall:  0.5130788264404382


In [24]:
# Creates a confusion matrix
cm = confusion_matrix(y_test, rf_predictions_2)
cm_df = pd.DataFrame(cm)
cm_df.style.background_gradient(cmap='coolwarm')

Unnamed: 0,0,1,2
0,23,0,11
1,1,74,0
2,45,1,30


#**Precision vs. Recall for Imbalanced Classification**
You may decide to use precision or recall on your imbalanced classification problem. </br>

Maximizing precision will minimize the number false positives, whereas maximizing the recall will minimize the number of false negatives. </br>

Precision: Appropriate when minimizing false positives is the focus. </br>
Recall: Appropriate when minimizing false negatives is the focus. </br>