# ***BALANCING THE DATA USING SMOTE*** âœ”

***IF you Liked my notebook, Upvote it!!. Also comment down your feedback or any question you want to ask.***

***Wheather we use Deep Learning Model or Machine Learning Model or any powerful model, if we have imbalanced data, we can not get good accuracy or targeted accuracy we want, the accuracy can be good but if we want best accuracy then we must tackle this problem.*** 

# **EXPLANATION:**

1. *The challenge of working with imbalanced datasets is that most machine learning techniques will ignore, and in turn have poor performance on, the minority class, although typically it is performance on the minority class that is most important.*
2. *For Example in this dataset we have Target Column Named 'Failure' which contain only two values 0 and 1*
3. *When I apply Machine Learning Model it outputed with accuracy of 1.0 in train data and 0.50 on test. so this is the case of overfitting.*
4. *Overfitting Means: Your Model performs best in the train data but worst in test data.*
5. *Now we want to tackle this kind of problem. so when I use function value_counts() it gave me an output which clearly define that it is imbalanced.*
6. *It shows me like 0: 20921 and 1: 5649*
7. *We try to oversample the minority classes like in this example 0.*
8. *This kind of Approach is only done by SMOTE (Synthetic Minority Oversampling Technique).*


#### First, we need to know **what is SMOTE and How it works ?**


SMOTE is an oversampling technique that generates synthetic samples from the minority class. It is used to obtain a synthetically class-balanced or nearly class-balanced training set, which is then used to train the classifier. The SMOTE samples are linear combinations of two similar samples from the minority class (x and xR) and are defined as 

### s=x+uâ‹…(xRâˆ’x),


with 0â€‰â‰¤â€‰uâ€‰â‰¤â€‰1; xR is randomly chosen among the 5 minority class nearest neighbors of x.

#### **SMOTE does not change the expected value of the (SMOTE-augmented) minority class and it decreases its variability**

SMOTE samples have the same expected value as the original minority class samples (E(Xj^SMOTE)=E(Xj)
), but smaller variance (var(Xj^SMOTE)=2/3 var(Xj)).

# **AFTER APPLYING THIS APPROACH TO MY DATA, MY MODEL IMPROVED** ðŸ˜Ž


#### **Import Libraries:**

In [None]:
import pandas as pd
import numpy as np
from imblearn.over_sampling import SMOTE
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split,cross_val_score
import xgboost as xgb

In [None]:
train_csv = pd.read_csv('../input/tabular-playground-series-aug-2022/train.csv')
train_csv.head(2)

In [None]:
for col in train_csv.columns:
    if train_csv.loc[:,col].isnull().sum() > 0:
        train_csv.loc[:,col].fillna(train_csv.loc[:,col].median(),inplace=True)

train_data = train_csv.copy()

# ***Do Read this :***

***Like I said on above Explanation that we detected that 0 have too many samples but 1 have too low. so this is imbalancing. if Our Model train on this then it learns the 0 output too much because it contain too many 0, when we try to predict the output which has actual value 0 and it will predict output 0 but when we try to predict the output which has actual value 1 but becasue of too many learning on 0 it will predict that output 0 which is incorrect.*** 

In [None]:
train_csv.failure.value_counts()

In [None]:

for col in train_csv.columns:
    if train_csv.loc[:,col].dtype != 'object':
        
        first_quartile = train_csv[col].quantile(0.25)
        third_quartile = train_csv[col].quantile(0.75)
        
        IQR = third_quartile - first_quartile 
        
        out = third_quartile + 3*IQR
       
        train_csv.drop(train_csv[train_csv[col] > out].index,axis=0,inplace=True)

        


train_csv[['loading','measurement_17']] = np.log(train_csv[['loading','measurement_17']])
train = train_csv.drop(['id','product_code','attribute_0','attribute_1','attribute_2','attribute_3'],axis=1)
input = train.drop('failure',axis=1)
target=train.failure

In [None]:
x_train,x_test,y_train,y_test = train_test_split(input,target,test_size=0.15,random_state=34)
ss = StandardScaler()
x_train = ss.fit_transform(x_train)
x_test = ss.fit_transform(x_test)

***As you see it outputs with 100% accuracy but on test.csv it gave 0.5 which is not good according to out train accuracy. so we will apply SMOTE to balance it.***

In [None]:
model = xgb.XGBClassifier(base_score=0.5, booster='gbtree', callbacks=None,
              colsample_bylevel=1, colsample_bynode=1, colsample_bytree=0.7,
              early_stopping_rounds=None, enable_categorical=False,
              eval_metric=None, gamma=0.3,grow_policy='depthwise',
              importance_type=None,
              learning_rate=0.005, max_bin=25, max_cat_to_onehot=4,
              max_delta_step=0, max_depth=5,n_estimators=50,
              verbosity=1)

model.fit(x_train,y_train)
score = cross_val_score(model,input,target,cv=10)
print(score)
print(model.score(x_test,y_test))

***Now We Apply SMOTE Method to Balance the data, we increase the sample size of minority data in this problem 1 to the sample size of 0***

In [None]:
X = train_data.drop(['id','failure','product_code','attribute_0','attribute_1'],axis=1)
Y = train_data['failure']

sm = SMOTE(k_neighbors=143)
X_new,Y_new = sm.fit_resample(X,Y)
X_new['failure'] = Y_new

In [None]:
X.shape

# ***Balanced Data we Get***  ðŸ˜Ž

In [None]:
X_new.failure.value_counts()

# ***Same PreProcessing***

In [None]:

for col in X_new.columns:
    if X_new.loc[:,col].dtype != 'object':
        
        first_quartile = X_new[col].quantile(0.25)
        third_quartile = X_new[col].quantile(0.75)
        
        IQR = third_quartile - first_quartile 
        
        out = third_quartile + 3*IQR
       
        X_new.drop(X_new[X_new[col] > out].index,axis=0,inplace=True)

        


X_new[['loading','measurement_17']] = np.log(X_new[['loading','measurement_17']])
X_new.drop(['attribute_2','attribute_3'],axis=1,inplace=True)
input = X_new.drop('failure',axis=1)
target=X_new.failure

In [None]:
x_train,x_test,y_train,y_test = train_test_split(input,target,test_size=0.15,random_state=34)
ss = StandardScaler()
x_train = ss.fit_transform(x_train)
x_test = ss.fit_transform(x_test)

***Now as you see the accuracy improved and now we easily say that this is the original accuracy we get not the above one. it also improved on test.csv data, it improves from 0.50 -> 0.559*** ðŸ”¥

In [None]:
model = xgb.XGBClassifier(base_score=0.5, booster='gbtree', callbacks=None,
              colsample_bylevel=1, colsample_bynode=1, colsample_bytree=0.7,
              early_stopping_rounds=None, enable_categorical=False,
              eval_metric=None, gamma=0.3,grow_policy='depthwise',
              importance_type=None,
              learning_rate=0.005, max_bin=25, max_cat_to_onehot=4,
              max_delta_step=0, max_depth=5,n_estimators=50,
              verbosity=1)

model.fit(x_train,y_train)
score = cross_val_score(model,input,target,cv=10)
print(score)
print(model.score(x_test,y_test))