### Imbalanced data is a common issue in machine learning, where the distribution of classes in the target variable is uneven. In this scenario, machine learning models tend to favor the majority class while neglecting the minority class, leading to poor performance and biased predictions.
### follwoing are the methods to handle imbalanced data
### 1) Random under sampling: Decrease the number of instances in the majority class by randomly removing instances.
### 2) Random over sampling: Increase the number of instances in the minority class by duplicating them or generating new instances using techniques like SMOTE (Synthetic Minority Over-sampling Technique). 

In [34]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

In [35]:
# load data
d_set=pd.read_csv("Social_Network_Ads.csv")
d_set.head()

Unnamed: 0,User ID,Gender,Age,EstimatedSalary,Purchased
0,15624510,Male,19,19000,0
1,15810944,Male,35,20000,0
2,15668575,Female,26,43000,0
3,15603246,Female,27,57000,0
4,15804002,Male,19,76000,0


In [36]:
# to display the total value count of purchased column
d_set['Purchased'].value_counts()

0    257
1    143
Name: Purchased, dtype: int64

### result: we have total number of 0s 257 andx 1s 143. therefore, our model once created will have more focus on 0s as it has greater number than 1. and therefore, in the most cases the model will try to predict 0.

In [37]:
# model training before applying any technique to handle the inbalanced data 

# separate the input and output data
x=d_set[['Age','EstimatedSalary']]
y=d_set['Purchased']

# split the data
xtrain, xtest, ytarin, ytest=train_test_split(x,y, test_size=0.2)

# model
lr=LogisticRegression()

# fit the data
lr.fit(xtrain, ytarin)

In [38]:
# score
lr.score(xtest, ytest)*100

68.75

In [39]:
# now sample predictions
# from the dataset, at 19	19000	the model is predicting 0
lr.predict([[19,19000]])



array([0], dtype=int64)

In [40]:
# from the dataset, at 46	28000	the value in the dataset is 1
# but the result is shown below which is wrong

lr.predict([[46,	28000]])



array([0], dtype=int64)

### result: since the model is predicting wrong answer and this is because of majority of 0s in the dataset as compared to 1s

### hence we need to apply any technique to handle this imbalanced data to make model perform well in the following manner.
### for this purpose we need tpo install imblearn by running --> pip install imblearn

# 1) Random Under sampling technique
##### Decrease the number of instances in the majority class by randomly removing instances.

In [41]:
# import the library for under sampling technique
from imblearn.under_sampling import  RandomUnderSampler

In [42]:
# technique
rus=RandomUnderSampler()

# resample the data
rus_x, rus_y=rus.fit_resample(x, y)

# check the values of input x
rus_x.head()

Unnamed: 0,Age,EstimatedSalary
270,43,133000
3,27,57000
317,35,55000
349,38,61000
295,36,63000


In [43]:
# y values
rus_y.value_counts()

0    143
1    143
Name: Purchased, dtype: int64

### result: it can be seen that the values of 0s have been minimized and made equal to number of 1s

In [44]:
# now training the model again
x_train, x_test, y_train, y_test=train_test_split(rus_x, rus_y, test_size=0.2)

# model
rulr=LogisticRegression()

# fit the data
rulr.fit(x_train, y_train)

In [45]:
# score
rulr.score(x_test, y_test)*100

41.37931034482759

### result: accuracy is less than the previous model

In [46]:
# same predictions (# from the dataset, at 46	28000	the model is predicting 1) as above which was wrong earlier in the model
rulr.predict([[46, 28000]]) 



array([1], dtype=int64)

### result: besides low accuracy the model is predicting the right answer

# 2) Random over sampling

In [47]:
# import the library for under sampling technique
from imblearn.over_sampling import RandomOverSampler

In [48]:
ros=RandomOverSampler()

# fit the data
ros_x, ros_y=ros.fit_resample(x, y)

In [49]:
ros_x

Unnamed: 0,Age,EstimatedSalary
0,19,19000
1,35,20000
2,26,43000
3,27,57000
4,19,76000
...,...,...
509,59,76000
510,60,46000
511,47,144000
512,47,144000


In [50]:
ros_y.value_counts()

0    257
1    257
Name: Purchased, dtype: int64

### result: the value of 1 is increased and made equal to 0

In [51]:
# train the model
xtrain, xtest, ytrain, ytest=train_test_split(ros_x, ros_y, test_size=0.2)

# model
rolr=LogisticRegression()

# fit the data
rolr.fit(x_train, y_train)

In [52]:
# score
rolr.score(x_test, y_test)*100

41.37931034482759

### result: the accuracy is less than that of the previous one

In [53]:
# same predictions (# from the dataset, at 46	28000	the model is predicting 1) as above which was wrong earlier in the model
rolr.predict([[46, 28000]]) 



array([1], dtype=int64)

### result: besides low accuracy the model is predicting the right answer