# Machine Learning - Supervised Learning

## Imbalance Data Handling

* Class imbalance is a common problem in machine learning, where the number of instances in each class is not equally distributed. This imbalance can lead to poor performance of the model, especially for the minority class. Here are some techniques to handle class imbalance:

### Resampling Techniques
* **Oversampling**: Increase the number of instances in the minority class.

    * **Random Oversampling**: Randomly replicate minority class instances.
    * **SMOTE (Synthetic Minority Over-sampling Technique)**: Generate synthetic instances by interpolating between existing minority class instances.
* Undersampling: Reduce the number of instances in the majority class.

    * **Random Undersampling(+**: Randomly remove majority class instances.
    * **Cluster Centroids**: Use clustering techniques to reduce the majority class by averaging points in clusters.

In [8]:
import pandas as pd
from imblearn.over_sampling import SMOTE, ADASYN
# import pip
# pip.main(["install", "imblearn"])

In [5]:
my_data = pd.read_csv("Bank_Imbalance.csv")
print(my_data.head())

   age  duration  emp_var_rate  cons_price_idx  cons_conf_idx  euribor3m  \
0   44       210           1.4          93.444          -36.1      4.963   
1   53       138          -0.1          93.200          -42.0      4.021   
2   28       339          -1.7          94.055          -39.8      0.729   
3   39       185          -1.8          93.075          -47.1      1.405   
4   55       137          -2.9          92.201          -31.4      0.869   

   nr_employed  y  
0       5228.1  1  
1       5195.8  1  
2       4991.6  1  
3       5099.1  1  
4       5076.2  1  


In [8]:
x = my_data.iloc[ : , : 7]
y = my_data.iloc[ : , 7]

In [13]:
y.value_counts()

y
1    2396
0     603
Name: count, dtype: int64

In [14]:
y.value_counts().sum()

2999

In [15]:
(y.value_counts()/y.value_counts().sum() ) * 100

y
1    79.893298
0    20.106702
Name: count, dtype: float64

### SMOTE

* SMOTE (Synthetic Minority Over-sampling Technique) is a widely used technique to address class imbalance in datasets. SMOTE works by generating synthetic samples for the minority class. It does this by selecting samples that are close to each other in the feature space, drawing a line between the samples in feature space, and creating new samples along this line.

In [18]:
smt = SMOTE()

In [29]:
x_smote, y_smote = smt.fit_resample(x, y)

In [31]:
y_smote.value_counts() # There are newly generated  0 points previosly we had 603 but now 2396 same as 1s

y
1    2396
0    2396
Name: count, dtype: int64

In [32]:
x.shape

(2999, 7)

In [33]:
x_smote.shape # shape was  updated due to overesampeling

(4792, 7)

### ADASYN

* ADASYN (Adaptive Synthetic Sampling Approach) is an advanced oversampling technique used to handle class imbalance in datasets. It is an improvement over the SMOTE (Synthetic Minority Over-sampling Technique) method. ADASYN generates synthetic data points for the minority class, focusing more on difficult-to-learn examples that are near the boundary of the decision region, which helps in improving classifier performance.

* How ADASYN Works
    * **Identify Difficult Sample**s: It identifies minority class samples that are difficult to classify.
    * **Generate Synthetic Samples**: It generates synthetic samples for the minority class, giving more weight to samples that are harder to learn

In [36]:
ads = ADASYN()

In [37]:
x_ads, y_ads = ads.fit_resample(x, y)

In [42]:
y_ads.value_counts()   # Now balance but not same as SMOTE

y
0    2404
1    2396
Name: count, dtype: int64

# Hyper Parameter Optimization

* Hyperparameter optimization is the process of finding the best set of hyperparameters for a machine learning model. The goal is to identify the combination of hyperparameters that results in the best performance of the model, typically measured by some validation metric.

Methods of Hyperparameter Optimization:

**1) Grid Search**:

* A systematic way of working through multiple combinations of hyperparameter values, specified in a grid.
* The model is evaluated for each combination using cross-validation.

**2) Random Search**:

* Instead of trying all combinations, it randomly samples from the hyperparameter space.
* Can be more efficient than grid search for large parameter spaces.

**3) Bayesian Optimization**:

* Uses a probabilistic model to predict the performance of different hyperparameter combinations.
* More efficient by focusing on promising areas of the hyperparameter space.
  
Tools:

* Hyperopt
* BayesianOptimization
  
**4) Genetic Algorithms**:

* Inspired by the process of natural selection.
* Combines exploration and exploitation by iteratively selecting, recombining, and mutating a population of hyperparameter combinations.

Tools:

* TPOT
* DEAP
  
**5) Gradient-based Optimization**:

* Uses gradients to optimize hyperparameters.
* Suitable for differentiable hyperparameters.

## Grid 
* when we have two hyper perameters we have to use comBination(couple) of them
* EX: a = 1, 2, 3   b = 10, 20, ---> (1, 10) or (1, 20) or (2, 10) or ... (3, 20) Then all of these are separately apply and find validation choose a combination which having minimum cross validation.

In [10]:
from sklearn.model_selection import train_test_split, GridSearchCV, KFold # "KFold" technique is used for cross validation
from sklearn.neighbors import KNeighborsRegressor

In [11]:
data_hyper = pd.read_csv("Boston.csv")
data_hyper.head()

Unnamed: 0,crim,zn,indus,nox,rm,age,dis,rad,tax,ptratio,black,lstat,medv
0,0.00632,18.0,2.31,0.538,6.575,65.2,4.09,1,296,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0.469,6.421,78.9,4.9671,2,242,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0.469,7.185,61.1,4.9671,2,242,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0.458,6.998,45.8,6.0622,3,222,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0.458,7.147,54.2,6.0622,3,222,18.7,396.9,5.33,36.2


In [12]:
x = data_hyper.iloc[:, :12].values
y = data_hyper.iloc[:, 12].values

In [13]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state = 0)

In [14]:
param_s = {"n_neighbors": [1,2,3,4,5,6,7,8,9,10]} # single hypper parameter
model = KNeighborsRegressor()
c_vals = KFold(n_splits= 10)

In [15]:
gsearch = GridSearchCV( model, param_s, cv = c_vals)

In [16]:
results = gsearch.fit(x_train, y_train)

In [17]:
results.best_params_ # we dont want to do anything manually algorithm does(in cross_val_score ,we must do something to find best value)

{'n_neighbors': 5}