# Table of Contents
- [Classification: a first example](#Classification:-a-first-example)
- [Model Evaluation](#Model-evaluation)
    - Holdout evaluation
    - *k*-fold Cross-Validation
- [Evaluation Metrics for classification problems](#Evaluation-Metrics-for-classification-problems)
- [Evaluation on Iris Dataset](#Evaluation-on-Iris-Dataset)
- [Feature Selection](#Feature-Selection)
- [Sampling and rebalancing with `imblearn`](#Sampling-and-rebalancing-with-imblearn)


In [None]:
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt

## Classification: a first example

K nearest neighbors (kNN) is one of the simplest learning strategies: given a new, unknown observation, look up in your reference database which ones have the closest features and assign the predominant class.

![KNN](https://upload.wikimedia.org/wikipedia/commons/thumb/e/e7/KnnClassification.svg/355px-KnnClassification.svg.png)

Let's try it out on the IRIS dataset.

In [None]:
from sklearn import datasets

iris = datasets.load_iris()
X, y = iris.data, iris.target

# print(iris.DESCR)

Let's have a look at the documentation of the sklearn implementation of the KNN classifier.

Typically, for a sklearn estimator, you can rely on the following sources:
- API: 
https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html
- User guide: https://scikit-learn.org/stable/modules/neighbors.html#classification
- Examples: https://scikit-learn.org/stable/auto_examples/neighbors/plot_classification.html#sphx-glr-auto-examples-neighbors-plot-classification-py

In [None]:
from sklearn.neighbors import KNeighborsClassifier
KNeighborsClassifier?

**1. Create the "model"**


In [None]:
knn = KNeighborsClassifier(n_neighbors = 5)

**2. Fit the model**

In [None]:
knn.fit(X, y)

**3. Use the model**: what kind of iris has 3cm x 5cm sepal and 4cm x 2cm petal? call the "predict" method.

In [None]:
result = knn.predict([[3, 5, 4, 2],])
print(result)
print(iris.target_names[result])

## Model evaluation
The final goal of a classification algorithm is to correctly classify a previously unseen example. In other words we want to assess the **generalization capability** of our model. Therefore it is not sufficient to solve an optimization problem on the examples used for training. 

### Holdout method
In the holdout method, the input dataset is split into two separate sets:  **training set** and **test set**. 

- the **training set** is used during training in order to increase the experience of the model. An optimization procedure finds the parameters configuration which minimizes the training error.
- the **test set** is used to measure the actual performance of the model, thus its generalization capability. 

The inference capability on previously unseen examples arises from an assumption about data generating process (i.i.d. assumption): examples in training and test sets are supposed to be independent from each other and identically distributed.




**It is of the utmost importance that the actual test set is not used to make choices about the model and its parameters/hyperparameters.**


### *k*-fold cross-validation

The **simple holdout evaluation approach can be problematic if the resulting test set is small**: the sampled test examples may not be representative of the actual distribution of our dataset. Training (and evaluating) our model on different random splits would results in different values of model performance. In other words, we may observe statistical uncertainty around the estimated average generalization error.

To address this issue, *k*-fold cross validation is commonly adopted.
The procedure consists in the following steps:
- the dataset is split in *k* non-overlapping subset
- for each subset *i* the model is evaluated on the *i-th* subset itself and trained on the union of the remaining *k-1* subsets,
- after *k* iteration, we can rely on *k* values of model performance: the final score is obtained averaging the scores across the *k* trials.

Using k-fold cross-validation allows to estimate the average generalization error using all the examples, at the price of an increased runtime (k training procedures)

![kfoldcv](https://upload.wikimedia.org/wikipedia/commons/1/1c/K-fold_cross_validation_EN.jpg)


For both *single holdout validation* and *k-fold cross-validation* we can resort to the implementation provided in the `model_selection` module provided in the `scikit-learn` (sklearn) library.

- `train_test_split`: [docs](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html)
- `KFold` [docs](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html#sklearn-model-selection-kfold), or (better) `StratifiedKFold` [docs](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html?highlight=stratified#sklearn.model_selection.StratifiedKFold), that is a variation of k-fold in which each set contains approximately the same percentage of samples of each target class as the complete set.


## Evaluation Metrics for classification problems

Machine learning models are often used to predict the outcomes of a classification problem. Predictive models rarely predict everything perfectly, so there are many performance metrics that can be used to analyze our models.

When you run a prediction on your data to distinguish among two classes (*positive* and *negative* classes, for simplicity), your results can be broken down into 4 parts:

<img src="images/classification_report.png" alt="drawing" width="450"/>

* **True Positives**: data in class *positive* that the model predicts will be in class *positive*;
* **True Negatives**: data in class *negative* that the model predicts will be in class *negative*;
* **False Positives**: data in class *negative* that the model predicts will be in class *positive*;
* **False Negatives**: data in class *positive* that the model predicts will be in class *negative*.

The most common performance metrics in this binary classification scenario are the following:

* **accuracy**: the fraction of observations (both positive and negative) predicted correctly:

$$ Accuracy = \frac{(TP+TN)}{(TP+FP+TN+FN)} $$

* **precision** (or **positive predictive value**): the fraction of predicted positive observations that are actually positive:

$$ Precision = \frac{TP}{(TP+FP)} $$

* **recall** (or **sensitivity** or **True Positive Rate (TPR)**): the fraction of positive observations that are predicted correctly:

$$ Recall = \frac{TP}{(TP+FN)} $$

* **False Positive Rate (FPR)** (or **1-specificity**): the fraction of negative observations that are wrongly predicted as positive (false alarm rate):

$$ False Positive Rate = \frac{FP}{(FP+TN)} $$


* **f1-score**: a composite measure that combines both precision and recall (harmonic mean):

$$ F_1 = \frac{2 \cdot P \cdot R}{(P+R)}$$

The **confusion matrix** is useful for quickly calculating precision and recall given the predicted labels from a model. A confusion matrix for binary classification shows the four different outcomes: true positive, false positive, true negative, and false negative. The actual values form the columns, and the predicted values (labels) form the rows. The intersection of the rows and columns show one of the four outcomes. 

![confusion-matrix.png](images/confusion-matrix.png)

What if we have more than two classes?
We can still plot the confusion matrix, as shown in the following example based on Iris dataset.
![cm_multiclass](https://scikit-learn.org/stable/_images/sphx_glr_plot_confusion_matrix_001.png)



We can also still evaluate the **metrics per class**, in a OneClass-vs-Rest fashion. For instance:
- $precision_{virginica}$: number of correctly predicted virginica records (9) out of all predicted verginica records (9+6+0=15), which amounts to 9/15=0.6
- $recall_{virginica}$: number of correctly predicted virginica records (9) out of the number of actual viriginica records (9+0+0=9), which amounts to 9/9=1
- $precision_{versicolor}$: number of correctly predicted versicolor records (10) out of all predicted versicolor records (10+0+0=10), which amounts to 10/10=1
- ...

and so on, for the other classes.

ROC (Receiver Operating Characteristic) curve is another useful tool to evaluate classifier output quality.
ROC curves feature **true positive rate** on the Y axis, and **false positive rate** on the X axis.

ROC curves are typically used in binary classification to study the output of a classifier. In order to extend ROC curve and ROC area to multi-label classification, it is necessary to binarize the output. 

In [None]:
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.metrics import RocCurveDisplay
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
X_b, y_b = make_classification(random_state = 2)

plt.scatter(X_b[:, 0], 
            X_b[:, 1], 
            c = y_b, 
            edgecolors = "k")

In [None]:
X_b_train, X_b_test, y_b_train, y_b_test = train_test_split(X_b, y_b, random_state = 0) 

In [None]:
clf = SVC(random_state = 0, probability = True).fit(X_b_train, y_b_train)

In [None]:
y_b_pred = clf.predict_proba(X_b_test) 

An actual curve (multiple points in the FPR-TPR space) can be obtained if we predict *probabilities*.

If we predict *labels* directly, we simply obtain a single point in the FPR-TPR space.


In [None]:
y_b_pred

In [None]:
RocCurveDisplay.from_predictions(y_b_test, y_b_pred[:, 1]) # display from predictions. There is also the "from_estimator" method.
plt.plot([0, 1], [0, 1], color = "navy", lw = 2, linestyle = "--")
plt.show()


## Evaluation on Iris Dataset

### Holdout evaluation

In [None]:
pd.Series(y).value_counts()

Split the dataset in training set and test set.

In [None]:
from sklearn.model_selection import train_test_split
train_test_split?

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size = 0.3, 
                                                    random_state = 0)


In [None]:
pd.Series(y_train).value_counts()

In [None]:
pd.Series(y_test).value_counts()

Train the classification model based on the training set and predict the labels for the instances in the test set.

In [None]:
knn = KNeighborsClassifier(n_neighbors = 5).fit(X_train, y_train)
y_pred = knn.predict(X_test)

Evaluate the performance of the model.

In [None]:
from sklearn.metrics import accuracy_score, classification_report, ConfusionMatrixDisplay
print(accuracy_score(y_test, y_pred))

Sanity check: how is accuracy defined?

In [None]:
np.sum(y_test == y_pred) / len(y_test)

In [None]:
ConfusionMatrixDisplay.from_predictions(y_test, 
                                        y_pred,
                                        display_labels = iris.target_names)
plt.show()

In [None]:
print(classification_report(y_test, y_pred, target_names = iris.target_names))

Optionally, you can get a dictionary from `classification_report`

In [None]:
classification_report(y_test, y_pred, target_names = iris.target_names, output_dict = True)

Note on metrics averaging:
- **macro**: Calculate metrics for each label, and find their unweighted mean. This does not take label imbalance into account.
- **weighted**: Calculate metrics for each label, and find their average weighted by support (the number of true instances for each label). This alters 'macro' to account for label imbalance.
- **micro**: Calculate metrics globally by counting the total true positives, false negatives and false positives.


### Cross-validation

In [None]:
from sklearn.model_selection import KFold
kf = KFold(n_splits = 5,
           shuffle = True,
           random_state = 123
          )

for enu,(train_index, test_index) in enumerate(kf.split(iris.data, iris.target)):
    print(f'------------------------fold {enu}------------------------')
    
    print("TRAIN:",train_index.shape)
    print(train_index)
    print("TEST:", test_index.shape)
    print(test_index)
    print("TEST LABELS:")
    print(iris.target[test_index])
    
    # TODO: train model on the current training set
    # TODO: test model on the current test set
    
    print(f'----------------------end fold {enu}----------------------','\n')

# compute average metrics

**Shuffling** may be essential to get a meaningful cross-validation result if data ordering is not arbitrary (i.e., same class labels are contiguous).
By default no shuffling occurs, including for the (stratified) K fold cross-validation.

Actually, we do not need to implement the loop above. Indeed, `sklearn` offers specific methods for evaluating models performance in cross-validation

In [None]:
from sklearn.model_selection import cross_val_score
cross_val_score(KNeighborsClassifier(), 
                X, 
                y, 
                cv = KFold(5, shuffle = True, random_state = 123)) 
# note that the default scorer for KNeighborsClassifier is the accuracy score

The `cross_validate` function differs from `cross_val_score` in two ways:
- It allows specifying multiple metrics for evaluation.
- It returns a dict containing fit-times, score-times (and optionally training scores as well as fitted estimators) in addition to the test score.

In [None]:
from sklearn.model_selection import cross_validate
from sklearn.metrics import make_scorer, f1_score, accuracy_score
results = cross_validate(KNeighborsClassifier(), 
                         iris.data,
                         iris.target,
                         scoring = {
                             'fscore': make_scorer(f1_score, average = 'macro'),
                             'accuracy': make_scorer(accuracy_score)},
                         return_estimator = True,
                         cv = KFold(5, shuffle = True, random_state = 123),
                         n_jobs = -1) # Number of jobs to run in parallel. 
                                      # Training the estimator and computing the score are parallelized over the cross-validation splits.
results

How to obtain the ***average* classification report** over cross-validation?

In [None]:
from sklearn.model_selection import StratifiedKFold
skf = StratifiedKFold(n_splits = 5, shuffle = True, random_state = 123)

list_df = []
list_accuracy = []

k = 1
for train, val in skf.split(iris.data, iris.target):
    print(f'FOLD {k}')

    # fit and predict using classifier
    X_tr = iris.data[train]
    y_tr = iris.target[train]
    X_val = iris.data[val]
    y_val = iris.target[val]
    clf = KNeighborsClassifier()
    clf.fit(X_tr,y_tr)
    y_pred = clf.predict(X_val)

    # compute classification report
    cr = classification_report(y_val, y_pred, output_dict = True, zero_division = np.nan) # important!
    print(classification_report(y_val, y_pred, zero_division = np.nan))

    # store accuracy
    list_accuracy.append(cr['accuracy'])

    # store per-class metrics as a dataframe
    df = pd.DataFrame({k: v for k, v in cr.items() if k != 'accuracy'})
    display(df)
    list_df.append(df)
    k+=1
    

# compute average per-class metrics    
df_concat = pd.concat(list_df)
grouped_by_row_index = df_concat.groupby(df_concat.index)
df_avg = grouped_by_row_index.mean()

# compute average accuracy
accuracy_avg = np.mean(list_accuracy)

In [None]:
accuracy_avg

In [None]:
df_avg

We can also generate cross-validated estimates for each input data point, by juxtaposing the predictions obtained on the fold out at each iteration of the CV (remember that each sample belongs to exactly one test set).

Passing these predictions into an evaluation metric may not be a valid way to measure generalization performance. 
Results can differ from cross_validate and cross_val_score unless all tests sets have equal size and the metric decomposes over samples.

In [None]:
from sklearn.model_selection import cross_val_predict

y_pred = cross_val_predict(clf, iris.data, iris.target, cv = skf)
ConfusionMatrixDisplay.from_predictions(iris.target, y_pred)
plt.show()

## Feature Selection
Feature selection/ dimensionality reduction can be used either to improve estimators' accuracy scores or to boost their performance on very high-dimensional datasets.



### A baseline: Variance threshold



Removes all features whose variance doesnâ€™t meet some threshold. 

In [None]:
X_train

In [None]:
from sklearn.preprocessing import MinMaxScaler
X_normalized = MinMaxScaler().fit_transform(X_train)
X_normalized[:10, :] # display just the first ten rows

In [None]:
X_normalized.var(axis = 0)

In [None]:
from sklearn.feature_selection import VarianceThreshold
fsel = VarianceThreshold(threshold = 0.05)
fsel.fit_transform(X_normalized)[:10, :] # display just the first ten rows


### Univariate feature selection
Univariate feature selection works by examining each feature individually: the best features are selected based on univariate statistical tests. 

Scikit-learn exposes feature selection routines as objects that implement the *transform* method. The most popular *selection routines* are the following:
- `SelectKBest`: Select features according to the *k* highest scores.
- `SelectPercentile`: Select features according to a percentile of the highest scores (percent of features to keep).

These selection routines take as input a *scoring function* that returns univariate scores:
- classification
    - `chi2`: Compute chi-squared stats between each non-negative feature and class.
    - `f_classif`: Compute the ANOVA F-value for the provided sample. (*Note: F-test captures only linear dependency*)
    - `mutual_info_classif`: Estimate mutual information for a discrete target variable. (*Note: mutual information can capture any kind of dependency between variables*)
- regression
    - `f_regression`: Univariate linear regression tests returning F-statistic and p-values.
    - `mutual_info_regression`: Estimate mutual information for a continuous target variable.



A practical example.

In [None]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

# The iris dataset
X, y = load_iris(return_X_y = True)

In [None]:
# Some noisy data not correlated
E = np.random.RandomState(32).uniform(0, 0.1, size = (X.shape[0], 20))

In [None]:
# Add the noisy data to the informative features
X = np.hstack((X, E))
display(pd.DataFrame(X))

In [None]:
# Split dataset to select feature and evaluate the classifier
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify = y, random_state = 0)

In [None]:
pd.DataFrame(X)[4].plot(kind = 'hist', bins = 7)
plt.title('column 4')

In [None]:
X_train.shape, X_test.shape

In [None]:
from sklearn.feature_selection import SelectKBest, f_classif

selector = SelectKBest(f_classif, k = 4)
selector.fit(X_train, y_train)


In [None]:
f_classif?

The *f_classif* function evaluates the one-way ANOVA test.

The one-way ANOVA tests the null hypothesis that 2 or more groups have    the same population mean. The test is applied to samples from two or     more groups, possibly with differing sizes. 

In our setting, the *groups* are the sets of samples belonging to a given class. Intuitively, if, for a given feature, the sets of samples belonging to different classes have the same mean, than that feature is "**not useful**".

In [None]:
f_statistics, p_values = f_classif(X[:, :2], y)
print(f'f-stat Att.0: {f_statistics[0]:.4}  -  p-value Att.0: {p_values[0]:.3}')
print(f'f-stat Att.1: {f_statistics[1]:.4}  -  p-value Att.1: {p_values[1]:.3}')


For the first two attributes, the p-value is close to zero. With a level of significance $\alpha$ (e.g. $\alpha = 0.05$) we can reject the null hypothesis --> the three groups (one per class) have different mean --> the feature is "**useful**" and should be retained.

In [None]:
X_indices = np.arange(X.shape[-1])
plt.bar(X_indices - 0.05, selector.pvalues_, width = 0.2)
plt.title("Feature univariate score")
plt.xlabel("Feature number")
plt.ylabel(r"Univariate score ($p_{value}$)")
plt.show()

In [None]:
scores = - np.log10(selector.pvalues_)
scores /= scores.max()

In [None]:
X_indices = np.arange(X.shape[-1])
plt.bar(X_indices - 0.05, scores, width = 0.2)
plt.title("Feature univariate score")
plt.xlabel("Feature number")
plt.ylabel(r"Univariate score ($-Log(p_{value})$)")
plt.show()

In [None]:
X_out = selector.transform(X_train)
display(pd.DataFrame(X_train).head())
display(pd.DataFrame(X_out).head())

In [None]:
selector.get_feature_names_out()

### Recursive feature elimination (RFE)
Suppose you can rely on an external estimator (classification/regression model) that assigns weights to features.

In RFE, first the estimator is trained on the initial set of features and the importance of each feature is obtained either through any specific attribute (e.g., `coef_`, `feature_importances_`)  or callable. Then, the least important feature (or features, if a step>1 is used) is pruned from current set of features. That procedure is recursively repeated on the pruned set until the desired number of features to select is eventually reached.

In [None]:
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression(max_iter = 1000) 

# LogisticRegression has attribute "coef_", indicating coefficient of the features in the decision function.

rfe = RFE(estimator = lr, n_features_to_select = 1, step = 1)
rfe.fit(X_train, y_train)
rfe.ranking_


### Feature selection using `SelectFromModel`

SelectFromModel is a meta-transformer that can be used alongside any estimator that assigns importance to each feature through a specific attribute (such as `coef_`, `feature_importances_`) or via an `importance_getter` callable after fitting. The features are considered unimportant and removed if the corresponding importance of the feature values are below the provided threshold parameter (or via other built-in heuristics).

Differently from RFE, `SelectFromModel` involves no iteration.


### Sequential Feature Selection (SFS)

It can be either forward or backward:
- *Forward-SFS* is a greedy procedure that iteratively finds the best new feature to add to the set of selected features. Concretely, we initially start with zero features and find the one feature that **maximizes a cross-validated score** when an estimator is trained on this single feature. Once that first feature is selected, we repeat the procedure by adding a new feature to the set of selected features. The procedure stops when the desired number of selected features is reached, as determined by the `n_features_to_select` parameter.
- *Backward-SFS* follows the same idea but works in the opposite direction: instead of starting with no features and greedily adding features, we start with all the features and greedily remove features from the set. The direction parameter controls whether forward or backward SFS is used.


As it relies on a cross-validated score for selecting the best feature at each iteration, SFS differs from RFE and SelectFromModel in that it does not require the underlying model to expose a `coef_` or `feature_importances_` attribute. It may however be slower considering that more models need to be evaluated, compared to the other approaches.




In [None]:
pd.DataFrame(X_train)

In [None]:
from sklearn.feature_selection import SequentialFeatureSelector

In [None]:
skf

In [None]:
lr = LogisticRegression(max_iter = 1000)
sfs = SequentialFeatureSelector(lr, 
                                cv = skf,
                                scoring = 'accuracy', 
                                direction = 'forward', 
                                n_features_to_select = 4)
sfs.fit(X_train, y_train)

sfs.get_support()

In [None]:
lr = LogisticRegression(max_iter = 1000)
sfs = SequentialFeatureSelector(lr,
                                cv = skf, 
                                scoring = 'accuracy',  
                                direction = 'backward', 
                                n_features_to_select = 4)
sfs.fit(X_train, y_train)

sfs.get_support()

**Note 1**: Which one takes the longest and why?

In [None]:
%%timeit # what sort of magic is this? a brief aside at the end of the notebook

lr = LogisticRegression(max_iter = 1000)
sfs = SequentialFeatureSelector(lr, 
                                cv = skf,
                                scoring = 'accuracy', 
                                direction ='forward', 
                                n_features_to_select = 4)
sfs.fit(X_train, y_train)

sfs.get_support()

In [None]:
%%timeit

lr = LogisticRegression(max_iter = 1000)
sfs = SequentialFeatureSelector(lr, 
                                cv = skf, 
                                scoring = 'accuracy', 
                                direction = 'backward', 
                                n_features_to_select = 4)
sfs.fit(X_train, y_train)

sfs.get_support()

**Note 2**: Why do we have noisy features among the selected ones?

Check manually the first step, e.g. in the forward fashion.

In [None]:
for i in range(X_train.shape[1]):
    print(np.mean(cross_val_score(LogisticRegression(max_iter = 1000), 
                                  X_train[:, i].reshape(-1, 1),
                                  y_train, 
                                  cv = skf)))


The highest score (around 0.96) is obtained for the fourth attribute (index = 3).

Evaluate the second step: which feature should be added? (i.e., which one improves most the cross-val-score?)

In [None]:
for i in range(X_train.shape[1]):
    if i == 3:
        continue
    X_new = np.concatenate((X_train[:, i].reshape(-1, 1), (X_train[:, 3].reshape(-1, 1))),
                           axis = 1)
    print(i, np.mean(cross_val_score(LogisticRegression(max_iter = 1000), 
                                    X_new,
                                    y_train, 
                                    cv = skf)))

It seems that, in our setting, the other actual features from the Iris dataset (X0, X1, X2) worsen the prediction, rather than improving it. That's why, if we set the target number of features to be selected to 4, the algorithm favors the noisy ones.

By setting an even minimum tolerance value (features are selected until the score improvement does not exceed such a threshold), the behavior is as expected.

In [None]:
lr = LogisticRegression(max_iter = 1000)
sfs = SequentialFeatureSelector(lr, 
                                cv = skf, 
                                tol = 0.0000001, 
                                scoring = 'accuracy',  
                                direction ='forward', 
                                n_features_to_select = 'auto')
sfs.fit(X_train, y_train)

sfs.get_support()

# Sampling and rebalancing with `imblearn`



Imbalanced-learn (imported as `imblearn`) is a library relying on scikit-learn and provides tools when dealing with classification with imbalanced classes. 
As such, it provides several samplers, which follows the scikit-learn API using the base estimator and implements a sampling functionality through the `fit_resample` method.



Generate an imbalanced dataset using sklearn utilities.

In [None]:
from sklearn.datasets import make_blobs

plt.figure(figsize = (8, 8))
plt.title("Synthetic normally distributed dataset")
X1, Y1 = make_blobs(n_samples = [1000, 100, 500], n_features = 2,random_state = 1)
plt.scatter(X1[:, 0], X1[:, 1], marker = "o", c = Y1, s = 25, edgecolor = "k")
plt.show()


In [None]:
pd.Series(Y1).value_counts()

In [None]:
from imblearn.under_sampling import RandomUnderSampler
sampler = RandomUnderSampler(random_state = 42)
X1_RUS, Y1_RUS = sampler.fit_resample(X1, Y1)
print(pd.Series(Y1_RUS).value_counts())

In [None]:
from imblearn.over_sampling import RandomOverSampler
sampler = RandomOverSampler(random_state = 42)
X1_ROS, Y1_ROS = sampler.fit_resample(X1, Y1)
print(pd.Series(Y1_ROS).value_counts())

In [None]:
from imblearn.over_sampling import SMOTE
sampler = SMOTE(random_state = 42)
X1_SMOTE, Y1_SMOTE = sampler.fit_resample(X1, Y1)
print(pd.Series(Y1_SMOTE).value_counts())

All rebalancing methods ensure flexibility about sampling strategy and target ratio between classes: check the docs and specifically the `sampling_strategy` parameter.

Visual comparison of the three methods:

In [None]:
fig,ax = plt.subplots(1, 4, figsize = (20, 5))

ax[0].scatter(X1[:, 0], X1[:, 1], marker = "o", c = Y1, s = 25, edgecolor = "k")
ax[0].set_title('original')

ax[1].scatter(X1_RUS[:, 0], X1_RUS[:, 1], marker = "o", c = Y1_RUS, s = 25, edgecolor = "k")
ax[1].set_title('Random UnderSampling')

ax[2].scatter(X1_ROS[:, 0], X1_ROS[:, 1], marker = "o", c = Y1_ROS, s = 25, edgecolor = "k")
ax[2].set_title('Random OverSampling')

ax[3].scatter(X1_SMOTE[:, 0], X1_SMOTE[:, 1], marker = "o", c = Y1_SMOTE, s = 25, edgecolor = "k")
ax[3].set_title('SMOTE')

plt.show()


Several other methods/extensions are available in imblearn (see the [docs](https://imbalanced-learn.org/stable/references/index.html#api))


# Aside: *Magic* Functions

[IPython's 'magic' functions](https://ipython.readthedocs.io/en/stable/interactive/magics.html)

- The magic function system provides a series of **functions which allow you to
control the behavior of IPython itself**, plus a lot of system-type
features. There are two kinds of magics, **line-oriented** and **cell-oriented**:
    - **Line magics are prefixed with the % character** and work much like OS
command-line calls. They get as an argument the rest of the line, where
arguments are passed without parentheses or quotes.
    - **Cell magics are prefixed with %% (a double % character)**, and they are functions that get as an argument not only the rest of the line, but also the lines below it in a separate argument. These magics are called with two arguments: the rest of the call line and the body of the cell, consisting of the lines below the first.

In [None]:
%lsmagic

In [None]:
%whos

In [None]:
%whos int

In [None]:
%timeit enu * i