<div class="alert alert-block alert-success">
    <h1 align="center">Scikit-Learn Tips</h1>
    <h3 align="center">Tip 19 : Syntethic data - Part 1</h3>
    <h4 align="center"><a href="https://github.com/AliBinary">Ali Ghanbari</a></h5>
</div>

Generate a random n-class classification problem.

Imagine you just learned about a new classification algorithm. And you want to explore it further. Maybe youâ€™d like to try out its hyperparameters to see how they affect performance.

The only problem is - you canâ€™t find a good dataset to experiment with.

Donâ€™t fret. Scikit-Learn has written a function just for you!

You can use make_classification() to create a variety of classification datasets. Here are a few possibilities:

* Generate binary or multiclass labels.
* Create labels with balanced or imbalanced classes.

Letâ€™s create a few such datasets. Weâ€™ll also build RandomForestClassifier models to classify a few of them.

In [None]:
from sklearn.datasets import make_classification

X, y = make_classification(
    n_samples=1000, # 1000 observations 
    n_features=5, # 5 total features
    n_informative=3, # 3 'useful' features
    n_classes=2, # binary target/label 
    random_state=85 # if you want the same results as mine
)

Here are the basic input parameters for the function make_classification():

* n_samples: How many observations do you want to generate?
* n_features: The number of numerical features.
* n_informative: The number of features that are â€˜useful.â€™ Only these features carry the signal that your model will use to classify the dataset.
* n_classes: The number of unique classes (values) for the target label.


In [None]:
import pandas as pd
dataset = pd.DataFrame(X)
dataset.columns = ['X1', 'X2', 'X3', 'X4', 'X5']
dataset['y'] = y
dataset.info()

In [None]:
dataset['y'].value_counts()

In [None]:
dataset.head()

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_validate

# initialize classifier
classifier = RandomForestClassifier() 

# Run cross validation with 10 folds
scores = cross_validate(
    classifier, X, y, cv=10, 
    # measure score for a list of classification metrics
    scoring=['accuracy', 'precision', 'recall', 'f1']
)

scores = pd.DataFrame(scores)
scores.mean().round(4)

## Imbalanced Dataset

In [None]:
X, y = make_classification(
    # the usual parameters
    n_samples=1000, n_features=5, n_informative=3, n_classes=2, 
    # Set label 0 for  97% and 1 for rest 3% of observations
    weights=[0.97], 
)

In [None]:
pd.DataFrame(y).value_counts()

## Multiclass Dataset ðŸ”—

In [None]:
X, y = make_classification(
    # same parameters as usual 
    n_samples=1000, n_features=5, n_informative=3,
    # create target label with 3 classes
    n_classes=3, 
)

In [None]:
pd.DataFrame(y).value_counts()