<a href="https://colab.research.google.com/github/Deepan-mn/Machine_Learning_Techniques/blob/main/Categorical_Data/Handling_Categorical_Data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##**Categorical Data**

Categorical Data is a collection of information that is divided into groups.Categorical data can take on numerical values(such as "1" indicating yes and "2" indicating NO),but those numbers don't have mathematical meaning. One can neither add them together nor subtract them from each other.<br>
**Types of Categorical Data**<br>
&nbsp;&nbsp;&nbsp;&nbsp; 1.Nominal Data <br>
&nbsp;&nbsp; &nbsp;&nbsp;2.Ordinal Data<br>  

**Nominal Data**<br>
&nbsp;&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp;&nbsp;  Nominal data is sometimes called "labelled" or "named" data. Nominal data cannot be ordered in a meaningful way and does not have hierarchy.It is just a collection of data under one label

**Ordinal Data**<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;This is a data type with a set order or scale to it. Ordinal data can be arranged in a order or hierarchy.

##**Encoding Nominal Categorical Features**

We have taken a feature with Nominal classes that has no intrinsic ordering(eg.hair color-white, brown, black)

**One-hot encode** the feature using scikit-learn's **LabelBinarizer**

In [None]:
#Import libraries 
import numpy as np
from sklearn.preprocessing import LabelBinarizer, MultiLabelBinarizer

In [None]:
#Create feature
feature = np.array([["Texas"],
                    ["California"],
                    ["Texas"],
                    ["Delaware"],
                    ["Texas"]])

In [None]:
#Create One-hot encoder
one_hot = LabelBinarizer()
#one-hot encode feature
one_hot.fit_transform(feature)

array([[0, 0, 1],
       [1, 0, 0],
       [0, 0, 1],
       [0, 1, 0],
       [0, 0, 1]])

We can use the **Classes_method** to the output classes

In [None]:
#View feature classes
one_hot.classes_

array(['California', 'Delaware', 'Texas'], dtype='<U10')

If we want to reverse the one-hot encoding, we can use **inverse_transform**

In [None]:
#Reverse one-hot encoding
one_hot.inverse_transform(one_hot.transform(feature))

array(['Texas', 'California', 'Texas', 'Delaware', 'Texas'], dtype='<U10')

We can even use pandas to one-hot encode the feature

In [None]:
#Import library
import pandas as pd
#Create dummy varibles from feature
pd.get_dummies(feature[:,0])

Unnamed: 0,California,Delaware,Texas
0,0,0,1
1,1,0,0
2,0,0,1
3,0,1,0
4,0,0,1


One of the advantage of scikit-learn is to handle a situation where each observation lists multiple classes:

In [None]:
#Create multiclass feature
multiclass_feautre=[("Texas", "Florida"),
                    ("California", "Alabama"),
                    ("Texas", "Florida"),
                   ("Delware", "Florida"),
                   ("Texas", "Alabama")]
#create multiclass one-hot encoder
one_hot_multiclass = MultiLabelBinarizer()
one_hot_multiclass.fit_transform(multiclass_feautre)

array([[0, 0, 0, 1, 1],
       [1, 1, 0, 0, 0],
       [0, 0, 0, 1, 1],
       [0, 0, 1, 1, 0],
       [1, 0, 0, 0, 1]])

Once again, we can see the classes with the classes_method:


In [None]:
#View classes
one_hot_multiclass.classes_

array(['Alabama', 'California', 'Delware', 'Florida', 'Texas'],
      dtype=object)

##**Encoding Ordinal Categorical Features**

Use pandas DataFrame's replace method to transform string labels to numerical equivalents:

In [None]:
#Load library
import pandas as pd
#Create features
dataframe = pd.DataFrame({"Score":["Low","Low","Medium","Medium","High"]})
#create mapper
scale_mapper={
              "Low":1,
              "Medium":2,
              "High":3
            }
#Replace feature values with scale
dataframe["Score"].replace(scale_mapper)

0    1
1    1
2    2
3    2
4    3
Name: Score, dtype: int64

In [None]:
dataframe = pd.DataFrame({"Score":["Low","Low","Medium","Medium","High"]})
dataframe

Unnamed: 0,Score
0,Low
1,Low
2,Medium
3,Medium
4,High


In [None]:
dataframe = pd.DataFrame({"Score": ["Low",
                                    "Low",
                                    "Medium",
                                    "Medium",
                                    "High",
                                    "Barely More Than Medium"]})
scale_mapper = {"Low":1,
                "Medium":2,
                "Barely More Than Medium": 3,
                "High":4}
dataframe["Score"].replace(scale_mapper)

0    1
1    1
2    2
3    2
4    4
5    3
Name: Score, dtype: int64

In [None]:
dataframe = pd.DataFrame({"Score": ["Low",
                                    "Low",
                                    "Medium",
                                    "Medium",
                                    "High",
                                    "Barely More Than Medium"]})
dataframe

Unnamed: 0,Score
0,Low
1,Low
2,Medium
3,Medium
4,High
5,Barely More Than Medium


In this example, the distance between **Low** and **Medium** is the same as the  distance between Medium and Barely More than Medium, which is almost certainly not accurate.The best approach is to be conscious about the numerical values mapped to  classess:

In [None]:
scale_mapper = {"Low":1,
                "Medium":2,
                "Barely More Than Medium": 2.1,
                "High":3}
dataframe["Score"].replace(scale_mapper)

0    1.0
1    1.0
2    2.0
3    2.0
4    3.0
5    2.1
Name: Score, dtype: float64

##**Encoding Dictionaries of Features**

In [None]:
#import Library
from sklearn.feature_extraction import DictVectorizer

#create dictionary
data_dict =[{"Red":2,"Blue":4},
            {"Red":4,"Blue":3},
            {"Red":1,"Yellow":2},
            {"Red":2,"Yellow":2}
            
]

#Create dictionary vectorizer
dictvectorizer =DictVectorizer(sparse=False)

#Convert dictionary to feature_matrix
features = dictvectorizer.fit_transform(data_dict)

#view feature matrix
features

array([[4., 2., 0.],
       [3., 4., 0.],
       [0., 1., 2.],
       [0., 2., 2.]])

By default **DictVectorizer** output a sparse matrix that only stores elements with a value other than 0.This can be very helpful when we have massive matrices(often  encountered in natural language processing) and want to minimize the memory requirements. We can force **DictVectorizer** to output a dense matrix using  **sparse=False**. We can get the names of each generated feature using the **get_feature_names** method.

In [None]:
#Get Feature names
feature_names = dictvectorizer.get_feature_names()

#view feature names
feature_names



['Blue', 'Red', 'Yellow']

We can use pandas For the better understanding

In [None]:
#Import Library
import pandas as pd

#Create dataframe from features
pd.DataFrame(features, columns=feature_names)

Unnamed: 0,Blue,Red,Yellow
0,4.0,2.0,0.0
1,3.0,4.0,0.0
2,0.0,1.0,2.0
3,0.0,2.0,2.0


In [None]:
#Create word counts dictionaries for four documnets
doc_1_word_count ={"Red":2,"Blue":4}
doc_2_word_count ={"Red":4,"Blue":3}
doc_3_word_count ={"Red":1,"Yellow":2}
doc_4_word_count ={"Red":2,"Yellow":2}

#Create list
doc_word_count=[doc_1_word_count,
                doc_2_word_count,
                doc_3_word_count,
                doc_4_word_count]

#convert list of word count dictionaries into feature matrix
dictvectorizer.fit_transform(doc_word_count)

array([[4., 2., 0.],
       [3., 4., 0.],
       [0., 1., 2.],
       [0., 2., 2.]])

In our toy example there are only  three unique words(Red,Yellow,Blue)so there are only three features in our matrix: however, you can imagine that if each documents was actually a book in a university library our features matrix would be very large(and then **we would want to set sparse to True**)

##**Imputing Missing Class Values**

In [2]:

# Load libraries
import numpy as np
from sklearn.neighbors import KNeighborsClassifier
# Create feature matrix with categorical feature
X = np.array([[0, 2.10, 1.45],
              [1, 1.18, 1.33],
              [0, 1.22, 1.27],
              [1, -0.21, -1.19]])
# Create feature matrix with missing values in the categorical feature
X_with_nan = np.array([[np.nan, 0.87, 1.31],
                      [np.nan, -0.67, -0.22]])
# Train KNN learner
clf = KNeighborsClassifier(3, weights='distance')
trained_model = clf.fit(X[:,1:], X[:,0])
# Predict missing values' class
imputed_values = trained_model.predict(X_with_nan[:,1:])

In [None]:
imputed_values.reshape(-1,1)

array([[0.],
       [1.]])

In [None]:
#join Column of predicted class with their other features
X_with_imputed =np.hstack((imputed_values.reshape(-1,1),X_with_nan[:,1:]))

In [None]:
X_with_imputed

array([[ 0.  ,  0.87,  1.31],
       [ 1.  , -0.67, -0.22]])

In [None]:
#join two features matrices
np.vstack((X_with_imputed,X))

array([[ 0.  ,  0.87,  1.31],
       [ 1.  , -0.67, -0.22],
       [ 0.  ,  2.1 ,  1.45],
       [ 1.  ,  1.18,  1.33],
       [ 0.  ,  1.22,  1.27],
       [ 1.  , -0.21, -1.19]])

##**difference between hstack and vstack**

An alternative solution is to fill in missing values with the feature's most frequent value:

In [None]:
a =np.ones((3,3))
np.vstack((a,np.array((2,2,2))))

array([[1., 1., 1.],
       [1., 1., 1.],
       [1., 1., 1.],
       [2., 2., 2.]])

Adding a column requires a bit more work, however.You can't use **np.hstack** directly.

In [None]:
a = np.ones((3, 3))
np.hstack( (a, np.array((2,2,2))) )

This is because **np.hstack** cannot concatenate two arrays with different numbers of rows. Schematically:<br>
We can't simply transpose our new row , either because its one-dimensional array and its transpose is the same shape as the original. So we need to reshape it first.

In [None]:
np.ones((3,3))
b=np.array((2,2,2)).reshape(3,1)

In [None]:
b

array([[2],
       [2],
       [2]])

In [None]:
np.hstack((a,b))

array([[1., 1., 1., 2.],
       [1., 1., 1., 2.],
       [1., 1., 1., 2.]])

An alternative solution is to fill in missing values with the feature's most frequent value:

In [4]:
from sklearn.impute import SimpleImputer

In [6]:
#Join the two features matrices
X_complete = np.vstack((X_with_nan,X))
imputer = SimpleImputer(strategy='most_frequent')
imputer.fit_transform(X_complete)

array([[ 0.  ,  0.87,  1.31],
       [ 0.  , -0.67, -0.22],
       [ 0.  ,  2.1 ,  1.45],
       [ 1.  ,  1.18,  1.33],
       [ 0.  ,  1.22,  1.27],
       [ 1.  , -0.21, -1.19]])

##**Handling Imbalance Classes**

We have a target vecotr with highly imbalanced classes.<br>
Collect more data.If that isn't possible, change the metrics used to evaluate your model. If that doesn't work, consider using a model's built-in class weight parameters(if  available), downsampling or upsampling.
The following are used to handled imbalanced classes<br>


1.   Evaluation metrics
2.   class weight parameters
3.   downsampling
4.   upsampling<br>
To demonstrate  our solutions, we need to create some data with imbalanced classes. Fisher's Iris dataset contains three balanced classes of 50 observations, each indicating the species of flower(Iris setosa, Iris virginica, Iris veriscolor).To unbalance the dataset, we remove 40 of the 50 Iris setosa observations and then merge the Iris Virginica and Iris versicolor classes. The end result is a binary target vector indicating if an oservation is an iris setosa flower or not. The result is 10 observations of Iris setosa(clas 0 ) and 100 observations of not Iris setosa(calss 1) 



In [8]:
#Load Libraries
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris

#Load iris data
iris = load_iris()

#Create feature matrix
features =iris.data

#Create target vector
target=iris.target

In [9]:
target


array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

In [10]:
#Remove first 40 observations
features = features
target = target[40:]

#Create binary target vector indicating if class 0
target =np.where((target ==0),0,1)
#Look at the imbalanced target vector
target


array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])

Many algorithms in scikit-learn offer a parameter to weight classes during training to counteract the effect of their imbalance. **RandomForestClassifier** is a popular classification algorithm and includes a **class_weight** parameter. We can pass an argument specifying the desired class weights explicitly:

In [11]:
#Create Weights
weights={0: 0.9,1:0.1}

#Create random forest classifier with weights
RandomForestClassifier(class_weight= weights)

RandomForestClassifier(class_weight={0: 0.9, 1: 0.1})

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0,<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;class_weight={0: 0.9, 1: 0.1}, criterion='gini',<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;max_depth=None, max_features='auto', max_leaf_nodes=None,<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;max_samples=None, min_impurity_decrease=0.0,<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;min_impurity_split=None, min_samples_leaf=1,<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;min_samples_split=2, min_weight_fraction_leaf=0.0,<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;n_estimators=100, n_jobs=None, oob_score=False,<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;random_state=None, verbose=0, warm_start=False)<br>
Or you can pass balanced, which automatically creates weights inversely proportional to class frequencies:

In [12]:
#Train a random forest with balanced class_weights
RandomForestClassifier(class_weight="balanced")

RandomForestClassifier(class_weight='balanced')

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight='balanced',<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;criterion='gini', max_depth=None, max_features='auto',<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;max_leaf_nodes=None, max_samples=None,<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;min_impurity_decrease=0.0, min_impurity_split=None,<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;min_samples_leaf=1, min_samples_split=2,<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;min_weight_fraction_leaf=0.0, n_estimators=100,<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;n_jobs=None, oob_score=False, random_state=None,<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;verbose=0, warm_start=False)<br>
Alternatively, we can downsample the majority class or upsample the minority class. In downsampling, we randomly sample without replacement from the majority class (i.e., the class with more observations) to create a new subset of observations equal in size to the minority class. For example, if the minority class has 10 observations, we will randomly select 10 observations from the majority class and use those 20 observations as our data. Here we do exactly that using our unbalanced Iris data:

In [14]:
# Indicies of each class' observations
i_class0 = np.where(target == 0)[0]
i_class1 = np.where(target == 1)[0]

# Number of observations in each class
n_class0 = len(i_class0)
n_class1 = len(i_class1)

In [15]:
i_class0

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [16]:
i_class1

array([ 10,  11,  12,  13,  14,  15,  16,  17,  18,  19,  20,  21,  22,
        23,  24,  25,  26,  27,  28,  29,  30,  31,  32,  33,  34,  35,
        36,  37,  38,  39,  40,  41,  42,  43,  44,  45,  46,  47,  48,
        49,  50,  51,  52,  53,  54,  55,  56,  57,  58,  59,  60,  61,
        62,  63,  64,  65,  66,  67,  68,  69,  70,  71,  72,  73,  74,
        75,  76,  77,  78,  79,  80,  81,  82,  83,  84,  85,  86,  87,
        88,  89,  90,  91,  92,  93,  94,  95,  96,  97,  98,  99, 100,
       101, 102, 103, 104, 105, 106, 107, 108, 109])

In [17]:
#For every observation of class 0, randomly smaple
# from class 1 without replacement
i_class1_downsampled = np.random.choice(i_class1,size=n_class0,replace=False)

In [18]:
i_class1_downsampled

array([ 87,  78,  10, 104,  24,  49,  41,  83,  91,  40])

In [19]:
# Join together class 0's target vector with the
# downsampled class 1's target vector
np.hstack((target[i_class0], target[i_class1_downsampled]))

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])

In [20]:
# Join together class 0's feature matrix with the
# downsampled class 1's feature matrix
np.vstack((features[i_class0,:], features[i_class1_downsampled,:]))

array([[5.1, 3.5, 1.4, 0.2],
       [4.9, 3. , 1.4, 0.2],
       [4.7, 3.2, 1.3, 0.2],
       [4.6, 3.1, 1.5, 0.2],
       [5. , 3.6, 1.4, 0.2],
       [5.4, 3.9, 1.7, 0.4],
       [4.6, 3.4, 1.4, 0.3],
       [5. , 3.4, 1.5, 0.2],
       [4.4, 2.9, 1.4, 0.2],
       [4.9, 3.1, 1.5, 0.1],
       [6.3, 2.3, 4.4, 1.3],
       [6. , 2.9, 4.5, 1.5],
       [5.4, 3.7, 1.5, 0.2],
       [6.5, 3. , 5.8, 2.2],
       [4.8, 3.4, 1.9, 0.2],
       [5. , 3.3, 1.4, 0.2],
       [4.5, 2.3, 1.3, 0.3],
       [6. , 2.7, 5.1, 1.6],
       [6.1, 3. , 4.6, 1.4],
       [5. , 3.5, 1.3, 0.3]])

Our Other option is to upsample the minority class. In upsampling, for every observation in the majority calss, we randomly select an observation fromt the minority class with replacement. The end result is the same number of observations from the minority and majority classes. Upsampling is implemented very similarly to downsampling,just in reverse.

In [21]:

# For every observation in class 1, randomly sample from class 0 with replacement
i_class0_upsampled = np.random.choice(i_class0, size=n_class1, replace=True)

In [22]:
# Join together class 0's upsampled target vector with class 1's target vector
np.concatenate((target[i_class0_upsampled], target[i_class1]))


array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1])

In [23]:
# Join together class 0's upsampled feature matrix with class 1's feature matrix
np.vstack((features[i_class0_upsampled,:], features[i_class1,:]))[0:5]

array([[5.1, 3.5, 1.4, 0.2],
       [5.1, 3.5, 1.4, 0.2],
       [4.6, 3.4, 1.4, 0.3],
       [5.1, 3.5, 1.4, 0.2],
       [4.9, 3. , 1.4, 0.2]])

Summary<br><br>
In the real world, imbalanced classes are everywhere-most visitors dont click the buy button and detecting the cancer. For this reason, handling imbalanced classes is a common activity in machine learning.Our best strategy is simply to collect more observations-especially observations from the minority class. However, this often just not possible,so we have to resort to other options. A second strategy is to use a model evaluation metric better suited to imbalanced classes. Accuracy is often used as a metric for evaluating the performance of a model,but when imbalanced classes are present accuracy can be ill suited. For example, if only 0.5% of observations have some rare cancer, then even a naive model that predicts nobody has cancer will be 99.5% accurate. Clearly this is not ideal. Some better evaluation of metrics are **confusion matrices, precision, recall, F1 score and ROC curves**.A third strategy is to use class weighing parameters included in implementations of some models. This allows us to have the algorithm adjust for imbalanced classes. Fortunately, many scikit-learn classifiers have a **class_weight** parameter, making it a good option.The fourth and fifth strategies are related :**Downsampling and upsampling**. In downsampling we create a random subset of the majority class of equal size to the minority class. In upsampling we repeatedly sample with replacement from the minority class to make it of equal size as the majority class. The decision between using downsampling and upsampling is context-specific and in general we should try both to see which better results.