# Data Cleaning - Noisy Data

## Clustering: Unsupervised learning - K-means algoirthm

**Question**

Give the following data: 13,15,16,16,19,20,20,21,22,22,25,25,25,25,30,33,33,35,35,35,35,36,40,45,46,52,70.<br/>
Apply K-means clustering algoirthm to smooth the data, where k = 3.



In [1]:
import numpy as np
from sklearn.cluster import KMeans

In [2]:
random_state = 2**16
np.random.seed(random_state)

In [3]:
X = np.array([13,15,16,16,19,20,20,21,22,22,25,25,25,25,30,33,33,35,35,35,35,36,40,45,46,52,70])
# sklearn model will think this is one sample with 27 attributes.
print('Before:', X.shape)

# Expand the dimension, so we can have a 27x1 matrix (27 samples, each sample has 1 attribute).
X = np.expand_dims(X, axis=1)
print('After:', X.shape)
print('Print the top 5 data points:')
print(X[:5])

Before: (27,)
After: (27, 1)
Print the top 5 data points:
[[13]
 [15]
 [16]
 [16]
 [19]]


In [4]:
kmean = KMeans(n_clusters=3, random_state=random_state)
kmean.fit(X)

KMeans(n_clusters=3, random_state=65536)

In [5]:
# Note: the clusters are not ordered.
pred = kmean.predict(X)
pred

array([2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 1, 1])

In [6]:
# Compute mean for each cluster
means = np.zeros(3, dtype=np.int32)
for i in range(3):
    idx = np.where(pred == i)[0]
    mean = X[idx].mean()
    means[i] = np.round(mean, decimals=0)

means

array([37, 61, 20])

In [7]:
# Map x with its cluster mean value
X_smoothed = np.array([means[x] for x in pred])

X_smoothed

array([20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 37, 37, 37,
       37, 37, 37, 37, 37, 37, 37, 37, 61, 61])

# Imbalanced Data

## Generating data using SMOTE - Synthetic Minority Over-sampling Technique

<img src="imgs/smote.png" style="height:200px;">

The algorithm of SMOTE:<br/>
<img src="imgs/smote_algorithm.png" style="width:800px;"><br/>
Source: Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: synthetic minority over-sampling technique. Journal of artificial intelligence research, 16, 321-357.

In [8]:
from sklearn.neighbors import KDTree

In [9]:
def smote(X, p, k):
    """ Apply minority over-sampling technique - SMOTE

    Parameters
    ----------
    X : ndarray
        the samples with a minority class.
    p : float
        Amount of SMOTE. If p is less than 1, then the synthetic data will generate from p percent of randomized X.
    k : int
        Number of nearest neighbours.

    Returns
    -------
    outputs : ndarray
        Synthetic minority class samples
    """
    n_X = len(X)
    n_out = int(n_X * p)
    n_attributes = X.shape[1]
    # randomly choose n_X samples from X with replacement.
    indices_outputs = np.random.choice(n_X, size=n_out, replace=True)
    outputs = X.copy()[indices_outputs]

    # Find k nearest neighbours for all X. 
    tree = KDTree(X)
    k_indices = tree.query(X, k=k, return_distance=False)
    gaps = np.random.rand(n_out)
    
    for i, i_x in enumerate(indices_outputs):
        nn = np.random.choice(k_indices[i_x], 1)
        dif = X[i_x] - X[nn]
        outputs[i] = X[nn] + gaps[i] * dif
    
    return outputs

## Testing SMOTE

In [10]:
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

In [11]:
X, y = datasets.load_iris(return_X_y=True)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=random_state)

# Select all sample where class = 0
indices = np.where(y_train==0)[0]
X0 = X_train[indices]

print(X0.shape)

print(X0[:5])

(36, 4)
[[5.4 3.9 1.3 0.4]
 [4.8 3.4 1.9 0.2]
 [5.1 3.4 1.5 0.2]
 [4.9 3.  1.4 0.2]
 [4.4 2.9 1.4 0.2]]


In [12]:
rf = RandomForestClassifier(n_estimators=10, max_depth=5)
rf.fit(X_train, y_train)

RandomForestClassifier(max_depth=5, n_estimators=10)

In [13]:
acc_train = rf.score(X_train, y_train)
print('Accuracy on training set: {:2.2%}\n'.format(acc_train))
acc_test = rf.score(X_test, y_test)
print('Accuracy on test set: {:2.2%}\n'.format(acc_test))

Accuracy on training set: 99.05%

Accuracy on test set: 97.78%



In [14]:
# Generate 200% of synthetic data, with k = 5

# If the value of an attribute is the same between the targeted point and its neighbour,
# the difference for that attribute will be 0.
X0_smote = smote(X0, 2.0, 5)

print(X0_smote.shape)

print(X0_smote[:10])

pred = rf.predict(X0_smote)
print(pred)

(72, 4)
[[5.15969776 3.78010075 1.5        0.28010075]
 [5.1042635  3.5        1.4042635  0.2       ]
 [5.51435389 4.09058981 1.62376408 0.4       ]
 [4.83795007 3.1        1.56204993 0.2       ]
 [5.37358708 3.89119569 1.71760861 0.4       ]
 [5.1        3.7        1.5        0.4       ]
 [5.21254359 3.76248547 1.56248547 0.2       ]
 [5.04990696 3.52504652 1.4        0.24990696]
 [4.5        2.3        1.3        0.3       ]
 [4.86709413 3.03290587 1.46581175 0.2       ]]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]


In [15]:
# Generate 50% of synthetic data, with k = 3

# k-neighbours include the targeted point itself,
# so k =3 means the target and 2 of its neighbours.
# The smaller k gets, the more likely to resample from 
# the real data.
X0_smote = smote(X0, 0.5, 3)

print(X0_smote.shape)

print(X0_smote[:10])

pred = rf.predict(X0_smote)
print(pred)

(18, 4)
[[5.57003184 3.84332272 1.7        0.34332272]
 [4.4        2.98507207 1.31492793 0.2       ]
 [5.1        3.79034507 1.59034507 0.21930985]
 [5.1        3.8        1.5620414  0.2379586 ]
 [5.25188632 3.55188632 1.5        0.2       ]
 [5.40969028 3.70323009 1.50646019 0.20323009]
 [4.95038597 3.6        1.4        0.15038597]
 [4.9        3.1        1.5        0.1728397 ]
 [5.1        3.5        1.4        0.2       ]
 [5.4        3.9        1.3        0.4       ]]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]


In [16]:
# Generate 100% of synthetic data, with k = 1

# When k = 1, the difference should be always 0. 
# Thus SMOTE is equivalent to resampling 
# (random choose n samples with replacement).
X0_smote = smote(X0, 1, 1)

print(X0_smote.shape)

print(X0_smote[:10])

pred = rf.predict(X0_smote)
print(pred)

(36, 4)
[[5.1 3.4 1.5 0.2]
 [5.1 3.5 1.4 0.2]
 [4.9 3.1 1.5 0.1]
 [4.8 3.1 1.6 0.2]
 [4.4 2.9 1.4 0.2]
 [4.9 3.6 1.4 0.1]
 [5.7 3.8 1.7 0.3]
 [5.1 3.8 1.5 0.3]
 [4.4 2.9 1.4 0.2]
 [5.1 3.5 1.4 0.2]]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
