<hr/>

# Data Mining  [EN.550.636.02]

03/09/2018

**TA** - Cong Mu (cmu2@jhu.edu)   <br/>
**Office Hour** - Monday 9:00am ~ 11:00am

- **Python:** scikit-learn
- **Classification:** NB, LDA, QDA
- **Q & A**

<hr/>


[Install Python](https://www.python.org/) <br/>
[Install Anaconda](https://www.continuum.io/downloads)

In [1]:
%pylab inline

Populating the interactive namespace from numpy and matplotlib


In [2]:
import pandas as pd

<h2><font color="darkblue">Python</font></h2>
<hr/>

### scikit-learn
[Tutorial](http://scikit-learn.org/stable/tutorial/index.html)

#### Preprocessing
[Reference](http://scikit-learn.org/stable/modules/preprocessing.html#preprocessing)

In [3]:
from sklearn import preprocessing

- **Standardization**

In [4]:
np.random.seed(2018)
X = np.random.rand(20, 3)
X

array([[ 0.88234931,  0.10432774,  0.90700933],
       [ 0.3063989 ,  0.44640887,  0.58998539],
       [ 0.8371111 ,  0.69780061,  0.80280284],
       [ 0.10721508,  0.75709253,  0.99967101],
       [ 0.725931  ,  0.14144824,  0.3567206 ],
       [ 0.94270411,  0.61016189,  0.22757747],
       [ 0.66873237,  0.69290455,  0.41686251],
       [ 0.17180956,  0.97689051,  0.33022414],
       [ 0.62904415,  0.16061095,  0.08995264],
       [ 0.97082236,  0.81657757,  0.57136573],
       [ 0.34585315,  0.403744  ,  0.13738304],
       [ 0.90093449,  0.93393613,  0.04737714],
       [ 0.67150688,  0.03483186,  0.25269136],
       [ 0.55712505,  0.52582348,  0.35296779],
       [ 0.09298297,  0.30450898,  0.86242986],
       [ 0.71693654,  0.96407149,  0.53970186],
       [ 0.95053982,  0.66798156,  0.87424103],
       [ 0.48120492,  0.13739854,  0.69022154],
       [ 0.50211855,  0.07451108,  0.52351229],
       [ 0.91856772,  0.5274287 ,  0.36424787]])

In [5]:
X_scaled = preprocessing.scale(X, axis=0, with_mean=True, with_std=True, copy=True)

In [6]:
# Check whether mean = 0
X_scaled.mean(axis=0)

array([ -6.66133815e-17,   3.19189120e-17,   3.60822483e-17])

In [7]:
# Check whether std = 1
X_scaled.std(axis=0)

array([ 1.,  1.,  1.])

- **Binarization**

In [8]:
X = np.array([[ 1., -1.,  2.],
              [ 2.,  0.,  0.],
              [ 0.,  1., -1.]])
X

array([[ 1., -1.,  2.],
       [ 2.,  0.,  0.],
       [ 0.,  1., -1.]])

In [9]:
binarizer = preprocessing.Binarizer(threshold=0.0, copy=True)
binarizer.transform(X)

array([[ 1.,  0.,  1.],
       [ 1.,  0.,  0.],
       [ 0.,  1.,  0.]])

In [10]:
binarizer = preprocessing.Binarizer(threshold=1.5, copy=True)
binarizer.transform(X)

array([[ 0.,  0.,  1.],
       [ 1.,  0.,  0.],
       [ 0.,  0.,  0.]])

- **Custom transformers**

In [11]:
X = np.array([[0, 1], [2, 3]])
X

array([[0, 1],
       [2, 3]])

In [12]:
# Could be useful in pipeline
transformer = preprocessing.FunctionTransformer(np.log1p) # log(1 + x)
transformer.transform(X)

array([[ 0.        ,  0.69314718],
       [ 1.09861229,  1.38629436]])

In [13]:
np.log1p(X)

array([[ 0.        ,  0.69314718],
       [ 1.09861229,  1.38629436]])

<h2><font color="darkblue">Model</font></h2>
<hr/>

In [14]:
from sklearn import datasets

In [15]:
iris = datasets.load_iris()
c = np.unique(iris.target)
c

array([0, 1, 2])

### Naive Bayes
[Reference](http://scikit-learn.org/stable/modules/naive_bayes.html)

#### Procedure of Naive Bayes

- Fit
> Estimate the parameters in each class

- Predict
> For each unlabeled data, calculate the posterior for each class
>
> Classify the data with class k having the largest posterior

- Assumption
> Features are independent

In [16]:
# Toy example for Gaussian Naive Bayes

class GNB(dict):
    
    def fit(self, X, C):
        for k in np.unique(C):
            # Observation in class k
            members = (C == k)
            # Number of obvervation in class k
            num = members.sum() 
            # Use frequency as prior
            prior = num / float(C.size)
            # Choose the observation in class k
            XX = X[members,:] 
            # Calculate mean for class k
            mu = XX.mean(axis=0)
            # Center
            XX -= mu
            # Calculate variance for class k
            var = (XX*XX).sum(axis=0) / (XX.shape[0]-1)
            # Save the result for class k
            self[k] = (prior, num, mu, var)
    
    def predict(self, Y):
        pred = -1 * ones(Y.shape[0])
        for i in range(pred.size):
            # Initialization
            pmax, kmax = -1, None   
            # Calculate the posterior for each class
            for k in self:
                prior, num, mu, var = self[k]
                diff = Y[i,:] - mu
                d2 = diff*diff / (2*var) 
                posterior = prior * np.exp(-d2.sum()) / np.sqrt(np.prod(2*pi*var))
                # Update the threshold and prediction with the largest posterior
                if posterior > pmax:
                    pmax = posterior
                    kmax = k
                pred[i] = kmax
        return pred

In [17]:
clf = GNB()
clf.fit(iris.data, iris.target)
pred = clf.predict(iris.data)

print('Classifier: GNB')
print('Number of mislabeled points out of a total %d points : %d' % (iris.target.size, (iris.target!=pred).sum()))
print('Accuracy: ', mean(iris.target==pred))

Classifier: GNB
Number of mislabeled points out of a total 150 points : 6
Accuracy:  0.96


In [18]:
from sklearn.naive_bayes import GaussianNB

In [19]:
# Specify the model
clf = GaussianNB(priors=None)

# Fit
clf.fit(iris.data, iris.target)

# Predict
pred = clf.predict(iris.data)

print('Classifier: GNB')
print('Number of mislabeled points out of a total %d points : %d' % (iris.target.size, (iris.target!=pred).sum()))
print('Accuracy: ', mean(iris.target==pred))

Classifier: GNB
Number of mislabeled points out of a total 150 points : 6
Accuracy:  0.96


### Linear Discriminant Analysis & Quadratic Discriminant Analysis
[Reference](http://scikit-learn.org/stable/modules/lda_qda.html)

#### Procedure of LDA & QDA

- Fit
> Estimate the parameters in each class

- Predict
> For each unlabeled data, calculate the log-likelihood for each class
>
> Classify the data with class k having the largest log-likelihood

- Difference
> LDA: same covariance matrix in different classes
>
> QDA: different covariance matrix in different classes


In [20]:
# Toy example for Quadratic Discriminant Analysis

class QDA(dict):
    
    def fit(self, X, C):
        for k in np.unique(C):
            # Observation in class k
            members = (C==k)
            # Number of obvervation in class k
            num = members.sum() 
            # Use frequency as prior
            prior = num / float(C.size)
            # Choose the observation in class k
            S = X[members,:] 
            # Calculate mean for class k
            mu = S.mean(axis=0)    
            # Center
            Z = (S-mu).T
            # Calculate variance for class k
            cov = Z.dot(Z.T) / (Z[0,:].size-1)
            # Save the result for class k
            self[k] = (num, prior, mu, cov)

            
    def predict(self, Y):
        pred = -1 * ones(Y.shape[0])
        for i in range(pred.size):
            # Initialization
            d2min, kbest = 1e99, None
            # Calculate the log-likelihood for each class
            for k in self: 
                num, prior, mu, cov = self[k]
                diff = (Y[i,:]-mu).T
                d2 = diff.T.dot(linalg.inv(cov)).dot(diff) / 2
                d2 += np.log(linalg.det(cov)) / 2 - np.log(prior) 
                # Update the threshold and prediction with the largest log-likelihood
                if d2 < d2min: 
                    d2min, kbest = d2,k
            pred[i] = kbest
        return pred

In [21]:
clf = QDA()
clf.fit(iris.data, iris.target)
pred = clf.predict(iris.data)

print('Classifier: QDA')
print('Number of mislabeled points out of a total %d points : %d' % (iris.target.size, (iris.target!=pred).sum()))
print('Accuracy: ', mean(iris.target==pred))

Classifier: QDA
Number of mislabeled points out of a total 150 points : 3
Accuracy:  0.98


In [22]:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

In [23]:
# Specify the model
clf = LinearDiscriminantAnalysis(priors=None)

# Fit
clf.fit(iris.data, iris.target)

# Predict
pred = clf.predict(iris.data)

print('Classifier: LDA')
print('Number of mislabeled points out of a total %d points : %d' % (iris.target.size, (iris.target!=pred).sum()))
print('Accuracy: ', mean(iris.target==pred))

Classifier: LDA
Number of mislabeled points out of a total 150 points : 3
Accuracy:  0.98


In [24]:
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis

In [25]:
# Specify the model
clf = QuadraticDiscriminantAnalysis(priors=None)

# Fit
clf.fit(iris.data, iris.target)

# Predict
pred = clf.predict(iris.data)

print('Classifier: QDA')
print('Number of mislabeled points out of a total %d points : %d' % (iris.target.size, (iris.target!=pred).sum()))
print('Accuracy: ', mean(iris.target==pred))

Classifier: QDA
Number of mislabeled points out of a total 150 points : 3
Accuracy:  0.98
