<hr/>

# Data Mining
**Tamás Budavári** - budavari@jhu.edu <br/>

- Classification exercises
- Cross-validation

<hr/>

In [None]:
%pylab inline

<h1><font color="darkblue">Classification</font></h1>
<hr/>

- Based on a **training set** of labeled points, assign class labels to unknown vectors in the **query set**.  

> **Training set**

>$T = \big\{ (x_i, C_i) \big\}$ 

> where $x_i\in \mathbb{R}^d$ are feature sets and $C_i$ are the known class memberships

<nbsp/>

> **Query set**

>$Q = \big\{ x_i \big\}$ where $x_i\in \mathbb{R}^d$ consist of the kind of features in $T$

- And again, $x_i$ are not real vectors but **feature sets** of a bunch of scalars in general

### Bayes with Covariance Matrix

- Estimate the full covariance matrix for the classes

>$\displaystyle {\cal{}L}_{\boldsymbol{x}}(C_k) =  G(\boldsymbol{x};\mu_k, \Sigma_k)$

> Handles correlated features well

- Consider binary problem with 2 classes

> Taking the negative logarithm of the likelihoods we compare

>$\displaystyle (x\!-\!\mu_1)^T\,\Sigma_1^{-1}(x\!-\!\mu_1) + \ln\,\lvert \Sigma_1  \lvert $ vs.

>$\displaystyle (x\!-\!\mu_2)^T\,\Sigma_2^{-1}(x\!-\!\mu_2) + \ln \, \lvert\Sigma_2\lvert $

> If the difference is lower than a threshold, we classify it accordingly

- This is called [**Quadratic Discriminant Analysis**](https://scikit-learn.org/stable/modules/lda_qda.html)

### Same Covariance Matrix

- When $\Sigma_1=\Sigma_2=\Sigma$, the quadratic terms cancel from the difference
 
>$\displaystyle (x\!-\!\mu_1)^T\,\Sigma^{-1}(x\!-\!\mu_1) $ 
>$\displaystyle -\ (x\!-\!\mu_2)^T\,\Sigma^{-1}(x\!-\!\mu_2) $

- Hence this is called [**Linear Discriminant Analysis**](https://scikit-learn.org/stable/modules/lda_qda.html)

> Fewer parameters to estimate during the learning process

> Good, if we don't have enough data, for example...

> Think linear vs quadratic fitting and how you decide between those

### Exercise: QDA 

- Use the provided [training](Class-Train.csv) and [query](Class-Query.csv) sets to perform classification

> **Training** set consists of 3 columns of ($x_i$, $y_i$, $C_i$)

> **Query** set only has 2 columns of ($x_i$, $y_i$)



In [2]:
class MyQDA(object):
    """ Template for classifier
    """
    def fit(self,X,C):
        # your code here
        return self

    def predict(self,Y):
        Cpred = None
        # your code here
        # use linalg.det(matrix)
        # and linalg.inv(matrix)
        return Cpred

In [3]:
class MyQDA(object):
    """ Simple implementation for illustration purposes
    """       
    def fit(self,X,C):
        self.param = dict()
        for k in np.unique(C):
            members = (C==k)
            prior = members.sum() / float(C.size)
            S = X[members,:] # subset of class 
            mu = S.mean(axis=0)    
            Z = (S-mu).T # centered column vectors
            cov = Z.dot(Z.T) / (Z[0,:].size-1)
            self.param[k] = (mu,cov,prior)
        return self
            
    def predict(self,Y):
        Cpred = -1 * ones(Y[:,0].size)
        for i in range(Cpred.size):
            d2min, kbest = 1e99, None
            for k in self.param:
                mu, cov, prior = self.param[k]
                diff = (Y[i,:]-mu).T
                d2 = diff.T.dot(linalg.inv(cov)).dot(diff) / 2
                d2 += np.log(linalg.det(cov)) / 2 - np.log(prior) 
                if d2<d2min: d2min,kbest = d2,k
            Cpred[i] = kbest
        return Cpred

In [4]:
# reference implementation
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis as QDA

D = np.loadtxt('files/Class-Train.csv', delimiter=',')
Q = np.loadtxt('files/Class-Query.csv', delimiter=',')
X, C = D[:,0:2], D[:,2]

Cpred = MyQDA().fit(X,C).predict(Q)
Cskit =   QDA().fit(X,C).predict(Q)

print ('Number of different estimates:', (Cpred!=Cskit).sum())

Number of different estimates: 0


<h1><font color="darkblue">Cross-Validation</font></h1>
<hr/>

- How to evaluate the quality of estimator?

> $k$-NN method's parameter affects the results

- We saw on the IRIS data that 1-NN was overfitting

> We discussed excluding the point itself

### Partitions of the Training set

- Random complementary subsets 

> Train on a larger subset, test on a small

> Multiple rounds to decrease variance

### Leave-One-Out

- For each point, we train on the others and test

> Testing on $n$ points requires $n$ training 

- Expensive!

### A Relaxed Variant

- $k$-fold cross-validation 

> 0. Create $k$ partitions of equal sizes, e.g., $k=2$ yields two subsets
> 0. Pick a single partition and train on the other $(k\!-\!1)$ 
> 0. Repeat for all $k$ partitions - requires $k$ trainings

- Leave-One-Out is a special case with $k=n$


### Exercise: Cross-Validation

- Evaluate "QDA" on the [training](files/Class-Train.csv) set using 2-fold cross-validation

>0. What is the fraction of correct estimates? 
>0. What is the uncertainty of that fraction?
 
> The **training** set consists of 3 columns of ($x_i$, $y_i$, $C_i$)


In [5]:
Dc = D.copy()
# randomize and split to D1 + D2
np.random.seed(seed=42)
np.random.shuffle(Dc)
split = int(Dc[:,0].size/2)
D1, D2 = Dc[:split,:], Dc[split:,:]

# train on one, estimate on the other
# ... your code here ...

In [6]:
Dc = D.copy()
# randomize and split to D1 + D2
np.random.seed(seed=42)
np.random.shuffle(Dc)
split = int(Dc[:,0].size/2)
D1, D2 = Dc[:split,:], Dc[split:,:]
# train on one estimate or the other
for i,(T,Q) in enumerate([(D1,D2),(D2,D1)]):
    X, C = T[:,0:2], T[:,2]
    Cpred, Ctrue = MyQDA().fit(X,C).predict(Q[:,:2]), Q[:,2]
    print ("Case #%d - Number of mislabeled points out of a total %3d points : %2d" \
        % (i, Q.shape[0],(Ctrue!=Cpred).sum()))

Case #0 - Number of mislabeled points out of a total 157 points : 19
Case #1 - Number of mislabeled points out of a total 156 points : 20


### Done already?

- Visualize the results in the 2D features space
- Make these simple codes run faster 


### 3-fold CV - quick hack

In [7]:
Dc = D.copy()

# randomize and split to D1 + D2
np.random.seed(seed=42)
np.random.shuffle(Dc)
split = int(Dc[:,0].size/3)
split2 = 2*split
D1, D2, D3 = Dc[:split,:], Dc[split:split2,:], Dc[split2:]

# train on one, estimate on the other
for T,Q in [ (np.vstack([D1,D2]),D3), (np.vstack([D2,D3]),D1), (np.vstack([D3,D1]),D2)]:
    Cpred = QDA().fit(T[:,:2],T[:,2]).predict(Q[:,:2])
    Ctrue = Q[:,2]
    print ((Cpred!=Ctrue).sum(), T.shape)

11 (208, 3)
10 (209, 3)
17 (209, 3)


### Unhomework

- Implement LDA and compare to sklearn

>0. Write code without using sklearn 
>0. Apply to [training](Class-Train.csv) and [query](Class-Query.csv) sets 
>0. Compare your results to sklearn's 

- Perform 10-fold cross-validation of *MyQDA* on [this](Class-Train.csv) file

>0. Write code without using sklearn 
>0. Calculate average and variance of good classifications 
>0. Compare to sklearn 

In [8]:
from sklearn.model_selection import cross_val_score
clf = QDA()
cross_val_score(clf, X,C, cv=10)

array([0.82352941, 0.82352941, 0.82352941, 0.875     , 0.73333333,
       0.8       , 1.        , 1.        , 1.        , 0.93333333])

### What does this mean?

In [9]:
from sklearn.model_selection import KFold
k_fold = KFold(n_splits=10, shuffle=False) 

for k, (train, test) in enumerate(k_fold.split(X)):
    clf.fit(X[train],C[train])
    Cpred = clf.predict(X[test])
    print (k, ':\t', (C[test]==Cpred).sum() / float(test.size),
        '  =  ', clf.score(X[test],C[test]) )

0 :	 0.8125   =   0.8125
1 :	 0.875   =   0.875
2 :	 0.75   =   0.75
3 :	 0.875   =   0.875
4 :	 0.8125   =   0.8125
5 :	 0.75   =   0.75
6 :	 1.0   =   1.0
7 :	 1.0   =   1.0
8 :	 0.9333333333333333   =   0.9333333333333333
9 :	 1.0   =   1.0


<h1><font color="darkblue">Summary</font></h1>
<hr/>

In [14]:
from sklearn import datasets

In [15]:
iris = datasets.load_iris()
c = np.unique(iris.target)
c

array([0, 1, 2])

### Procedure of [LDA & QDA](http://scikit-learn.org/stable/modules/lda_qda.html)

- Fit
> Estimate the parameters in each class

- Predict
> For each unlabeled data, calculate the log-likelihood for each class
>
> Classify the data with class k having the largest log-likelihood

- Difference
> LDA: same covariance matrix in different classes
>
> QDA: different covariance matrix in different classes


In [20]:
# Toy example for Quadratic Discriminant Analysis

class QDA(dict):
    
    def fit(self, X, C):
        for k in np.unique(C):
            # Observation in class k
            members = (C==k)
            # Number of obvervation in class k
            num = members.sum() 
            # Use frequency as prior
            prior = num / float(C.size)
            # Choose the observation in class k
            S = X[members,:] 
            # Calculate mean for class k
            mu = S.mean(axis=0)    
            # Center
            Z = (S-mu).T
            # Calculate variance for class k
            cov = Z.dot(Z.T) / (Z[0,:].size-1)
            # Save the result for class k
            self[k] = (num, prior, mu, cov)

            
    def predict(self, Y):
        pred = -1 * ones(Y.shape[0])
        for i in range(pred.size):
            # Initialization
            d2min, kbest = 1e99, None
            # Calculate the log-likelihood for each class
            for k in self: 
                num, prior, mu, cov = self[k]
                diff = (Y[i,:]-mu).T
                d2 = diff.T.dot(linalg.inv(cov)).dot(diff) / 2
                d2 += np.log(linalg.det(cov)) / 2 - np.log(prior) 
                # Update the threshold and prediction with the largest log-likelihood
                if d2 < d2min: 
                    d2min, kbest = d2,k
            pred[i] = kbest
        return pred

In [21]:
clf = QDA()
clf.fit(iris.data, iris.target)
pred = clf.predict(iris.data)

print('Classifier: QDA')
print('Number of mislabeled points out of a total %d points : %d' % (iris.target.size, (iris.target!=pred).sum()))
print('Accuracy: ', mean(iris.target==pred))

Classifier: QDA
Number of mislabeled points out of a total 150 points : 3
Accuracy:  0.98


In [22]:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

In [23]:
# Specify the model
clf = LinearDiscriminantAnalysis(priors=None)

# Fit
clf.fit(iris.data, iris.target)

# Predict
pred = clf.predict(iris.data)

print('Classifier: LDA')
print('Number of mislabeled points out of a total %d points : %d' % (iris.target.size, (iris.target!=pred).sum()))
print('Accuracy: ', mean(iris.target==pred))

Classifier: LDA
Number of mislabeled points out of a total 150 points : 3
Accuracy:  0.98


In [24]:
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis

In [25]:
# Specify the model
clf = QuadraticDiscriminantAnalysis(priors=None)

# Fit
clf.fit(iris.data, iris.target)

# Predict
pred = clf.predict(iris.data)

print('Classifier: QDA')
print('Number of mislabeled points out of a total %d points : %d' % (iris.target.size, (iris.target!=pred).sum()))
print('Accuracy: ', mean(iris.target==pred))

Classifier: QDA
Number of mislabeled points out of a total 150 points : 3
Accuracy:  0.98


### [Split Training Data and Test Data](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html)

In [21]:
from sklearn.model_selection import train_test_split

In [22]:
Y = np.arange(10)
Y

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [23]:
X = np.arange(20).reshape(10, 2)
X

array([[ 0,  1],
       [ 2,  3],
       [ 4,  5],
       [ 6,  7],
       [ 8,  9],
       [10, 11],
       [12, 13],
       [14, 15],
       [16, 17],
       [18, 19]])

In [24]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.25, random_state=2018)
print('X_train: \n', X_train)
print('X_test: \n', X_test)
print('Y_train: \n', Y_train)
print('Y_test: \n', Y_test)

X_train: 
 [[16 17]
 [18 19]
 [ 8  9]
 [10 11]
 [ 2  3]
 [ 4  5]
 [12 13]]
X_test: 
 [[ 0  1]
 [ 6  7]
 [14 15]]
Y_train: 
 [8 9 4 5 1 2 6]
Y_test: 
 [0 3 7]


In [25]:
# If you don't want to shuffle
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.25, random_state=2018, shuffle=False)
print('X_train: \n', X_train)
print('X_test: \n', X_test)
print('Y_train: \n', Y_train)
print('Y_test: \n', Y_test)

X_train: 
 [[ 0  1]
 [ 2  3]
 [ 4  5]
 [ 6  7]
 [ 8  9]
 [10 11]
 [12 13]]
X_test: 
 [[14 15]
 [16 17]
 [18 19]]
Y_train: 
 [0 1 2 3 4 5 6]
Y_test: 
 [7 8 9]


### [Cross-Validation](http://scikit-learn.org/stable/modules/cross_validation.html)

In [26]:
from sklearn.model_selection import cross_val_score

In [27]:
wine = datasets.load_wine()

In [28]:
wine.keys()

dict_keys(['data', 'target', 'target_names', 'DESCR', 'feature_names'])

In [29]:
# Features
wine.data.shape

(178, 13)

In [30]:
# Label
wine.target.shape

(178,)

In [31]:
# How many class and how many observation in each class
np.unique(wine.target, return_counts=True)

(array([0, 1, 2]), array([59, 71, 48]))

In [32]:
from sklearn import neighbors

In [33]:
knn = neighbors.KNeighborsClassifier(n_neighbors=3)
cvscores = cross_val_score(estimator=knn, X=wine.data, y=wine.target, cv=10)
print(cvscores)

[ 0.68421053  0.55555556  0.72222222  0.66666667  0.66666667  0.66666667
  0.72222222  0.77777778  0.88235294  0.875     ]


In [34]:
mean(cvscores)

0.72193412452700367

In [35]:
# Alternatively

# Split dataset into k consecutive folds (without shuffling by default).
from sklearn.model_selection import KFold

# The folds are made by preserving the percentage of samples for each class
from sklearn.model_selection import StratifiedKFold 

In [36]:
wine.data.shape

(178, 13)

In [37]:
wine.target.shape

(178,)

In [38]:
skf = StratifiedKFold(n_splits=2, shuffle=False, random_state=2018) # random_state used when shuffle == True.
for fold1_index, fold2_index in skf.split(wine.data, wine.target):
    print('Fold1:', fold1_index)
    print('Fold2:', fold2_index)
    print('\n')

Fold1: [ 30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47
  48  49  50  51  52  53  54  55  56  57  58  95  96  97  98  99 100 101
 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119
 120 121 122 123 124 125 126 127 128 129 154 155 156 157 158 159 160 161
 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177]
Fold2: [  0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17
  18  19  20  21  22  23  24  25  26  27  28  29  59  60  61  62  63  64
  65  66  67  68  69  70  71  72  73  74  75  76  77  78  79  80  81  82
  83  84  85  86  87  88  89  90  91  92  93  94 130 131 132 133 134 135
 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153]


Fold1: [  0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17
  18  19  20  21  22  23  24  25  26  27  28  29  59  60  61  62  63  64
  65  66  67  68  69  70  71  72  73  74  75  76  77  78  79  80  81  82
  83  84  85  86  87  88  89  90  

In [39]:
# The distribution of each class in each folds
y_fold1 = wine.target[fold1_index]
y_fold2 = wine.target[fold2_index]
print('Fold1:')
print(np.unique(y_fold1, return_counts=True))
print('Fold2:')
print(np.unique(y_fold2, return_counts=True))

Fold1:
(array([0, 1, 2]), array([30, 36, 24]))
Fold2:
(array([0, 1, 2]), array([29, 35, 24]))


In [40]:
kf = KFold(n_splits=2, shuffle=True, random_state=2018) # random_state used when shuffle == True.
for fold1_index, fold2_index in kf.split(wine.data, wine.target):
    print('Fold1:', fold1_index)
    print('Fold2:', fold2_index)
    print('\n')

Fold1: [  0   4   6   9  16  19  22  25  26  27  29  31  34  38  40  42  43  44
  45  46  47  49  50  55  59  60  61  65  67  70  71  72  73  74  75  76
  77  79  83  86  87  88  93  96  97  98  99 100 102 103 104 105 109 113
 114 117 119 120 122 123 126 127 128 130 134 135 137 141 142 145 147 148
 149 150 151 152 153 154 156 157 159 160 163 164 170 171 172 175 177]
Fold2: [  1   2   3   5   7   8  10  11  12  13  14  15  17  18  20  21  23  24
  28  30  32  33  35  36  37  39  41  48  51  52  53  54  56  57  58  62
  63  64  66  68  69  78  80  81  82  84  85  89  90  91  92  94  95 101
 106 107 108 110 111 112 115 116 118 121 124 125 129 131 132 133 136 138
 139 140 143 144 146 155 158 161 162 165 166 167 168 169 173 174 176]


Fold1: [  1   2   3   5   7   8  10  11  12  13  14  15  17  18  20  21  23  24
  28  30  32  33  35  36  37  39  41  48  51  52  53  54  56  57  58  62
  63  64  66  68  69  78  80  81  82  84  85  89  90  91  92  94  95 101
 106 107 108 110 111 112 115 116 1

In [41]:
# The distribution of each class in each folds
y_fold1 = wine.target[fold1_index]
y_fold2 = wine.target[fold2_index]
print('Fold1:')
print(np.unique(y_fold1, return_counts=True))
print('Fold2:')
print(np.unique(y_fold2, return_counts=True))

Fold1:
(array([0, 1, 2]), array([35, 32, 22]))
Fold2:
(array([0, 1, 2]), array([24, 39, 26]))


- <font color="red">**NOTE**: </font> Be careful of the unbalanced when split the data