## Subject 1: MNIST Clustering

1. Download the MNIST data set using the code below.
2. Ignoring the label normally associated to the dataset, construct a clustering of the data. Your clustering should maximize the v-score measure relative to true data labels (
    Paper describing the measure: https://www.aclweb.org/anthology/D07-1043.pdf
    Implementation available in python: https://scikit learn.org/stable/modules/generated/sklearn.metrics.v_measure_score.html#sklearn.metrics.v_measure_score
). Note that failing to ignore the labels during training will void your score for this subject.

In [10]:
from sklearn.datasets import fetch_openml
# Load data from https://www.openml.org/d/554
X, y = fetch_openml('mnist_784', version=1, return_X_y=True)

print(X), print(y)

[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]
['5' '0' '4' ... '4' '5' '6']


(None, None)

In [11]:
import pandas as pd
df= pd.DataFrame(X,y)
df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,774,775,776,777,778,779,780,781,782,783
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [12]:
import matplotlib.pyplot as plt
from matplotlib.cm import binary

some_digit_image = X[1].reshape(28, 28)
plt.imshow(some_digit_image, cmap=binary, interpolation="nearest")
plt.axis('off')
plt.show()

<Figure size 640x480 with 1 Axes>

In [13]:
X_norm=X/255. #normalization

In [16]:
kmeans = KMeans(n_clusters=3, random_state=0,max_iter=500, n_init=15).fit(X_norm)
v_measure_score(y, kmeans.labels_) #for 3 clusters

0.28592992067939343

In [15]:
from sklearn.cluster import KMeans
import numpy as np
#for 10 clusters
kmeans = KMeans(n_clusters=10, random_state=0,max_iter=500, n_init=15).fit(X_norm)

from sklearn.metrics.cluster import v_measure_score
v_measure_score(y, kmeans.labels_)

0.49974303091937844

In [None]:
kmeans = KMeans(n_clusters=15, random_state=0,max_iter=500, n_init=15).fit(X_norm)
v_measure_score(y, kmeans.labels_) 

In [24]:
from sklearn.cluster import KMeans
import numpy as np
from sklearn.metrics.cluster import v_measure_score
v_score=[]

for j in range(6,12):
    kmeans = KMeans(n_clusters=j, random_state=0).fit(X_norm)
    
    v_score.append(v_measure_score(y, kmeans.labels_))

In [25]:
print(dict(zip(range(6,12),v_score)))

{6: 0.45175031159301016, 7: 0.47041202409755334, 8: 0.5063185078368722, 9: 0.49246848728550235, 10: 0.49974303091937844, 11: 0.4971195507583373}


The best number of clusters comes out to be 8 with a v_score of 0.506 , however, the number of clusters = 10 is also very close. 

## Subject 2: Text NewsGroup Classification
1. Download the newsgroups data set using the code below. 
2. Construct a text classifier that predicts the target variable (newsgroups.target) from the input data (newsgroups.data).
3. We will evaluate your classifier against a hold-out data set, so be sure to construct a classification function that can receive a single string.

In [1]:
from sklearn.datasets import fetch_20newsgroups
newsgroups = fetch_20newsgroups(subset='train')

Downloading 20news dataset. This may take a few minutes.
Downloading dataset from https://ndownloader.figshare.com/files/5975967 (14 MB)


In [2]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB


count_vect = CountVectorizer(strip_accents='ascii', token_pattern=u'(?ui)\\b\\w*[a-z]+\\w*\\b', 
                     lowercase=True, stop_words='english')

X_transformed = count_vect.fit_transform(newsgroups.data)



mod = MultinomialNB()
mod.fit(X_transformed, newsgroups.target)
predictions = mod.predict(X_transformed)

In [3]:
from sklearn.metrics import classification_report
print(classification_report(y_true=newsgroups.target,y_pred=predictions))

              precision    recall  f1-score   support

           0       0.95      0.99      0.97       480
           1       0.84      0.97      0.90       584
           2       0.97      0.24      0.39       591
           3       0.72      0.97      0.83       590
           4       0.93      0.98      0.96       578
           5       0.84      0.98      0.91       593
           6       0.96      0.93      0.95       585
           7       0.97      0.98      0.98       594
           8       1.00      0.99      0.99       598
           9       0.99      0.99      0.99       597
          10       0.98      0.99      0.99       600
          11       0.96      0.99      0.98       595
          12       0.96      0.97      0.97       591
          13       0.99      0.99      0.99       594
          14       0.98      1.00      0.99       593
          15       0.98      0.99      0.99       599
          16       0.97      0.99      0.98       546
          17       0.99    

In [4]:
# please enter the texts in a list as newsgroup.data
def predict_val(text):
    
    X_trans = count_vect.transform(text)
    
    predictions = mod.predict(X_trans)
    return(predictions)

In [5]:
# testing the function
predict_val([newsgroups.data[0]])

array([7])