### Introducing the 20newsgroups dataset
The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups.The 20 newsgroups collection has become a popular data set for experiments in text applications of machine learning techniques, such as text classification and text clustering.


### Creating cluster using knn classification algoritum to work on 20newsgroup data set

In [1]:
# import required modules
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
import matplotlib.pyplot as plt

%matplotlib inline

In [2]:
# importing the required module
from sklearn import metrics

In [3]:
# import 20 newsgroup dataset
from sklearn.datasets import fetch_20newsgroups

#categories = ['alt.atheism', 'comp.graphics', 'sci.space']
categories = None
data_train = fetch_20newsgroups(subset='train', remove=('headers', 'footers', 'quotes'), categories=categories)
data_test = fetch_20newsgroups(subset='test', remove=('headers', 'footers', 'quotes'), categories=categories)

In [4]:
type(data_train)
type(data_test)

sklearn.utils._bunch.Bunch

In [5]:
print( "Train data target labels:",data_train.target)
print ("Test data target labels:",data_test.target)

Train data target labels: [7 4 4 ... 3 1 8]
Test data target labels: [ 7  5  0 ...  9  6 15]


In [6]:
print( "Train data target names:",data_train.target_names)
print ("Test data target names:",data_test.target_names)

Train data target names: ['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc']
Test data target names: ['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc']


In [7]:
print( "Total train data:",len(data_train.data))
print ("Total test data:",len(data_test.data))

Total train data: 11314
Total test data: 7532


In [8]:
# Train data type
print (type(data_train.data))
print (type(data_train.target))

# Test data type
print (type(data_test.data))
print (type(data_test.target))

<class 'list'>
<class 'numpy.ndarray'>
<class 'list'>
<class 'numpy.ndarray'>


### Requirements for working with data in scikit-learn
1. Features and response are separate objects
2. Features and response should be numeric
3. Features and response should be NumPy arrays
4. Features and response should have specific shapes

In [9]:
# So, first converting text data into vectors of numerical values using tf-idf to form feature vector
vectorizer = TfidfVectorizer()
data_train_vectors = vectorizer.fit_transform(data_train.data)
data_test_vectors = vectorizer.transform(data_test.data) 

In [10]:
# Train data type
print( type(data_train_vectors.data))
print( type(data_train.target))

# Test data type
print (type(data_train_vectors.data))
print (type(data_train.target))

<class 'numpy.ndarray'>
<class 'numpy.ndarray'>
<class 'numpy.ndarray'>
<class 'numpy.ndarray'>


In [11]:
# check the shape of the features matrix
print( data_train_vectors.shape)

(11314, 101631)


In [12]:
# check the shape of the response (single dimension matching the number of observations)
print (data_train.target.shape)

(11314,)


### Train \ Test data

In [13]:
# store training feature matrix in "Xtr"
Xtr = data_train_vectors
print ("Xtr:\n", Xtr)

# store training response vector in "ytr"
ytr = data_train.target
print ("ytr:",ytr)

Xtr:
   (np.int32(0), np.int32(95844))	0.2085823901983838
  (np.int32(0), np.int32(97181))	0.11904550931918896
  (np.int32(0), np.int32(48754))	0.08965702221604545
  (np.int32(0), np.int32(18915))	0.14367199533485261
  (np.int32(0), np.int32(68847))	0.058045106782124184
  (np.int32(0), np.int32(88638))	0.05191891673444119
  (np.int32(0), np.int32(30074))	0.06757106887546716
  (np.int32(0), np.int32(37335))	0.17894975360023155
  (np.int32(0), np.int32(60560))	0.056847192482369406
  (np.int32(0), np.int32(68080))	0.08295129558221048
  (np.int32(0), np.int32(88767))	0.16403190525380046
  (np.int32(0), np.int32(25775))	0.4046469916999255
  (np.int32(0), np.int32(80623))	0.11038705606471814
  (np.int32(0), np.int32(88532))	0.1642840263996455
  (np.int32(0), np.int32(68781))	0.06213700528329297
  (np.int32(0), np.int32(31990))	0.09144124612366937
  (np.int32(0), np.int32(51326))	0.07095431715969416
  (np.int32(0), np.int32(34809))	0.1289860835335575
  (np.int32(0), np.int32(84538))	0.1421950

In [14]:
# store testing feature matrix in "Xtt"
Xtt = data_test_vectors
print( "Xtt:\n", Xtt)

# store testing response vector in "ytt"
ytt = data_test.target
print ("ytt:",ytt)

Xtt:
   (np.int32(0), np.int32(12725))	0.15335704976717565
  (np.int32(0), np.int32(12796))	0.3002667198928725
  (np.int32(0), np.int32(17936))	0.05759434615712483
  (np.int32(0), np.int32(18091))	0.0701782380881259
  (np.int32(0), np.int32(18165))	0.15201468957423658
  (np.int32(0), np.int32(18521))	0.03497274159244273
  (np.int32(0), np.int32(19443))	0.09797137915272827
  (np.int32(0), np.int32(19756))	0.052231473530768874
  (np.int32(0), np.int32(22494))	0.09613590599615258
  (np.int32(0), np.int32(23622))	0.22430112916077485
  (np.int32(0), np.int32(24849))	0.11147946738587439
  (np.int32(0), np.int32(25590))	0.05344626240468926
  (np.int32(0), np.int32(29214))	0.15626686058683387
  (np.int32(0), np.int32(30074))	0.07389720168367159
  (np.int32(0), np.int32(31040))	0.14924087861925034
  (np.int32(0), np.int32(32737))	0.15335704976717565
  (np.int32(0), np.int32(33605))	0.14647368373226027
  (np.int32(0), np.int32(35974))	0.12034350910056273
  (np.int32(0), np.int32(39524))	0.096759

### K-Nearest Neighbors (KNN) classification 
## Hyperparameters: n_neighbours=5, weights=default

In [15]:
# import the required module from scikit learn
from sklearn.neighbors import KNeighborsClassifier

In [17]:
# Implementing classification model- using KNeighborsClassifier

# Instantiate the estimator
clf_knn =  KNeighborsClassifier(n_neighbors=5, )

# Fit the model with data (aka "model training")
clf_knn.fit(Xtr, ytr)

# Predict the response for a new observation
y_pred = clf_knn.predict(Xtt)
print ("Predicted Class Labels:",y_pred)

# Predict the response score for a new observation
y_pred_score_knn = clf_knn.predict_proba(Xtt)
#print ("Predicted Score:\n",y_pred_score_knn)
print ("Classification Accuracy:",metrics.accuracy_score(ytt, y_pred))
print ("Classification Error of KNN:", 1 - metrics.accuracy_score(ytt, y_pred))

Predicted Class Labels: [1 1 7 ... 2 2 1]
Classification Accuracy: 0.07939458311205523
Classification Error of KNN: 0.9206054168879447


### Confusion matrix
Table that describes the performance of a classification model

In [19]:
# first argument is true values, second argument is predicted values
print (metrics.confusion_matrix(ytt, y_pred))

[[43 46 51 24 44  4  4 46  6  5  5  1  5  0  9  5  3  5  6  7]
 [40 76 55 34 44  7  7 45  7 16  8  4  8  6  9  0  8  3  9  3]
 [41 56 74 27 51  3 11 53  6 16  3  4  7  3  9  2  9  8  6  5]
 [45 54 71 40 39  1  7 42 13 10  8  1 10  7  5  2  6 10 16  5]
 [46 48 64 26 53  4  9 52  7 15  7  5  4  6  4  4  8  8  8  7]
 [42 68 55 32 47 16  6 39 14 14  2  4 11  4  5  0  8  5 14  9]
 [40 55 59 19 46  5 31 48  4 15  2  2 10 11 11  1  4 10  9  8]
 [45 59 55 33 60  6  6 54 10 11 10  3  9  5  8  0  2  5  8  7]
 [47 49 82 25 46  4 10 44 15 26  8  1  9  5  8  2  3  6  4  4]
 [38 58 57 21 57  4 11 54  7 21  6  5  8 11  7  1  4 10  6 11]
 [36 57 70 25 45  5  9 38 14 23 29  3 11  2  8  1  5  9  3  6]
 [40 46 68 21 51  2  6 65 12 18  6 14  4  7  7  2 10  4  7  6]
 [45 67 76 22 54  8  6 38  7 10  7  0 13  5 13  1  5  3  8  5]
 [47 58 52 26 64  5  7 52 11 14 11  3 13  6  7  2  5  8  1  4]
 [40 53 56 26 58  4  8 47 10 20  4  5 11  7 16  0  5  1  8 15]
 [51 53 63 25 59  1  4 43  7 20  3  4 10  8 10 13  6  3

### Sensitivity: When the actual value is positive, how often is the prediction correct?

How "sensitive" is the classifier to detecting positive instances?
Also known as "True Positive Rate" or "Recall"

In [20]:
print ("Sensitivity of KNN:",metrics.recall_score(ytt, y_pred, average='weighted'))

Sensitivity of KNN: 0.07939458311205523


### Precision: When a positive value is predicted, how often is the prediction correct?

How "precise" is the classifier when predicting positive instances?
TP / (TP + FP)

In [21]:
print ("Precision of KNN:", metrics.precision_score(ytt, y_pred, average='weighted'))

Precision of KNN: 0.12100733740836499


### F-measure:

2 P R / (P + R)

In [22]:
print ("F-measure of KNN:", metrics.f1_score(ytt, y_pred, average='weighted'))

F-measure of KNN: 0.07557918282968247


### KNN Classification 2
## Hyperparameters: n_neighbours=1, weights=distance

In [23]:
# Instantiate the estimator
clf_knn =  KNeighborsClassifier(n_neighbors=1, weights='distance')

# Fit the model with data (aka "model training")
clf_knn.fit(Xtr, ytr)

# Predict the response for a new observation
y_pred_knn = clf_knn.predict(Xtt)
print( "Predicted Class Labels:",y_pred_knn)

# calculate accuracy
print ("Classification Accuracy:",metrics.accuracy_score(ytt, y_pred_knn))
print ("Classification Error of KNN:", 1 - metrics.accuracy_score(ytt, y_pred_knn))

Predicted Class Labels: [ 9  1  3 ...  1 19  3]
Classification Accuracy: 0.1090015932023367
Classification Error of KNN: 0.8909984067976633


## Confusion matrix
### Table that describes the performance of a classification model

In [24]:
# first argument is true values, second argument is predicted values
print (metrics.confusion_matrix(ytt, y_pred_knn))

[[24 17 25 12 28 11 11 36 14 15 10 13 16  9 19  6 11 10 12 20]
 [16 40 20 23 23 15  9 46 15 26 15 10 17 17 20  9 20 21 11 16]
 [13 15 45 30 29 19  8 52  8 19  9  9 23 19 22 12 12 17  9 24]
 [13 21 32 59 28  8  4 41 21 21 13 10 19 19 20 14 10 11 14 14]
 [11 17 23 18 40 20 10 42 14 20 20 13 22 11 25 11 15 19  8 26]
 [10 22 16 23 33 33  6 54 15 27 12  8 15 12 27 12 15 15 13 27]
 [13 20 26 18 28 10 48 33 15 26 21  7 23 14 21  5 14 21  9 18]
 [18 10 26 17 39 29  7 53 14 24 10 15 20 13 27  5 12 18 19 20]
 [ 9 15 28 19 28 19 10 42 44 33 13  6 24 15 22  8  8 14 14 27]
 [13 19 22 18 32 22  4 45 15 41 13 13 23 16 22 12 12 23 14 18]
 [ 4 22 23 12 32 10  4 49 22 24 61 12 17 21 20  9 12 10 15 20]
 [10 18 30 16 27 18  6 41 14 37 20 34 21 16 18  8 15 15 10 22]
 [11 23 30 21 32 22  8 49 15 15 11  8 37 16 18  9  7 18 25 18]
 [11 13 21 23 33 21  8 51 18 21 22  9 17 30 27  9 14 15 16 17]
 [11 14 17 17 35 22  4 36  7 29 15 17 19 21 44 14 11 23 12 26]
 [21 20 27 18 32 21  7 31 14 19 14 13  6 21 22 41 16 18

### Sensitivity: When the actual value is positive, how often is the prediction correct?

How "sensitive" is the classifier to detecting positive instances?
Also known as "True Positive Rate" or "Recall"

In [25]:
print ("Sensitivity of KNN:",metrics.recall_score(ytt, y_pred_knn, average='weighted'))

Sensitivity of KNN: 0.1090015932023367


### Precision: When a positive value is predicted, how often is the prediction correct?

How "precise" is the classifier when predicting positive instances?
TP / (TP + FP)

In [26]:
print ("Precision of KNN:", metrics.precision_score(ytt, y_pred_knn, average='weighted'))

Precision of KNN: 0.12167713712643455


### F-measure:

2 P R / (P + R)

In [27]:
print ("F-measure of KNN:", metrics.f1_score(ytt, y_pred_knn, average='weighted'))

F-measure of KNN: 0.11149546811846324


### KNN Classification 2
## Hyperparameters: n_neighbours=2, weights=distance

In [36]:
# Instantiate the estimator
clf_knn =  KNeighborsClassifier(n_neighbors=2, weights='distance')

# Fit the model with data (aka "model training")
clf_knn.fit(Xtr, ytr)

# Predict the response for a new observation
y_pred_knn = clf_knn.predict(Xtt)
print( "Predicted Class Labels:",y_pred_knn)

# calculate accuracy
print ("Classification Accuracy:",metrics.accuracy_score(ytt, y_pred_knn))
print ("Classification Error of KNN:", 1 - metrics.accuracy_score(ytt, y_pred_knn))

Predicted Class Labels: [1 1 3 ... 1 7 3]
Classification Accuracy: 0.11364843335103558
Classification Error of KNN: 0.8863515666489644


## Confusion matrix
### Table that describes the performance of a classification model

In [29]:
# first argument is true values, second argument is predicted values
print (metrics.confusion_matrix(ytt, y_pred_knn))

[[35 27 49 15 37 13 10 57 12 12 10  7  7  8 10  5  2  0  2  1]
 [24 63 37 32 37 15 12 56 25 26 13  7 12 10  4  0  7  5  4  0]
 [26 33 63 40 41 19 19 49 22 22  9  4 12 10 11  2  3  3  2  4]
 [29 45 50 63 38  9  7 41 14 26 11  6 12 13 12  7  1  3  3  2]
 [23 33 47 26 58 22 14 44 22 20 16  9 11  7 10  5 10  7  0  1]
 [22 37 35 36 45 31 14 53 26 24 13 12  9  5 15  3  9  2  2  2]
 [31 35 47 20 43 13 42 55 11 21 12  7 14  6  9  5  7  7  2  3]
 [29 29 41 32 50 32  9 56 17 28 14 10 12  9 10  5  6  3  3  1]
 [25 32 53 32 34 22 11 49 35 31 10  7 22 10  9  2  4  5  1  4]
 [29 42 32 30 46 23  8 57 15 41 14 11 12 11 10  1  6  5  2  2]
 [16 37 50 21 47 13  9 49 22 32 44 13 13 13 10  2  1  5  1  1]
 [24 27 51 25 45 22 14 51 22 25 21 25 15 11  5  3  4  3  1  2]
 [26 39 52 33 53 26 11 51 16 17  9 10 22  9  8  3  2  1  4  1]
 [27 33 44 28 41 24 16 48 23 29 25  5 19 13  8  3  4  5  1  0]
 [27 31 37 37 44 27  6 45 12 25 12 14 18 14 27  6  1  6  3  2]
 [39 41 46 31 46 18 13 41 10 23 18  9 13 13 12 13  4  6

### Sensitivity: When the actual value is positive, how often is the prediction correct?

How "sensitive" is the classifier to detecting positive instances?
Also known as "True Positive Rate" or "Recall"

In [30]:
print ("Sensitivity of KNN:",metrics.recall_score(ytt, y_pred_knn, average='weighted'))

Sensitivity of KNN: 0.09399893786510886


### Precision: When a positive value is predicted, how often is the prediction correct?

How "precise" is the classifier when predicting positive instances?
TP / (TP + FP)

In [31]:
print ("Precision of KNN:", metrics.precision_score(ytt, y_pred_knn, average='weighted'))

Precision of KNN: 0.12395712469852152


### F-measure:

2 P R / (P + R)

In [33]:
print ("F-measure of KNN:", metrics.f1_score(ytt, y_pred_knn, average='weighted'))

F-measure of KNN: 0.09189560967526793


### Conclusion

### comparison between hyperparameters
 0.       n_neighbour   weights              accuracy                        error
 1.        5             default            0.07939458311205523           0.9206054168879447       
 2.        1             distance           0.1090015932023367            0.8909984067976633
 3.        2             distance           0.11364843335103558           0.8863515666489644