# Lab 9: Document Analysis

In this assignment, we will learn how to do document classification and clustering



## 1. Example

In this example, we use [20newsgroups](https://scikit-learn.org/stable/datasets/real_world.html#newsgroups-dataset) dataset. Each sample is a document and there are totally 20 classes. 

### 1.1 Load data

In [2]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_20newsgroups

data_train = fetch_20newsgroups(subset='train', remove=('headers', 'footers', 'quotes'))
data_test = fetch_20newsgroups(subset='test', remove=('headers', 'footers', 'quotes'))

print("Train data target labels: {}".format(data_train.target))
print("Train data target names: {}".format(data_train.target_names))

print('#training samples: {}'.format(len(data_train.data)))
print('#testing samples: {}'.format(len(data_test.data)))


Train data target labels: [7 4 4 ... 3 1 8]
Train data target names: ['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc']
#training samples: 11314
#testing samples: 7532


### 1.2 Represent documents with TF-IDF represention

In [3]:
from sklearn.feature_extraction.text import TfidfVectorizer

#TF-IDF representation for each document
vectorizer = TfidfVectorizer()
data_train_vectors = vectorizer.fit_transform(data_train.data)
data_test_vectors = vectorizer.transform(data_test.data) 

print(data_train_vectors.shape, data_test_vectors.shape)


(11314, 101631) (7532, 101631)


### 1.3 Use KNN to do document classification

Here, we use the cross-validation method to select $K$.

In [4]:
from sklearn.model_selection import GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, f1_score


Xtr = data_train_vectors
Ytr = data_train.target

Xte = data_test_vectors
Yte = data_test.target

k_range = range(1, 5)
param_grid = dict(n_neighbors=k_range)

clf_knn =  KNeighborsClassifier(n_neighbors=1)

grid = GridSearchCV(clf_knn, param_grid, cv=5, scoring='accuracy')
grid.fit(Xtr, Ytr)

print(grid.best_score_)
print(grid.best_params_)

0.16855203045338205
{'n_neighbors': 1}


### 1.3 Use Logistic Regression to do document classification
Here, we also use the cross-validation method to select the regularization coefficient. 

In [5]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
import numpy as np

#=====training with cross validation======
coeff = range(1, 10)
param_grid = dict(C=coeff)

clf_lr = LogisticRegression(penalty='l2')

grid = GridSearchCV(clf_lr, param_grid, cv=5, scoring='accuracy')
grid.fit(Xtr, Ytr)

print(grid.best_params_)

#=====testing======
clf_lr = LogisticRegression(penalty='l2', C=grid.best_params_['C'])
clf_lr.fit(Xtr, Ytr)

y_pred = clf_lr.predict(Xte)

acc = accuracy_score(Yte, y_pred)
macro_f1 = f1_score(Yte, y_pred, average='macro')
micro_f1 = f1_score(Yte, y_pred, average='micro')

print(acc, macro_f1, micro_f1)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(

KeyboardInterrupt



## 2. Task: Document Classification and Clustering

In this task, we are going to use [BBCNews](BBC_News_Train.csv) dataset. There are 1490 articles from 5 topics, including tech, business, sport, entertainment, politics. 

* Task 1: Please use KNN and logistic regression to do classification, and compare their performance.

* Task 2: Please use K-means to partition this dataset into 5 clusters and find the representative words in each cluster. 

### 2.1 Load data and represent it with TF-IDF representation

In [71]:
import pandas as pd
from sklearn.model_selection import train_test_split

from sklearn.preprocessing import LabelEncoder
from sklearn.feature_extraction.text import TfidfVectorizer


train = pd.read_csv('./BBC_News_Train.csv')

# change Category to label encoding
LE = LabelEncoder()
LE.fit(train['Category'].values) 
train['Category'] = LE.transform(train['Category'].values) # 'y'

# find TF_IDF representation of docs
TFID_V = TfidfVectorizer(strip_accents='ascii',
                         lowercase=True,
                         stop_words='english')
TF_IDF_rep = TFID_V.fit_transform(train['Text'].values) # 'X'

# split data
X_train, X_test, y_train, y_test = train_test_split(TF_IDF_rep, 
                                                    train['Category'].values, 
                                                    test_size=0.20, 
                                                    random_state=12)

train.head()

Unnamed: 0,ArticleId,Text,Category
0,1833,worldcom ex-boss launches defence lawyers defe...,0
1,154,german business confidence slides german busin...,0
2,1101,bbc poll indicates economic gloom citizens in ...,0
3,1976,lifestyle governs mobile choice faster bett...,4
4,917,enron bosses in $168m payout eighteen former e...,0
...,...,...,...
1485,857,double eviction from big brother model caprice...,1
1486,325,dj double act revamp chart show dj duo jk and ...,1
1487,1590,weak dollar hits reuters revenues at media gro...,0
1488,1587,apple ipod family expands market apple has exp...,4


### 2.2 Use KNN to do document classification

In [72]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV

from sklearn.metrics import f1_score, accuracy_score, recall_score, precision_score

param_grid = {
    # started with [3, 10, 20, 50]
    'n_neighbors': [8, 9, 10, 11, 12] # for the K in KNN
}

clf_knn =  KNeighborsClassifier(n_neighbors=1)
grid = GridSearchCV(clf_knn, param_grid, cv=5, scoring='accuracy')
grid.fit(X_train, y_train)

results = grid.best_params_

# train model
clf_knn =  KNeighborsClassifier(n_neighbors=results['n_neighbors'])
clf_knn.fit(X_train, y_train)
y_pred = clf_knn.predict(X_test)

acc = accuracy_score(y_test, y_pred)
macro_f1 = f1_score(y_test, y_pred, average='macro')
micro_f1 = f1_score(y_test, y_pred, average='micro')

print(f'Num Neighbors: {results["n_neighbors"]}\nacc:{acc}')
print(f'macro_f1: {macro_f1}')
print(f'micro_f1: {micro_f1}')

Num Neighbors: 11
acc:0.9395973154362416
macro_f1: 0.9371516520659446
micro_f1: 0.9395973154362416


### 2.3 Use Logistic Regression to do document classification

In [73]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV

# Fing Optimal Value for Hyperparam C -------------------
param_grid = {'C': range(1, 10)}

clf_lr = LogisticRegression(penalty='l2')

grid = GridSearchCV(clf_lr, param_grid, cv=5, scoring='accuracy')
grid.fit(TF_IDF_rep, y)

_C = grid.best_params_['C']

# Print Stats after training ----------------------------
clf_lr = LogisticRegression(penalty='l2', C=_C)
clf_lr.fit(X_train, y_train)

y_pred = clf_lr.predict(X_test)

acc = accuracy_score(y_test, y_pred)
macro_f1 = f1_score(y_test, y_pred, average='macro')
micro_f1 = f1_score(y_test, y_pred, average='micro')

print(f'C: {_C}\nacc:{acc}')
print(f'macro_f1: {macro_f1}')
print(f'micro_f1: {micro_f1}')

C: 9
acc:0.9731543624161074
macro_f1: 0.9723516288584557
micro_f1: 0.9731543624161074


### 2.4 Use K-means to do document clustering and find the 10 most representative words in each cluster. 

In [74]:
from sklearn.cluster import KMeans

kmeans = KMeans(n_clusters=5, random_state=0, n_init="auto").fit(TF_IDF_rep)

In [75]:
import numpy as np
centroids = kmeans.cluster_centers_ # [ 0 1 2 3 4 ]

word_bag = TFID_V.get_feature_names_out()

for ind, center in enumerate(centroids):
    top_10_words_ind = np.argpartition(center, -10)[-10:]
    top_10_words = word_bag[top_10_words_ind]
    
    print(f'Cluster:{ind}\n{top_10_words}\n\n')

Cluster:0
['search' 'phones' 'technology' 'digital' 'broadband' 'said' 'phone'
 'mobile' 'music' 'people']


Cluster:1
['minister' 'howard' 'brown' 'labour' 'party' 'government' 'mr' 'said'
 'election' 'blair']


Cluster:2
['economy' 'company' 'market' 'mr' 'growth' 'firm' 'said' 'government'
 'new' 'year']


Cluster:3
['season' 'team' 'match' 'chelsea' 'cup' 'players' 'said' 'game' 'win'
 'england']


Cluster:4
['star' 'festival' 'band' 'award' 'album' 'awards' 'actor' 'films' 'film'
 'best']


