# Tutorial - Text Mining - Classification 

We will predict the category of discussion posts in a newsgroup.

**The unit of analysis is a discussion post**

### Import common packages

In [101]:
import pandas as pd
import numpy as np

np.random_seed = 1

### Load data

In [102]:
news = pd.read_csv('news.csv')

news.shape


(597, 5)

In [103]:
news.head(5)

Unnamed: 0,TEXT,graphics,hockey,medical,newsgroup
0,I have a few reprints left of chapters from my...,1,0,0,graphics
1,"gnuplot, etc. make it easy to plot real valued...",1,0,0,graphics
2,Article-I.D.: snoopy.1pqlhnINN8k1 References: ...,1,0,0,graphics
3,"Hello, I am looking to add voice input capabil...",1,0,0,graphics
4,I recently got a file describing a library of ...,1,0,0,graphics


### Check for missing values

In [104]:
news[['TEXT']].isna().sum()

TEXT    0
dtype: int64

## Assign the input variable to X and the target variable to y

In [105]:
X = news['TEXT']

This is a multi-class classification problem. There are three categories we will predict:<br>
Whether a post is "graphics," "hockey," or "medical" related

In [106]:
y = news['newsgroup']
y.unique()

array(['graphics', 'hockey', 'medical'], dtype=object)

In [107]:
from sklearn import preprocessing

le = preprocessing.LabelEncoder()
le.fit(y)
print(le.classes_)
y = le.transform(y)

y


['graphics' 'hockey' 'medical']


array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,

## Split the data

In [108]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

In [109]:
X_train.shape, y_train.shape

((417,), (417,))

In [110]:
X_test.shape, y_test.shape

((180,), (180,))

In [111]:
X_train.head(5)

214    In < 1993Apr5.182124.17415@ists.ists.ca> dchha...
469    In article < werner-240393161954@tol7mac15.soe...
17     Does anybody know if there are any good 2d-gra...
495    In article < 206@ky3b.UUCP> km@ky3b.pgh.pa.us ...
108    Robert G. Carpenter writes: I too would like a...
Name: TEXT, dtype: object

In [112]:
y_train[:5]

array([1, 2, 0, 2, 0])

## Sklearn: Text preparation

For simplicity (and focus), we will not do any text cleaning or preprocessing. We will just use the raw text as input to the model. See the text mining fundamentals tutorial for more details on text cleaning and preprocessing.

In [113]:
#TfidfVectorizer includes pre-processing, tokenization, filtering stop words
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vect = TfidfVectorizer(stop_words='english', lowercase=True, token_pattern="[^\W\d_]+")

X_train = tfidf_vect.fit_transform(X_train)

**Notice in the previous step that we use `fit_transform` on TRAIN. When we transform the TEST data, we need to use `transform` only. This enables us to keep the number of columns (features) the same across the data sets. Otherwise, they WILL be different, and no model will work!**

In [114]:
# Perform the TfidfVectorizer transformation
# Be careful: We are using the train fit to transform the test data set. Otherwise, the test data 
# features will be very different and match the train set!!!

X_test = tfidf_vect.transform(X_test)


In [115]:
X_train.shape, X_test.shape

((417, 10400), (180, 10400))

In [116]:
# These data sets are "sparse matrix". We can't see them unless we convert using toarray()
X_train

<417x10400 sparse matrix of type '<class 'numpy.float64'>'
	with 31753 stored elements in Compressed Sparse Row format>

In [117]:
# These data sets are "sparse matrix". We can't see them unless we convert using toarray()
X_train.toarray()

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

## Latent Semantic Analysis (Singular Value Decomposition)

Singular Value Decomposition (SVD) is a technique used to reduce the dimensions of data by decomposing the data matrix into three matrices. The three matrices consist of a left singular matrix, a diagonal matrix, and a right singular matrix. The n_components parameter in SVD represents the number of components (or dimensions) that we want to keep after dimensionality reduction. In this analysis, we will explore the impacts of changing the n_components parameter to 100, 300, and 500, and discuss the results.

## First LSA model with n_components=100

In [118]:
from sklearn.decomposition import TruncatedSVD

svd_1 = TruncatedSVD(n_components=100, n_iter=10)#n_components is the number of topics, which should be less than the number of features

X_train_1 = svd_1.fit_transform(X_train)
X_test_1 = svd_1.transform(X_test)

In [119]:
X_train_1.shape, X_test_1.shape

((417, 100), (180, 100))

## Random Forest

In [120]:
from sklearn.ensemble import RandomForestClassifier 

rnd_clf_1 = RandomForestClassifier(n_estimators=100, max_leaf_nodes=16, n_jobs=-1) 
_ = rnd_clf_1.fit(X_train_1, y_train)

### Evaluating Model Performance

In [121]:
from sklearn.metrics import accuracy_score

In [122]:
#Train accuracy - Not a good measure of model performance as we are using the same data set to train and test
y_pred_train_1 = rnd_clf_1.predict(X_train_1)
acc = accuracy_score(y_train, y_pred_train_1)
print(f"Train acc_1: {accuracy_score(y_train, y_pred_train_1):.4f}")

Train acc_1: 0.9832


In [123]:
#Test accuracy
y_pred_test_1 = rnd_clf_1.predict(X_test_1)
acc = accuracy_score(y_test, y_pred_test_1)
print(f"Test acc_1: {accuracy_score(y_test, y_pred_test_1):.4f}")

Test acc_1: 0.8778


In [124]:
# Confusion Matrix
from sklearn.metrics import confusion_matrix

confusion_matrix(y_test, y_pred_test_1)

array([[55,  2,  5],
       [ 1, 49,  3],
       [11,  0, 54]], dtype=int64)

## Stochastic Gradient Descent Classifier

In [125]:
from sklearn.linear_model import SGDClassifier

sgd_clf_1 = SGDClassifier(max_iter=100)
_ = sgd_clf_1.fit(X_train_1, y_train)

### Evaluating Model Performance

In [126]:
#Train accuracy
y_pred_train_SGD1 = sgd_clf_1.predict(X_train_1)
print(f"Train acc_SGD1: {accuracy_score(y_train, y_pred_train_SGD1):.4f}")

Train acc_SGD1: 0.9856


In [127]:
#Test accuracy
y_pred_test_SGD1 = sgd_clf_1.predict(X_test_1)
print(f"Test acc_SGD1: {accuracy_score(y_test, y_pred_test_SGD1):.4f}")

Test acc_SGD1: 0.9056


In [128]:
# Confusion Matrix
from sklearn.metrics import confusion_matrix

confusion_matrix(y_test, y_pred_test_SGD1)

array([[55,  0,  7],
       [ 2, 46,  5],
       [ 3,  0, 62]], dtype=int64)

Analysis:
    
The first model can capture the most crucial aspects of the data when n components is set to 100 while still drastically lowering the dimensions. This value of n components results in a computation that is more quickly completed and requires less memory. For complicated datasets with plenty of characteristics, this might not be enough, and the model's performance might decrease. Here the derived values for random forest are Train accuracy with 98.32% and Test accuracy: 87.78%, values for SGD classifier are Train accuracy with 98.56% and Test accuracy with 90.56%.

## Second LSA model with n_components=300

In [129]:
svd_2 = TruncatedSVD(n_components=300, n_iter=10)#n_components is the number of topics, which should be less than the number of features

X_train_2 = svd_2.fit_transform(X_train)
X_test_2 = svd_2.transform(X_test)

In [130]:
X_train_2.shape, X_test_2.shape

((417, 300), (180, 300))

## Random Forest

In [131]:
from sklearn.ensemble import RandomForestClassifier 

rnd_clf_2 = RandomForestClassifier(n_estimators=100, max_leaf_nodes=16, n_jobs=-1) 
_ = rnd_clf_2.fit(X_train_2, y_train)

### Evaluating Model Performance

In [132]:
from sklearn.metrics import accuracy_score

In [133]:
#Train accuracy - Not a good measure of model performance as we are using the same data set to train and test
y_pred_train_2 = rnd_clf_2.predict(X_train_2)
acc = accuracy_score(y_train, y_pred_train_2)
print(f"Train acc_2: {accuracy_score(y_train, y_pred_train_2):.4f}")

Train acc_2: 0.9904


In [134]:
#Test accuracy
y_pred_test_2 = rnd_clf_2.predict(X_test_2)
acc = accuracy_score(y_test, y_pred_test_2)
print(f"Test acc_2: {accuracy_score(y_test, y_pred_test_2):.4f}")

Test acc_2: 0.8611


In [135]:
# Confusion Matrix
from sklearn.metrics import confusion_matrix

confusion_matrix(y_test, y_pred_test_2)

array([[54,  2,  6],
       [ 1, 46,  6],
       [ 8,  2, 55]], dtype=int64)

## Stochastic Gradient Descent Classifier

In [136]:
from sklearn.linear_model import SGDClassifier

sgd_clf_2 = SGDClassifier(max_iter=100)
_ = sgd_clf_2.fit(X_train_2, y_train)

### Evaluating Model Performance

In [137]:
#Train accuracy
y_pred_train_SGD2 = sgd_clf_2.predict(X_train_2)
print(f"Train acc_SGD2: {accuracy_score(y_train, y_pred_train_SGD2):.4f}")

Train acc_SGD2: 0.9952


In [138]:
#Test accuracy
y_pred_test_SGD2 = sgd_clf_2.predict(X_test_2)
print(f"Test acc_SGD2: {accuracy_score(y_test, y_pred_test_SGD2):.4f}")

Test acc_SGD2: 0.8556


In [139]:
# Confusion Matrix
from sklearn.metrics import confusion_matrix

confusion_matrix(y_test, y_pred_test_SGD2)

array([[49,  0, 13],
       [ 0, 42, 11],
       [ 2,  0, 63]], dtype=int64)

Analysis:
    
In second model with n_components set to 300, the model is able to capture more features of the data, resulting in better performance. However, the computation time and memory requirements increase with this value, and it may not be suitable for larger datasets.For this model we can see that the train and test accuracies for random forest are increased to 99.04% and 86.11%, train accuracy for SGD classifier is improved to 99.52% and but test accuracy dropped to 85.56%

## Third LSA model with n_components=500

In [140]:
svd_3 = TruncatedSVD(n_components=500, n_iter=10)#n_components is the number of topics, which should be less than the number of features

X_train_3 = svd_3.fit_transform(X_train)
X_test_3 = svd_3.transform(X_test)

In [141]:
X_train_3.shape, X_test_3.shape

((417, 417), (180, 417))

## Random Forest

In [142]:
from sklearn.ensemble import RandomForestClassifier 

rnd_clf_3 = RandomForestClassifier(n_estimators=100, max_leaf_nodes=16, n_jobs=-1) 
_ = rnd_clf_3.fit(X_train_3, y_train)

### Evaluating Model Performance

In [143]:
from sklearn.metrics import accuracy_score

In [144]:
#Train accuracy - Not a good measure of model performance as we are using the same data set to train and test
y_pred_train_3 = rnd_clf_3.predict(X_train_3)
acc = accuracy_score(y_train, y_pred_train_3)
print(f"Train acc_3: {accuracy_score(y_train, y_pred_train_3):.4f}")

Train acc_3: 0.9904


In [145]:
#Test accuracy
y_pred_test_3 = rnd_clf_3.predict(X_test_3)
acc = accuracy_score(y_test, y_pred_test_3)
print(f"Test acc_3: {accuracy_score(y_test, y_pred_test_3):.4f}")

Test acc_3: 0.8444


In [146]:
# Confusion Matrix
from sklearn.metrics import confusion_matrix

confusion_matrix(y_test, y_pred_test_3)

array([[49,  4,  9],
       [ 1, 48,  4],
       [ 6,  4, 55]], dtype=int64)

## Stochastic Gradient Descent Classifier

In [147]:
from sklearn.linear_model import SGDClassifier

sgd_clf_3 = SGDClassifier(max_iter=100)
_ = sgd_clf_3.fit(X_train_3, y_train)

### Evaluating Model Performance

In [148]:
#Train accuracy
y_pred_train_SGD3 = sgd_clf_3.predict(X_train_3)
print(f"Train acc_SGD3: {accuracy_score(y_train, y_pred_train_SGD3):.4f}")

Train acc_SGD3: 0.9976


In [149]:
#Test accuracy
y_pred_test_SGD3 = sgd_clf_3.predict(X_test_3)
print(f"Test acc_SGD3: {accuracy_score(y_test, y_pred_test_SGD3):.4f}")

Test acc_SGD3: 0.9222


In [150]:
# Confusion Matrix
from sklearn.metrics import confusion_matrix

confusion_matrix(y_test, y_pred_test_SGD3)

array([[60,  1,  1],
       [ 4, 49,  0],
       [ 8,  0, 57]], dtype=int64)

Analysis:
    
The third model captures even more aspects of the data when n components is set to 500, providing improved performance over the earlier values. However, some datasets might not be able to handle this value due to its high computational cost and memory requirements.The train and test accuracies for this model's random forest are almost same as previous model which are 99.04% and 84.44%, but the train and test accuracies for the SGD classifier are increased to 99.76% and 92.22%

Firstly, SVD may not be suitable for all types of datasets. For example, if the dataset is already low-dimensional or if there are only a few important features, then SVD may not provide any significant benefits.
and this technique can be computationally expensive, particularly if the number of features is large. This may lead to longer computation times and higher memory requirements, making it impractical for some datasets.

In conclusion, SVD is a powerful technique that can help us to extract important features from complex datasets. However, the value of the n_components parameter can significantly impact the performance of the model, and we need to carefully choose the appropriate value based on our dataset and computational resources.

Overall, when we look into the accuracies derived from all of the three components the values for n_components 500 showcased best performance in comparison with the other two n_component values and SGD Classifier model performed more accurately than the Random Forest model.But, the gap between train and test accuracy showed that the Random Forest model was more susceptible to overfitting.