# WE06

We will predict the category of discussion posts in a newsgroup.

### Import common packages

In [2]:
import pandas as pd
import numpy as np
from sklearn.decomposition import TruncatedSVD
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import SGDClassifier
from sklearn.pipeline import make_pipeline

np.random_seed = 1

### Load data

In [3]:
news = pd.read_csv('news.csv')

news.shape


(597, 5)

In [4]:
news.head(5)

Unnamed: 0,TEXT,graphics,hockey,medical,newsgroup
0,I have a few reprints left of chapters from my...,1,0,0,graphics
1,"gnuplot, etc. make it easy to plot real valued...",1,0,0,graphics
2,Article-I.D.: snoopy.1pqlhnINN8k1 References: ...,1,0,0,graphics
3,"Hello, I am looking to add voice input capabil...",1,0,0,graphics
4,I recently got a file describing a library of ...,1,0,0,graphics


### Check for missing values

In [5]:
news[['TEXT']].isna().sum()

TEXT    0
dtype: int64

## Assign the input variable to X and the target variable to y

In [6]:
X = news['TEXT']

This is a multi-class classification problem. There are three categories we will predict:<br>
Whether a post is "graphics," "hockey," or "medical" related

In [8]:
y = news['newsgroup']
y.unique()

array(['graphics', 'hockey', 'medical'], dtype=object)

In [9]:
from sklearn import preprocessing

le = preprocessing.LabelEncoder()
le.fit(y)
print(le.classes_)
y = le.transform(y)

y


['graphics' 'hockey' 'medical']


array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,

## Split the data

In [10]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

In [11]:
X_train.shape, y_train.shape

((417,), (417,))

In [12]:
X_test.shape, y_test.shape

((180,), (180,))

In [13]:
X_train.head(5)

84     scrowe@hemel.bull.co.uk (Simon Crowe) writes: ...
517    In article < 19604@pitt.UUCP> geb@cs.pitt.edu ...
534    sasghm@theseus.unx.sas.com (Gary Merrill) writ...
164    In article < 1993Apr15.164940.11632@mercury.un...
564    My mom has just been diagnosed with cystic bre...
Name: TEXT, dtype: object

In [14]:
y_train[:5]

array([0, 2, 2, 0, 2])

## Sklearn: Text preparation

For simplicity (and focus), we will not do any text cleaning or preprocessing. We will just use the raw text as input to the model. See the text mining fundamentals tutorial for more details on text cleaning and preprocessing.

In [15]:
#TfidfVectorizer includes pre-processing, tokenization, filtering stop words
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vect = TfidfVectorizer(stop_words='english', lowercase=True, token_pattern="[^\W\d_]+",max_features=10000)

X_train = tfidf_vect.fit_transform(X_train)

**Notice in the previous step that we use `fit_transform` on TRAIN. When we transform the TEST data, we need to use `transform` only. This enables us to keep the number of columns (features) the same across the data sets. Otherwise, they WILL be different, and no model will work!**

In [16]:
# Perform the TfidfVectorizer transformation
# Be careful: We are using the train fit to transform the test data set. Otherwise, the test data 
# features will be very different and match the train set!!!

X_test = tfidf_vect.transform(X_test)


In [17]:
X_train.shape, X_test.shape

((417, 9544), (180, 9544))

In [18]:
# These data sets are "sparse matrix". We can't see them unless we convert using toarray()
X_train

<417x9544 sparse matrix of type '<class 'numpy.float64'>'
	with 29002 stored elements in Compressed Sparse Row format>

In [19]:
# These data sets are "sparse matrix". We can't see them unless we convert using toarray()
X_train.toarray()

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

## Latent Semantic Analysis (Singular Value Decomposition)

In [29]:
from sklearn.decomposition import TruncatedSVD

#svd = TruncatedSVD(n_components=300, n_iter=10) #n_components is the number of topics, which should be less than the number of feature
    
# n_components = 100
svd_100 = TruncatedSVD(n_components=100, n_iter=10)
X_train_svd_100 = svd_100.fit_transform(X_train)
X_test_svd_100 = svd_100.transform(X_test)

# n_components = 300
svd_300 = TruncatedSVD(n_components=300, n_iter=10)
X_train_svd_300 = svd_300.fit_transform(X_train)
X_test_svd_300 = svd_300.transform(X_test)

# n_components = 500
svd_500 = TruncatedSVD(n_components=500, n_iter=10)
X_train_svd_500 = svd_500.fit_transform(X_train)
X_test_svd_500 = svd_500.transform(X_test)



In [30]:
X_train.shape, X_test.shape

((417, 9544), (180, 9544))

## Random Forest with Evaluating Model Performance

In [35]:
from sklearn.ensemble import RandomForestClassifier

# Define n_components to try
n_components_list = [100, 300, 500]

for n in n_components_list:
    print(f"n_components = {n}")
    
    # Apply TruncatedSVD to reduce dimensionality
    svd = TruncatedSVD(n_components=n, n_iter=10)
    X_train_svd = svd.fit_transform(X_train)
    X_test_svd = svd.transform(X_test)
    
    # Train a Random Forest Classifier on the reduced data
    rf_clf = RandomForestClassifier()
    _ = rf_clf.fit(X_train_svd, y_train)
    
    # Evaluate the model on the train set
    y_pred_train = rf_clf.predict(X_train_svd)
    train_acc = accuracy_score(y_train, y_pred_train)
    print(f"Train acc: {train_acc:.4f}")
    
    # Evaluate the model on the test set
    y_pred_test = rf_clf.predict(X_test_svd)
    test_acc = accuracy_score(y_test, y_pred_test)
    print(f"Test acc: {test_acc:.4f}")
    
    # Print the confusion matrix
    print(f"Confusion matrix:\n{confusion_matrix(y_test, y_pred_test)}\n")


n_components = 100
Train acc: 0.9952
Test acc: 0.8000
Confusion matrix:
[[50  0 21]
 [ 1 54  9]
 [ 4  1 40]]

n_components = 300
Train acc: 0.9952
Test acc: 0.7722
Confusion matrix:
[[46  0 25]
 [ 0 51 13]
 [ 2  1 42]]

n_components = 500
Train acc: 0.9952
Test acc: 0.7944
Confusion matrix:
[[50  1 20]
 [ 0 53 11]
 [ 4  1 40]]



## Stochastic Gradient Descent Classifier with Evaluating Model Performance

In [34]:
from sklearn.linear_model import SGDClassifier
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.decomposition import TruncatedSVD

for n in [100, 300, 500]:
    print(f"n_components = {n}")
    svd = TruncatedSVD(n_components=n, n_iter=10)
    X_train_svd = svd.fit_transform(X_train)
    X_test_svd = svd.transform(X_test)

    sgd_clf = SGDClassifier(max_iter=100)
    _ = sgd_clf.fit(X_train_svd, y_train)

    # Train accuracy
    y_pred_train = sgd_clf.predict(X_train_svd)
    print(f"Train acc (n_components={n}): {accuracy_score(y_train, y_pred_train):.4f}")

    # Test accuracy
    y_pred_test = sgd_clf.predict(X_test_svd)
    print(f"Test acc (n_components={n}): {accuracy_score(y_test, y_pred_test):.4f}")

    # Confusion Matrix
    print(f"Confusion Matrix (n_components={n}): \n{confusion_matrix(y_test, y_pred_test)}\n")

    
    

n_components = 100
Train acc (n_components=100): 0.9904
Test acc (n_components=100): 0.9222
Confusion Matrix (n_components=100): 
[[64  0  7]
 [ 1 59  4]
 [ 2  0 43]]

n_components = 300
Train acc (n_components=300): 0.9952
Test acc (n_components=300): 0.9722
Confusion Matrix (n_components=300): 
[[70  0  1]
 [ 1 62  1]
 [ 2  0 43]]

n_components = 500
Train acc (n_components=500): 0.9952
Test acc (n_components=500): 0.9389
Confusion Matrix (n_components=500): 
[[65  1  5]
 [ 2 61  1]
 [ 2  0 43]]



In [52]:
from prettytable import PrettyTable

table = PrettyTable()
table.field_names = ["Model", "n_components", "Train Accuracy", "Test Accuracy"]
table.add_row(["Random Forest", "LSA 100", "0.9952" , "0.8000"])
table.add_row(["Random Forest", "LSA 300","0.9952","0.7722"])
table.add_row(["Random Forest", "LSA 500","0.9952", "0.7944"])
table.add_row(["SGD", "LSA 100",  "0.9904", "0.9222"])
table.add_row(["SGD", "LSA 300", "0.9952","0.9722"])
table.add_row(["SGD", "LSA 500", "0.9952", "0.9389"])

print(table)


+---------------+--------------+----------------+---------------+
|     Model     | n_components | Train Accuracy | Test Accuracy |
+---------------+--------------+----------------+---------------+
| Random Forest |   LSA 100    |     0.9952     |     0.8000    |
| Random Forest |   LSA 300    |     0.9952     |     0.7722    |
| Random Forest |   LSA 500    |     0.9952     |     0.7944    |
|      SGD      |   LSA 100    |     0.9904     |     0.9222    |
|      SGD      |   LSA 300    |     0.9952     |     0.9722    |
|      SGD      |   LSA 500    |     0.9952     |     0.9389    |
+---------------+--------------+----------------+---------------+


From the results displayed in the table, we can see that the performance of both models (Random Forest and SGD) varies depending on the value of n_components used for LSA.

For the Random Forest model, we can observe that increasing the number of n_components from 100 to 300 slightly decreased the test accuracy, but further increasing it to 500 increased the test accuracy again. This suggests that the optimal number of n_components may not be a fixed value and can vary depending on the dataset and the specific task at hand.Since, we have 417 observations but we passed n_components as 500, which makes the model to consider 417 as n_components is a point to be noted.

On the other hand, for the SGD model, we can observe a steady increase in test accuracy as we increase the number of n_components. This could be because SGD is a linear model and may benefit from increased dimensionality reduction through LSA.

In general, using SVD (including LSA) in our analysis can be useful for reducing the dimensionality of our data and removing noise, which can improve the performance of our models by reducing the risk of overfitting. However, SVD may not always be necessary or beneficial depending on the specific dataset and task. If the dataset is already small and well-structured, or if the task requires keeping all the original features, then SVD may not be necessary. Additionally, SVD can be computationally expensive and may not be practical for very large datasets.