# Tutorial - Text Mining - Classification 

We will predict the category of discussion posts in a newsgroup.

**The unit of analysis is a discussion post**

## 1.0 Import common packages and load data

In [1]:
import pandas as pd
import numpy as np

np.random.seed(1)

In [2]:
news = pd.read_csv('./data/news.csv')

news.shape


(597, 5)

In [3]:
news.head(5)

Unnamed: 0,TEXT,graphics,hockey,medical,newsgroup
0,I have a few reprints left of chapters from my...,1,0,0,graphics
1,"gnuplot, etc. make it easy to plot real valued...",1,0,0,graphics
2,Article-I.D.: snoopy.1pqlhnINN8k1 References: ...,1,0,0,graphics
3,"Hello, I am looking to add voice input capabil...",1,0,0,graphics
4,I recently got a file describing a library of ...,1,0,0,graphics


## 2.0 Preprocessing

### 2.1 Check for missing values

In [4]:
news[['TEXT']].isna().sum()

TEXT    0
dtype: int64

### 2.2 Assign the input variable to X and the target variable to y

In [5]:
X = news['TEXT']

In [6]:
type(X)

pandas.core.series.Series

This is a multi-class classification problem. There are three categories we will predict:<br>
Whether a post is "graphics," "hockey," or "medical" related

In [7]:
y = news['newsgroup']
y.unique()

array(['graphics', 'hockey', 'medical'], dtype=object)

In [8]:
from sklearn import preprocessing

le = preprocessing.LabelEncoder()
le.fit(y)
print(le.classes_)
y = le.transform(y)

y


['graphics' 'hockey' 'medical']


array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,

### 2.3 lemmatization

In [9]:
import nltk
# nltk.download('averaged_perceptron_tagger') # you only need to run this once
from nltk.stem import WordNetLemmatizer 
from nltk import pos_tag, word_tokenize


In [10]:
transformed_X = []
wnl = WordNetLemmatizer()
for document in X:
    transformed_document = ""
    for word, tag in pos_tag(word_tokenize(document)):
        wntag = tag[0].lower()
        wntag = wntag if wntag in ['a', 'r', 'n', 'v'] else None
        if not wntag:
            lemma = word
        else:
            lemma = wnl.lemmatize(word, wntag)
        transformed_document+= lemma + " "
    transformed_X += [transformed_document]

transformed_X = pd.Series(transformed_X)

In [11]:
transformed_X.head()

0    I have a few reprint leave of chapter from my ...
1    gnuplot , etc . make it easy to plot real valu...
2    Article-I.D . : snoopy.1pqlhnINN8k1 References...
3    Hello , I be look to add voice input capabilit...
4    I recently get a file describe a library of re...
dtype: object

In [12]:
X.head()

0    I have a few reprints left of chapters from my...
1    gnuplot, etc. make it easy to plot real valued...
2    Article-I.D.: snoopy.1pqlhnINN8k1 References: ...
3    Hello, I am looking to add voice input capabil...
4    I recently got a file describing a library of ...
Name: TEXT, dtype: object

### 2.4 Split the data

In [13]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(transformed_X, y, test_size=0.3)

In [14]:
X_train.shape, y_train.shape

((417,), (417,))

In [15]:
X_test.shape, y_test.shape

((180,), (180,))

In [16]:
X_train.head(5)

139    cptully @ med.unc.edu ( Christopher P. Tully ,...
157    I 'm a mac user who want to use some of the ra...
16     Article-I.D . : DIALix.1praaa $ pqv Organizati...
572    I love the FAQ . The comment about contact len...
365    In a previous article , steveg @ cadkey.com ( ...
dtype: object

In [17]:
y_train[:5]

array([0, 0, 0, 2, 1])

### 2.5 Sklearn: Text preparation

For simplicity (and focus), we will not do any text cleaning or preprocessing. We will just use the raw text as input to the model. See the text mining fundamentals tutorial for more details on text cleaning and preprocessing.

In [18]:
#TfidfVectorizer includes pre-processing, tokenization, filtering stop words
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vect = TfidfVectorizer(stop_words='english', lowercase=True, token_pattern="[^\W\d_]+")

X_train = tfidf_vect.fit_transform(X_train)

**Notice in the previous step that we use `fit_transform` on TRAIN. When we transform the TEST data, we need to use `transform` only. This enables us to keep the number of columns (features) the same across the data sets. Otherwise, they WILL be different, and no model will work!**

In [19]:
# Perform the TfidfVectorizer transformation
# Be careful: We are using the train fit to transform the test data set. Otherwise, the test data 
# features will be very different and match the train set!!!

X_test = tfidf_vect.transform(X_test)


In [20]:
X_train.shape, X_test.shape

((417, 9229), (180, 9229))

In [21]:
# These data sets are "sparse matrix". We can't see them unless we convert using toarray()
X_train

<417x9229 sparse matrix of type '<class 'numpy.float64'>'
	with 30421 stored elements in Compressed Sparse Row format>

In [22]:
# These data sets are "sparse matrix". We can't see them unless we convert using toarray()
X_train.toarray()

array([[0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.41217195, 0.        , ..., 0.        , 0.        ,
        0.        ]])

## 3.0 Latent Semantic Analysis (Singular Value Decomposition)

In [23]:
from sklearn.decomposition import TruncatedSVD

### n_components = 100

In [24]:
svd = TruncatedSVD(n_components=100, n_iter=10) #n_components is the number of topics, which should be less than the number of features

X_train_100 = svd.fit_transform(X_train)
X_test_100 = svd.transform(X_test)

In [25]:
X_train_100.shape, X_test_100.shape

((417, 100), (180, 100))

### n_components = 300

In [26]:
svd = TruncatedSVD(n_components=300, n_iter=10) 

X_train_300 = svd.fit_transform(X_train)
X_test_300 = svd.transform(X_test)

In [27]:
X_train_300.shape, X_test_300.shape

((417, 300), (180, 300))

### n_components = 500

In [28]:
svd = TruncatedSVD(n_components=500, n_iter=10) 

X_train_500= svd.fit_transform(X_train)
X_test_500 = svd.transform(X_test)


In [29]:
X_train_500.shape, X_test_500.shape #n_components is the number of topics, which should be less than the number of features

((417, 417), (180, 417))

 ## 4.0 Fitting classifiers

In [30]:
from sklearn.metrics import accuracy_score

performance = pd.DataFrame({"model": [], "Accuracy": []})

def performance_metric(y_test, model_preds, performance, model_name):
    performance = pd.concat([performance, pd.DataFrame({'model': str(model_name), 
                                                        'Accuracy': accuracy_score(y_test, model_preds)}, index=[0])])
    return performance

### 4.1 Random Forest

In [31]:
from sklearn.ensemble import RandomForestClassifier 

In [32]:
rnd_clf_100 = RandomForestClassifier(n_estimators=100, max_leaf_nodes=16, n_jobs=-1) 
_ = rnd_clf_100.fit(X_train_100, y_train)

In [33]:
rnd_clf_300 = RandomForestClassifier(n_estimators=100, max_leaf_nodes=16, n_jobs=-1) 
_ = rnd_clf_300.fit(X_train_300, y_train)

In [34]:
rnd_clf_500 = RandomForestClassifier(n_estimators=100, max_leaf_nodes=16, n_jobs=-1) 
_ = rnd_clf_500.fit(X_train_500, y_train)

#### 4.1.1 Evaluating Model Performance

In [35]:
#Train accuracy - Not a good measure of model performance as we are using the same data set to train and test
y_pred_train_100 = rnd_clf_100.predict(X_train_100)
y_pred_train_300 = rnd_clf_300.predict(X_train_300)
y_pred_train_500 = rnd_clf_500.predict(X_train_500)
print(f"Train acc 100: {accuracy_score(y_train, y_pred_train_100):.4f}")
print(f"Train acc 300: {accuracy_score(y_train, y_pred_train_300):.4f}")
print(f"Train acc 500: {accuracy_score(y_train, y_pred_train_500):.4f}")


Train acc 100: 0.9760
Train acc 300: 0.9928
Train acc 500: 0.9880


In [36]:
#Test accuracy
y_pred_test_100 = rnd_clf_100.predict(X_test_100)
y_pred_test_300 = rnd_clf_300.predict(X_test_300)
y_pred_test_500 = rnd_clf_500.predict(X_test_500)

print(f"Test acc 100: {accuracy_score(y_test, y_pred_test_100):.4f}")
print(f"Test acc 300: {accuracy_score(y_test, y_pred_test_300):.4f}")
print(f"Test acc 500: {accuracy_score(y_test, y_pred_test_500):.4f}")

performance = performance_metric(y_test, y_pred_test_100, performance, "random forest 100")
performance = performance_metric(y_test, y_pred_test_300, performance, "random forest 300")
performance = performance_metric(y_test, y_pred_test_500, performance, "random forest 500")



Test acc 100: 0.9278
Test acc 300: 0.8944
Test acc 500: 0.9111


In [37]:
# Confusion Matrix
from sklearn.metrics import confusion_matrix

print("100", "\n", confusion_matrix(y_test, y_pred_test_100), "\n")
print("300", "\n", confusion_matrix(y_test, y_pred_test_300), "\n")
print("500", "\n", confusion_matrix(y_test, y_pred_test_500))

100 
 [[58  0  2]
 [ 1 57  0]
 [ 9  1 52]] 

300 
 [[56  0  4]
 [ 1 54  3]
 [ 9  2 51]] 

500 
 [[56  1  3]
 [ 1 55  2]
 [ 8  1 53]]


### 4.2 Stochastic Gradient Descent Classifier

In [38]:
from sklearn.linear_model import SGDClassifier
from sklearn.metrics import confusion_matrix

In [39]:
sgd_clf_100 = SGDClassifier(max_iter=100)
_ = sgd_clf_100.fit(X_train_100, y_train)

In [40]:
sgd_clf_300 = SGDClassifier(max_iter=100)
_ = sgd_clf_300.fit(X_train_300, y_train)

In [41]:
sgd_clf_500 = SGDClassifier(max_iter=100)
_ = sgd_clf_500.fit(X_train_500, y_train)

#### 4.2.1 Evaluating Model Performance

In [42]:
#Train accuracy
y_pred_train_100 = sgd_clf_100.predict(X_train_100)
y_pred_train_300 = sgd_clf_300.predict(X_train_300)
y_pred_train_500 = sgd_clf_500.predict(X_train_500)
print(f"Train acc 100: {accuracy_score(y_train, y_pred_train_100):.4f}")
print(f"Train acc 300: {accuracy_score(y_train, y_pred_train_300):.4f}")
print(f"Train acc 500: {accuracy_score(y_train, y_pred_train_500):.4f}")

Train acc 100: 0.9928
Train acc 300: 0.9976
Train acc 500: 0.9976


In [43]:
#Test accuracy
y_pred_test_100 = sgd_clf_100.predict(X_test_100)
y_pred_test_300 = sgd_clf_300.predict(X_test_300)
y_pred_test_500 = sgd_clf_500.predict(X_test_500)

print(f"Train acc: {accuracy_score(y_test, y_pred_test_100):.4f}")
print(f"Train acc: {accuracy_score(y_test, y_pred_test_300):.4f}")
print(f"Train acc: {accuracy_score(y_test, y_pred_test_500):.4f}")

performance = performance_metric(y_test, y_pred_test_100, performance, "SGD classifier 100")
performance = performance_metric(y_test, y_pred_test_300, performance, "SGD classifier 300")
performance = performance_metric(y_test, y_pred_test_500, performance, "SGD classifier 500")

Train acc: 0.9444
Train acc: 0.9611
Train acc: 0.9500


In [44]:
performance

Unnamed: 0,model,Accuracy
0,random forest 100,0.927778
0,random forest 300,0.894444
0,random forest 500,0.911111
0,SGD classifier 100,0.944444
0,SGD classifier 300,0.961111
0,SGD classifier 500,0.95


In [45]:
# Confusion Matrix
print("100", "\n", confusion_matrix(y_test, y_pred_test_100),"\n")
print("300", "\n", confusion_matrix(y_test, y_pred_test_300),"\n")
print("500", "\n", confusion_matrix(y_test, y_pred_test_500))

100 
 [[60  0  0]
 [ 5 52  1]
 [ 3  1 58]] 

300 
 [[58  1  1]
 [ 1 57  0]
 [ 2  2 58]] 

500 
 [[60  0  0]
 [ 1 57  0]
 [ 3  5 54]]


# 5.0 Summary 

In [46]:
performance.sort_values(by=['Accuracy'])

Unnamed: 0,model,Accuracy
0,random forest 300,0.894444
0,random forest 500,0.911111
0,random forest 100,0.927778
0,SGD classifier 100,0.944444
0,SGD classifier 500,0.95
0,SGD classifier 300,0.961111


For random forest, the fewer components, the better the accuracy. Meanwhile, in the SGD classifier (for this random seed) an intermediate number of components kept in SVD achieved the best accuracy.

The best SVD number of components in SGD is the worst in random forest. SVD may be helpful when we have sparse data and when data is highly correlated, therefore fewer number features/components will lead to better classification.

While SVD can help to reduce the number of features and improve speed, it can also make the model less interpretable and lead to overfitting because it extracts the dominant patterns, and if you have very noisy data you will end up overfitting. To use or not SVD depends on the particular problem/application.

