# Tutorial - Text Mining - Classification 

We will predict the category of discussion posts in a newsgroup.

**The unit of analysis is a discussion post**

### Import common packages

In [1]:
import pandas as pd
import numpy as np

np.random_seed = 1

### Load data

In [2]:
news = pd.read_csv('news.csv')

news.shape


(597, 5)

In [3]:
news.head(5)

Unnamed: 0,TEXT,graphics,hockey,medical,newsgroup
0,I have a few reprints left of chapters from my...,1,0,0,graphics
1,"gnuplot, etc. make it easy to plot real valued...",1,0,0,graphics
2,Article-I.D.: snoopy.1pqlhnINN8k1 References: ...,1,0,0,graphics
3,"Hello, I am looking to add voice input capabil...",1,0,0,graphics
4,I recently got a file describing a library of ...,1,0,0,graphics


### Check for missing values

In [4]:
news[['TEXT']].isna().sum()

TEXT    0
dtype: int64

## Assign the input variable to X and the target variable to y

In [5]:
X = news['TEXT']

In [6]:
X

0      I have a few reprints left of chapters from my...
1      gnuplot, etc. make it easy to plot real valued...
2      Article-I.D.: snoopy.1pqlhnINN8k1 References: ...
3      Hello, I am looking to add voice input capabil...
4      I recently got a file describing a library of ...
                             ...                        
592    carl@SOL1.GPS.CALTECH.EDU (Carl J Lydick) writ...
593    In article < 1qmlgaINNjab@hp-col.col.hp.com> ,...
594    Article-I.D.: kestrel.1993Apr16.172052.27843 R...
595    In article < 1qmlgaINNjab@hp-col.col.hp.com> ,...
596    I have a 42 yr old male friend, misdiagnosed a...
Name: TEXT, Length: 597, dtype: object

This is a multi-class classification problem. There are three categories we will predict:<br>
Whether a post is "graphics," "hockey," or "medical" related

In [7]:
y = news['newsgroup']
y.unique()

array(['graphics', 'hockey', 'medical'], dtype=object)

## Implementing Lemmatizer

In [8]:
import nltk
from nltk.stem import WordNetLemmatizer

In [9]:
lemmatizer = WordNetLemmatizer()

In [10]:
def preprocess_text(text):
    # Convert to lowercase
    text = text.lower()
    
    # Tokenize the text
    tokens = nltk.word_tokenize(text)
    
    # Lemmatize the tokens
    lemmatized_tokens = [lemmatizer.lemmatize(token) for token in tokens]
    
    # Remove stopwords
    stopwords = set(nltk.corpus.stopwords.words('english'))
    filtered_tokens = [token for token in lemmatized_tokens if token not in stopwords]
    
    # Join the tokens back into a string
    preprocessed_text = ' '.join(filtered_tokens)
    
    return preprocessed_text


In [11]:
X = X.apply(preprocess_text)

In [12]:
X

0      reprint left chapter book `` vision future '' ...
1      gnuplot , etc . make easy plot real valued fun...
2      article-i.d . : snoopy.1pqlhninn8k1 reference ...
3      hello , looking add voice input capability use...
4      recently got file describing library rendering...
                             ...                        
592    carl @ sol1.gps.caltech.edu ( carl j lydick ) ...
593    article < 1qmlgainnjab @ hp-col.col.hp.com > ,...
594    article-i.d . : kestrel.1993apr16.172052.27843...
595    article < 1qmlgainnjab @ hp-col.col.hp.com > ,...
596    42 yr old male friend , misdiagnosed osteoppor...
Name: TEXT, Length: 597, dtype: object

In [13]:
import nltk
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('stopwords')
  

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\ankit\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\ankit\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\ankit\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\ankit\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [14]:
from sklearn import preprocessing

le = preprocessing.LabelEncoder()
le.fit(y)
print(le.classes_)
y = le.transform(y)

y


['graphics' 'hockey' 'medical']


array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,

## Split the data

In [15]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

In [16]:
X_train.shape, y_train.shape

((417,), (417,))

In [17]:
X_test.shape, y_test.shape

((180,), (180,))

In [18]:
X_train.head(5)

54     gchen @ essex.ecn.uoknor.edu writes article < ...
595    article < 1qmlgainnjab @ hp-col.col.hp.com > ,...
381    flyer closed season last night 8th straight vi...
206    article < c4zhj1.7xb @ idacom.hp.com > andrew ...
310    long ha cherry student sfu ? ( arche bunker ? ...
Name: TEXT, dtype: object

In [19]:
y_train[:5]

array([0, 2, 1, 1, 1])

## Sklearn: Text preparation

For simplicity (and focus), we will not do any text cleaning or preprocessing. We will just use the raw text as input to the model. See the text mining fundamentals tutorial for more details on text cleaning and preprocessing.

In [20]:
#TfidfVectorizer includes pre-processing, tokenization, filtering stop words
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vect = TfidfVectorizer(stop_words='english', lowercase=True, token_pattern="[^\W\d_]+")

X_train = tfidf_vect.fit_transform(X_train)

**Notice in the previous step that we use `fit_transform` on TRAIN. When we transform the TEST data, we need to use `transform` only. This enables us to keep the number of columns (features) the same across the data sets. Otherwise, they WILL be different, and no model will work!**

In [21]:
# Perform the TfidfVectorizer transformation
# Be careful: We are using the train fit to transform the test data set. Otherwise, the test data 
# features will be very different and match the train set!!!

X_test = tfidf_vect.transform(X_test)


In [22]:
X_train.shape, X_test.shape

((417, 9433), (180, 9433))

In [23]:
# These data sets are "sparse matrix". We can't see them unless we convert using toarray()
X_train

<417x9433 sparse matrix of type '<class 'numpy.float64'>'
	with 30714 stored elements in Compressed Sparse Row format>

In [24]:
# These data sets are "sparse matrix". We can't see them unless we convert using toarray()
X_train.toarray()

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

## The outcomes of the modeling can be significantly impacted by altering the n components parameter in SVD. The number of singular values and vectors to retain is specified by the n components option. The higher the value of n_components, the more information is retained, but also the higher the computational cost. Conversely, reducing the value of n_components leads to a loss of information but reduces the computational cost.

## Below we are going to see if and how how different "n_components values" impact the performance of the models.

## Latent Semantic Analysis (Singular Value Decomposition)

In [25]:
from sklearn.decomposition import TruncatedSVD

for num in [100, 300, 500]:

    svd = TruncatedSVD(n_components=num, random_state=42) #n_components is the number of topics, which should be less than the number of features

    X_train_svd= svd.fit_transform(X_train)
    X_test_svd = svd.transform(X_test)
    
    print(f"for n_components={num}")
    
    from sklearn.ensemble import RandomForestClassifier 

    rnd_clf = RandomForestClassifier(n_estimators=100, random_state=42) 
    rnd_clf.fit(X_train_svd, y_train)
    
    #Evaluating model performance.
    from sklearn.metrics import accuracy_score
    
    # Train accuracy
    y_pred_train = rnd_clf.predict(X_train_svd)
    acc = accuracy_score(y_train, y_pred_train)
    print(f" RF Train acc score: {accuracy_score(y_train, y_pred_train):.4f}")
    
    #Test accuracy
    y_pred_test = rnd_clf.predict(X_test_svd)
    acc = accuracy_score(y_test, y_pred_test)
    print(f" Rf Test acc score: {accuracy_score(y_test, y_pred_test):.4f}")
    
    # Confusion Matrix
    from sklearn.metrics import confusion_matrix
    confusion_matrix(y_test, y_pred_test)
    
    from sklearn.linear_model import SGDClassifier
    
    sgd_clf = SGDClassifier(random_state=42)
    sgd_clf.fit(X_train_svd, y_train)
    
    #Train accuracy
    y_pred_train = sgd_clf.predict(X_train_svd)
    print(f"SGD Train acc score: {accuracy_score(y_train, y_pred_train):.4f}")
    
    #Test accuracy
    y_pred_test = sgd_clf.predict(X_test_svd)
    print(f"SGD Test acc score: {accuracy_score(y_test, y_pred_test):.4f}")
    
    # Confusion Matrix
    from sklearn.metrics import confusion_matrix
    confusion_matrix(y_test, y_pred_test)
    
    print(f"   ")

for n_components=100
 RF Train acc score: 0.9976
 Rf Test acc score: 0.9222
SGD Train acc score: 0.9880
SGD Test acc score: 0.9167
   
for n_components=300
 RF Train acc score: 0.9976
 Rf Test acc score: 0.8944
SGD Train acc score: 0.9976
SGD Test acc score: 0.9222
   
for n_components=500
 RF Train acc score: 0.9976
 Rf Test acc score: 0.8833
SGD Train acc score: 0.9976
SGD Test acc score: 0.9444
   


## The n_components values of 100,300,500 will retain the top 100,300 and 500 singular values and vectors. By reducing the number of components, we can reduce the dimensionality of the data, making it easier to analyze and model but may also result in a loss of information. 

# For Random Forest,

Accuracy with n_components=100: 0.9222

Accuracy with n_components=300: 0.8944

Accuracy with n_components=500: 0.8833

# For SGDClassifier,

Accuracy with n_components=100: 0.9167

Accuracy with n_components=300: 0.9222

Accuracy with n_components=500: 0.9444

# As per my results, increasing n_components leads to an improvement in test data accuracy for SGDClassifier model, while increasing n_components   leads to an decrease in test data accuracy for Random Forest model.

# The exact value of n_components that provides the best performance depends on the specific dataset and model being used.

# Without the SVD approach in our analysis, it gave less accuracy(about 80%) as compared to while using it(~90%). So, i would prefer this approach in this case. But it is important to carefully consider the advantages and disadvantages before using it in your analysis.