## Hema Sai Ari

# Tutorial - Text Mining - Classification 

We will predict the category of discussion posts in a newsgroup.

**The unit of analysis is a discussion post**

### Import common packages

In [1]:
import pandas as pd
import numpy as np

np.random_seed = 1

### Load data

In [2]:
news = pd.read_csv('news.csv')

news.shape


(597, 5)

In [3]:
news.head(5)

Unnamed: 0,TEXT,graphics,hockey,medical,newsgroup
0,I have a few reprints left of chapters from my...,1,0,0,graphics
1,"gnuplot, etc. make it easy to plot real valued...",1,0,0,graphics
2,Article-I.D.: snoopy.1pqlhnINN8k1 References: ...,1,0,0,graphics
3,"Hello, I am looking to add voice input capabil...",1,0,0,graphics
4,I recently got a file describing a library of ...,1,0,0,graphics


### Check for missing values

In [4]:
news[['TEXT']].isna().sum()

TEXT    0
dtype: int64

## Assign the input variable to X and the target variable to y

In [5]:
X = news['TEXT']

This is a multi-class classification problem. There are three categories we will predict:<br>
Whether a post is "graphics," "hockey," or "medical" related

In [6]:
y = news['newsgroup']
y.unique()

array(['graphics', 'hockey', 'medical'], dtype=object)

In [7]:
from sklearn import preprocessing

le = preprocessing.LabelEncoder()
le.fit(y)
print(le.classes_)
y = le.transform(y)

y


['graphics' 'hockey' 'medical']


array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,

In [8]:
X

0      I have a few reprints left of chapters from my...
1      gnuplot, etc. make it easy to plot real valued...
2      Article-I.D.: snoopy.1pqlhnINN8k1 References: ...
3      Hello, I am looking to add voice input capabil...
4      I recently got a file describing a library of ...
                             ...                        
592    carl@SOL1.GPS.CALTECH.EDU (Carl J Lydick) writ...
593    In article < 1qmlgaINNjab@hp-col.col.hp.com> ,...
594    Article-I.D.: kestrel.1993Apr16.172052.27843 R...
595    In article < 1qmlgaINNjab@hp-col.col.hp.com> ,...
596    I have a 42 yr old male friend, misdiagnosed a...
Name: TEXT, Length: 597, dtype: object

In [9]:
import nltk
nltk.download('averaged_perceptron_tagger') # you only need to run this once
nltk.download('punkt')# you only need to run this once
nltk.download('wordnet')# you only need to run this once
from nltk.stem import WordNetLemmatizer 
from nltk import pos_tag, word_tokenize

# Define the corpus of documents
corpus =X
transformed_corpus = []
wnl = WordNetLemmatizer()
for document in corpus:
    transformed_document = ""
    for word, tag in pos_tag(word_tokenize(document)):
        wntag = tag[0].lower()
        wntag = wntag if wntag in ['a', 'r', 'n', 'v'] else None
        if not wntag:
            lemma = word
        else:
            lemma = wnl.lemmatize(word, wntag)
        transformed_document+= lemma + " "
    transformed_corpus += [transformed_document]

x=transformed_corpus

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\arihe\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\arihe\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\arihe\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [10]:
X

0      I have a few reprints left of chapters from my...
1      gnuplot, etc. make it easy to plot real valued...
2      Article-I.D.: snoopy.1pqlhnINN8k1 References: ...
3      Hello, I am looking to add voice input capabil...
4      I recently got a file describing a library of ...
                             ...                        
592    carl@SOL1.GPS.CALTECH.EDU (Carl J Lydick) writ...
593    In article < 1qmlgaINNjab@hp-col.col.hp.com> ,...
594    Article-I.D.: kestrel.1993Apr16.172052.27843 R...
595    In article < 1qmlgaINNjab@hp-col.col.hp.com> ,...
596    I have a 42 yr old male friend, misdiagnosed a...
Name: TEXT, Length: 597, dtype: object

## Split the data

In [11]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

In [12]:
X_train.shape, y_train.shape

((417,), (417,))

In [13]:
X_test.shape, y_test.shape

((180,), (180,))

In [14]:
X_train.head(5)

573    Hello to everybody, I write here because I am ...
33     i am sorry, but this genoa card does nothing t...
457    Article-I.D.: pitt.19425 References: < 1993Mar...
540    In article < 1993Apr14.122647.16364@tms390.mic...
110    Hello... Does anyone know of any C or C++ func...
Name: TEXT, dtype: object

In [15]:
y_train[:5]

array([2, 0, 2, 2, 0])

## Sklearn: Text preparation

For simplicity (and focus), we will not do any text cleaning or preprocessing. We will just use the raw text as input to the model. See the text mining fundamentals tutorial for more details on text cleaning and preprocessing.

In [16]:
#TfidfVectorizer includes pre-processing, tokenization, filtering stop words
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vect = TfidfVectorizer(stop_words='english', lowercase=True, token_pattern="[^\W\d_]+")

X_train = tfidf_vect.fit_transform(X_train)

**Notice in the previous step that we use `fit_transform` on TRAIN. When we transform the TEST data, we need to use `transform` only. This enables us to keep the number of columns (features) the same across the data sets. Otherwise, they WILL be different, and no model will work!**

In [17]:
# Perform the TfidfVectorizer transformation
# Be careful: We are using the train fit to transform the test data set. Otherwise, the test data 
# features will be very different and match the train set!!!

X_test = tfidf_vect.transform(X_test)


In [18]:
X_train.shape, X_test.shape

((417, 9631), (180, 9631))

In [19]:
# These data sets are "sparse matrix". We can't see them unless we convert using toarray()
X_train

<417x9631 sparse matrix of type '<class 'numpy.float64'>'
	with 29683 stored elements in Compressed Sparse Row format>

In [20]:
# These data sets are "sparse matrix". We can't see them unless we convert using toarray()
X_train.toarray()

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

## Latent Semantic Analysis (Singular Value Decomposition)

In [21]:
np.random_seed=1

from sklearn.decomposition import TruncatedSVD
n_components=[100,300,500]

for n in n_components :
    X_train_new=0
    X_test_new=0
    y_pred_test=0
    y_pred_train=0
    
    ## Latent Semantic Analysis (Singular Value Decomposition)
    svd = TruncatedSVD(n_components=n, n_iter=10) #n_components is the number of topics, which should be less than the number of features

    X_train_new= svd.fit_transform(X_train)
    X_test_new = svd.transform(X_test)
    
    print("Summary of Results for n_estimator=",{n})
    print("*******************************************")




    ## Random Forest

    from sklearn.ensemble import RandomForestClassifier 

    rnd_clf = RandomForestClassifier(n_estimators=100, max_leaf_nodes=16, n_jobs=-1) 
    _ = rnd_clf.fit(X_train_new, y_train)

    ### Evaluating Model Performance

    from sklearn.metrics import accuracy_score

    #Train accuracy - Not a good measure of model performance as we are using the same data set to train and test
    y_pred_train = rnd_clf.predict(X_train_new)
    acc = accuracy_score(y_train, y_pred_train)
    
    print(" Random Forest")
    print(".........................")
    print(f"Train accuracy of Random Forest for n_estimator:{n} is  {accuracy_score(y_train, y_pred_train):.4f}")

    #Test accuracy
    y_pred_test = rnd_clf.predict(X_test_new)
    acc = accuracy_score(y_test, y_pred_test)
    print(f"Test accuracy of Random Forest for n_estimator:{n} is  {accuracy_score(y_test, y_pred_test):.4f}")

    print("\n")

    ## Stochastic Gradient Descent Classifier

    from sklearn.linear_model import SGDClassifier

    sgd_clf = SGDClassifier(max_iter=100)
    _ = sgd_clf.fit(X_train_new, y_train)

    ### Evaluating Model Performance

    #Train accuracy
    y_pred_train = sgd_clf.predict(X_train_new)
    
    print(" Stochastic Gradient Descent")
    print(".....................................")
    print(f"Train accuracy of Stochastic Gradient Descent Classifier  for n_estimator:{n} is {accuracy_score(y_train, y_pred_train):.4f}")

    #Test accuracy
    y_pred_test = sgd_clf.predict(X_test_new)
    print(f"Test accuracy of Stochastic Gradient Descent Classifier for n_estimator:{n} is  {accuracy_score(y_test, y_pred_test):.4f}")
    print("--------------------------------------------------------------------------------------------")
    print("\n")
   
   
    

Summary of Results for n_estimator= {100}
*******************************************
 Random Forest
.........................
Train accuracy of Random Forest for n_estimator:100 is  0.9832
Test accuracy of Random Forest for n_estimator:100 is  0.8556


 Stochastic Gradient Descent
.....................................
Train accuracy of Stochastic Gradient Descent Classifier  for n_estimator:100 is 0.9832
Test accuracy of Stochastic Gradient Descent Classifier for n_estimator:100 is  0.9167
--------------------------------------------------------------------------------------------


Summary of Results for n_estimator= {300}
*******************************************
 Random Forest
.........................
Train accuracy of Random Forest for n_estimator:300 is  0.9880
Test accuracy of Random Forest for n_estimator:300 is  0.8500


 Stochastic Gradient Descent
.....................................
Train accuracy of Stochastic Gradient Descent Classifier  for n_estimator:300 is 0.9976


## Conclusion

Here we can see that the test accuracy of the models with n_estimator= 300 is more than the other two. This could aslo be because the number of observations in the data are just 417, that could be the reason for n_estimator=500 is not giving the best accuracy score

Fore Random Forest model the model with n_esitmator=500 have the highest test accuracy score i.e 0.8611
For Stochastic Gradient Descent model the model with n_estimator=300 have the highest test accuracy i.e 0.9278

Singular Value Decomposition (SVD) is a technique that is used for Dimensionality reduction.

- Reasons to use SVD:
    - Dimensionality reduction- SVD helps us reduce the data by keeping the important information alive. This will help us in reducing the computing complexity in large datasets
    - Reducing Noise- SVD will help us in removing the low rank elements/documents
    - Compressing Data- SVD will help us compress the data

- Reasons to not use SVD:
    - Robustness- SVD will assume that there are no errors in the data
    - Computational complexity: SVD is computationally expensive especially for large datasets. 