
# Week 10-11: Document Classification

*CUNY SPS DATA 620*  

*April 22, 2022*

*Bonnie Cooper, George Deschamps, Rob Hodde*

It can be useful to be able to classify new "test" documents using already classified "training" documents.  A common example is using a corpus of labeled spam and ham (non-spam) e-mails to predict whether or not a new document is spam.  Here is one example of such data:  UCI Machine Learning Repository: Spambase Data Set
For this project, you can either use the above dataset to predict the class of new documents (either withheld from the training dataset or from another source such as your own spam folder).

For more adventurous students, you are welcome (encouraged!) to come up a different set of documents (including scraped web pages!?) that have already been classified (e.g. tagged), then analyze these documents to predict how new documents should be classified.


<br>
<br>

We will use the Kaggle competition, **Natural Language Processing with Disaster Tweets.**  (https://www.kaggle.com/competitions/nlp-getting-started/overview).  

The challenge is to build a machine learning model to distinguish between Tweets that are about real disasters and those that are not. 

<br>

We start by importing packages for data collection and cleansing, processing and finally prediction modeling:

In [None]:
from collections import Counter
import matplotlib.pyplot as plt
import numpy as np 
from numpy import mean
from numpy import std
import os
import pandas as pd 
from pathlib import Path
import re
import string

import cleantext  
from emoji import demojize

import nltk
from nltk.tokenize import word_tokenize
nltk.download('wordnet')
from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer

from sklearn import svm
from sklearn.model_selection import RepeatedKFold
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_validate
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD


We create a helper function to scrub the Tweets to be more legible to a computer. This includes removing numbers, punctuation, and words that carry little information. We also translate emoticons into words.


In [63]:
# Changes text to lower case
# Removes:
#    numbers and punctuation 
#    stopwords
#    extra spaces
# Translates emoji's into phrases 
def clean_text(x):
    x = demojize(x, language='alias') 
    x = re.sub(r"[:]+\ *", " ", x) #removes emoji colons and separates them with a space
    return cleantext.clean(x, extra_spaces=True, lowercase=True, numbers=True, punct=True, stopwords=True,
                     reg=r"[a-z0-9\.\-+_]+@[a-z0-9\.\-+_]+\.[a-z]+", reg_replace=' ')
                     

The lemmatize function converts variations of words into their root form:

In [64]:
#Function to Lemmatize text (convert various forms to root words) 
def lemmatize_word(text):
    lemmatizer = WordNetLemmatizer()
    lemma = [lemmatizer.lemmatize(word) for word in text]
    return lemma
    

Finally, we combine functions together to clean, tokenize (create a unique ID for), and lemmatize all the words in the Tweets.

In [65]:
#Rationalize the text: clean, tokenize and lemmatize 
def rationalize_text(txt):
    return txt.apply(lambda x: clean_text(x)).apply(word_tokenize).apply(lambda x: lemmatize_word(x)).apply(lambda x: ''.join(i+' ' for i in x))
    

We import the Tweets, separated into a Training set, and a Test set.

In [66]:
#import train and test data into dataframes
os.chdir('C:\\Users\\TRADE\\Documents\\GitHub\\DATA620-Week11\\')

df = pd.read_csv('train.csv')
train_df = df[['text','target']].copy()

df = pd.read_csv('test.csv')
test_df = df[['id','text']].copy()

test_df.head()

Unnamed: 0,id,text
0,0,Just happened a terrible car crash
1,2,"Heard about #earthquake is different cities, s..."
2,3,"there is a forest fire at spot pond, geese are..."
3,9,Apocalypse lighting. #Spokane #wildfires
4,11,Typhoon Soudelor kills 28 in China and Taiwan


We transform the Tweets to be more legible to a computer:

In [67]:
train_df["text"] = rationalize_text(train_df["text"])
test_df["text"] = rationalize_text(test_df["text"])

test_df.head()

Unnamed: 0,id,text
0,0,happened terrible car crash
1,2,heard earthquake different city stay safe ever...
2,3,forest fire spot pond goose fleeing across str...
3,9,apocalypse lighting spokane wildfire
4,11,typhoon soudelor kill china taiwan


We will use 80% of the Training Tweets to build a prediction model:

In [68]:
#Split training set so that a model can be built and evaluated
X_train, X_test, y_train, y_test = train_test_split(train_df['text'], train_df['target'], test_size=0.2)
X_train.head()

5701    zakbagans pet r like part family love animal l...
2591    black eye space battle occurred star involving...
5835    china stock market crash gem rubble chinaûªs ...
2085                             thats val dead im suing 
6251    new photo oak snowstorm http tcojhscgdag south...
Name: text, dtype: object

To help separate actual emergency Tweets from non-emergency ones, we use two tools: Support Vector Machine (SVM) and Term Frequency Inverse Document Frequency (TFIDF).

A SVM attempts to draw a line along a theoretical plane, that separates one class of documents from another (in our case, Emergency / Non-Emergency).

In order for the documents to be placed in a theoretical plane, they need coordinates. TFIDF supplies (imbeds) these coordinates. 

In [73]:
#Using Term Frequency Inverse Document Frequency, vectorize the text so that it can be mapped in high dimensional space
#Then use Principal Component Analysis Singular Value Decomposition to eliminate low-predictive-value words
train_text = TruncatedSVD(n_components=750).fit_transform(TfidfVectorizer().fit_transform(X_train))
train_text.shape

# For the Support Vector function above we tried a range of values to settle on the 750 word count:
# Words	    Precision	Recall
# 200		79.6%		75.6%
# 500		80.3%		77.1%
# 700		81.3%		78.5%
# 750		81.7%		78.8%
# 800		81.4%		78.7%
# 1000	    81.6%		78.4%
# 3000	    80.0%		77.8%

(6090, 800)

Now we are ready to create a prediction model. We use cross-validation to make the results more robust:

In [74]:
#adapted from https://towardsdatascience.com/write-a-document-classifier-in-less-than-30-minutes-2d96a8a8820c
cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)
model = svm.SVC(kernel='linear', C=1, decision_function_shape='ovo')
metrics = cross_validate(model, train_text, y_train.values, scoring=['precision_macro', 'recall_macro'], cv=cv, n_jobs=-1)

print('Precision: %.3f (%.3f)' % (mean(metrics["test_precision_macro"]), std(metrics["test_precision_macro"])))
print('Recall: %.3f (%.3f)' % (mean(metrics["test_recall_macro"]), -std(metrics["test_recall_macro"])))

Precision: 0.814 (0.016)
Recall: 0.787 (-0.017)


The model performs reasonably well, identifying actual emergencies correctly four out of five times, and discovering four out of five actual emergencies.  However, this leaves a lot of room for improvement, as this level of accuracy is not good enough to deploy in an emergency response organization. 