<h1 align=center><font size=5>Text Classification with NLTK</font></h1>

### Table of contents

- [Objective](#obj)
- [Data](#data)
- [Data Cleaning](#data_cleaning)
- [Bag of Words Model](#bow)
- [Bag of N-Grams Model](#ngrams)
- [TF-IDF Model](#tfidf)

### Objective <a id='obj'></a>

In this notebook, we will learn how to use NLTK library for text cleaning. Moreover, we will use Bag of Words (BoW), Bag of N-Grams, and TF-IDF models for text classification.

### Data <a id='data'></a>

Let us consider the 20 newsgroups dataset. You can access to this dataset in the following URL: https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_20newsgroups.html

In [None]:
from sklearn.datasets import fetch_20newsgroups

categories = ['alt.atheism','talk.religion.misc','comp.graphics','sci.space']

newsgroups_train = fetch_20newsgroups(subset= "train",
                                remove= ("headers", "footers", "quotes"),
                                categories= categories, 
                                shuffle= True, random_state= 123)

newsgroups_test = fetch_20newsgroups(subset= "test",
                                remove= ("headers", "footers", "quotes"),
                                categories= categories, 
                                shuffle= True, random_state= 123)

Downloading 20news dataset. This may take a few minutes.
Downloading dataset from https://ndownloader.figshare.com/files/5975967 (14 MB)


In [None]:
# Training set
X_train = newsgroups_train.data
y_train = newsgroups_train.target

# Test set
X_test = newsgroups_test.data
y_test = newsgroups_test.target

### Data Cleaning <a id='data_cleaning'></a>

&#x270d; Consider the following libraries and packages for further usage.

In [None]:
#!pip install --user -U nltk

In [None]:
#!pip install --user -U nltk

import nltk
nltk.download('wordnet')
nltk.download('stopwords')
nltk.download('punkt')

from nltk.corpus import stopwords
stopwords_list = stopwords.words('english')
# Add or remove any additional stopwords
stopwords_list.extend(['the'])
#stopwords_list.remove('no')
#stopwords_list.remove('not')

from nltk.tokenize import word_tokenize

from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

from nltk.stem.porter import PorterStemmer
stemmer = PorterStemmer()

import string

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


&#x270d; Let us define a helper function to clean the text data with performing the following operations:
1. Convert text to all lowercase letters
2. Remove whitespaces
3. Tokenize text
4. Remove all tokens that are not alphabetic
5. Remove stopwords
6. Remove tokens less than three letters
7. Remove punctuations
8. Lemmetize or stem tokens

In [None]:
def clean_text(text):
    
    # Convert text to all lowercase letters
    text = text.lower()
    
    # Remove whitespaces
    text = text.strip()
    
    # Tokenize text
    tokens = word_tokenize(text)
    
    # Remove all tokens that are not alphabetic
    tokens = [token for token in tokens if token.isalpha()]
    
    # Remove stopwords
    tokens = [token for token in tokens if not token in stopwords_list]
    
    # Remove tokens less than three letters
    tokens = [token for token in tokens if len(token)>= 4]
    
    # Remove punctuations
    tokens = [token for token in tokens if token not in string.punctuation]
    
    # Lemmetize tokens
    tokens = [lemmatizer.lemmatize(token, pos = 'v') for token in tokens]
    
    # Stem tokens
    #tokens = [stemmer.stem(token) for token in tokens]
    
    # Re-create text from filtered tokens, so that vectorizer won't complain
    text = ' '.join(tokens)
    return text

&#x270d; Perform the helper function over training and test texts.

In [None]:
X_train_cleaned = []
for text in X_train:
    X_train_cleaned.append(clean_text(text))

X_test_cleaned = []
for text in X_test:
    X_test_cleaned.append(clean_text(text))

&#x270d; Print out the first 5 samples of training data before and after cleanning.

In [None]:
X_train_cleaned[3]

'understand difference studio mainly ipas interface along small fix ipas code run faster newest version'

In [None]:
print(X_train[:5])

['Is the ".3ds" file format for Autodesk\'s 3D Animation Studio available?\n\nThanks,\nGary', '\n[...stuff deleted...]\n\nComputers are an excellent example...of evolution without "a" creator.\nWe did not "create" computers.  We did not create the sand that goes\ninto the silicon that goes into the integrated circuits that go into\nprocessor board.  We took these things and put them together in an\ninteresting way. Just like plants "create" oxygen using light through \nphotosynthesis.  It\'s a much bigger leap to talk about something that\ncreated "everything" from nothing.  I find it unfathomable to resort\nto believing in a creator when a much simpler alternative exists: we\nsimply are incapable of understanding our beginnings -- if there even\nwere beginnings at all.  And that\'s ok with me.  The present keeps me\nperfectly busy.', "I am trying to configure Zsoft's PC Paintbrush IV+ for use with my\nLogitech Scanman 32 (hand scanner), but I can't get Paintbrush to\nacknowledge the s

In [None]:
print(X_train_cleaned[:5])

['file format autodesk animation studio available thank gary', 'stuff delete computers excellent example evolution without creator create computers create sand go silicon go integrate circuit processor board take things together interest like plant create oxygen use light photosynthesis much bigger leap talk something create everything nothing find unfathomable resort believe creator much simpler alternative exist simply incapable understand beginnings even beginnings present keep perfectly busy', 'try configure zsoft paintbrush logitech scanman hand scanner paintbrush acknowledge scanner anybody use paintbrush scanner help thank luis nobrega', 'understand difference studio mainly ipas interface along small fix ipas code run faster newest version', 'stuff delete mean like second minutes hours days months years remember fahrenheit temperature scale also centigrade scale revisionists tell history something like coldest point particular russian winter mark thermometer body temperature vol

### Bag of Words Model <a id='bow'></a>

In this part, we consider the Bag of Words (BoW) model for text classification. The BoW model extracts featurs from text based on the occurrence of words within the text disregarding grammar and even word order. Please check the following URL for further information: https://en.wikipedia.org/wiki/Bag-of-words_model

&#x270d; Convert the training/test text data to BoW.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer()

# Training set
X_train_cv = cv.fit_transform(X_train_cleaned)
# Test set
X_test_cv = cv.transform(X_test_cleaned)

In [None]:
print(X_train_cv[2]) # (row index, feature/term index) value (that is, term frequency)

  (0, 15405)	1
  (0, 16344)	1
  (0, 15926)	1
  (0, 3082)	1
  (0, 17319)	1
  (0, 11039)	3
  (0, 8962)	1
  (0, 13555)	1
  (0, 6663)	1
  (0, 13556)	3
  (0, 144)	1
  (0, 761)	1
  (0, 6856)	1
  (0, 9042)	1
  (0, 10446)	1


In [None]:
X_train_cv.type

In [None]:
X_train_cv = X_train_cv.toarray()
X_test_cv = X_test_cv.toarray()

In [None]:
X_train_cv.shape

(2034, 17334)

&#x270d; Show the table of featurs vectors for the corpus of the training data.

In [None]:
import pandas as pd

# get all unique words in the corpus
feature_names = cv.get_feature_names()
# show document feature vectors
docs = pd.DataFrame(X_train_cv, columns= feature_names)
docs.head()

Unnamed: 0,aangeboden,aangegeven,aantal,aarseth,aasked,aavso,abandon,abba,abbasids,abberation,abbreviate,abbreviation,abduct,abdullah,abekas,aberdeen,aberrations,abhor,abhorrent,abide,abilities,ability,abiliy,abingdon,abiogenesis,ablazing,able,aboard,abolish,abolishment,abolition,abolitionist,abolitionists,abomb,abord,abort,abortion,abortions,abound,abraham,...,zeus,zeven,zhao,zien,zijn,zillion,zillions,zion,zionist,zip,zogeheten,zombie,zond,zone,zonker,zoology,zoom,zopfi,zorastrian,zorg,zorn,zoroaster,zoroastrian,zoroastrianism,zoroastrians,zsoft,zubin,zuck,zues,zullen,zulu,zurbrin,zurich,zurvanism,zwaartepunten,zwak,zwakke,zware,zwarte,zyxel
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


&#x270d; Build a Naive Bayes classifier for the BoW model. 

In [None]:
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics

clf = MultinomialNB(alpha=.01)
clf.fit(X_train_cv, y_train)

y_pred = clf.predict(X_test_cv)

accuracy = metrics.accuracy_score(y_test, y_pred)
print("Accuracy: %0.1f%%" % (accuracy * 100))

Accuracy: 67.5%


### Bag of N-Grams Model <a id='ngrams'></a>

&#x270d; Now, build a classifier based on the bi-gram model. Here, you can recap about the concept of n-gram models: https://en.wikipedia.org/wiki/N-gram

In [None]:
import pandas as pd

cv = CountVectorizer(ngram_range=(2,2)) 
# ngram_range of (1, 1) means only unigrams, (1, 2) means unigrams and bigrams, and (2, 2) means only bigrams

X_train_cv = cv.fit_transform(X_train_cleaned)
X_test_cv = cv.transform(X_test_cleaned)

X_train_cv = X_train_cv.toarray()
X_test_cv = X_test_cv.toarray()

In [None]:
feature_names = cv.get_feature_names()
docs = pd.DataFrame(X_train_cv, columns= feature_names)
docs.head()

Unnamed: 0,aangeboden binnen,aangegeven inschrijving,aantal voordrachten,aarseth pioneer,aasked come,aavso american,abandon alpha,abandon atheism,abandon child,abandon colour,abandon develop,abandon fine,abandon moral,abandon pulpit,abandon sexuality,abandon theism,abandon woman,abba father,abbasids seem,abbasids tyre,abberation contrary,abberation well,abbreviate amorc,abbreviation quarterdeck,abduct ufos,abdullah salam,abekas code,abekas function,abekas smpte,aberdeen irit,aberdeen satellite,aberrations like,abhor even,abhor publish,abhorrent say,abide consequences,abide righteous,abide truth,abilities anyone,abilities piece,...,zoroastrianism throughout,zoroastrianism time,zoroastrians arrive,zoroastrians believe,zoroastrians call,zoroastrians claim,zoroastrians come,zoroastrians entire,zoroastrians especially,zoroastrians faith,zoroastrians fear,zoroastrians hundreds,zoroastrians india,zoroastrians many,zoroastrians much,zoroastrians point,zoroastrians procedure,zoroastrians raise,zoroastrians sometimes,zoroastrians value,zoroastrians year,zoroastrians zoroastrians,zsoft paintbrush,zubin mehta,zuck reply,zues odin,zullen alles,zullen ingaan,zulu time,zurbrin compact,zurich workbench,zurvanism thank,zurvanism zurvanism,zwaartepunten huidige,zwak waarnemingsperiode,zwakke radiosignaal,zware sterren,zwarte gaten,zyxel access,zyxel epimntl
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [None]:
clf = MultinomialNB()
clf.fit(X_train_cv, y_train)

accuracy = metrics.accuracy_score(y_test, y_pred)
print("Accuracy: %0.1f%%" % (accuracy * 100))

Accuracy: 67.5%


### TF-IDF Model <a id='tfidf'></a>

&#x270d; Build a classifier based on the term frequency–inverse document frequency (TF-IDF) model. Here, you can recap about the concept of tf-idf models: https://en.wikipedia.org/wiki/Tf–idf

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

tv = TfidfVectorizer(min_df=0., max_df=1., use_idf=True)

X_train_tv = tv.fit_transform(X_train_cleaned)
X_test_tv = tv.transform(X_test_cleaned)

X_train_tv = X_train_tv.toarray()
X_test_tv = X_test_tv.toarray()

In [None]:
feature_names = tv.get_feature_names()
docs = pd.DataFrame(X_train_tv, columns= feature_names)
docs.head()

Unnamed: 0,aangeboden,aangegeven,aantal,aarseth,aasked,aavso,abandon,abba,abbasids,abberation,abbreviate,abbreviation,abduct,abdullah,abekas,aberdeen,aberrations,abhor,abhorrent,abide,abilities,ability,abiliy,abingdon,abiogenesis,ablazing,able,aboard,abolish,abolishment,abolition,abolitionist,abolitionists,abomb,abord,abort,abortion,abortions,abound,abraham,...,zeus,zeven,zhao,zien,zijn,zillion,zillions,zion,zionist,zip,zogeheten,zombie,zond,zone,zonker,zoology,zoom,zopfi,zorastrian,zorg,zorn,zoroaster,zoroastrian,zoroastrianism,zoroastrians,zsoft,zubin,zuck,zues,zullen,zulu,zurbrin,zurich,zurvanism,zwaartepunten,zwak,zwakke,zware,zwarte,zyxel
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.213483,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [None]:
clf = MultinomialNB()
clf.fit(X_train_tv, y_train)

accuracy = metrics.accuracy_score(y_test, y_pred)
print("Accuracy: %0.1f%%" % (accuracy * 100))

Accuracy: 67.5%


In [None]:
y_test

array([1, 1, 2, ..., 3, 1, 1])