# TEXT CLASSIFICATION

### **This is a project on Natural Language Processessing for text classification in Python using NLTK and Sci-kit-Learn .Here as the name suggest we will be classifying sms/text messages as either spam or not spam.**


### 1. Import Necessary Libraries

In [15]:
import sys
import nltk
import sklearn 
import pandas as pd
import numpy as np

### 2. Loading  the Dataset

### **Now that we have ensured that our libraries are imported correctly, let's load the data set as a Pandas DataFrame. Furthermore, let's extract some useful information such as the column information and class distribution.**

### **The data set we will be using comes from the UCI Machine Learning Repository.  It contains over 5000 SMS labeled messages that have been collected for mobile phone spam research. It can be downloaded from the following URL:**

https://archive.ics.uci.edu/ml/datasets/sms+spam+collection

In [16]:
## loading the dataset of SMS messages

Text_Data = pd.read_table('SMSSPamCollection', header=None, encoding='utf-8')

Text_Data

Unnamed: 0,0,1
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
...,...,...
5567,spam,This is the 2nd time we have tried 2 contact u...
5568,ham,Will ü b going to esplanade fr home?
5569,ham,"Pity, * was in mood for that. So...any other s..."
5570,ham,The guy did some bitching but I acted like i'd...


In [17]:
## print useful information about the dataset

Text_Data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   0       5572 non-null   object
 1   1       5572 non-null   object
dtypes: object(2)
memory usage: 87.2+ KB


In [18]:
## checking the class distribution

Classes = Text_Data[0]

Classes.value_counts()

ham     4825
spam     747
Name: 0, dtype: int64

## 2. Preprocessing the Data

### **Preprocessing the Text data is an essential step in natural language process. In the following cells, we will convert our class labels to binary values using the LabelEncoder from sklearn.**

In [19]:
from sklearn.preprocessing import LabelEncoder

##  Converting the class labels to binary values, 0 = ham and 1 = spam

Encoder = LabelEncoder()

Binary_Labels = Encoder.fit_transform(Classes)

Binary_Labels[:20] ## The first 20 instances

array([0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1])

## 2.1 Regular Expressions

###  **Using regular expressions  we will be  replacing email addresses, URLs, phone numbers, other numbers.**

In [20]:
Text_Messages = Text_Data[1]


## 1. Replacing email addresses with 'emailaddr'

Email_replaced = Text_Messages.str.replace(r'^.+@[^\.].*\.[a-z]{2,}$','emailaddr')


## 2. Replacing URLs with 'webaddr'

Urls_replaced = Email_replaced.str.replace(r'^http\://[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(/\S*)?$','webaddr')


## 3. Replacing money symbols with 'moneysymb'

Moneysymb_replaced = Urls_replaced.str.replace(r'£|\$', 'moneysymb')


## 4. Replacing 10 digit phone numbers (formats include paranthesis, spaces, no spaces, dashes) with 'phonenumber'

phnum_replaced = Moneysymb_replaced.str.replace(r'^\(?[\d]{3}\)?[\s-]?[\d]{3}[\s-]?[\d]{4}$','phonenumbr')


## 5. Replacing numbers with 'numbr'

Num_replaced = phnum_replaced.str.replace(r'\d+(\.\d+)?', 'numbr')


## 6. # Removing punctuation or to be specific replacing it with blank single space.

Punc_Removed = Num_replaced.str.replace(r'[^\w\d\s]', ' ')


## 7. Replacing whitespace between terms with a blank single space

Preprocessed = Punc_Removed.str.replace(r'\s+', ' ')


## 8. Removing leading and trailing whitespace

Preprocessed = Preprocessed.str.replace(r'^\s+|\s+?$', '')


## 9. Changing every words to its lower case - Hello, HELLO, hello are all the same word

Preprocessed = Preprocessed.str.lower()


Preprocessed



  Email_replaced = Text_Messages.str.replace(r'^.+@[^\.].*\.[a-z]{2,}$','emailaddr')
  Urls_replaced = Email_replaced.str.replace(r'^http\://[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(/\S*)?$','webaddr')
  Moneysymb_replaced = Urls_replaced.str.replace(r'£|\$', 'moneysymb')
  phnum_replaced = Moneysymb_replaced.str.replace(r'^\(?[\d]{3}\)?[\s-]?[\d]{3}[\s-]?[\d]{4}$','phonenumbr')
  Num_replaced = phnum_replaced.str.replace(r'\d+(\.\d+)?', 'numbr')
  Punc_Removed = Num_replaced.str.replace(r'[^\w\d\s]', ' ')
  Preprocessed = Punc_Removed.str.replace(r'\s+', ' ')
  Preprocessed = Preprocessed.str.replace(r'^\s+|\s+?$', '')


0       go until jurong point crazy available only in ...
1                                 ok lar joking wif u oni
2       free entry in numbr a wkly comp to win fa cup ...
3             u dun say so early hor u c already then say
4       nah i don t think he goes to usf he lives arou...
                              ...                        
5567    this is the numbrnd time we have tried numbr c...
5568                  will ü b going to esplanade fr home
5569    pity was in mood for that so any other suggest...
5570    the guy did some bitching but i acted like i d...
5571                            rofl its true to its name
Name: 1, Length: 5572, dtype: object

###  **After this we will be removing the stopwords and word stems from the Text messages . Using corpus and  Porter Stemmer from nltk library**

In [21]:
from nltk.corpus import stopwords

## Removing stop words from text messages

stop_words = set(stopwords.words('english'))

Stopwords_removed = Preprocessed.apply(lambda x: ' '.join(term for term in x.split() if term not in stop_words))
## Here we will have an empty string and for each text messages we are going to append to this string
## all the words as long as they are not in the stop_words set that we have imported from the corpus 

Stopwords_removed ## Stop words has been removed

0       go jurong point crazy available bugis n great ...
1                                 ok lar joking wif u oni
2       free entry numbr wkly comp win fa cup final tk...
3                     u dun say early hor u c already say
4                  nah think goes usf lives around though
                              ...                        
5567    numbrnd time tried numbr contact u u moneysymb...
5568                          ü b going esplanade fr home
5569                                pity mood suggestions
5570    guy bitching acted like interested buying some...
5571                                       rofl true name
Name: 1, Length: 5572, dtype: object

In [22]:
# Removing the word stems using a Porter stemmer

Stemmer = nltk.PorterStemmer()

Wordstem_removed = Stopwords_removed.apply(lambda x: ' '.join(Stemmer.stem(term) for term in x.split()))


Wordstem_removed  ## Word Stems has been removed

0       go jurong point crazi avail bugi n great world...
1                                   ok lar joke wif u oni
2       free entri numbr wkli comp win fa cup final tk...
3                     u dun say earli hor u c alreadi say
4                    nah think goe usf live around though
                              ...                        
5567    numbrnd time tri numbr contact u u moneysymbnu...
5568                              ü b go esplanad fr home
5569                                    piti mood suggest
5570    guy bitch act like interest buy someth els nex...
5571                                       rofl true name
Name: 1, Length: 5572, dtype: object

## 3. Generating Features

### **Basically this a part of feature engineering .So what we will do is the words in each text message will be our features. For this purpose, it will be necessary to tokenize each word. We will use the 2000 most common words as features.( I chose 2000 words there is  no restriction you can choose any number within range. Mainly because more words means more features more time to train)**

In [23]:
from nltk.tokenize import word_tokenize

Preprocessed = Wordstem_removed

# creating bag-of-words

All_words = [] ## empty list


for message in Preprocessed:
    
    Words = word_tokenize(message)
    
    for word in Words:
        
        All_words.append(word) ## Appending each word to the empty string All_words
        

All_words = nltk.FreqDist(All_words) ## We are a frequency distribution of All words. 
                                     ## Means how many times each word has been repeated.

All_words

FreqDist({'numbr': 2648, 'u': 1207, 'call': 674, 'go': 456, 'get': 451, 'ur': 391, 'gt': 318, 'lt': 316, 'come': 304, 'moneysymbnumbr': 303, ...})

In [24]:
# Let us print the total number of words and the 10 most common words.

print('Number of words: {}'.format(len(All_words)))

print('Most common words: {}'.format(All_words.most_common(10)))

Number of words: 6579
Most common words: [('numbr', 2648), ('u', 1207), ('call', 674), ('go', 456), ('get', 451), ('ur', 391), ('gt', 318), ('lt', 316), ('come', 304), ('moneysymbnumbr', 303)]


In [25]:
# using the 2000 most common words as features

Word_features = list(All_words.keys())[:2000]

Word_features[:10] 

['go',
 'jurong',
 'point',
 'crazi',
 'avail',
 'bugi',
 'n',
 'great',
 'world',
 'la']

###  **After this we will define a function called find_features function that will determine which of this 2000 word features are contained in each review**

In [26]:
# The find_features function will determine which of the 1500 word features are contained in the review

def find_features(message):
    
    Words = word_tokenize(message)
    
    Features = {}
    
    for word in Word_features:
        
        Features[word] = (word in Words)
    
    return Features



In [27]:
# Lets see an example!

features = find_features(Preprocessed[0])

for key,value in features.items():
    if value == True:
        print(key)
        
## Works fine..!

go
jurong
point
crazi
avail
bugi
n
great
world
la
e
buffet
cine
got
amor
wat


In [28]:
## Now lets to it for all the Messages

Messages = list(zip(Preprocessed,Binary_Labels))


## defining a seed for reproducibility

seed = 1

np.random.seed = seed

np.random.shuffle(Messages)


##  Calling the  find_features function for each SMS message

Feature_sets = [(find_features(text), label) for (text, label) in Messages]


### Now we will split the Feature_sets into training and testing datasets using sklearn

In [29]:
from sklearn.model_selection import train_test_split

# splitting the data into training and testing datasets

Training,Testing = train_test_split(Feature_sets, test_size = 0.25, random_state=1)

In [31]:
print(len(Training))
print(len(Testing))

4179
1393


### 4. Scikit-Learn Classifiers with NLTK

Now that we have our training dataset, We'll need to import each algorithm we plan on using from sklearn.  We also need to import some performance metrics, such as accuracy_score and classification_report.

In [39]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix
from sklearn.linear_model import RidgeClassifier


In [40]:
# Defining each models to train

names = ["K Nearest Neighbors", "Decision Tree", "Random Forest", "Logistic Regression", "SGD Classifier",
         "Naive Bayes", "SVM Linear","AdaBoost Classifier","Gradient Boosting Classifier","Bagging Classifier","Extra Trees Classifier",
         "Ridge Classifier"]

classifiers = [
    KNeighborsClassifier(),
    DecisionTreeClassifier(),
    RandomForestClassifier(),
    LogisticRegression(),
    SGDClassifier(max_iter = 100),
    MultinomialNB(),
    SVC(kernel = 'linear'),
    AdaBoostClassifier(),
    GradientBoostingClassifier(),
    BaggingClassifier(),
    ExtraTreesClassifier(),
    RidgeClassifier(max_iter=100)]


models = zip(names, classifiers)


for name, model in models:
    
    nltk_model = SklearnClassifier(model)
    
    nltk_model.train(Training)
    
    accuracy = nltk.classify.accuracy(nltk_model, Testing)*100
    
    print("{} Accuracy: {}".format(name, accuracy))

K Nearest Neighbors Accuracy: 92.96482412060301
Decision Tree Accuracy: 97.20028715003589
Random Forest Accuracy: 98.85139985642498
Logistic Regression Accuracy: 98.63603732950466
SGD Classifier Accuracy: 98.34888729361091
Naive Bayes Accuracy: 97.77458722182341
SVM Linear Accuracy: 98.7078248384781
AdaBoost Classifier Accuracy: 98.06173725771716
Gradient Boosting Classifier Accuracy: 98.34888729361091
Bagging Classifier Accuracy: 98.1335247666906
Extra Trees Classifier Accuracy: 98.92318736539842
Ridge Classifier Accuracy: 97.4156496769562


## Ensemble methods - Voting classifier (Combination of all the classifier methods into a single classification algorithm)

In [49]:
from sklearn.ensemble import VotingClassifier

names2 = ["K Nearest Neighbors", "Decision Tree", "Random Forest", "Logistic Regression", "SGD Classifier",
         "Naive Bayes", "SVM Linear","AdaBoost Classifier","Gradient Boosting Classifier","Bagging Classifier","Extra Trees Classifier",
         "Ridge Classifier"]

classifiers2 = [
    KNeighborsClassifier(),
    DecisionTreeClassifier(),
    RandomForestClassifier(),
    LogisticRegression(),
    SGDClassifier(max_iter = 100),
    MultinomialNB(),
    SVC(kernel = 'linear'),
    AdaBoostClassifier(),
    GradientBoostingClassifier(),
    BaggingClassifier(),
    ExtraTreesClassifier(),
    RidgeClassifier(max_iter=100)]


models2 = list(zip(names2,classifiers2))

nltk_ensemble = SklearnClassifier(VotingClassifier(estimators = models2, voting = 'hard', n_jobs = -1))
nltk_ensemble.train(Training)
accuracy = nltk.classify.accuracy(nltk_model, Testing)*100
print("Voting Classifier: Accuracy: {}".format(accuracy))

Voting Classifier: Accuracy: 97.4156496769562


In [50]:
## make class label prediction for testing set

txt_features, labels = zip(*Testing)

prediction = nltk_ensemble.classify_many(txt_features)

In [51]:
# print a confusion matrix and a classification report

print(classification_report(labels, prediction))

pd.DataFrame(
    confusion_matrix(labels, prediction),
    index = [['actual', 'actual'], ['ham', 'spam']],
    columns = [['predicted', 'predicted'], ['ham', 'spam']])

              precision    recall  f1-score   support

           0       0.99      1.00      0.99      1205
           1       1.00      0.90      0.95       188

    accuracy                           0.99      1393
   macro avg       0.99      0.95      0.97      1393
weighted avg       0.99      0.99      0.99      1393



Unnamed: 0_level_0,Unnamed: 1_level_0,predicted,predicted
Unnamed: 0_level_1,Unnamed: 1_level_1,ham,spam
actual,ham,1205,0
actual,spam,18,170


### 0 is our ham class and 1 is our spam class .

# -----------------------------------------------------------------------------------------------------------------