# Text Feature Extraction using ML models 

# Installation and Setup

### From the command line or terminal:
> `conda install scikit-learn`
> <br>*or*<br>
> `pip install -U scikit-learn`

Scikit-learn additionally requires that NumPy and SciPy be installed. For more info visit http://scikit-learn.org/stable/install.html

# Perform Imports and Load Data
For this exercise we'll be using the **SMSSpamCollection** dataset from [UCI datasets](https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection) that contains more than 5 thousand SMS phone messages.<br>You can check out the [**sms_readme**](../TextFiles/sms_readme.txt) file for more info.

The file is a [tab-separated-values](https://en.wikipedia.org/wiki/Tab-separated_values) (tsv) file with four columns:
> **label** - every message is labeled as either ***ham*** or ***spam***<br>
> **message** - the message itself<br>
> **length** - the number of characters in each message<br>
> **punct** - the number of punctuation characters in each message

# Classification of sms message using length and punct columns

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [None]:
df=pd.read_csv('smsspamcollection.tsv',sep='\t')
df.head()

## Check for missing values:
Machine learning models usually require complete data.

In [None]:
df.isnull()
#if anything is missing, it is returning True, nothing is missing, it will return False
#False is treated as 0, True is treated as 1

In [None]:
df.isnull().sum()
#no data is missing, because we have 0 values

In [None]:
len(df) #returns the no of rows. we have 5572 rows

## Take a quick look at the *ham* and *spam* `label` column:

In [None]:
df['label'].unique()

In [None]:
df['label'].value_counts()

<font color=green>We see that 4825 out of 5572 messages, or 86.6%, are ham.<br>This means that any machine learning model we create has to perform **better than 86.6%** to beat random chance.</font>

In [None]:
#Build a simple ML model that can predict whether the msg is ham or spam based on length of msg or punct 


## Visualize the data:
Since we're not ready to do anything with the message text, let's see if we can predict ham/spam labels based on message length and punctuation counts. We'll look at message `length` first:

In [None]:
df['length'].describe()

<font color=green>This dataset is extremely skewed. The mean value is 80.5 and yet the max length is 910. Let's plot this on a logarithmic x-axis.</font>

In [None]:
#spam message may be longer than ham message
%matplotlib inline

plt.xscale('log')
bins=1.5**(np.arange(0,15))
plt.hist(df[df['label']=='ham']['length'],bins=bins,alpha=0.8)
plt.hist(df[df['label']=='spam']['length'],bins=bins,alpha=0.8)
plt.legend(('ham','spam'))
plt.show

<font color=green>It looks like there's a small range of values where a message is more likely to be spam than ham.</font>

Now let's look at the `punct` column:

In [None]:
df['punct'].describe()

In [None]:
plt.xscale('log')
bins=1.5**(np.arange(0,15))
plt.hist(df[df['label']=='ham']['punct'],bins=bins,alpha=0.8)
plt.hist(df[df['label']=='spam']['punct'],bins=bins,alpha=0.8)
plt.legend(('ham','spam'))
plt.show
#no distinct behaviour between ham and spam on the basis of punct

<font color=green>This looks even worse - there seem to be no values where one would pick spam over ham. We'll still try to build a machine learning classification model, but we should expect poor results.</font>

___
# Split the data into train & test sets:



In [None]:
#create Feature and label sets
X=df[['length','punct']] #note the double set of brackets 
#passing list of columns so two sets of brackets
y=df['label']
#X is our feature data
#y is our label

## Additional train/test/split arguments:
The default test size for `train_test_split` is 30%. Here we'll assign 33% of the data for testing.<br>
Also, we can set a `random_state` seed value to ensure that everyone uses the same "random" training & testing sets.

In [None]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.3,random_state=42)

In [None]:
X_train.shape #3900 rows with 2 columns

In [None]:
X_test.shape #1672 rows with 2 columns

Now we can pass these sets into a series of different training & testing algorithms and compare their results.

___
# Train a Logistic Regression classifier
One of the simplest multi-class classification tools is [logistic regression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html). Scikit-learn offers a variety of algorithmic solvers; we'll use [L-BFGS](https://en.wikipedia.org/wiki/Limited-memory_BFGS). 

In [None]:
#step1 import the model;
#step2 create an instance of the model
#step3 fit the model to the training data
#step4 predict the model
#solver means what algorithm to use in the optimization problem -for multiclass problems, use lbfgs


In [None]:
from sklearn.linear_model import LogisticRegression
lr_model=LogisticRegression(solver='lbfgs')
lr_model.fit(X_train,y_train)

## Test the Accuracy of the Model

In [None]:
from sklearn import metrics
#Create a predication set:
predications=lr_model.predict(X_test)
#Print a confusion matrix:
print(metrics.confusion_matrix(y_test,predications))

In [None]:
# You can make the confusion matrix less confusing by adding labels:
df_confusion=pd.DataFrame(metrics.confusion_matrix(y_test,predications),index=['ham','spam'],columns=['ham','spam'])
df_confusion
#correctly classified 1404 as ham  and 5 as spam

<font color=green>These results are terrible! More spam messages were confused as ham (241) than correctly identified as spam (5), although a relatively small number of ham messages (44) were confused as spam.</font>

In [None]:
# Print a classification report
print(metrics.classification_report(y_test,predications))

In [None]:
# Print the overall accuracy
print(metrics.accuracy_score(y_test,predications))
#84.3% accuracy

<font color=green>This model performed *worse* than a classifier that assigned all messages as "ham" would have!</font>

___
# Train a naïve Bayes classifier:
One of the most common - and successful - classifiers is [naïve Bayes](http://scikit-learn.org/stable/modules/naive_bayes.html#naive-bayes).

In [None]:
from sklearn.naive_bayes import MultinomialNB
nb_model=MultinomialNB()
nb_model.fit(X_train,y_train)

## Run predictions and report on metrics

In [None]:
predications=nb_model.predict(X_test)
print(metrics.confusion_matrix(y_test,predications))
#now we cannot classify any spam messages

<font color=green>The total number of confusions dropped from **287** to **256**. [241+46=287, 246+10=256]</font>

In [None]:
print(metrics.classification_report(y_test,predications))

In [None]:
print(metrics.accuracy_score(y_test,predications))

<font color=green>Better, but still less accurate than 86.6%</font>

___
# Train a support vector machine (SVM) classifier
Among the SVM options available, we'll use [C-Support Vector Classification (SVC)](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC)

In [None]:
from sklearn.svm import SVC
svc_model=SVC(gamma='auto') #default value
svc_model.fit(X_train,y_train)

## Run predictions and report on metrics

In [None]:
predications=svc_model.predict(X_test)
print(metrics.confusion_matrix(y_test,predications))

<font color=green>The total number of confusions dropped even further to **209**.</font>

In [None]:
print(metrics.classification_report(y_test,predications))

In [None]:
print(metrics.accuracy_score(y_test,predications))

<font color=green>And finally we have a model that performs *slightly* better than random chance.</font>