[back](./02-count-vectorization-and-tfidf.ipynb)

---
## `NLP Example with Spam`

In [1]:
import pandas as pd
import nltk
import string
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix


In [2]:
# revisit spam-ham example

df = pd.read_table('../../assets/SMSSpamCollection', header=None)
df.head(3)

Unnamed: 0,0,1
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...


### `Data setup`

In [3]:
df.columns=['spam', 'msg']
df.head(2)

Unnamed: 0,spam,msg
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...


In [4]:
nltk.download('stopwords')
nltk.download('punkt')
stopwords = set(nltk.corpus.stopwords.words('english'))
punctuation_set = set(string.punctuation)


[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/goutham/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /Users/goutham/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [5]:
df['msg_cleaned'] = df.msg.apply(lambda x: ' '.join([word for word in x.split() if word not in stopwords \
  and word not in punctuation_set]))
df.head(2)

Unnamed: 0,spam,msg,msg_cleaned
0,ham,"Go until jurong point, crazy.. Available only ...","Go jurong point, crazy.. Available bugis n gre..."
1,ham,Ok lar... Joking wif u oni...,Ok lar... Joking wif u oni...


In [6]:
df['msg_cleaned'] = df.msg_cleaned.str.lower()
df.head(2)

Unnamed: 0,spam,msg,msg_cleaned
0,ham,"Go until jurong point, crazy.. Available only ...","go jurong point, crazy.. available bugis n gre..."
1,ham,Ok lar... Joking wif u oni...,ok lar... joking wif u oni...


### `Count Vectorization`

In [7]:
count_vect = CountVectorizer()

In [8]:
X = count_vect.fit_transform(df.msg_cleaned)
X

<5572x8703 sparse matrix of type '<class 'numpy.int64'>'
	with 54276 stored elements in Compressed Sparse Row format>

In [9]:
X.shape

(5572, 8703)

In [10]:
y = df.spam

In [11]:
X_train, X_test, y_train, y_test = train_test_split(X, y)

In [12]:
lr = LogisticRegression()
lr.fit(X_train, y_train)

y_pred = lr.predict(X_test)
lr.score(X_test, y_test)

0.9777458722182341

In [13]:
y_test.head(5)

3867     ham
4092     ham
5137    spam
3992     ham
4510     ham
Name: spam, dtype: object

In [14]:
y_pred[0:5]

array(['ham', 'ham', 'spam', 'ham', 'ham'], dtype=object)

### `Conclusion`

In [15]:
confusion_matrix(y_test, y_pred)

array([[1204,    1],
       [  30,  158]])

The diagonals are the values that we have predicted, $0\times0$ is predicting *ham as ham* and $1\times1$ is predicting *spam as spam*.  
$0\times1$ and $1\times0$ is *type1 error* and *type2 error* respectively. That is, we have mislabelling *ham* once and mislabelled *spam* $30$ times


---
[next]()