[back](./03-nlp-example-with-spam.ipynb)

---
## `Tweak NLP model with Spam data`

### `Imports and Setup`

In [1]:
import pandas as pd
import nltk
import string
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
df = pd.read_table('../../assets/SMSSpamCollection', header=None)
df.columns=['spam', 'msg']
nltk.download('stopwords')
nltk.download('punkt')
stopwords = set(nltk.corpus.stopwords.words('english'))
punctuation_set = set(string.punctuation)
df['msg_cleaned'] = df.msg.apply(lambda x: ' '.join([word for word in x.split() if word not in stopwords \
  and word not in punctuation_set]))
df['msg_cleaned'] = df.msg_cleaned.str.lower()

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/goutham/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /Users/goutham/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


### `Trying TFIDF`

In [2]:
tfidf = TfidfVectorizer()

In [3]:
df.head(2)

Unnamed: 0,spam,msg,msg_cleaned
0,ham,"Go until jurong point, crazy.. Available only ...","go jurong point, crazy.. available bugis n gre..."
1,ham,Ok lar... Joking wif u oni...,ok lar... joking wif u oni...


In [4]:
X = tfidf.fit_transform(df.msg_cleaned)
y = df.spam
X_train, X_test, y_train, y_test = train_test_split(X, y)

#### `01 - Random Forest`

In [5]:
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier()
rf.fit(X_train, y_train)

y_pred = rf.predict(X_test)
rf.score(X_test, y_test)

0.9791816223977028

In [6]:
confusion_matrix(y_test, y_pred)

array([[1223,    0],
       [  29,  141]])

#### `02 - Gradient Boost`

In [7]:
from sklearn.ensemble import GradientBoostingClassifier

gb = GradientBoostingClassifier()
gb.fit(X_train, y_train)

y_pred = gb.predict(X_test)
gb.score(X_test, y_test)


0.968413496051687

In [8]:
confusion_matrix(y_test, y_pred)

array([[1218,    5],
       [  39,  131]])

### `Trying TFIDF with bigrams and trigrams`

In [9]:
tfidf = TfidfVectorizer(ngram_range=(1, 3))

In [10]:
X = tfidf.fit_transform(df.msg_cleaned)
y = df.spam
X_train, X_test, y_train, y_test = train_test_split(X, y)


#### `01 - Gradient Boost`

In [11]:
gb = GradientBoostingClassifier()
gb.fit(X_train, y_train)

y_pred = gb.predict(X_test)
gb.score(X_test, y_test)


0.9619526202440776

In [12]:
confusion_matrix(y_test, y_pred)


array([[1186,    7],
       [  46,  154]])

### `TFIDF with Logistic Regression`

In [13]:
tfidf = TfidfVectorizer()
X = tfidf.fit_transform(df.msg_cleaned)
y = df.spam
X_train, X_test, y_train, y_test = train_test_split(X, y)

In [14]:
lr = LogisticRegression()
lr.fit(X_train, y_train)

y_pred = lr.predict(X_test)
lr.score(X_test, y_test)

0.9562096195262024

In [15]:
confusion_matrix(y_test, y_pred)


array([[1205,    3],
       [  58,  127]])

### `Conclusion`

The conclusion is that **CountVectorizer** + **LogisticRegression** *(our initial try)* is giving us the best results!  
So, these are the things that we can try to refine / tweak our model or try different parameters that we can pass into our model to see if we can improve it. Also, as we can see, modeling takes a lot of iterations to finalize.


---
[next](./05-pipeline-with-spam-data.ipynb)