[back](./04-tweak-nlp-model.ipynb)

---
## `Pipeline with Spam data`

### `Imports`

In [1]:
import pandas as pd
import nltk
import string
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix

df = pd.read_table('../../assets/SMSSpamCollection', header=None)
df.columns = ['spam', 'msg']
nltk.download('stopwords')
nltk.download('punkt')
stopwords = set(nltk.corpus.stopwords.words('english'))
punctuation_set = set(string.punctuation)
df['msg_cleaned'] = df.msg.apply(lambda x: ' '.join([word for word in x.split() if word not in stopwords
                                                     and word not in punctuation_set]))
df['msg_cleaned'] = df.msg_cleaned.str.lower()
df.head(2)


[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/goutham/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /Users/goutham/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Unnamed: 0,spam,msg,msg_cleaned
0,ham,"Go until jurong point, crazy.. Available only ...","go jurong point, crazy.. available bugis n gre..."
1,ham,Ok lar... Joking wif u oni...,ok lar... joking wif u oni...


### `Pipeline`

In the previous few sections, we have seen a lot of repeated code and **Pipeline** is useful in a way that we can streamline our code and then reuse it, without rewriting it.

In [2]:
pipeline = Pipeline([('countvect', CountVectorizer(stop_words=stopwords)), \
                    # ('tfidf', TfidfVectorizer(stop_words=stopwords)), \
                        ('rf', RandomForestClassifier())])

In [3]:
X = df.msg_cleaned
y = df.spam
X_train, X_test, y_train, y_test = train_test_split(X, y)

pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)

print(pipeline.score(X_test, y_test))

print(confusion_matrix(y_test, y_pred))

0.9770279971284996
[[1198    0]
 [  32  163]]


#### `Consider we want to do TFIDF`

In [4]:
pipeline_tfidf = Pipeline([# ('countvect', CountVectorizer(stop_words=stopwords)), \
                    ('tfidf', TfidfVectorizer(stop_words=stopwords)), \
                     ('rf', RandomForestClassifier())])


In [5]:
X = df.msg_cleaned
y = df.spam
X_train, X_test, y_train, y_test = train_test_split(X, y)

pipeline_tfidf.fit(X_train, y_train)
y_pred = pipeline_tfidf.predict(X_test)

print(pipeline_tfidf.score(X_test, y_test))

print(confusion_matrix(y_test, y_pred))


0.9748743718592965
[[1188    2]
 [  33  170]]


#### `What if we want to do Count Vectorizer with Logistic Regression`

In [6]:
pipeline_cv = Pipeline([('countvect', CountVectorizer(stop_words=stopwords)), \
    ('lg', LogisticRegression())])


In [7]:
X = df.msg_cleaned
y = df.spam
X_train, X_test, y_train, y_test = train_test_split(X, y)

pipeline_cv.fit(X_train, y_train)
y_pred = pipeline_cv.predict(X_test)

print(pipeline_cv.score(X_test, y_test))

print(confusion_matrix(y_test, y_pred))


0.9777458722182341
[[1203    3]
 [  28  159]]


### `Conclusion`

You can have multiple **pipelines** created and give it different names and go through the same steps to know which gives us the best score and best type1, type2 error combinations.


---
[next](../08-pyspark-df-sql/00-index.ipynb)