In [1]:
# Perform imports and load the dataset:
import numpy as np
import pandas as pd

df = pd.read_csv('smsspamcollection.tsv', sep='\t')
df.head()

Unnamed: 0,label,message,length,punct
0,ham,"Go until jurong point, crazy.. Available only ...",111,9
1,ham,Ok lar... Joking wif u oni...,29,6
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,155,6
3,ham,U dun say so early hor... U c already then say...,49,6
4,ham,"Nah I don't think he goes to usf, he lives aro...",61,2


In [2]:
df.isnull().sum()

label      0
message    0
length     0
punct      0
dtype: int64

In [3]:
df['label'].value_counts()

ham     4825
spam     747
Name: label, dtype: int64

In [4]:
4825/len(df)

0.8659368269921034

4825 out of 5572 messages, or 86.6%, are ham. This means that any text classification model we create has to perform **better than 86.6%** to beat random chance

In [5]:
from sklearn.model_selection import train_test_split

X = df['message']  # this time we want to look at the text
y = df['label']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

## Scikit-learn's CountVectorizer
Text preprocessing, tokenizing and the ability to filter out stopwords are all included in [CountVectorizer],which builds a dictionary of features and transforms documents to feature vectors.

In [6]:
from sklearn.feature_extraction.text import CountVectorizer

In [7]:
count_vect = CountVectorizer()

In [8]:
X_train_counts = count_vect.fit_transform(X_train)

In [9]:
X_train_counts.shape

(3733, 7082)

This shows that our training set is comprised of 3733 documents, and 7082 features.

## Transform Counts to Frequencies with Tf-idf
While counting words is helpful, longer documents will have higher average count values than shorter documents, even though they might talk about the same topics.

To avoid this we can simply divide the number of occurrences of each word in a document by the total number of words in the document: these new features are called **tf** for Term Frequencies.

Another refinement on top of **tf** is to downscale weights for words that occur in many documents in the corpus and are therefore less informative than those that occur only in a smaller portion of the corpus.

This downscaling is called **tf–idf** for “Term Frequency times Inverse Document Frequency”.

Both tf and tf–idf can be computed as follows using [TfidfTransformer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html):

In [10]:
from sklearn.feature_extraction.text import TfidfTransformer

In [11]:
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts) # passing the result of count vectorizer
X_train_tfidf.shape

(3733, 7082)

 the `fit_transform()` method actually performs two operations: it fits an estimator to the data and then transforms our count-matrix to a tf-idf representation.

## Combine Steps with TfidVectorizer
In the future, we can combine the CountVectorizer and TfidTransformer steps into one using [TfidVectorizer]

In [13]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()

X_train_tfidf = vectorizer.fit_transform(X_train) #use orignal X_train here

X_train_tfidf.shape

(3733, 7082)

## Train a Classifier

In [15]:
from sklearn.svm import LinearSVC
clf = LinearSVC()
clf.fit(X_train_tfidf,y_train)

LinearSVC()

## Build a Pipeline
Remember that only our training set has been vectorized into a full vocabulary. In order to perform an analysis on our test set we'll have to submit it to the same procedures.

In [16]:
from sklearn.pipeline import Pipeline

In [17]:
text_clf = Pipeline([('tf_idf',TfidfVectorizer()),('clf',LinearSVC())])

In [19]:
# Feed the training data through the pipeline
text_clf.fit(X_train,y_train)

Pipeline(steps=[('tf_idf', TfidfVectorizer()), ('clf', LinearSVC())])

## Test the classifier and display results

In [20]:
predictions =text_clf.predict(X_test)

In [22]:
from sklearn import metrics

In [23]:
print(metrics.confusion_matrix(y_test,predictions))

[[1586    7]
 [  12  234]]


In [24]:
print(metrics.accuracy_score(y_test,predictions))

0.989668297988037


In [25]:
print(metrics.classification_report(y_test,predictions))

              precision    recall  f1-score   support

         ham       0.99      1.00      0.99      1593
        spam       0.97      0.95      0.96       246

    accuracy                           0.99      1839
   macro avg       0.98      0.97      0.98      1839
weighted avg       0.99      0.99      0.99      1839

