## Data prep

In [1]:
import pandas as pd
import numpy as np

spam_data = pd.read_csv('data/spam.csv')

spam_data['target'] = np.where(spam_data['target']=='spam',1,0)
spam_data.head(10)

Unnamed: 0,text,target
0,"Go until jurong point, crazy.. Available only ...",0
1,Ok lar... Joking wif u oni...,0
2,Free entry in 2 a wkly comp to win FA Cup fina...,1
3,U dun say so early hor... U c already then say...,0
4,"Nah I don't think he goes to usf, he lives aro...",0
5,FreeMsg Hey there darling it's been 3 week's n...,1
6,Even my brother is not like to speak with me. ...,0
7,As per your request 'Melle Melle (Oru Minnamin...,0
8,WINNER!! As a valued network customer you have...,1
9,Had your mobile 11 months or more? U R entitle...,1


In [2]:
from sklearn.model_selection import train_test_split


X_train, X_test, y_train, y_test = train_test_split(spam_data['text'], 
                                                    spam_data['target'], 
                                                    random_state=0)

## Questions

### Question 1: What percentage of the documents in spam_data are spam?

In [3]:
def answer_one():
    return len(spam_data[spam_data["target"] == 1]) / len(spam_data) * 100

answer_one()

13.406317300789663

### Question 2: Fit the training data X_train using a Count Vectorizer with default parameters.

What is the longest token in the vocabulary?

In [6]:
def answer_two():
    from sklearn.feature_extraction.text import CountVectorizer

    vect = CountVectorizer().fit(X_train)

    return max(vect.get_feature_names(), key = len)

answer_two()

'com1win150ppmx3age16subscription'

### Question 3: Fit and transform the training data X_train using a Count Vectorizer with default parameters.

Next, fit a fit a multinomial Naive Bayes classifier model with smoothing alpha=0.1. Find the area under the curve (AUC) score using the transformed test data.



In [9]:
def answer_three():
    from sklearn.feature_extraction.text import CountVectorizer
    from sklearn.naive_bayes import MultinomialNB
    from sklearn.metrics import roc_auc_score

    vect = CountVectorizer().fit(X_train)
    X_train_vectorized = vect.transform(X_train).toarray()

    model = MultinomialNB()
    model.fit(X_train_vectorized, y_train)

    predictions = model.predict(vect.transform(X_test).toarray())

    return roc_auc_score(y_test, predictions)

answer_three()

0.9581366823421557

### Question 4: Fit and transform the training data X_train using a Tfidf Vectorizer with default parameters.

What 20 features have the smallest tf-idf and what 20 have the largest tf-idf?

Put these features in a two series where each series is sorted by tf-idf value and then alphabetically by feature name. The index of the series should be the feature name, and the data should be the tf-idf.

The series of 20 features with smallest tf-idfs should be sorted smallest tfidf first, the list of 20 features with largest tf-idfs should be sorted largest first.

In [28]:
def answer_four():
    from sklearn.feature_extraction.text import TfidfVectorizer

    vect = TfidfVectorizer()
    X_train_vectorized = vect.fit_transform(X_train)

    feature_names = np.array(vect.get_feature_names())
    sorted_coef_index = X_train_vectorized.max(0).toarray()[0].argsort()

    s = pd.Series(sorted_coef_index, index = feature_names).sort_values()

    return s

answer_four()

accommodationvouchers       0
lately                      1
6wu                         2
box42wr29c                  3
aint                        4
                         ... 
thm                      7349
dileep                   7350
crickiting               7351
ldn                      7352
jet                      7353
Length: 7354, dtype: int64