# Tutorial: Machine Learning with Text in scikit-learn

# Note from Donny

This combines: Kevin Markham's Text Learning Tutorial<br>
Some Information from Wikipedia<br>
And some stuff on TF-IDF

## Agenda

1. Model building in scikit-learn (refresher)
2. Representing text as numerical data
3. Reading a text-based dataset into pandas
4. Vectorizing our dataset
5. Building and evaluating a model
6. Comparing models
7. Performing Cross-Validation to select C
8. Tuning the vectorizer (discussion)
9. TF-IDF Vectorizer plus model

## Part 1: Model building in scikit-learn (refresher)

In [316]:
# load the iris dataset as an example
from sklearn.datasets import load_iris
iris = load_iris()

In [317]:
# store the feature matrix (X) and response vector (y)
X = iris.data
y = iris.target

**"Features"** are also known as predictors, inputs, or attributes. The **"response"** is also known as the target, label, or output.

In [318]:
# check the shapes of X and y
print(X.shape)
print(y.shape)

(150, 4)
(150,)


**"Observations"** are also known as samples, instances, or records.

In [319]:
# examine the first 5 rows of the feature matrix (including the feature names)
import pandas as pd
pd.DataFrame(X, columns=iris.feature_names).head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2


In [320]:
# examine the response vector
print(y)

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2]


In order to **build a model**, the features must be **numeric**, and every observation must have the **same features in the same order**.

In [321]:
# import the class
from sklearn.neighbors import KNeighborsClassifier

# instantiate the model (with the default parameters)
knn = KNeighborsClassifier()

# fit the model with data (occurs in-place)
knn.fit(X, y)

In order to **make a prediction**, the new observation must have the **same features as the training observations**, both in number and meaning.

In [322]:
# predict the response for a new observation
knn.predict([[3, 5, 4, 2]])

array([1])

## Part 2: Representing text as numerical data

In [323]:
# example text for model training (SMS messages)
simple_train = ['call you tonight', 'Call me a cab', 'please call me... PLEASE!']

From the [scikit-learn documentation](http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction):

> Text Analysis is a major application field for machine learning algorithms. However the raw data, a sequence of symbols cannot be fed directly to the algorithms themselves as most of them expect **numerical feature vectors with a fixed size** rather than the **raw text documents with variable length**.

We will use [CountVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) to "convert text into a matrix of token counts":

In [324]:
# import and instantiate CountVectorizer (with the default parameters)
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer()

In [325]:
# learn the 'vocabulary' of the training data (occurs in-place)
vect.fit(simple_train)

In [326]:
simple_train

['call you tonight', 'Call me a cab', 'please call me... PLEASE!']

In [327]:
# examine the fitted vocabulary
vect.get_feature_names_out()

array(['cab', 'call', 'me', 'please', 'tonight', 'you'], dtype=object)

In [328]:
# transform training data into a 'document-term matrix'
simple_train_dtm = vect.transform(simple_train)
simple_train_dtm

<3x6 sparse matrix of type '<class 'numpy.int64'>'
	with 9 stored elements in Compressed Sparse Row format>

In [329]:
simple_train_dtm

<3x6 sparse matrix of type '<class 'numpy.int64'>'
	with 9 stored elements in Compressed Sparse Row format>

In [330]:
# convert sparse matrix to a dense matrix
simple_train_dtm.toarray()

array([[0, 1, 0, 0, 1, 1],
       [1, 1, 1, 0, 0, 0],
       [0, 1, 1, 2, 0, 0]], dtype=int64)

In [331]:
# examine the vocabulary and document-term matrix together
pd.DataFrame(simple_train_dtm.toarray(), columns=vect.get_feature_names_out())

Unnamed: 0,cab,call,me,please,tonight,you
0,0,1,0,0,1,1
1,1,1,1,0,0,0
2,0,1,1,2,0,0


From the [scikit-learn documentation](http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction):

> In this scheme, features and samples are defined as follows:

> - Each individual token occurrence frequency (normalized or not) is treated as a **feature**.
> - The vector of all the token frequencies for a given document is considered a multivariate **sample**.

> A **corpus of documents** can thus be represented by a matrix with **one row per document** and **one column per token** (e.g. word) occurring in the corpus.

> We call **vectorization** the general process of turning a collection of text documents into numerical feature vectors. This specific strategy (tokenization, counting and normalization) is called the **Bag of Words** or "Bag of n-grams" representation. Documents are described by word occurrences while completely ignoring the relative position information of the words in the document.

In [332]:
# check the type of the document-term matrix
type(simple_train_dtm)

scipy.sparse._csr.csr_matrix

In [333]:
# examine the sparse matrix contents
print(simple_train_dtm)

  (0, 1)	1
  (0, 4)	1
  (0, 5)	1
  (1, 0)	1
  (1, 1)	1
  (1, 2)	1
  (2, 1)	1
  (2, 2)	1
  (2, 3)	2


From the [scikit-learn documentation](http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction):

> As most documents will typically use a very small subset of the words used in the corpus, the resulting matrix will have **many feature values that are zeros** (typically more than 99% of them).

> For instance, a collection of 10,000 short text documents (such as emails) will use a vocabulary with a size in the order of 100,000 unique words in total while each document will use 100 to 1000 unique words individually.

> In order to be able to **store such a matrix in memory** but also to **speed up operations**, implementations will typically use a **sparse representation** such as the implementations available in the `scipy.sparse` package.

In [334]:
# example text for model testing
simple_test = ["please don't call me"]

In order to **make a prediction**, the new observation must have the **same features as the training observations**, both in number and meaning.

In [335]:
# transform testing data into a document-term matrix (using existing vocabulary)
simple_test_dtm = vect.transform(simple_test)
simple_test_dtm.toarray()

array([[0, 1, 1, 1, 0, 0]], dtype=int64)

In [336]:
# examine the vocabulary and document-term matrix together
pd.DataFrame(simple_test_dtm.toarray(), columns=vect.get_feature_names_out())

Unnamed: 0,cab,call,me,please,tonight,you
0,0,1,1,1,0,0


**Summary:**

- `vect.fit(train)` **learns the vocabulary** of the training data
- `vect.transform(train)` uses the **fitted vocabulary** to build a document-term matrix from the training data
- `vect.transform(valid)` uses the **fitted vocabulary** to build a document-term matrix from the validation data (and **ignores tokens** it hasn't seen before)

## Part 3: Reading a text-based dataset into pandas

Now you are going to do some things.

In [337]:
# read file into pandas using a relative path sms.tsv
# your code here
path = '../Week 8 Files-20250310/sms.tsv'
sms = pd.read_table(path, header=None, names=['label', 'message'])

Check the .head() as usual, did read_table work correctly? Do you need to set any parameters such as header or names?

We want the columns to have names <i>label</i> and <i>message</i>

In [338]:
# your code here, read the tsv correctly
sms.head()

Unnamed: 0,label,message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [339]:
# examine the shape
sms.shape

(5572, 2)

Examine the first 10 rows

In [340]:
# examine the first 10 rows
sms.head(10)

Unnamed: 0,label,message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
5,spam,FreeMsg Hey there darling it's been 3 week's n...
6,ham,Even my brother is not like to speak with me. ...
7,ham,As per your request 'Melle Melle (Oru Minnamin...
8,spam,WINNER!! As a valued network customer you have...
9,spam,Had your mobile 11 months or more? U R entitle...


In [341]:
# examine the class distribution
sms.label.value_counts()

label
ham     4825
spam     747
Name: count, dtype: int64

In [342]:
# convert label to a numerical variable
sms['label_num'] = sms.label.map({'ham':0, 'spam':1})

In [343]:
# check that the conversion worked
sms.head(10)

Unnamed: 0,label,message,label_num
0,ham,"Go until jurong point, crazy.. Available only ...",0
1,ham,Ok lar... Joking wif u oni...,0
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,1
3,ham,U dun say so early hor... U c already then say...,0
4,ham,"Nah I don't think he goes to usf, he lives aro...",0
5,spam,FreeMsg Hey there darling it's been 3 week's n...,1
6,ham,Even my brother is not like to speak with me. ...,0
7,ham,As per your request 'Melle Melle (Oru Minnamin...,0
8,spam,WINNER!! As a valued network customer you have...,1
9,spam,Had your mobile 11 months or more? U R entitle...,1


In [344]:
# Identify what will be the response variable and what will be the set of features, the X and y
X = sms['message']
y = sms['label_num']
print(X.shape)
print(y.shape)

(5572,)
(5572,)


In [345]:
from sklearn.model_selection import train_test_split

# split X and y into training and validation sets - train_test_split with random_state=1138 so we all get the same
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.25, random_state=1138)

print(X_train.shape)
print(X_valid.shape)
print(y_train.shape)
print(y_valid.shape)

(4179,)
(1393,)
(4179,)
(1393,)


## Part 4: Vectorizing our dataset

In [346]:
# instantiate the vectorizer
vect = CountVectorizer()

### Notice

The Vectorizer (the vocabulary) is only built on the Training Set of Data. Can anyone tell me why?

The Vectorizer needs to be "fit" like a ML algorithm. Fit the <i>vect</i> with X_train

In [347]:
# your code here
vect.fit(X_train)


You now need to create a dtm (document term matrix) like above by transforming the text into a matrix

In [348]:
# transform training data into a document-term matrix
X_train_dtm = vect.transform(X_train)

In [349]:
# examine the document-term matrix
X_train_dtm

<4179x7533 sparse matrix of type '<class 'numpy.int64'>'
	with 56069 stored elements in Compressed Sparse Row format>

In [350]:
import numpy as np

In [351]:
for row in X_train_dtm.toarray():
     print(row)

[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
[0 0 0 ...

In [352]:
X_train_dtm.toarray()

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

The test data is transformed to the already fitted vocabulary. Words that were not in the original training set are ignored (our model would not know what to do with them). 

We could, of course, built this all as a pipeline using make_pipeline

In [353]:
# transform testing data (using fitted vocabulary) into a document-term matrix
# your code, make the X_test_dtm, you have already fit vect, so just transform
X_test_dtm = vect.transform(X_valid)
X_test_dtm

<1393x7533 sparse matrix of type '<class 'numpy.int64'>'
	with 16820 stored elements in Compressed Sparse Row format>

## Part 5: Building and evaluating a model

We will use LogisticRegression to start with the default settings:

In [354]:
# import and instantiate a LogisticRegression model
from sklearn.linear_model import LogisticRegression
nb = LogisticRegression()

Note: X_train_dtm is a sparse matrix, LogisticRegression understands sparse matrices, some models may have to do conversions.

In [355]:
# train the model using X_train_dtm (timing it with an IPython "magic command")
# Fill in below
%time nb.fit(X_train_dtm, y_train)

CPU times: total: 31.2 ms


Wall time: 16 ms


In [356]:
# make class predictions for X_test_dtm
# Fill in below
y_pred_class = nb.predict(X_test_dtm)

In [357]:
# calculate accuracy of class predictions
from sklearn import metrics
metrics.accuracy_score(y_valid, y_pred_class)

0.9877961234745154

In [358]:
nb.score(X_test_dtm, y_valid)

0.9877961234745154

In [359]:
# print the confusion matrix
metrics.confusion_matrix(y_valid, y_pred_class)

array([[1215,    3],
       [  14,  161]], dtype=int64)

In [360]:
from sklearn.metrics import classification_report

# print the classification report (import it if you have not already)
print(classification_report(y_valid, y_pred_class))

              precision    recall  f1-score   support

           0       0.99      1.00      0.99      1218
           1       0.98      0.92      0.95       175

    accuracy                           0.99      1393
   macro avg       0.99      0.96      0.97      1393
weighted avg       0.99      0.99      0.99      1393



In [361]:
# print out all the "wrong" ones
X_valid[y_pred_class != y_valid]

1777                    Call FREEPHONE 0800 542 0578 now!
2965    Do you ever notice that when you're driving, a...
4144    In The Simpsons Movie released in July 2007 na...
763     Urgent Ur £500 guaranteed award is still uncla...
1097    Dear Subscriber ur draw 4 £100 gift voucher wi...
2247    Hi ya babe x u 4goten bout me?' scammers getti...
2402    Babe: U want me dont u baby! Im nasty and have...
1449    As a registered optin subscriber ur draw 4 £10...
3460    Not heard from U4 a while. Call me now am here...
495                      Are you free now?can i call now?
2774    How come it takes so little time for a child w...
4729    I (Career Tel) have added u as a contact on IN...
1430    For sale - arsenal dartboard. Good condition b...
788     Ever thought about living a good life with a p...
2521    Misplaced your number and was sending texts to...
4394    RECPT 1/3. You have ordered a Ringtone. Your o...
4676    Hi babe its Chloe, how r u? I was smashed on s...
Name: message,

In [362]:
# print message text for the false positives (ham incorrectly classified as spam)
X_valid[(y_pred_class==1) & (y_valid==0)] # or X_test[y_pred_class > y_test]

495                      Are you free now?can i call now?
4729    I (Career Tel) have added u as a contact on IN...
2521    Misplaced your number and was sending texts to...
Name: message, dtype: object

In [363]:
# print message text for the false negatives (spam incorrectly classified as ham)
X_valid[(y_pred_class==0) & (y_valid==1)]

1777                    Call FREEPHONE 0800 542 0578 now!
2965    Do you ever notice that when you're driving, a...
4144    In The Simpsons Movie released in July 2007 na...
763     Urgent Ur £500 guaranteed award is still uncla...
1097    Dear Subscriber ur draw 4 £100 gift voucher wi...
2247    Hi ya babe x u 4goten bout me?' scammers getti...
2402    Babe: U want me dont u baby! Im nasty and have...
1449    As a registered optin subscriber ur draw 4 £10...
3460    Not heard from U4 a while. Call me now am here...
2774    How come it takes so little time for a child w...
1430    For sale - arsenal dartboard. Good condition b...
788     Ever thought about living a good life with a p...
4394    RECPT 1/3. You have ordered a Ringtone. Your o...
4676    Hi babe its Chloe, how r u? I was smashed on s...
Name: message, dtype: object

In [364]:
# example false negative
X_train[3132]

"LookAtMe!: Thanks for your purchase of a video clip from LookAtMe!, you've been charged 35p. Think you can do better? Why not send a video in a MMSto 32323."

In [365]:
# calculate predicted probabilities for X_test_dtm (poorly calibrated probabilities)
y_pred_prob = nb.predict_proba(X_test_dtm)[:, 1]
y_pred_prob

array([0.00121215, 0.00561094, 0.99440214, ..., 0.51716479, 0.98330475,
       0.00917443])

In [366]:
# calculate AUC - another metric for measure the performance of a classification system, it relies on probabilities. 
# I briefly mentioned it in lectures
metrics.roc_auc_score(y_valid, y_pred_prob)

0.9826741731175228

## Part 6: Comparing models

We will compare LogisticRegression with a linear Support Vector Classifier and Random Forests:

Probability=True makes it take a bit longer to train but it will allow us to calculate the roc_auc score

In [367]:
# import and instantiate a SVC model
from sklearn.svm import SVC
svcmodel = SVC(kernel='linear', probability=True)


In [368]:
# train the model using X_train_dtm
%time svcmodel.fit(X_train_dtm, y_train)

CPU times: total: 1.69 s
Wall time: 1.56 s


In [369]:
# make class predictions for X_test_dtm
y_pred_class = svcmodel.predict(X_test_dtm)

In [370]:
# calculate predicted probabilities for X_test_dtm (well calibrated)
y_pred_prob = svcmodel.predict_proba(X_test_dtm)[:, 1]

In [371]:
# calculate accuracy
metrics.accuracy_score(y_valid, y_pred_class)

0.9885139985642498

In [372]:
# calculate AUC
metrics.roc_auc_score(y_valid, y_pred_prob)

0.9848135116115411

Now do Random Forests

In [373]:
from sklearn.ensemble import RandomForestClassifier

# import and instantiate a RandomForestClassifier model
rfmodel = RandomForestClassifier(random_state=1138)

# train the model using X_train_dtm
%time rfmodel.fit(X_train_dtm, y_train)

# make class predictions for X_test_dtm
y_pred_class = rfmodel.predict(X_test_dtm)

# calculate predicted probabilities for X_test_dtm
y_pred_prob = rfmodel.predict_proba(X_test_dtm)[:, 1]

# calculate accuracy
accuracy = metrics.accuracy_score(y_valid, y_pred_class)
print(f'Accuracy: {accuracy}')

# calculate AUC
auc = metrics.roc_auc_score(y_valid, y_pred_prob)
print(f'AUC: {auc}')

CPU times: total: 766 ms
Wall time: 779 ms
Accuracy: 0.9827709978463748
AUC: 0.9878817733990147


## Part 7: Cross Validation

Now you want to do some form of cross-validation with SVC, find the best 'C', use GridSearchCV or whatever method was your favourite. Find the best cross-validation score. You can set probability=False this time around and just use the accuracy score to compare

Same with Random Forest

In [374]:
from sklearn.model_selection import GridSearchCV

# your code here
# define the parameter grid
param_grid = {'C': [0.1, 1, 10, 100, 1000]}

# instantiate the grid search model
grid_search = GridSearchCV(SVC(kernel='linear', probability=False), param_grid, cv=5, scoring='accuracy')

# fit the grid search to the data
grid_search.fit(X_train_dtm, y_train)

# print the best parameters and the best cross-validation score
print(f'Best parameters: {grid_search.best_params_}')
print(f'Best cross-validation score: {grid_search.best_score_}')

Best parameters: {'C': 10}
Best cross-validation score: 0.9822934991261498


Now find the best 'C' with LogisticRegression using cross-validation, again find the best cross-validation score.

In [375]:
# your code here
# define the parameter grid
param_grid_lr = {'C': [0.1, 1, 10, 100, 1000]}

# instantiate the grid search model
grid_search_lr = GridSearchCV(LogisticRegression(), param_grid_lr, cv=5, scoring='accuracy')

# fit the grid search to the data
grid_search_lr.fit(X_train_dtm, y_train)

# print the best parameters and the best cross-validation score
print(f'Best parameters: {grid_search_lr.best_params_}')
print(f'Best cross-validation score: {grid_search_lr.best_score_}')

Best parameters: {'C': 1000}
Best cross-validation score: 0.9834896713749535


## Part 8: Tuning the vectorizer (discussion)

All of this can go into maybe a pipeline, keep the Vectorizer options as hyperparameters that could be chosen using cross-validation

Thus far, we have been using the default parameters of [CountVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html):

In [376]:
# show default parameters for CountVectorizer
vect

However, the vectorizer is worth tuning, just like a model is worth tuning! Here are a few parameters that you might want to tune:

- **stop_words:** string {'english'}, list, or None (default)
    - If 'english', a built-in stop word list for English is used.
    - If a list, that list is assumed to contain stop words, all of which will be removed from the resulting tokens.
    - If None, no stop words will be used.

In [377]:
# remove English stop words
vect = CountVectorizer(stop_words='english')

In [378]:
X_train_dtm = vect.fit_transform(X_train)
X_train_dtm

<4179x7270 sparse matrix of type '<class 'numpy.int64'>'
	with 32985 stored elements in Compressed Sparse Row format>

- **ngram_range:** tuple (min_n, max_n), default=(1, 1)
    - The lower and upper boundary of the range of n-values for different n-grams to be extracted.
    - All values of n such that min_n <= n <= max_n will be used.

In [379]:
# include 1-grams and 2-grams
vect = CountVectorizer(ngram_range=(1, 2))

In [380]:
X_train_dtm = vect.fit_transform(X_train)
X_train_dtm

<4179x41361 sparse matrix of type '<class 'numpy.int64'>'
	with 112052 stored elements in Compressed Sparse Row format>

- **max_df:** float in range [0.0, 1.0] or int, default=1.0
    - When building the vocabulary, ignore terms that have a document frequency strictly higher than the given threshold (corpus-specific stop words).
    - If float, the parameter represents a proportion of documents.
    - If integer, the parameter represents an absolute count.

In [381]:
# ignore terms that appear in more than 50% of the documents
vect = CountVectorizer(max_df=0.5)

In [382]:
X_train_dtm = vect.fit_transform(X_train)
X_train_dtm

<4179x7533 sparse matrix of type '<class 'numpy.int64'>'
	with 56069 stored elements in Compressed Sparse Row format>

- **min_df:** float in range [0.0, 1.0] or int, default=1
    - When building the vocabulary, ignore terms that have a document frequency strictly lower than the given threshold. (This value is also called "cut-off" in the literature.)
    - If float, the parameter represents a proportion of documents.
    - If integer, the parameter represents an absolute count.

In [383]:
# only keep terms that appear in at least 2 documents
vect = CountVectorizer(min_df=2)

In [384]:
X_train_dtm = vect.fit_transform(X_train)
X_train_dtm

<4179x3506 sparse matrix of type '<class 'numpy.int64'>'
	with 52042 stored elements in Compressed Sparse Row format>

**Guidelines for tuning CountVectorizer:**

- Use your knowledge of the **problem** and the **text**, and your understanding of the **tuning parameters**, to help you decide what parameters to tune and how to tune them.
- **Experiment**, and let the data tell you the best approach!
- We could use cross-validation to make our choices! 

We can also use a pipeline like

make_pipeline(CountVectorizer(), LogisticRegression())

we can do cross-validation over the CountVectorizer parameters then

# TF-IDF

The most commonly used technique is the tf-idf short for “term frequency-inverse document frequency”, which basically reflects how important a word is to a document (email) in a collection or corpus (our set of emails or documents).

## Term frequency

Suppose we have a set of English text documents and wish to rank which document is most relevant to the query, "the brown cow". A simple way to start out is by eliminating documents that do not contain all three words "the", "brown", and "cow", but this still leaves many documents. To further distinguish them, we might count the number of times each term occurs in each document; the number of times a term occurs in a document is called its term frequency. However, in the case where the length of documents varies greatly, adjustments are often made (see definition below). The first form of term weighting is due to Hans Peter Luhn (1957) which may be summarized as:

    The weight of a term that occurs in a document is simply proportional to the term frequency.

## Inverse document frequency
Because the term "the" is so common, term frequency will tend to incorrectly emphasize documents which happen to use the word "the" more frequently, without giving enough weight to the more meaningful terms "brown" and "cow". The term "the" is not a good keyword to distinguish relevant and non-relevant documents and terms, unlike the less-common words "brown" and "cow". Hence an inverse document frequency factor is incorporated which diminishes the weight of terms that occur very frequently in the document set and increases the weight of terms that occur rarely. 


## TF-IDF

The tf-idf is an statistic that increases with the number of times a word appears in the document, penalized by the number of documents in the corpus that contain the word.

Fortunately for us, Scikit-learn has a method that does just this (sklearn.feature_extraction.text.TfidfVectorizer). See the documentation [here](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html)

In [385]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [386]:
#TfidfVectorizer?

In [387]:
tfidf_vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5, stop_words='english')

## Why these options?

TfidfVectorizer sets the vectorizer up. Here we change sublinear_tf to true, which replaces tf with 1 + log(tf). This addresses the issue that “twenty occurrences of a term in a document” does not represent “twenty times the significance of a single occurrence” [link](https://nlp.stanford.edu/IR-book/html/htmledition/sublinear-tf-scaling-1.html). Therefore, it reduces the importance of high frequency words (note that 1+log(1) = 1, while 1+log(20) = 2.3).

In [388]:
corpus = [
    "This is my first email.",
    "I'm trying to learn machine learning.",
    "This is the second email",
    "Learning is fun"
]

In [389]:
corpus_M = tfidf_vectorizer.fit_transform(corpus)

In [390]:
print(corpus_M)

  (0, 0)	1.0
  (1, 6)	0.5254727492640658
  (1, 2)	0.5254727492640658
  (1, 4)	0.5254727492640658
  (1, 3)	0.41428875116588965
  (2, 0)	0.6191302964899972
  (2, 5)	0.7852882757103967
  (3, 3)	0.6191302964899972
  (3, 1)	0.7852882757103967


In [391]:
vocabulary = tfidf_vectorizer.get_feature_names_out()

In [392]:
print(vocabulary)

['email' 'fun' 'learn' 'learning' 'machine' 'second' 'trying']


In [393]:
pd.DataFrame(data=corpus_M.toarray(), columns=vocabulary)

Unnamed: 0,email,fun,learn,learning,machine,second,trying
0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.525473,0.414289,0.525473,0.0,0.525473
2,0.61913,0.0,0.0,0.0,0.0,0.785288,0.0
3,0.0,0.785288,0.0,0.61913,0.0,0.0,0.0


In [394]:
test = ["I’m also trying to learn python"]

In [395]:
corpus_test = tfidf_vectorizer.transform(test)
pd.DataFrame(data=corpus_test.toarray(), columns=vocabulary)

Unnamed: 0,email,fun,learn,learning,machine,second,trying
0,0.0,0.0,0.707107,0.0,0.0,0.0,0.707107


## SMS Set

Use the SMS set to make a classifier using tfidf as our vectorizer instead of the bag of words

In [396]:
#your code here
X_train_dtm = tfidf_vectorizer

In [397]:
X_train_dtm

In [399]:
X_train_dtm_transformed = X_train_dtm.fit_transform(X_train)
X_train_dtm_transformed.toarray()

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [403]:
from sklearn.linear_model import LogisticRegression

# Transform the training data using the tfidf_vectorizer
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)

# Transform the validation data using the same tfidf_vectorizer
X_valid_tfidf = tfidf_vectorizer.transform(X_valid)

# Import and instantiate a LogisticRegression model
logreg = LogisticRegression()

# Train the model using X_train_tfidf
logreg.fit(X_train_tfidf, y_train)

# Make class predictions for X_valid_tfidf
y_pred_class_tfidf = logreg.predict(X_valid_tfidf)

# Calculate accuracy of class predictions
accuracy_tfidf = metrics.accuracy_score(y_valid, y_pred_class_tfidf)
print(f'Accuracy: {accuracy_tfidf}')

# Calculate predicted probabilities for X_valid_tfidf
y_pred_prob_tfidf = logreg.predict_proba(X_valid_tfidf)[:, 1]

# Calculate AUC
auc_tfidf = metrics.roc_auc_score(y_valid, y_pred_prob_tfidf)
print(f'AUC: {auc_tfidf}')

Accuracy: 0.964824120603015
AUC: 0.9844944874501524


In [404]:
tfidf_vectorizer.fit(X_train)

In [406]:
X_valid_dtm = tfidf_vectorizer.transform(X_valid)

In [407]:
X_valid_dtm

<1393x7270 sparse matrix of type '<class 'numpy.float64'>'
	with 9319 stored elements in Compressed Sparse Row format>

In [409]:
y_pred_class = logreg.predict(X_valid_tfidf)

In [412]:
metrics.accuracy_score(y_valid, y_pred_class)

0.964824120603015

In [413]:
logreg.score(X_valid_tfidf, y_valid)

0.964824120603015

Take whatever was the best above (LogisticRegression, SVC, Random Forest) and see how it would compare with a TFIDF vectorizer

In [414]:
# Transform the training data using the tfidf_vectorizer
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)

# Transform the validation data using the same tfidf_vectorizer
X_valid_tfidf = tfidf_vectorizer.transform(X_valid)

# Import and instantiate a RandomForestClassifier model
rfmodel_tfidf = RandomForestClassifier(random_state=1138)

# Train the model using X_train_tfidf
rfmodel_tfidf.fit(X_train_tfidf, y_train)

# Make class predictions for X_valid_tfidf
y_pred_class_tfidf_rf = rfmodel_tfidf.predict(X_valid_tfidf)

# Calculate accuracy of class predictions
accuracy_tfidf_rf = metrics.accuracy_score(y_valid, y_pred_class_tfidf_rf)
print(f'Accuracy: {accuracy_tfidf_rf}')

# Calculate predicted probabilities for X_valid_tfidf
y_pred_prob_tfidf_rf = rfmodel_tfidf.predict_proba(X_valid_tfidf)[:, 1]

# Calculate AUC
auc_tfidf_rf = metrics.roc_auc_score(y_valid, y_pred_prob_tfidf_rf)
print(f'AUC: {auc_tfidf_rf}')

Accuracy: 0.9834888729361091
AUC: 0.9887309406521229
