
<div style="background-color: lightblue; padding: 60px;">
    <h1><b>Modeling
</b></h1>
</div>


In this lesson, we'll do a bit of feature engineering, and then model our text data. We'll be aiming to predict whether a given text message is spam or not, and trying to predict the category of news articles.

Feature Extraction: TF-IDF
 - TF: Term Frequency; how often a word appears in a document.
 - IDF: Inverse Documnet Frequency; a measure based on in how many documents will a word appear.
 - TF-IDF: A combination of the two measures above.


Term Frequency (TF)
 - Term frequency can be calculated in a number of ways, all of which reflect how frequently a word appears in a document.

 - Raw Count: This is simply the count of the number of occurances of each word.
Frequency: The number of times each word appears divided by the total number of words.

 - Augmented Frequency: The frequency of each word divided by the maximum frequency. This can help prevent bias towards larger documents.

In [2]:
from pprint import pprint
import pandas as pd
import numpy as np
import re
import nltk
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, accuracy_score
from sklearn.preprocessing import LabelEncoder

%matplotlib inline
import matplotlib.pyplot as plt

import unicodedata
from nltk.tokenize.toktok import ToktokTokenizer
from nltk.corpus import stopwords
from prepare import basic_clean, lemmatize

from math import log

from env import user, password, host


In [None]:
document = 'Mary had a little lamb, a little lamb, a little lamb.'

# clean up the text
document = document.lower().replace(',', '').replace('.', '')
# transform into a series
words = pd.Series(document.split())

# From the Series we can extract the value_counts, which is our raw count
# for term frequency. Once we have the raw counts, we can calculate the
# other measures.
(pd.DataFrame({'raw_count': words.value_counts()})
 .assign(frequency=lambda df: df.raw_count / df.raw_count.sum())
 .assign(augmented_frequency=lambda df: df.frequency / df.frequency.max()))

!!!tip "TF inputs" The calculation for an individual TF score requires a word and a body of text (a document).

### Inverse Document Frequency (IDF)

Inverse Document Frequency also provides information about individual words, but, in order to use this measure, we must have multiple documents, i.e. several different bodies of text.

Inverse Document Frequency tells us how much information a word provides. It is based on how commonly a word appears across multiple documents. The metric is divised such that the more frequently a word appears, the lower the IDF for that word will be.

 
!!!note "idf calculation" If a given word doesn't appear in any documents, the denominator in the equation above would be zero, so some definitions of idf will add 1 to the denominator.

For example, imagine we have 20 documents. We can visualize what the idf score looks like with the code below:

In [None]:
n_documents = 20

x = np.arange(1, n_documents + 1)
y = np.log(n_documents / x)

plt.figure(figsize=(12, 8))
plt.plot(x, y, marker='.')

plt.xticks(x)
plt.xlabel('# of Documents the word appears in')
plt.ylabel('IDF')
plt.title('IDF for a given word')

Now let's walk through an example of calculating IDF for multiple words. We'll use a small example dataset.

First we'll prepare the data:

In [3]:
# our 3 example documents
documents = {
    'news': 'Codeup announced last thursday that they just launched a new data science program. It is 18 weeks long.',
    'description': 'Codeup\'s data science program teaches hands on skills using Python and pandas.',
    'context': 'Codeup\'s data science program was created in response to a percieved lack of data science talent, and growing demand.'
}
pprint(documents)

print('\nCleaning and lemmatizing...\n')

documents = {topic: lemmatize(basic_clean(documents[topic])) for topic in documents}
pprint(documents)

{'context': "Codeup's data science program was created in response to a "
            'percieved lack of data science talent, and growing demand.',
 'description': "Codeup's data science program teaches hands on skills using "
                'Python and pandas.',
 'news': 'Codeup announced last thursday that they just launched a new data '
         'science program. It is 18 weeks long.'}

Cleaning and lemmatizing...

{'context': "codeup's data science program wa created in response to a "
            'percieved lack of data science talent and growing demand',
 'description': "codeup's data science program teach hand on skill using "
                'python and panda',
 'news': 'codeup announced last thursday that they just launched a new data '
         'science program it is 18 week long'}


In [None]:
Then we can calculate the inverse document frequency metric for each word.



In [4]:
# A simple way to calculate idf for demonstration. Note that this
# function relies on the globally defined documents variable.
def idf(word):
    n_occurences = sum([1 for doc in documents.values() if word in doc])
    return log(len(documents) / n_occurences)

# Get a list of the unique words
unique_words = pd.Series(' '.join(documents.values()).split()).unique()

# put the unique words into a data frame
(pd.DataFrame(dict(word=unique_words))
 # calculate the idf for each word
 .assign(idf=lambda df: df.word.apply(idf))
 # sort the data for presentation purposes
 .set_index('word')
 .sort_values(by='idf', ascending=False)
 .head(5))

Unnamed: 0_level_0,idf
word,Unnamed: 1_level_1
teach,1.098612
created,1.098612
hand,1.098612
skill,1.098612
using,1.098612


A higher IDF means that a word provides more information. That is, it is more relevant within a single document.

!!!tip "IDF inputs" The calculation for an individual IDF score requires a word and a set of documents.

## TF-IDF

In [None]:
TF-IDF is simply the multiplication of the two metrics we've discussed above. Let's calculate an TF-IDF for all of the words and documents:



In [5]:
tfs = []

# We'll caclulate the tf-idf value for every word across every document

# Start by iterating over all the documents
for doc, text in documents.items():
    # We'll make a data frame that contains the tf for every word in every document
    df = (pd.Series(text.split())
          .value_counts()
          .reset_index()
          .set_axis(['word', 'raw_count'], axis=1)
          .assign(tf=lambda df: df.raw_count / df.shape[0])
          .drop(columns='raw_count')
          .assign(doc=doc))
    # Then add that data frame to our list
    tfs.append(df)

In [6]:
df

Unnamed: 0,word,tf,doc
0,science,0.117647,context
1,data,0.117647,context
2,codeup's,0.058824,context
3,percieved,0.058824,context
4,growing,0.058824,context
5,and,0.058824,context
6,talent,0.058824,context
7,of,0.058824,context
8,lack,0.058824,context
9,to,0.058824,context


In [7]:
# We'll then concatenate all the tf values together.
(pd.concat(tfs)
 # calculate the idf value for each word
 .assign(idf=lambda df: df.word.apply(idf))
 # then use the if and idf values to calculate tf-idf 
 .assign(tf_idf=lambda df: df.idf * df.tf)
 .drop(columns=['tf', 'idf'])
 .sort_values(by='tf_idf', ascending=False))

Unnamed: 0,word,doc,tf_idf
5,hand,description,0.091551
4,teach,description,0.091551
11,panda,description,0.091551
9,python,description,0.091551
8,using,description,0.091551
7,skill,description,0.091551
14,wa,context,0.064624
13,created,context,0.064624
11,response,context,0.064624
9,to,context,0.064624


It's more common to see the data presented with the words as features, and the documents as observations, like this:

In [8]:
# We'll then concatenate all the tf values together.
(pd.concat(tfs)
 # calculate the idf value for each word
 .assign(idf=lambda df: df.word.apply(idf))
 # then use the if and idf values to calculate tf-idf 
 .assign(tf_idf=lambda df: df.idf * df.tf)
 .drop(columns=['tf', 'idf'])
 .sort_values(by='tf_idf', ascending=False)
 .pipe(lambda df: pd.crosstab(df.doc, df.word, values=df.tf_idf, aggfunc=lambda x: x))
 .fillna(0))

word,18,a,and,announced,codeup,codeup's,created,data,demand,growing,...,skill,talent,teach,that,they,thursday,to,using,wa,week
doc,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
context,0.0,0.0,0.023851,0.0,0.0,0.023851,0.064624,0.0,0.064624,0.064624,...,0.0,0.064624,0.0,0.0,0.0,0.0,0.064624,0.0,0.064624,0.0
description,0.0,0.0,0.033789,0.0,0.0,0.033789,0.0,0.0,0.0,0.0,...,0.091551,0.0,0.091551,0.0,0.0,0.0,0.0,0.091551,0.0,0.0
news,0.061034,0.0,0.0,0.061034,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.061034,0.061034,0.061034,0.0,0.0,0.0,0.061034


## TF-IDF with scikit-learn

In [9]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer()
tfidfs = tfidf.fit_transform(documents.values())
tfidfs

<3x36 sparse matrix of type '<class 'numpy.float64'>'
	with 45 stored elements in Compressed Sparse Row format>

We get back a sparse matrix, a matrix with more 0s than anything else. Numpy has a special type that makes some manipulations and operations faster on sparse matrices.

Becuase our data set is pretty small, we can convert our sparse matrix to a regular one, and put everything in a dataframe. If our data were larger, the operation below might take much longer.

In [10]:
pd.DataFrame(tfidfs.todense(), columns=tfidf.get_feature_names_out())


Unnamed: 0,18,and,announced,codeup,created,data,demand,growing,hand,in,...,skill,talent,teach,that,they,thursday,to,using,wa,week
0,0.263566,0.0,0.263566,0.155666,0.0,0.155666,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.263566,0.263566,0.263566,0.0,0.0,0.0,0.263566
1,0.0,0.25388,0.0,0.19716,0.0,0.19716,0.0,0.0,0.333821,0.0,...,0.333821,0.0,0.333821,0.0,0.0,0.0,0.0,0.333821,0.0,0.0
2,0.0,0.195932,0.0,0.152159,0.257627,0.304317,0.257627,0.257627,0.0,0.257627,...,0.0,0.257627,0.0,0.0,0.0,0.0,0.257627,0.0,0.257627,0.0


## Modeling

Now we'll use the computed TF-IDF values as features in a model. We'll take a look at the spam data set first.

Because of the way we are modeling the data, we have a lot of columns, and it is not uncommon to have more columns than rows. Also, our data is very imbalanced in the class distribution, that is, there are many more ham messages than spam messages.

Other than these considerations, we can treat this as a standard classification problem. We'll use logistic regression as an example:

In [11]:
def get_db_url(database, host=host, user=user, password=password):
    return f'mysql+pymysql://{user}:{password}@{host}/{database}'

url = get_db_url("spam_db")
sql = "SELECT * FROM spam"

df = pd.read_sql(sql, url, index_col="id")
df.head()

Unnamed: 0_level_0,label,text
id,Unnamed: 1_level_1,Unnamed: 2_level_1
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [12]:
def clean(text: str) -> list:
    'A simple function to cleanup text data'
    wnl = nltk.stem.WordNetLemmatizer()
    stopwords = set(nltk.corpus.stopwords.words('english'))
    text = (text.encode('ascii', 'ignore')
             .decode('utf-8', 'ignore')
             .lower())
    words = re.sub(r'[^\w\s]', '', text).split() # tokenization
    return [wnl.lemmatize(word) for word in words if word not in stopwords]


In [13]:
df['clean_text'] = df.text.apply(clean).apply(' '.join)


In [14]:
X = df.clean_text
y = df.label

In [15]:
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=.2)


In [16]:
tfidf = TfidfVectorizer()
X_train = tfidf.fit_transform(X_train)
X_test = tfidf.transform(X_test)

train = pd.DataFrame(dict(actual=y_train))
test = pd.DataFrame(dict(actual=y_test))

lm = LogisticRegression().fit(X_train, y_train)

train['predicted'] = lm.predict(X_train)
test['predicted'] = lm.predict(X_test)


In [17]:
print('Accuracy: {:.2%}'.format(accuracy_score(train.actual, train.predicted)))
print('---')
print('Confusion Matrix')
print(pd.crosstab(train.predicted, train.actual))
print('---')
print(classification_report(train.actual, train.predicted))

Accuracy: 96.52%
---
Confusion Matrix
actual      ham  spam
predicted            
ham        3852   148
spam          7   450
---
              precision    recall  f1-score   support

         ham       0.96      1.00      0.98      3859
        spam       0.98      0.75      0.85       598

    accuracy                           0.97      4457
   macro avg       0.97      0.88      0.92      4457
weighted avg       0.97      0.97      0.96      4457



In [18]:
print('Accuracy: {:.2%}'.format(accuracy_score(test.actual, test.predicted)))
print('---')
print('Confusion Matrix')
print(pd.crosstab(test.predicted, test.actual))
print('---')
print(classification_report(test.actual, test.predicted))

Accuracy: 95.96%
---
Confusion Matrix
actual     ham  spam
predicted           
ham        966    45
spam         0   104
---
              precision    recall  f1-score   support

         ham       0.96      1.00      0.98       966
        spam       1.00      0.70      0.82       149

    accuracy                           0.96      1115
   macro avg       0.98      0.85      0.90      1115
weighted avg       0.96      0.96      0.96      1115



# Exercises

 - Take the work we did in the lessons further:

    - What other types of models (i.e. different classifcation algorithms) could you use?
    - How do the models compare when trained on term frequency data alone, instead of TF-IDF values?

In [31]:
#Modeling

# Set up your functions for text cleaning
def clean(text):
    wnl = nltk.stem.WordNetLemmatizer()
    stopwords = set(nltk.corpus.stopwords.words('english'))
    text = (text.encode('ascii', 'ignore')
             .decode('utf-8', 'ignore')
             .lower())
    words = re.sub(r'[^\w\s]', '', text).split()
    return [wnl.lemmatize(word) for word in words if word not in stopwords]




# Load your dataset and preprocess the text
url = get_db_url("spam_db")
sql = "SELECT * FROM spam"
df = pd.read_sql(sql, url, index_col="id")
df['clean_text'] = df.text.apply(clean).apply(' '.join)

# Split the dataset into features and target variable
X = df.clean_text
y = df.label

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.2)

# Initialize different classification models
models = {
    'Logistic Regression (TF-IDF)': LogisticRegression(),
    'Multinomial Naive Bayes (TF-IDF)': MultinomialNB(),
    'SVM (TF-IDF)': SVC(),
    'Random Forest (TF-IDF)': RandomForestClassifier(),
}

# Encode target classes
y_train_encoded = y_train.map({'ham': 0, 'spam': 1})
y_test_encoded = y_test.map({'ham': 0, 'spam': 1})

# TF-IDF Vectorization
tfidf = TfidfVectorizer()
X_train_tfidf = tfidf.fit_transform(X_train)
X_test_tfidf = tfidf.transform(X_test)

# TF Vectorization
tf_vectorizer = CountVectorizer()
X_train_tf = tf_vectorizer.fit_transform(X_train)
X_test_tf = tf_vectorizer.transform(X_test)

# Evaluate models with TF-IDF features
for model_name, model in models.items():
    lm = model.fit(X_train_tfidf, y_train_encoded)
    test['predicted_' + model_name] = lm.predict(X_test_tfidf)
    accuracy = accuracy_score(y_test_encoded, test['predicted_' + model_name])
    print(f'{model_name} Accuracy (TF-IDF): {accuracy:.2%}')

# Train models with raw TF features
for model_name, model in models.items():
    lm = model.fit(X_train_tf, y_train_encoded)
    test['predicted_' + model_name + ' (TF)'] = lm.predict(X_test_tf)
    accuracy = accuracy_score(y_test_encoded, test['predicted_' + model_name + ' (TF)'])
    print(f'{model_name} Accuracy (TF): {accuracy:.2%}')

# Print classification reports for the models
for model_name in models:
    print(f'Classification Report for {model_name} (TF-IDF):')
    print(classification_report(y_test_encoded, test['predicted_' + model_name]))
    
    print(f'Classification Report for {model_name} (TF):')
    print(classification_report(y_test_encoded, test['predicted_' + model_name + ' (TF)']))


Logistic Regression (TF-IDF) Accuracy (TF-IDF): 95.78%
Multinomial Naive Bayes (TF-IDF) Accuracy (TF-IDF): 95.96%
SVM (TF-IDF) Accuracy (TF-IDF): 98.03%
Random Forest (TF-IDF) Accuracy (TF-IDF): 97.49%
Logistic Regression (TF-IDF) Accuracy (TF): 98.12%
Multinomial Naive Bayes (TF-IDF) Accuracy (TF): 98.30%
SVM (TF-IDF) Accuracy (TF): 98.21%
Random Forest (TF-IDF) Accuracy (TF): 97.76%
Classification Report for Logistic Regression (TF-IDF) (TF-IDF):
              precision    recall  f1-score   support

           0       0.96      1.00      0.98       966
           1       0.96      0.71      0.82       149

    accuracy                           0.96      1115
   macro avg       0.96      0.85      0.90      1115
weighted avg       0.96      0.96      0.96      1115

Classification Report for Logistic Regression (TF-IDF) (TF):
              precision    recall  f1-score   support

           0       0.98      1.00      0.99       966
           1       0.99      0.87      0.92       

Summary:-
Models Trained on TF-IDF Features:
Logistic Regression: Accuracy - 95.78%
Multinomial Naive Bayes: Accuracy - 95.96%
SVM: Accuracy - 98.03%
Random Forest: Accuracy - 97.49%
Models Trained on TF Features:
Logistic Regression: Accuracy - 98.12%
Multinomial Naive Bayes: Accuracy - 98.30%
SVM: Accuracy - 98.21%
Random Forest: Accuracy - 97.76%

For both feature types (TF-IDF and TF), the models achieved high accuracy. Some models, like SVM and Naive Bayes, performed equally well or even slightly better when trained on TF features compared to TF-IDF.

Note: TF-IDF takes into account the importance of words in the corpus, which can be valuable for certain tasks. However, TF features can be simpler and still yield excellent results.

## Using XGBoost:

In [33]:

# Initialize the label encoder
label_encoder = LabelEncoder()

# Encode the target variable
y_train_encoded = label_encoder.fit_transform(y_train)
y_test_encoded = label_encoder.transform(y_test)

# Train an XGBoost classifier
model = XGBClassifier()
model.fit(X_train_tfidf, y_train_encoded)

# Make predictions on the test set
y_pred_encoded = model.predict(X_test_tfidf)

# Decode the predictions back to original labels (if needed)
y_pred = label_encoder.inverse_transform(y_pred_encoded)

# Evaluate the model
accuracy = accuracy_score(y_test_encoded, y_pred_encoded)
print(f'Accuracy: {accuracy:.2%}')

# Print a classification report
print(classification_report(y_test_encoded, y_pred_encoded))


Accuracy: 96.41%
              precision    recall  f1-score   support

           0       0.96      1.00      0.98       966
           1       0.98      0.74      0.85       149

    accuracy                           0.96      1115
   macro avg       0.97      0.87      0.91      1115
weighted avg       0.96      0.96      0.96      1115



Summary:-
Model performs well in identifying 'ham' messages, achieving high precision and recall. While the model is slightly less precise in identifying 'spam' messages, it still maintains good overall performance.