<a href="https://colab.research.google.com/github/BrajanNieto/MISTI/blob/main/2026MISTIPeru_Evergreen_FeatureEngineering_EXERCISES.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**If you haven't already, please hit :**

`File` -> `Save a Copy in Drive`

**to copy this notebook to your Google drive, and work on a copy. If you don't do this, your changes won't be saved!**


# Feature Engineering with the StumbleUpon Evergreen Classification Challenge Dataset

[The StumbleUpon Evergreen Classification Challenge](https://www.kaggle.com/c/stumbleupon/data?select=train.tsv) is a data science competition that was held in the past. The goal of the challenge was to predict whether a given web page would be classified as "evergreen" or "ephemeral".

An evergreen web page is one that is always relevant and maintains its value over time, while an ephemeral web page is one that is only relevant for a short period of time before becoming outdated.

The challenge was hosted on Kaggle and provided a dataset of web pages along with their corresponding labels. Participants were tasked with developing a machine learning model that could accurately classify new web pages as evergreen or ephemeral based on their features.

###Imports and Setup

In [1]:
# Copy over data from github
%%bash
git clone https://github.com/caboonie/gsl-uruguay.git

Cloning into 'gsl-uruguay'...
Updating files:   0% (12/5869)Updating files:   1% (59/5869)Updating files:   1% (68/5869)Updating files:   2% (118/5869)Updating files:   3% (177/5869)Updating files:   4% (235/5869)Updating files:   5% (294/5869)Updating files:   6% (353/5869)Updating files:   7% (411/5869)Updating files:   8% (470/5869)Updating files:   9% (529/5869)Updating files:  10% (587/5869)Updating files:  11% (646/5869)Updating files:  12% (705/5869)Updating files:  12% (713/5869)Updating files:  13% (763/5869)Updating files:  14% (822/5869)Updating files:  15% (881/5869)Updating files:  15% (889/5869)Updating files:  15% (892/5869)Updating files:  16% (940/5869)Updating files:  17% (998/5869)Updating files:  18% (1057/5869)Updating files:  19% (1116/5869)Updating files:  19% (1132/5869)Updating files:  20% (1174/5869)Updating files:  21% (1233/5869)Updating files:  22% (1292/5869)Updating files:  23% (1350/5869)Updating files:  23% (1353/5869)Updat

In [2]:
# data manipulation
import pandas as pd
import numpy as np
import scipy.stats as st

# plots
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
import pylab as pl

# scaling
from sklearn.preprocessing import StandardScaler

# classification algorithms
from sklearn.linear_model import LogisticRegression

# dimension reduction
from sklearn.decomposition import PCA

# cross-validation
from sklearn.model_selection import train_test_split

# model evaluation
from sklearn.metrics import roc_auc_score

# text mining
import re
from nltk import PorterStemmer
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer

import warnings
warnings.filterwarnings("ignore")

### Explore the dataset
We will now load our data into a pandas dataframe which also stores the data as a table.

Each row of the table is a datapoint and each column represents a feature. We can check the shape of the data to see how many datapoints and features we have. Note, there are a lot of features in this dataset, but we will mostly focus on these ones:
1. **alchemy_category_score** - the score of how likely the alchemy category is correct, from 0 to 1.
2. **alchemy_category** - type of page, such as "health", "sports", etc.
3. **linkwordscore** - Percentage of words on the page that are in links
4. **news_front_page** - 0 or 1 representing whether this webpage is front-page news
5. **boilerplate** - the html text of the webpage
6. **spelling_errors_ratio** - Percentage of words that are mispelled.
7. **numberOfLinks** - number of links in the page
8. **numwords_in_url** - number of words in the url link

You can see descriptions of each feature here: https://www.kaggle.com/c/stumbleupon/data?select=train.tsv


In [37]:
# load the data as a pandas dataframe
dataset = pd.read_table("gsl-uruguay/content/w1d2/train.tsv", sep= "\t")
print("Data dimensions:" + str(dataset.shape))
# we will narrow our focus to only the 9 features listed above
dataset = dataset.filter(["alchemy_category_score", "alchemy_category", "linkwordscore", "news_front_page", "boilerplate",
                          "spelling_errors_ratio", "numberOfLinks", "numwords_in_url", "label"])
# display the first 10 lines
display(dataset.head(10))

Data dimensions:(7395, 27)


Unnamed: 0,alchemy_category_score,alchemy_category,linkwordscore,news_front_page,boilerplate,spelling_errors_ratio,numberOfLinks,numwords_in_url,label
0,0.789131,business,24,0,"{""title"":""IBM Sees Holographic Calls Air Breat...",0.07913,170,8,0
1,0.574147,recreation,40,0,"{""title"":""The Fully Electronic Futuristic Star...",0.125448,187,9,1
2,0.996526,health,55,0,"{""title"":""Fruits that Fight the Flu fruits tha...",0.057613,258,11,1
3,0.801248,health,24,0,"{""title"":""10 Foolproof Tips for Better Sleep ""...",0.100858,120,5,1
4,0.719157,sports,14,0,"{""title"":""The 50 Coolest Jerseys You Didn t Kn...",0.082569,162,10,0
5,?,?,12,?,"{""url"":""conveniencemedical genital herpes home...",0.087356,55,3,0
6,0.22111,arts_entertainment,21,0,"{""title"":""fashion lane American Wild Child "",""...",0.064327,93,3,1
7,?,?,5,?,"{""url"":""insidershealth article racing for reco...",0.148551,132,4,0
8,?,?,17,0,"{""title"":""Valet The Handbook 31 Days 31 days"",...",0.125,194,7,1
9,?,?,14,?,"{""url"":""howsweeteats 2010 03 24 cookies and cr...",0.094412,326,4,1


#### EXERCISE: Explore the StumbleUpon Dataset

Tasks:
1. Explore the StumbleUpon dataset using standard pandas functions.

##### TASK 1: Explore your Dataset

In [4]:
# TASK 1 EXERCISE

''' ADD YOUR CODE HERE '''

' ADD YOUR CODE HERE '

In [38]:
print(dataset.info())
print("Missing values in category score:", (dataset['alchemy_category_score'] == '?').sum())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7395 entries, 0 to 7394
Data columns (total 9 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   alchemy_category_score  7395 non-null   object 
 1   alchemy_category        7395 non-null   object 
 2   linkwordscore           7395 non-null   int64  
 3   news_front_page         7395 non-null   object 
 4   boilerplate             7395 non-null   object 
 5   spelling_errors_ratio   7395 non-null   float64
 6   numberOfLinks           7395 non-null   int64  
 7   numwords_in_url         7395 non-null   int64  
 8   label                   7395 non-null   int64  
dtypes: float64(1), int64(4), object(4)
memory usage: 520.1+ KB
None
Missing values in category score: 2342


In [39]:
display(dataset.describe())

Unnamed: 0,linkwordscore,spelling_errors_ratio,numberOfLinks,numwords_in_url,label
count,7395.0,7395.0,7395.0,7395.0,7395.0
mean,30.077079,0.101221,178.754564,4.960649,0.51332
std,20.393101,0.079231,179.466198,3.233111,0.499856
min,0.0,0.0,1.0,0.0,0.0
25%,14.0,0.068739,82.0,3.0,0.0
50%,25.0,0.089312,139.0,5.0,1.0
75%,43.0,0.112376,222.0,7.0,1.0
max,100.0,1.0,4997.0,22.0,1.0


In [40]:
label_counts = dataset['label'].value_counts()
print("Label Distribution:\n", label_counts)

evergreen_percentage = dataset['label'].mean() * 100
print(f"Percentage of evergreen pages: {evergreen_percentage:.2f}%")

Label Distribution:
 label
1    3796
0    3599
Name: count, dtype: int64
Percentage of evergreen pages: 51.33%


In [41]:
print(dataset['alchemy_category'].value_counts())

alchemy_category
?                     2342
recreation            1229
arts_entertainment     941
business               880
health                 506
sports                 380
culture_politics       343
computer_internet      296
science_technology     289
gaming                  76
religion                72
law_crime               31
unknown                  6
weather                  4
Name: count, dtype: int64


### Data Preprocessing

#### Addressing Missing Values

The provided code snippet demonstrates a step-by-step data preprocessing workflow, focusing on handling missing values and converting categorical features into numerical formats to prepare the dataset for further analysis.

The process begins by identifying missing values and setting "?" as a placeholder for missing data, which is then replaced with `NaN` for easier handling.

The code converts the `news_front_page` and `alchemy_category_score` columns from categorical to numerical formats, addressing their respective missing values by applying appropriate imputation techniques.

The `alchemy_category_score` column's missing values are filled with the mean of the feature, while missing values in `news_front_page` are set to 0.

Additionally, missing values in the `alchemy_category` column are replaced with a filler category (`"_M"`), and one-hot encoding is applied to create new binary columns for each unique category.

The code ensures that all missing values are addressed before proceeding to further analysis, making the dataset more robust and ready for machine learning tasks.

In [42]:
# find variables with missing values
print(np.sum(dataset.isnull()))

# set "?" as missing values
dataset = dataset.replace("?", np.nan)
# dataset.head(6)

# convert the is_news feature to numerical
dataset[["news_front_page"]] = dataset[["news_front_page"]].astype(float)
# dataset.dtypes

# convert the alchemy_category_score feature to numerical
dataset[["alchemy_category_score"]] = dataset[["alchemy_category_score"]].astype(float)
# dataset.dtypes

# replace missing values with the average of that feature
dataset["alchemy_category_score"] = dataset["alchemy_category_score"].fillna(np.mean(dataset["alchemy_category_score"]))

# address missing values for the "news_front_page" column
dataset['news_front_page'] = dataset['news_front_page'].replace(np.nan, 0)

# address missing values in alchemy category by replacing them with the filler category, "_M" to represent missing
dataset["alchemy_category"] = dataset["alchemy_category"].fillna("_M")

# we'll now create a one-hot column for each category
alch_dataset = pd.get_dummies(dataset["alchemy_category"], prefix= "category")
dataset = dataset.join(alch_dataset)

# find variables with missing values
print('\n')
print(np.sum(dataset.isnull()))

alchemy_category_score    0
alchemy_category          0
linkwordscore             0
news_front_page           0
boilerplate               0
spelling_errors_ratio     0
numberOfLinks             0
numwords_in_url           0
label                     0
dtype: int64


alchemy_category_score         0
alchemy_category               0
linkwordscore                  0
news_front_page                0
boilerplate                    0
spelling_errors_ratio          0
numberOfLinks                  0
numwords_in_url                0
label                          0
category__M                    0
category_arts_entertainment    0
category_business              0
category_computer_internet     0
category_culture_politics      0
category_gaming                0
category_health                0
category_law_crime             0
category_recreation            0
category_religion              0
category_science_technology    0
category_sports                0
category_unknown               0
categor

### Split data training, validation, and test

We will divide the data into three groups for training, validation and ultimately testing. In this notebook, we will only work with the train and test sets.

In [43]:
train, test = train_test_split(dataset, test_size= 0.1, train_size= 0.9, random_state= 234)
print("Train data size: " + str(train.shape))
print("Test data size: " + str(test.shape))

Train data size: (6655, 23)
Test data size: (740, 23)


## Train a Baseline Model

The provided code defines a function `evaluate_logistic_regression()` that trains and evaluates a linear regression model using a specified set of features from a dataset. The function splits the dataset into training and testing sets, with 90% of the data used for training and 10% for testing. It fits a linear regression model using the training set and makes predictions on the test set. The function returns the ROC-AUC score, a metric that evaluates the model's performance in distinguishing between classes, along with the trained model itself.

Following the function definition, a baseline model is trained using a comprehensive set of features, including numerical attributes such as `numberOfLinks` and `spelling_errors_ratio`, as well as one-hot encoded categorical variables from the `alchemy_category` column. This baseline evaluation provides a starting point to assess the effectiveness of the chosen features in predicting the target variable.

In [44]:
def plot_roc_curve(y_true, y_pred):
  fpr, tpr, thresholds = roc_curve(y_true, y_pred)
  # remove first thresholds which represents a classifier that always predicts the negative class
  thresholds = thresholds[1:]
  auc_score = roc_auc_score(y_true, y_pred)

  plt.figure(figsize=(8, 6))
  plt.plot(fpr, tpr, color='blue', label=f'ROC curve (AUC = {auc_score:.2f})')
  plt.plot([0, 1], [0, 1], color='red', linestyle='--', label='Random guess')
  plt.xlabel('False Positive Rate (FPR)')
  plt.ylabel('True Positive Rate (TPR)')
  plt.title('Receiver Operating Characteristic (ROC) Curve ')
  plt.legend()
  plt.grid(True)

plt.show()

In [45]:
def evaluate_logistic_regression(dataset, feat):
  train, test = train_test_split(dataset, test_size= 0.1, train_size= 0.9, random_state= 234)
  model = LogisticRegression()
  model.fit(train[feat], train["label"])
  predictions = model.predict(test[feat])
  return roc_auc_score(test["label"], predictions), model

# Fit a baseline model
feat = ["numberOfLinks", "spelling_errors_ratio", "linkwordscore", "alchemy_category_score", "news_front_page",
        'category__M', 'category_arts_entertainment', 'category_business', 'category_computer_internet',
        'category_culture_politics', 'category_gaming', 'category_health', 'category_law_crime', 'category_recreation',
        'category_religion', 'category_science_technology', 'category_sports', 'category_unknown', 'category_weather']

evaluate_logistic_regression(dataset, feat)

(np.float64(0.6556056041416829), LogisticRegression())

## Text Feature Engineering



### Text processing

In the **StumbleUpon Evergreen Dataset**, the **`boilerplate`** feature refers to a **JSON-encoded string** that contains metadata about the web page's content, such as its **title, body text, and description**. This metadata is extracted from the HTML boilerplate of the web page and is used to provide contextual information about the page's content, which can help classify whether the page is **evergreen** (i.e., content that remains relevant over time) or **ephemeral** (i.e., time-sensitive or short-lived content).

Run the following code to see a sample of the 'boilerplate' feature.

In [46]:
dataset['boilerplate'].sample().values

array(['{"title":"Carolina Herrera on Bella s Wedding Dress The Pressure and Inspiration News carolina herrera on bella\\u2019s wedding dress: the pressure and inspiration","body":"Carolina Herrera on Bella s Wedding Dress The Pressure and Inspiration By Emily Gyben 11 30 11 at 01 55 PM Carolina Herrera s created wedding dresses for everyone from Christina Hendricks to Renee Zellweger but for Bella Swan That was a completely different kind of challenge It is a huge pressure and also it s fabulous Herrera told the International Herald Tribune Can you imagine That film is seen by everybody Twilight is something very important and they re all waiting for the wedding gown By all accounts the long sleeved dress which six months to design for Kristen Stewart s Bella Swan was a success The gown will retail for 35 000 at four of the designer s boutiques come January and has already sparked at least one replica a 799 version by Alfred Angelo I took inspiration from the 20s Herrera said of the w

### EXERCISE: Train a `CountVectorizer()`

The goal of this exercise is to learn how to use the CountVectorizer() class to process text data and then build a model pipeline using the extracted features. You'll also practice integrating these features with other numerical features for a comprehensive model.

Tasks:
1. Train a `CountVectorizer()` on the `"boilerplate"` feature of the StumbleUpon dataset. Make sure to look at the inputs to the intantiation of the `CountVectorizer()` class to strengthen your understanding of the function.
2. After training the `CountVectorizer()`, inspect the extracted vocabulary using the `.vocabulary_` attribute. This will give you a dictionary where keys are the words or n-grams and values are their corresponding feature indices.
3. Use the trained CountVectorizer() to transform the "boilerplate" text data into feature vectors. Train a LogisticRegression() model using these vectors and evaluate the model's performance using the `roc_auc()` function.
4. Convert the transformed feature vectors into a Pandas DataFrame to inspect and visualize the word features.
5. Join the text features extracted from `CountVectorizer()` with the existing numerical features in your dataset. Train a model using this combined feature set, and evaluate its performance.

#### TASK 1: Train a `CountVectorizer()`

In [47]:
# TASK 1 EXERCISE

# Create the CountVectorizer with the specified parameters
string_vectorizer = CountVectorizer(
    min_df=10,
    max_features=100,
    strip_accents='unicode',
    analyzer="word",
    token_pattern=r"\w{1,}",
    ngram_range=(1, 2),
    binary=True
)

# Fit the vectorizer to the text data
string_vectorizer.fit(dataset['boilerplate'])

#### TASK 2: Visualize the `CountVectorizer()` Vocabulary

In [48]:
# TASK 2 EXERCISE

# View the vocabulary extracted by the CountVectorizer
print("Learned Features:", string_vectorizer.get_feature_names_out()[:100])

Learned Features: ['1' '10' '2' '3' '4' '5' 'a' 'about' 'all' 'also' 'an' 'and' 'are' 'as'
 'at' 'be' 'body' 'but' 'by' 'can' 'do' 'don' 'don t' 'even' 'for'
 'for the' 'from' 'from the' 'get' 'has' 'have' 'here' 'how' 'i' 'if'
 'if you' 'in' 'in a' 'in the' 'into' 'is' 'is a' 'it' 'it s' 'just'
 'like' 'make' 'minutes' 'more' 'most' 'my' 'new' 'no' 'not' 'of' 'of the'
 'on' 'on the' 'one' 'or' 'other' 'out' 'over' 'recipe' 's' 'so' 'some'
 't' 'than' 'that' 'the' 'their' 'them' 'then' 'there' 'these' 'they'
 'this' 'time' 'title' 'to' 'to the' 'until' 'up' 'url' 'use' 'was' 'we'
 'well' 'what' 'when' 'which' 'who' 'will' 'with' 'with a' 'with the'
 'you' 'you can' 'your']


#### TASK 3: Train a Model with the `CountVectorizer()` Features

In [49]:
# TASK 3 EXERCISE

# Transform the text data using the trained CountVectorizer
train_string_vectors = string_vectorizer.transform(train['boilerplate'])
test_string_vectors = string_vectorizer.transform(test['boilerplate'])

# Initialize and train a Logistic Regression model
model = LogisticRegression()
model.fit(train_string_vectors,train['label'])

# Make predictions on the test set
predictions = model.predict(test_string_vectors)

# Evaluate the model using ROC-AUC
print("Model performance (ROC-AUC):", roc_auc_score(test['label'], predictions))

Model performance (ROC-AUC): 0.7777282166778787


#### TASK 4: Visualize the `CountVectorizer()` Features

In [50]:
# TASK 4 EXERCISE

# Convert the Features to a DataFrame
dataset_word_vectors = pd.DataFrame(train_string_vectors.todense())
print(dataset_word_vectors.head())

   0   1   2   3   4   5   6   7   8   9   ...  90  91  92  93  94  95  96  \
0   1   1   1   1   1   1   1   0   1   0  ...   0   0   0   0   1   0   0   
1   0   0   0   0   0   0   1   0   1   0  ...   0   0   0   1   1   0   0   
2   1   1   1   1   0   1   1   1   1   1  ...   0   1   0   0   1   1   1   
3   0   0   0   0   0   0   0   0   0   0  ...   0   0   0   0   0   0   0   
4   1   1   1   1   1   1   1   1   1   1  ...   1   1   1   1   1   1   1   

   97  98  99  
0   1   1   0  
1   1   0   0  
2   1   1   0  
3   0   0   0  
4   1   1   1  

[5 rows x 100 columns]


#### TASK 5: Join the `CountVectorizer()` Features with the Other Features



In [52]:
# TASK 5 EXERCISE

# Define the list of existing numerical features
feat = [
    "numberOfLinks", "spelling_errors_ratio", "linkwordscore", "alchemy_category_score", "news_front_page",
    "category__M", "category_arts_entertainment", "category_business", "category_computer_internet",
    "category_culture_politics", "category_gaming", "category_health", "category_law_crime", "category_recreation",
    "category_religion", "category_science_technology", "category_sports", "category_unknown", "category_weather"
]

# Transform the text data using the CountVectorizer
string_vectors = string_vectorizer.transform(dataset['boilerplate'])

# Convert the transformed text features to a DataFrame
dataset_word_vectors = pd.DataFrame(string_vectors.todense())

# Join the text features with the numerical features
dataset_word_and_feat = dataset_word_vectors.join(dataset[feat])

# Add the label column
dataset_word_feat_label = dataset_word_and_feat.join(dataset['label'])

# Ensure all column names are strings
dataset_word_feat_label.columns = dataset_word_feat_label.columns.astype(str)

# Evaluate the model using a custom function
evaluate_logistic_regression(dataset_word_feat_label, dataset_word_feat_label.columns[:-1])

(np.float64(0.7838339816900171), LogisticRegression())

### EXERCISE: Train a `TfidfVectorizer()`

The goal of this exercise is to train a TfidfVectorizer() to transform text data into a term-frequency inverse document frequency (TF-IDF) matrix and integrate these features with other numerical features. You will also practice training and evaluating a Logistic Regression model using the combined features.

Tasks:

1. Train a `TfidfVectorizer()` on the "boilerplate" feature of the StumbleUpon dataset.
2. Inspect the resulting vocabulary and consider what types of words are prioritized in the TF-IDF representation.
3. Transform the dataset using the trained TfidfVectorizer().
4. Convert the transformed TF-IDF features into a Pandas DataFrame and evaluate model performance using a Logistic Regression model.
5. Combine the TF-IDF features with additional numerical features like alchemy_category and evaluate the impact on model performance.


#### TASK 1: Train a `TfidfVectorizer()`

In [None]:
# TASK 1: Split and Fit
train, test = train_test_split(dataset, test_size=0.2, random_state=42)
idf_dtm.fit(train['boilerplate'])

# TASK 2: View Vocabulary
print(idf_dtm.vocabulary_)

In [57]:
# TASK 1 EXERCISE

# Create the TfidfVectorizer with specified parameters
idf_dtm = TfidfVectorizer(
    min_df=10,
    max_features=1000,
    strip_accents="unicode",
    analyzer="word",
    token_pattern=r"\w{1,}",
    ngram_range=(1, 2),
    use_idf=True,
    smooth_idf=True,
    sublinear_tf=True
)

# Split the dataset into training and testing sets
train, test = train_test_split(dataset, test_size=0.2, random_state=42)

# Fit the TfidfVectorizer to the training data
idf_dtm.fit(train['boilerplate'])

#### TASK 2: Inspect the Vocabulary

In [58]:
# TASK 2 EXERCISE

# View the vocabulary extracted by the TfidfVectorizer
print(idf_dtm.get_feature_names_out())

['0' '000' '08' '09' '1' '1 1' '1 2' '1 4' '1 cup' '10' '100' '11' '12'
 '13' '14' '15' '16' '17' '18' '19' '2' '2 cup' '2 cups' '2 tablespoons'
 '20' '2007' '2008' '2009' '2010' '2011' '2012' '21' '22' '23' '24' '25'
 '27' '3' '3 4' '30' '350' '4' '4 cup' '40' '5' '50' '6' '7' '8' '9' 'a'
 'a bit' 'a few' 'a good' 'a great' 'a large' 'a little' 'a lot' 'a new'
 'a small' 'able' 'able to' 'about' 'about the' 'above' 'according'
 'according to' 'actually' 'add' 'add the' 'added' 'after' 'again'
 'against' 'age' 'ago' 'air' 'all' 'all of' 'all the' 'allow' 'almost'
 'along' 'already' 'also' 'always' 'am' 'amazing' 'american' 'amount'
 'amount of' 'an' 'and' 'and a' 'and i' 'and it' 'and more' 'and other'
 'and the' 'and then' 'and you' 'another' 'any' 'anything' 'app' 'apple'
 'april' 'are' 'are the' 'around' 'around the' 'art' 'article' 'as' 'as a'
 'as the' 'as well' 'as you' 'aside' 'at' 'at a' 'at least' 'at the'
 'available' 'away' 'baby' 'back' 'bacon' 'bad' 'bag' 'bake' 'baked'
 '

#### TASK 3: Transform the Dataset Using the Trained TfidfVectorizer()

In [60]:
# TASK 3 EXERCISE

# Transform the text data using the trained TfidfVectorizer
dataset_idf_dtm = idf_dtm.transform(dataset['boilerplate'].astype(str))

# Check the shape of the transformed dataset
print(dataset_idf_dtm.shape)

(7395, 1000)


#### TASK 4: Convert TF-IDF Features to a DataFrame and Evaluate the Model

In [69]:
# TASK 4 EXERCISE

# Convert the transformed TF-IDF features to a DataFrame
dataset_idf = pd.DataFrame(dataset_idf_dtm.toarray(), columns=idf_dtm.get_feature_names_out())

# Add the label column to the DataFrame
dataset_idf['label'] = dataset['label'].values
dataset_idf.columns = dataset_idf.columns.astype(str)


evaluate_logistic_regression(dataset_idf, dataset_idf.columns[:-1])

(np.float64(0.8153353417765948), LogisticRegression())

#### TASK 5: Add Additional Features to the Model

In [70]:
# TASK 5 EXERCISE

# Define the list of existing numerical features
feat = [
    "numberOfLinks", "spelling_errors_ratio", "linkwordscore", "alchemy_category_score", "news_front_page",
    "category__M", "category_arts_entertainment", "category_business", "category_computer_internet",
    "category_culture_politics", "category_gaming", "category_health", "category_law_crime", "category_recreation",
    "category_religion", "category_science_technology", "category_sports", "category_unknown", "category_weather"
]

# Convert the transformed TF-IDF features to a DataFrame
dataset_idf = pd.DataFrame(dataset_idf_dtm.toarray(), columns=idf_dtm.get_feature_names_out())

# Join the text features with the numerical features
dataset_idf = dataset_idf.join(dataset[feat].reset_index(drop=True))

# Add the label column
dataset_idf['label'] = dataset['label'].values

# Ensure all column names are strings
dataset_idf.columns = dataset_idf.columns.astype(str)

# Evaluate the model using a custom function
evaluate_logistic_regression(dataset_idf, dataset_idf.columns[:-1])

(np.float64(0.8142019362952997), LogisticRegression())

### EXERCISE: Applying text Pre-Processing Techniques

The goal of this exercise is to integrate lemmatization and stemming into your text processing pipeline using both `CountVectorizer()` and `TfidfVectorizer()`. By reducing words to their base form, you'll improve the model's ability to generalize across variations of words, which can lead to better performance.

You'll build on the previous exercises, testing how lemmatization and stemming affect the size of the vocabulary, model performance, and feature representations.

Tasks:
1. Apply Lemmatization with `CountVectorizer()`.
  - Explore the custom preprocessor function that lemmatizes each word in the text.
  - Pass the custom preprocessor to the CountVectorizer and fit it to the "boilerplate" feature of the training set.
  - Evaluate the Logistic Regression model using the lemmatized features and compare the performance to the previous exercises.
2. Apply stemming with `TfidfVectorizer()`.
  - Create a custom preprocessor function that stems each word in the text.
  - Pass the custom preprocessor to the TfidfVectorizer and fit it to the "boilerplate" feature of the training set.
  - Evaluate the Logistic Regression model using the stemmed features and compare the performance to the lemmatized features.
3. Compare the sizes of the vocabularies before and after stemming and lemmatization.
4. Look at some samples from the vocabularies before and after stemming and lematization to see how these approaches have altered what is learned by `CountVectorizer()` or `TfidfVectorizer()`.
5. Compare the ROC-AUC scores across the original, lemmatized, and stemmed models. Reflect on which approach performed better and why.



In [71]:
%pip install nltk



In [72]:
# Import the necessary Libraries
from nltk.stem import WordNetLemmatizer
from nltk.stem import PorterStemmer
import nltk

nltk.download('wordnet')
nltk.download('punkt')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

#### TASK 1: Apply Lemmatization with `CountVectorizer()`

In [74]:
# TASK 1 EXERCISE

# Import the necessary library
from nltk.stem import WordNetLemmatizer

# Create the lemmatizer
lemmatizer = WordNetLemmatizer()

# Define the custom lemmatization function
def lemmatize_text(text):
    return " ".join([lemmatizer.lemmatize(word) for word in text.split()])

# Create the CountVectorizer with custom preprocessor
lemmatized_vectorizer = CountVectorizer(
    preprocessor=lemmatize_text,
    min_df=10,
    max_features=100,
    strip_accents="unicode",
    analyzer="word",
    token_pattern=r"\w{1,}",
    ngram_range=(1, 2),
    binary=True
)

# Fit the vectorizer to the text data
lemmatized_vectorizer.fit(train["boilerplate"].astype(str))

# Transform the text data and train a model
train_string_vectors = lemmatized_vectorizer.transform(train["boilerplate"])
test_string_vectors = lemmatized_vectorizer.transform(test["boilerplate"])

# Train and evaluate a Logistic Regression model
model = LogisticRegression()
model.fit(train_string_vectors, train["label"])
predictions = model.predict(test_string_vectors)

# Evaluate the model using ROC-AUC
print("Model performance (ROC-AUC):", roc_auc_score(test["label"], predictions))

Model performance (ROC-AUC): 0.7660744005706238


#### TASK 2: Apply Stemming with TfidfVectorizer


In [75]:
# Create the stemmer
stemmer = PorterStemmer()

# Define the custom stemming function
def stem_text(text):
    return " ".join([stemmer.stem(word) for word in text.split()])

# Create the TfidfVectorizer with custom preprocessor
stemmed_vectorizer = TfidfVectorizer(
    preprocessor=stem_text,
    min_df=10,
    max_features=1000,
    strip_accents="unicode",
    analyzer="word",
    token_pattern=r"\w{1,}",
    ngram_range=(1, 2),
    use_idf=True,
    smooth_idf=True,
    sublinear_tf=True
)

# Fit the vectorizer to the text data
stemmed_vectorizer.fit(train["boilerplate"].astype(str))

# Transform the text data and train a model
train_string_vectors = stemmed_vectorizer.transform(train["boilerplate"])
test_string_vectors = stemmed_vectorizer.transform(test["boilerplate"])

# Train and evaluate a Logistic Regression model
model = LogisticRegression()
model.fit(train_string_vectors, train["label"])
predictions = model.predict(test_string_vectors)

# Evaluate the model using ROC-AUC
print("Model performance (ROC-AUC):", roc_auc_score(test["label"], predictions))


Model performance (ROC-AUC): 0.8041598844120929


#### TASK 3: Compare Vocabulary Sizes

In [76]:
# TASK 3 SOLUTION AND EXERCISE

# Print the size of the original vectorizers' vocabularies
print("Original CountVectorizer vocabulary size:", len(string_vectorizer.vocabulary_))
print("Original TfidfVectorizer vocabulary size:", len(idf_dtm.vocabulary_))

# Print the size of the lemmatized and stemmed vocabularies
print("Lemmatized CountVectorizer vocabulary size:", len(lemmatized_vectorizer.vocabulary_))
print("Stemmed TfidfVectorizer vocabulary size:", len(stemmed_vectorizer.vocabulary_))

Original CountVectorizer vocabulary size: 100
Original TfidfVectorizer vocabulary size: 1000
Lemmatized CountVectorizer vocabulary size: 100
Stemmed TfidfVectorizer vocabulary size: 1000


#### TASK 4: Explore a sample of the vocabularies before and after Stemming or Lemmatization

In [77]:
# Print a sample of the vocabulary before and after lemmatization
print("Original CountVectorizer vocabulary sample:", list(string_vectorizer.vocabulary_.keys())[:50])
print("Lemmatized CountVectorizer vocabulary sample:", list(lemmatized_vectorizer.vocabulary_.keys())[:50])

print("Original TF-IDF vocabulary sample:", list(idf_dtm.vocabulary_.keys())[:50])
print("Stemmed TF-IDF vocabulary sample:", list(stemmed_vectorizer.vocabulary_.keys())[:50])

Original CountVectorizer vocabulary sample: ['title', 'body', 'a', 'the', 'in', 'at', 'of', 'by', 'your', 'will', '3', 'who', 'and', 'be', 'that', 's', 'what', 'are', 'an', 'for', 'new', 'which', 'to', 'from', 'this', 'also', 'can', 'when', 'these', 'all', 'is', 'not', 'it', 'just', 'one', 'on', 'out', 'was', 'about', 'do', 'they', 'have', 'as', 'well', 'other', '5', '1', '10', 'than', '2']
Lemmatized CountVectorizer vocabulary sample: ['title', 'body', 'If', 'you', 'like', 'this', 'to', '10', 'of', 'into', 'your', 'The', 't', 'that', 'which', 'is', 'more', 'in', 'with', 'This', 'but', 'it', 's', 'the', 'on', 'a', 'for', 'and', 'if', 'than', '2', 'only', 'ha', 'up', 'are', 'one', 'there', 'at', 'just', 'make', 'not', 'so', 'time', 'can', 'be', 'have', 'some', 'You', 'or', 'all']
Original TF-IDF vocabulary sample: ['title', 'cool', 'story', 'summer', 'sports', 'body', 'if', 'you', 'like', 'this', 'feel', 'free', 'to', 'share', '10', 'when', 'think', 'of', 'may', 'men', 'into', 'your', '