PROBLEM 3 : Pairwise Feature selection for text
* On 20NG, run feature selection using skikit-learn built in "chi2" criteria to select top 200 features.

* Rerun a classification task, compare performance with HW3A-PB1. Then repeat the whole pipeline with "mutual-information" criteria

In [None]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

In [None]:
# Load 20 Newsgroups dataset
categories = ["alt.atheism", "sci.med", "sci.electronics", "comp.graphics", "talk.politics.guns", "sci.crypt"]
data = fetch_20newsgroups(subset='all', categories=categories, remove=('headers', 'footers', 'quotes'))

# Convert text data into numerical features
vectorizer = TfidfVectorizer(stop_words='english', max_features=30000)
X = vectorizer.fit_transform(data.data)

In [None]:
# Ensure y has the same number of samples as X (some documents may be removed)
y = np.array(data.target)[:X.shape[0]]  # Adjust y to match X's number of rows

In [None]:
# Apply chi2 feature selection (keep top 200 features)
selector = SelectKBest(chi2, k=200)
X_selected = selector.fit_transform(X, y)  # Transform X while keeping top features

In [None]:
# Train/test split
X_train, X_test, y_train, y_test = train_test_split(X_selected, y, test_size=0.2, random_state=42)


In [None]:
# Train classifier
clf = MultinomialNB()
clf.fit(X_train, y_train)

In [None]:
# Evaluate performance
y_pred = clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy with chi2-selected features: {accuracy:.4f}")

Accuracy with chi2-selected features: 0.7336


Mutual information

In [None]:
from sklearn.feature_selection import mutual_info_classif

# Apply Mutual Information feature selection (keep top 200 features)
selector_mi = SelectKBest(mutual_info_classif, k=200)
X_selected_mi = selector_mi.fit_transform(X, y)  # Transform X with top features


[1;30;43mSe truncaron las últimas líneas 5000 del resultado de transmisión.[0m


In [None]:
# Train/test split
X_train, X_test, y_train, y_test = train_test_split(X_selected_mi, y, test_size=0.2, random_state=42)


In [None]:
# Train classifier
clf_mi = MultinomialNB()
clf_mi.fit(X_train, y_train)

In [None]:
# Evaluate performance
y_pred_mi = clf_mi.predict(X_test)
accuracy_mi = accuracy_score(y_test, y_pred_mi)
print(f"Accuracy with mutual information-selected features: {accuracy_mi:.4f}")


Accuracy with mutual information-selected features: 0.5504


20NG with full features

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np

# Load the 20 Newsgroups dataset
categories = ["alt.atheism", "sci.med", "sci.electronics", "comp.graphics", "talk.politics.guns", "sci.crypt"]
data = fetch_20newsgroups(subset='all', categories=categories, remove=('headers', 'footers', 'quotes'))

# Vectorize text data (no feature selection, using full feature set)
vectorizer = TfidfVectorizer(stop_words='english', max_features=30000)
X_full = vectorizer.fit_transform(data.data)
y_full = np.array(data.target)[:X_full.shape[0]]



In [None]:
# Train/test split
X_train_full, X_test_full, y_train_full, y_test_full = train_test_split(X_full, y_full, test_size=0.2, random_state=42)

# Train classifier on full dataset (without feature selection)
clf_full = MultinomialNB()
clf_full.fit(X_train_full, y_train_full)

# Evaluate performance
y_pred_full = clf_full.predict(X_test_full)
accuracy_full = accuracy_score(y_test_full, y_pred_full)
print(f"Baseline Accuracy (Full Feature Set - No Selection): {accuracy_full:.4f}")


Baseline Accuracy (Full Feature Set - No Selection): 0.8646


# Results
* 20NG CHI2 and Multinomial NB : accuracy 0.73

* 20NG MI and Multinomial NB : accuracy 0.55

* 20NG and Multinomial NB with all fetures : accuracy 0.86

By reducing the number of features from 30000 to 200, useful information is lost. Also MultinomialNB works better with too many characteristics; so these are the reasons why the accuracy decreased with the feature extraction.
Also CHI2 performs better than Mutual Information (MI) because CHI2 selects features that are highly correlated with class labels, improving class separation. MI measures information gain but may select features that are less discriminative for MultinomialNB and since MultinomialNB assumes feature independence, CHI2 preserves the most relevant words, leading to better accuracy.