<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Natural Language Processing Lab

_Authors: Dave Yerrington (SF)_

---

In this lab, we'll explore scikit-learn and NLTK's capabilities for processing text even further. We'll use the 20 newsgroups data set, which is provided by scikit-learn.

In [1]:
# Standard data science imports:
import pandas as pd
import numpy as np
import string
import nltk
from nltk.corpus import stopwords
import matplotlib.pyplot as plt
from nltk.stem import SnowballStemmer
from sklearn.feature_extraction.text import CountVectorizer, HashingVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer
%matplotlib inline
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Chonn\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [2]:
# Getting the scikit-learn data set:
from sklearn.datasets import fetch_20newsgroups

### 1. Use the `fetch_20newsgroups` function to download a training and testing set.

The "20 Newsgroups" dataset is described [here](http://scikit-learn.org/stable/datasets/twenty_newsgroups.html).

For this lab let's choose 4 categories to analyze.  The full list is given below.


```python
['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']
```

Note that the solution code will use these categories:
- `alt.atheism`
- `talk.religion.misc`
- `comp.graphics`
- `sci.space`

Also remove the headers, footers, and quotes using the `remove` keyword argument of the function.

In [3]:
#Extracting Information from the Data's Dictionary format

categories = ['comp.sys.mac.hardware','comp.windows.x','misc.forsale','rec.autos']  # Fill in whatever categories you want to use!!

# Setting out training data
data_train = fetch_20newsgroups(subset='train', categories=categories,
                                shuffle=True, random_state=42,
                                remove=('headers', 'footers', 'quotes'))
# Setting our testing data
data_test = fetch_20newsgroups(subset='test', categories=categories,
                               shuffle=True, random_state=42,
                               remove=('headers', 'footers', 'quotes'))

**Question:** What does the `shuffle` argument do?  Why are we setting a `random_state`?

In [4]:
# A: The shuffle argument in fetch_20newsgroups() controls whether the training data is shuffled before being returned
# A: The random_state argument controls the random seed used to shuffle the data.

### 2) Inspect the data.

We've downloaded a few `newsgroups` categories and removed their headers, footers, and quotes.

Because this is a scikit-learn data set, it comes with pre-split training and testing sets (note: we were able to call "train" and "test" in subset).

Let's inspect them.

1) What data type is `data_train`?
- Is it a list? A dictionary? What else?
- How many data points does it contain?
- Inspect the first data point. What does it look like?

In [5]:
# A:
type(data_train)

sklearn.utils._bunch.Bunch

In [6]:
# Number of data points
num_data_points = len(data_train.data)
print(f"Number of data points: {num_data_points}")

Number of data points: 2350


In [7]:
# Inspect the first data point
first_data_point = data_train.data[0]
print("First data point:")
print(first_data_point)

First data point:
Dumbest options? Well here in the UK, BMW offer a 'no-smokers' option...
It just means they take the fag lighter out.... big deal....

BTW - I just bought a Honda CRX F1..... its neat... did consider an MR2 targa,
MX5 (you guys call it Miata?).... but that CRX just one my heart with that 
body kit and 8-spokes.... 


### 3) Create a bag-of-words model.

Let's train a model using a simple count vectorizer.

1) Initialize a standard CountVectorizer and fit the training data.
- How big is the feature dictionary?
- Eliminate English stop words.
- Is the dictionary smaller?
- Transform the training data using the trained vectorizer.
- Evaluate the performance of a logistic regression on the features extracted by the CountVectorizer.
    - You will have to transform the `test_set`, too. Be careful to use the trained vectorizer without refitting it.

**Bonus**
- Try a couple of modifications:
    - Restrict the `max_features`.
    - Change the `max_df` and `min_df`.

In [9]:
# A:
# Initialize the CountVectorizer
count_vectorizer = CountVectorizer()

# Fit the vectorizer on the training data
X_train = count_vectorizer.fit_transform(data_train.data)

# Get the feature dictionary (vocabulary) size
feature_dictionary_size = len(count_vectorizer.vocabulary_)



# Initialize a new CountVectorizer with English stop words removed
count_vectorizer = CountVectorizer(stop_words='english')

# Fit the new vectorizer on the training data
X_train_no_stopwords = count_vectorizer.fit_transform(data_train.data)

# Get the feature dictionary (vocabulary) size without stop words
feature_dictionary_size_no_stopwords = len(count_vectorizer.vocabulary_)


print(f"Size of the feature dictionary (with stop words): {feature_dictionary_size}")
print(f"Size of the feature dictionary (without stop words): {feature_dictionary_size_no_stopwords}")

Size of the feature dictionary (with stop words): 23525
Size of the feature dictionary (without stop words): 23231


In [10]:
X_train = count_vectorizer.fit_transform(data_train.data)
X_train, X_valid, y_train, y_valid = train_test_split(X_train, data_train.target, test_size=0.2, random_state=42)

In [11]:
logistic_regression_model = LogisticRegression()
logistic_regression_model.fit(X_train, y_train)

y_pred = logistic_regression_model.predict(X_valid)

In [12]:
accuracy = accuracy_score(y_valid, y_pred)
report = classification_report(y_valid, y_pred, target_names=data_train.target_names)
print(f"Accuracy: {accuracy}")
print(report)

Accuracy: 0.8297872340425532
                       precision    recall  f1-score   support

comp.sys.mac.hardware       0.88      0.81      0.85       124
       comp.windows.x       0.91      0.90      0.91       107
         misc.forsale       0.85      0.75      0.80       131
            rec.autos       0.70      0.88      0.78       108

             accuracy                           0.83       470
            macro avg       0.84      0.83      0.83       470
         weighted avg       0.84      0.83      0.83       470



In [13]:
# Transform the test set using the trained vectorizer with no stop words
X_test_transformed = count_vectorizer.transform(data_test.data)

# Make predictions on the test set
y_test_pred = logistic_regression_model.predict(X_test_transformed)

# Evaluate the model's performance on the test set
accuracy_test = accuracy_score(data_test.target, y_test_pred)
report_test = classification_report(data_test.target, y_test_pred, target_names=data_test.target_names)

print(f"Accuracy on the test set (without refitting the vectorizer): {accuracy_test}")
print(report_test)

Accuracy on the test set (without refitting the vectorizer): 0.8339719029374202
                       precision    recall  f1-score   support

comp.sys.mac.hardware       0.85      0.79      0.82       385
       comp.windows.x       0.94      0.82      0.88       395
         misc.forsale       0.86      0.79      0.83       390
            rec.autos       0.73      0.93      0.82       396

             accuracy                           0.83      1566
            macro avg       0.85      0.83      0.83      1566
         weighted avg       0.85      0.83      0.84      1566



In [14]:
# Initialize CountVectorizer with restricted features and custom max_df and min_df
count_vectorizer = 
, stop_words='english')

# Fit the vectorizer on the training data
X_train = count_vectorizer.fit_transform(data_train.data)

# Get the size of the feature dictionary
feature_dictionary_size = len(count_vectorizer.vocabulary_)

# Split the data into a training and validation set
X_train_split, X_valid, y_train_split, y_valid = train_test_split(X_train, data_train.target, test_size=0.2, random_state=42)

# Initialize the logistic regression model
logistic_regression_model = LogisticRegression()

# Fit the logistic regression model on the training data
logistic_regression_model.fit(X_train_split, y_train_split)

# Make predictions on the validation set
y_pred = logistic_regression_model.predict(X_valid)

# Evaluate the model's performance on the validation set
accuracy = accuracy_score(y_valid, y_pred)
report = classification_report(y_valid, y_pred, target_names=data_train.target_names)

print(f"Accuracy: {accuracy}")
print(report)

Accuracy: 0.8212765957446808
                       precision    recall  f1-score   support

comp.sys.mac.hardware       0.89      0.79      0.84       124
       comp.windows.x       0.89      0.90      0.89       107
         misc.forsale       0.85      0.76      0.80       131
            rec.autos       0.69      0.86      0.77       108

             accuracy                           0.82       470
            macro avg       0.83      0.83      0.82       470
         weighted avg       0.83      0.82      0.82       470



### 4) Test Out Hashing and TF-IDF.

Let's see if hashing or TF-IDF improves our accuracy.

1) Initialize a HashingVectorizer and repeat the test with no restriction on the number of features.
- Does the score improve with respect to the CountVectorizer?
- Print out the number of features for this model.
- Initialize a TF-IDF vectorizer and repeat the analysis above.
- Print out the number of features for this model.

**Bonus**
- Change the parameters of either (or both) models to improve your score.

In [15]:
# A:
# Initialize HashingVectorizer with no restriction on the number of features
hashing_vectorizer = HashingVectorizer(stop_words='english')

# Transform the training data using HashingVectorizer
X_train_hashing = hashing_vectorizer.transform(data_train.data)

# Split the data into a training and validation set
X_train_split, X_valid, y_train_split, y_valid = train_test_split(X_train_hashing, data_train.target, test_size=0.2, random_state=42)

# Initialize the logistic regression model
logistic_regression_model = LogisticRegression()

# Fit the logistic regression model on the training data
logistic_regression_model.fit(X_train_split, y_train_split)

# Make predictions on the validation set
y_pred = logistic_regression_model.predict(X_valid)

# Evaluate the model's performance on the validation set
accuracy_hashing = accuracy_score(y_valid, y_pred)
report_hashing = classification_report(y_valid, y_pred, target_names=data_train.target_names)

# Print the number of features for HashingVectorizer
num_features_hashing = X_train_hashing.shape[1]
print(f"Number of features (HashingVectorizer): {num_features_hashing}")
print(f"Accuracy (HashingVectorizer): {accuracy_hashing}")
print(report_hashing)


Number of features (HashingVectorizer): 1048576
Accuracy (HashingVectorizer): 0.851063829787234
                       precision    recall  f1-score   support

comp.sys.mac.hardware       0.88      0.84      0.86       124
       comp.windows.x       0.88      0.93      0.91       107
         misc.forsale       0.89      0.77      0.83       131
            rec.autos       0.75      0.88      0.81       108

             accuracy                           0.85       470
            macro avg       0.85      0.86      0.85       470
         weighted avg       0.86      0.85      0.85       470



In [16]:
# Transform the test set using the trained vectorizer with no stop words
X_test_transformed = hashing_vectorizer.transform(data_test.data)

# Make predictions on the test set
y_test_pred = logistic_regression_model.predict(X_test_transformed)

# Evaluate the model's performance on the test set
accuracy_test = accuracy_score(data_test.target, y_test_pred)
report_test = classification_report(data_test.target, y_test_pred, target_names=data_test.target_names)

print(f"Accuracy on the test set (without refitting the vectorizer): {accuracy_test}")
print(report_test)

Accuracy on the test set (without refitting the vectorizer): 0.8448275862068966
                       precision    recall  f1-score   support

comp.sys.mac.hardware       0.85      0.77      0.81       385
       comp.windows.x       0.93      0.87      0.90       395
         misc.forsale       0.90      0.80      0.85       390
            rec.autos       0.74      0.93      0.82       396

             accuracy                           0.84      1566
            macro avg       0.85      0.84      0.85      1566
         weighted avg       0.85      0.84      0.85      1566



TF-IDF

In [17]:
# Initialize TF-IDF Vectorizer with no restriction on the number of features
tfidf_vectorizer = TfidfVectorizer(max_features=None, stop_words='english')

# Transform the training data using TF-IDF Vectorizer
X_train_tfidf = tfidf_vectorizer.fit_transform(data_train.data)

# Split the data into a training and validation set
X_train_split, X_valid, y_train_split, y_valid = train_test_split(X_train_tfidf, data_train.target, test_size=0.2, random_state=42)

# Initialize the logistic regression model
logistic_regression_model = LogisticRegression()

# Fit the logistic regression model on the training data
logistic_regression_model.fit(X_train_split, y_train_split)

# Make predictions on the validation set
y_pred = logistic_regression_model.predict(X_valid)

# Evaluate the model's performance on the validation set
accuracy_tfidf = accuracy_score(y_valid, y_pred)
report_tfidf = classification_report(y_valid, y_pred, target_names=data_train.target_names)

# Print the number of features for TF-IDF Vectorizer
num_features_tfidf = X_train_tfidf.shape[1]
print(f"Number of features (TF-IDF Vectorizer): {num_features_tfidf}")
print(f"Accuracy (TF-IDF Vectorizer): {accuracy_tfidf}")
print(report_tfidf)

Number of features (TF-IDF Vectorizer): 23231
Accuracy (TF-IDF Vectorizer): 0.8638297872340426
                       precision    recall  f1-score   support

comp.sys.mac.hardware       0.93      0.86      0.90       124
       comp.windows.x       0.94      0.94      0.94       107
         misc.forsale       0.88      0.76      0.81       131
            rec.autos       0.73      0.92      0.81       108

             accuracy                           0.86       470
            macro avg       0.87      0.87      0.87       470
         weighted avg       0.87      0.86      0.86       470



In [18]:
# Transform the test set using the trained vectorizer with no stop words
X_test_transformed = tfidf_vectorizer.transform(data_test.data)


# Make predictions on the test set
y_test_pred = logistic_regression_model.predict(X_test_transformed)

# Evaluate the model's performance on the test set
accuracy_test = accuracy_score(data_test.target, y_test_pred)
report_test = classification_report(data_test.target, y_test_pred, target_names=data_test.target_names)

print(f"Accuracy on the test set (without refitting the vectorizer): {accuracy_test}")
print(report_test)

Accuracy on the test set (without refitting the vectorizer): 0.8671775223499362
                       precision    recall  f1-score   support

comp.sys.mac.hardware       0.90      0.80      0.85       385
       comp.windows.x       0.94      0.90      0.92       395
         misc.forsale       0.89      0.81      0.85       390
            rec.autos       0.77      0.96      0.85       396

             accuracy                           0.87      1566
            macro avg       0.88      0.87      0.87      1566
         weighted avg       0.88      0.87      0.87      1566



Tuning

In [19]:
# A:
# Initialize HashingVectorizer with no restriction on the number of features
hashing_vectorizer = HashingVectorizer(n_features=2**20, ngram_range=(1, 3),stop_words='english')

# Transform the training data using HashingVectorizer
X_train_hashing = hashing_vectorizer.transform(data_train.data)

# Split the data into a training and validation set
X_train_split, X_valid, y_train_split, y_valid = train_test_split(X_train_hashing, data_train.target, test_size=0.2, random_state=42)

# Initialize the logistic regression model
logistic_regression_model = LogisticRegression()

# Fit the logistic regression model on the training data
logistic_regression_model.fit(X_train_split, y_train_split)

# Make predictions on the validation set
y_pred = logistic_regression_model.predict(X_valid)

# Evaluate the model's performance on the validation set
accuracy_hashing = accuracy_score(y_valid, y_pred)
report_hashing = classification_report(y_valid, y_pred, target_names=data_train.target_names)

# Print the number of features for HashingVectorizer
num_features_hashing = X_train_hashing.shape[1]
print(f"Number of features (HashingVectorizer): {num_features_hashing}")
print(f"Accuracy (HashingVectorizer): {accuracy_hashing}")
print(report_hashing)


Number of features (HashingVectorizer): 1048576
Accuracy (HashingVectorizer): 0.8468085106382979
                       precision    recall  f1-score   support

comp.sys.mac.hardware       0.91      0.82      0.86       124
       comp.windows.x       0.87      0.93      0.90       107
         misc.forsale       0.90      0.76      0.82       131
            rec.autos       0.73      0.91      0.81       108

             accuracy                           0.85       470
            macro avg       0.85      0.85      0.85       470
         weighted avg       0.86      0.85      0.85       470



In [20]:
# Transform the test set using the trained vectorizer with no stop words
X_test_transformed = hashing_vectorizer.transform(data_test.data)


# Make predictions on the test set
y_test_pred = logistic_regression_model.predict(X_test_transformed)

# Evaluate the model's performance on the test set
accuracy_test = accuracy_score(data_test.target, y_test_pred)
report_test = classification_report(data_test.target, y_test_pred, target_names=data_test.target_names)

print(f"Accuracy on the test set (without refitting the vectorizer): {accuracy_test}")
print(report_test)

Accuracy on the test set (without refitting the vectorizer): 0.8448275862068966
                       precision    recall  f1-score   support

comp.sys.mac.hardware       0.88      0.76      0.82       385
       comp.windows.x       0.92      0.88      0.90       395
         misc.forsale       0.91      0.79      0.84       390
            rec.autos       0.73      0.94      0.82       396

             accuracy                           0.84      1566
            macro avg       0.86      0.84      0.85      1566
         weighted avg       0.86      0.84      0.85      1566



In [21]:
# Initialize TF-IDF Vectorizer with no restriction on the number of features
tfidf_vectorizer = TfidfVectorizer(
    max_features=1000,  # You can adjust this parameter
    ngram_range=(1, 2),  # Consider unigrams and bigrams
    stop_words='english',  # Remove common English stop words
    use_idf=True,
    smooth_idf=True,
    sublinear_tf=True,
)


# Transform the training data using TF-IDF Vectorizer
X_train_tfidf = tfidf_vectorizer.fit_transform(data_train.data)

# Split the data into a training and validation set
X_train_split, X_valid, y_train_split, y_valid = train_test_split(X_train_tfidf, data_train.target, test_size=0.2, random_state=42)

# Initialize the logistic regression model
logistic_regression_model = LogisticRegression()

# Fit the logistic regression model on the training data
logistic_regression_model.fit(X_train_split, y_train_split)

# Make predictions on the validation set
y_pred = logistic_regression_model.predict(X_valid)

# Evaluate the model's performance on the validation set
accuracy_tfidf = accuracy_score(y_valid, y_pred)
report_tfidf = classification_report(y_valid, y_pred, target_names=data_train.target_names)

# Print the number of features for TF-IDF Vectorizer
num_features_tfidf = X_train_tfidf.shape[1]
print(f"Number of features (TF-IDF Vectorizer): {num_features_tfidf}")
print(f"Accuracy (TF-IDF Vectorizer): {accuracy_tfidf}")
print(report_tfidf)

Number of features (TF-IDF Vectorizer): 1000
Accuracy (TF-IDF Vectorizer): 0.8404255319148937
                       precision    recall  f1-score   support

comp.sys.mac.hardware       0.90      0.83      0.87       124
       comp.windows.x       0.90      0.91      0.90       107
         misc.forsale       0.88      0.76      0.81       131
            rec.autos       0.71      0.89      0.79       108

             accuracy                           0.84       470
            macro avg       0.85      0.85      0.84       470
         weighted avg       0.85      0.84      0.84       470



In [22]:
# Transform the test set using the trained vectorizer with no stop words
X_test_transformed = tfidf_vectorizer.transform(data_test.data)


# Make predictions on the test set
y_test_pred = logistic_regression_model.predict(X_test_transformed)

# Evaluate the model's performance on the test set
accuracy_test = accuracy_score(data_test.target, y_test_pred)
report_test = classification_report(data_test.target, y_test_pred, target_names=data_test.target_names)

print(f"Accuracy on the test set (without refitting the vectorizer): {accuracy_test}")
print(report_test)

Accuracy on the test set (without refitting the vectorizer): 0.8326947637292464
                       precision    recall  f1-score   support

comp.sys.mac.hardware       0.84      0.77      0.81       385
       comp.windows.x       0.91      0.84      0.88       395
         misc.forsale       0.87      0.81      0.83       390
            rec.autos       0.74      0.91      0.82       396

             accuracy                           0.83      1566
            macro avg       0.84      0.83      0.83      1566
         weighted avg       0.84      0.83      0.83      1566



### 5. [Bonus] Robust Text Preprocessing

Your mission, should you choose to accept it, is to write a preprocessing function for all of your text.  This functions should

- convert all text to lowercase,
- remove punctuation,
- stem or lemmatize each word of the text,
- remove stopwords.

The function should receive one string of text and return the processed text.

Once you have built your function, use it to process your train and test data, then fit a Logistic Regression model to see how it performs.

In [23]:
stemmer = SnowballStemmer('english')

def preprocess_text(text):
    # Convert text to lowercase
    text = text.lower()

    # Remove punctuation
    text = ''.join([char for char in text if char not in string.punctuation])

    # Tokenize the text (split it into words)
    words = text.split()

    # Apply stemming to each word
    words = [stemmer.stem(word) for word in words]

    # Remove stopwords
    words = [word for word in words if word not in stopwords.words('english')]

    # Join the processed words back into a single string
    processed_text = ' '.join(words)

    return processed_text

In [24]:
# Preprocess and update the text data in data_train
for i in range(len(data_train.data)):
    data_train.data[i] = preprocess_text(data_train.data[i])

# Preprocess and update the text data in test
for i in range(len(data_test.data)):
    data_test.data[i] = preprocess_text(data_test.data[i])

In [25]:
# Initialize TF-IDF Vectorizer with no restriction on the number of features
tfidf_vectorizer = TfidfVectorizer(max_features=None, stop_words='english')

# Transform the training data using TF-IDF Vectorizer
X_train_tfidf = tfidf_vectorizer.fit_transform(data_train.data)

# Split the data into a training and validation set
X_train_split, X_valid, y_train_split, y_valid = train_test_split(X_train_tfidf, data_train.target, test_size=0.2, random_state=42)

# Initialize the logistic regression model
logistic_regression_model = LogisticRegression()

# Fit the logistic regression model on the training data
logistic_regression_model.fit(X_train_split, y_train_split)

# Make predictions on the validation set
y_pred = logistic_regression_model.predict(X_valid)

# Evaluate the model's performance on the validation set
accuracy_tfidf = accuracy_score(y_valid, y_pred)
report_tfidf = classification_report(y_valid, y_pred, target_names=data_train.target_names)

# Print the number of features for TF-IDF Vectorizer
num_features_tfidf = X_train_tfidf.shape[1]
print(f"Number of features (TF-IDF Vectorizer): {num_features_tfidf}")
print(f"Accuracy (TF-IDF Vectorizer): {accuracy_tfidf}")
print(report_tfidf)

Number of features (TF-IDF Vectorizer): 21845
Accuracy (TF-IDF Vectorizer): 0.8595744680851064
                       precision    recall  f1-score   support

comp.sys.mac.hardware       0.91      0.85      0.88       124
       comp.windows.x       0.92      0.93      0.92       107
         misc.forsale       0.88      0.76      0.82       131
            rec.autos       0.75      0.93      0.83       108

             accuracy                           0.86       470
            macro avg       0.87      0.87      0.86       470
         weighted avg       0.87      0.86      0.86       470



In [26]:
# Transform the test set using the trained vectorizer with no stop words
X_test_transformed = tfidf_vectorizer.transform(data_test.data)


# Make predictions on the test set
y_test_pred = logistic_regression_model.predict(X_test_transformed)

# Evaluate the model's performance on the test set
accuracy_test = accuracy_score(data_test.target, y_test_pred)
report_test = classification_report(data_test.target, y_test_pred, target_names=data_test.target_names)

print(f"Accuracy on the test set (without refitting the vectorizer): {accuracy_test}")
print(report_test)

Accuracy on the test set (without refitting the vectorizer): 0.8607918263090677
                       precision    recall  f1-score   support

comp.sys.mac.hardware       0.87      0.80      0.84       385
       comp.windows.x       0.95      0.88      0.91       395
         misc.forsale       0.88      0.82      0.85       390
            rec.autos       0.77      0.94      0.85       396

             accuracy                           0.86      1566
            macro avg       0.87      0.86      0.86      1566
         weighted avg       0.87      0.86      0.86      1566

