<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Natural Language Processing Lab

_Authors: Dave Yerrington (SF)_

---

In this lab we will further explore sklearn and NLTK's capabilities for processing text. We will use the 20 Newsgroup dataset, which is provided by sklearn.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
# Getting SKLearn dataset and other NLP tools
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import GridSearchCV

### 1. Use the `fetch_20newsgroups` function to download a training and testing set.

Look up the function documentation for how to grab the data.

You should pull these categories:
- `alt.atheism`
- `talk.religion.misc`
- `comp.graphics`
- `sci.space`

Also remove the headers, footers, and quotes using the `remove` keyword argument of the function.

In [3]:
# Set up the categories for loading
categories = [
    "alt.atheism",
    "talk.religion.misc",
    "comp.graphics",
    "sci.space"
]

In [4]:
print("Loading 20 newsgroups dataset for categories:")
print(categories)

Loading 20 newsgroups dataset for categories:
['alt.atheism', 'talk.religion.misc', 'comp.graphics', 'sci.space']


In [5]:
# Load the data using function fetch_20newsgroups()
data_train = fetch_20newsgroups(subset="train", categories=categories, remove=('headers', 'footers', 'quotes'))
data_test = fetch_20newsgroups(subset="test", categories=categories, remove=('headers', 'footers', 'quotes'))

In [6]:
print("%d documents" % len(data_train.filenames))
print("%d categories" % len(data_train.target_names))
print()

2034 documents
4 categories



In [7]:
list(data_train.keys())

['data', 'filenames', 'target_names', 'target', 'DESCR']

### 2. Data inspection

We have downloaded a few newsgroup categories and removed headers, footers and quotes.

Because this is an sklearn dataset, it comes with pre-split train and test sets (note we were able to call 'train' and 'test' in subset).

Let's inspect them.

1. What data taype is `data_train` - It is 'bunch' object file which resembles a dictionary
- There are 2034 data points (documents) - approx 508 documents per category
- The real data lies in 'data' and 'target' attributes

In [8]:
# 'Bunch' object that behaves like extended dictionary
print(type(data_train))

<class 'sklearn.utils.Bunch'>


In [9]:
# Category name
data_train['target_names'][0]

'alt.atheism'

In [10]:
# Integer index of category
data_train['target'][0]

1

In [11]:
# URL for document
data_train['filenames'][0]

'C:\\Users\\shmel\\scikit_learn_data\\20news_home\\20news-bydate-train\\comp.graphics\\38816'

In [12]:
# Data to machine-learn
data_train['data'][0]

"Hi,\n\nI've noticed that if you only save a model (with all your mapping planes\npositioned carefully) to a .3DS file that when you reload it after restarting\n3DS, they are given a default position and orientation.  But if you save\nto a .PRJ file their positions/orientation are preserved.  Does anyone\nknow why this information is not stored in the .3DS file?  Nothing is\nexplicitly said in the manual about saving texture rules in the .PRJ file. \nI'd like to be able to read the texture rule information, does anyone have \nthe format for the .PRJ file?\n\nIs the .CEL file format available from somewhere?\n\nRych"

In [13]:
# Full description of dataset
data_train['DESCR'][0]

'.'

In [14]:
# No. of data points
data_train.filenames.shape

(2034,)

In [15]:
# Top 10 target values (alt.atheism, comp.os.ms-windows.misc, comp.graphics etc.)
data_train.target[:10]

array([1, 3, 2, 0, 2, 0, 2, 1, 2, 1], dtype=int64)

### 3. Bag of Words model

Let's train a model using a simple count vectorizer.

1. Initialize a standard CountVectorizer and fit the training data
- how big is the feature dictionary?
- repeat eliminating english stop words
- is the dictionary smaller?
- transform the training data using the trained vectorizer
- evaluate the performance of a Logistic Regression on the features extracted by the CountVectorizer
    - you will have to transform the test_set too. Be carefule to use the trained vectorizer, without re-fitting it

**BONUS:**
- try a couple modifications:
    - restrict the max_features
    - change max_df and min_df

In [16]:
# Using Count Vectorizer and Logistic Regression
count_vect = CountVectorizer(stop_words='english', min_df=2, ngram_range=(1, 2))

# fit_transform takes in all possible words and returns document-term matrices
count_train = count_vect.fit_transform(data_train.data)

In [17]:
count_train.shape

(2034, 25590)

In [18]:
# transform test set using trained vectorizer (to evaluate it)
count_test = count_vect.transform(data_test.data)

In [19]:
from sklearn.linear_model import LogisticRegression

logreg = LogisticRegression(max_iter=1000)
logreg.fit(count_train, data_train.target)

LogisticRegression(max_iter=1000)

In [20]:
count_pred = logreg.predict(count_test)

In [21]:
from sklearn import metrics

print("Accuracy score from count vectorization:", np.round(metrics.accuracy_score(data_test.target, count_pred), 2))
print("F1 score from count vectorization:", np.round(metrics.f1_score(data_test.target, count_pred, average='macro'), 2))

Accuracy score from count vectorization: 0.74
F1 score from count vectorization: 0.71


### 4. Hashing and TF-IDF

Let's see if Hashing or TF-IDF improves the accuracy.

1. Initialize a HashingVectorizer and repeat the test with no restriction on the number of features
- does the score improve with respect to the count vectorizer?
- print out the number of features for this model
- Initialize a TF-IDF Vectorizer and repeat the analysis above
- print out the number of features for this model

**BONUS:**
- Change the parameters of either (or both!) models to improve your score

In [57]:
# TF-IDF vectorizer & SGD Classifier
tfidf_vect = TfidfVectorizer(stop_words='english')

tfidf_train = tfidf_vect.fit_transform(data_train.data)

In [58]:
# Transform test set with trained vectorizer
tfidf_test = tfidf_vect.transform(data_test.data)

In [59]:
# Number of features extracted from TF-IDF
features = tfidf_vect.get_feature_names_out()
print("Total number of features:", len(features))

Total number of features: 26576


In [24]:
sgd_log = SGDClassifier(loss='log', class_weight='balanced', n_jobs=-1, random_state=42)

sgd_log.fit(tfidf_train, data_train.target)

sgd_log.score(tfidf_test, data_test.target)

0.7812269031781227

In [49]:
tfidf_pred = sgd_log.predict(tfidf_test)

print("Accuracy score from TF-IDF vectorization:", np.round(metrics.accuracy_score(data_test.target, tfidf_pred), 2))
print("F1 score from TF-IDF vectorization:", np.round(metrics.f1_score(data_test.target, tfidf_pred, average='macro'), 2))

Accuracy score from TF-IDF vectorization: 0.78
F1 score from TF-IDF vectorization: 0.76


In [27]:
# Hashing vectorizer & SGD Classifier
from sklearn.feature_extraction.text import HashingVectorizer

hash_vect = HashingVectorizer(stop_words='english', ngram_range=(1, 2))

# fit_transform takes in all possible words and returns document-term matrices
hash_train = hash_vect.fit_transform(data_train.data)

In [28]:
# transform test set using trained vectorizer (to evaluate it)
hash_test = hash_vect.transform(data_test.data)

In [55]:
# CANNOT EXTRACT NUMBER OF FEATURES FROM HASHING - not supported

In [29]:
sgd = SGDClassifier(loss='log', class_weight='balanced', n_jobs=-1, random_state=42)

sgd.fit(hash_train, data_train.target)

sgd.score(hash_test, data_test.target)

0.7634885439763488

In [30]:
hash_pred = sgd.predict(hash_test)

print("Accuracy score from Hashing vectorization:", np.round(metrics.accuracy_score(data_test.target, hash_pred), 2))
print("F1 score from Hashing vectorization:", np.round(metrics.f1_score(data_test.target, hash_pred, average='macro'), 2))

Accuracy score from Hashing vectorization: 0.76
F1 score from Hashing vectorization: 0.74


In [None]:
# Vectorizers in order of best to worst performance:
# TF-IDF, Hashing, Count

In [40]:
# Tuning TF-IDF vectorizer with SGD classifier
from sklearn.pipeline import Pipeline

pipeline = Pipeline([
    ('tfidf', TfidfVectorizer(stop_words='english')), 
    ('sgd', SGDClassifier(loss='log', class_weight='balanced', n_jobs=-1, random_state=42))])

# max_df values defined as proportions (percentage)
parameters = {'tfidf__max_df': (0.25, 0.5, 0.75), 
              'tfidf__ngram_range': [(1, 1), (1, 2), (1, 3)]}

grid_search = GridSearchCV(pipeline, parameters, cv=2, n_jobs=2, verbose=3)
grid_search.fit(data_train.data, data_train.target)

print("Best parameters set:")
print(grid_search.best_estimator_.steps)

Fitting 2 folds for each of 9 candidates, totalling 18 fits
Best parameters set:
[('tfidf', TfidfVectorizer(max_df=0.25, ngram_range=(1, 2), stop_words='english')), ('sgd', SGDClassifier(class_weight='balanced', loss='log', n_jobs=-1, random_state=42))]


In [60]:
# Re-train new TF-IDF vectorizer with tuned parameters
tfidf = TfidfVectorizer(stop_words='english', ngram_range=(1, 2), max_df=0.25)

# New features matrix for classification tasks
X_train_vect = tfidf.fit_transform(data_train.data)

# New matrix for testing vectorizer performance
X_test_vect = tfidf.transform(data_test.data)

In [61]:
# Number of features extracted from TF-IDF
tfidf_features = tfidf.get_feature_names_out()
print("Total number of features:", len(tfidf_features))

Total number of features: 189472


In [62]:
sgd.fit(X_train_vect, data_train.target)

print("Accuracy score from TF-IDF vectorization:", np.round(sgd.score(tfidf_new, data_test.target), 2))

Accuracy score from TF-IDF vectorization: 0.77


In [63]:
y_pred = sgd.predict(X_test_vect)

print("F1 score from TF-IDF vectorization:", np.round(metrics.f1_score(data_test.target, y_pred, average='macro'), 2))

F1 score from TF-IDF vectorization: 0.75


In [None]:
# Okaay...sgd classifier performance has worsened slightly after tuning but number of features reduced by at least 7000
# Other hyperparameters to tune max_features, min_df, 