# Unsupervised Learning with Vox Politics

For this capstone I have found a data set that contains many articles from Vox. You can see the data set for yourself here:

[Vox Articles on data.world](https://data.world/elenadata/vox-articles)

Using the main body text of these articles, I would like to create a model that successfully predicts the authors. For this to be more challenging, I will also only use articles that have been categorized as "Politics & Policy" so that the contents are generally similar. So let's begin by importing all the necessary modules and taking a look at our data.

In [1]:
import numpy as np
import pandas as pd
import scipy
import matplotlib.pyplot as plt
%matplotlib inline

# Modules for making the model
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split

# Data cleaning and feature importance
from bs4 import BeautifulSoup
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.preprocessing import normalize
from sklearn.feature_selection import SelectKBest, chi2

# K-Means module
from sklearn.cluster import KMeans

# Spectral Clustering module
from sklearn.cluster import SpectralClustering

# Mean Shift modules
from sklearn.cluster import MeanShift, estimate_bandwidth
from sklearn.preprocessing import FunctionTransformer

# Metrics to evaluate models
import time
from sklearn import metrics
from sklearn.metrics import pairwise_distances
from scipy.spatial.distance import cdist
from sklearn.metrics import classification_report

In [2]:
# Read in the full TSV file
raw_df = pd.read_csv('dsjVoxArticles.tsv', sep='\t', header=0)

# Take a look at it
raw_df.head()

Unnamed: 0,title,author,category,published_date,updated_on,slug,blurb,body
0,Bitcoin is down 60 percent this year. Here's w...,Timothy B. Lee,Business & Finance,2014-03-31 14:01:30,2014-12-16 16:37:36,http://www.vox.com/2014/3/31/5557170/bitcoin-b...,Bitcoins have lost more than 60 percent of the...,<p>The markets haven't been kind to<span> </sp...
1,6 health problems marijuana could treat better...,German Lopez,War on Drugs,2014-03-31 15:44:21,2014-11-17 00:20:33,http://www.vox.com/2014/3/31/5557700/six-probl...,Medical marijuana could fill gaps that current...,<p>Twenty states have so far legalized the med...
2,9 charts that explain the history of global we...,Matthew Yglesias,Business & Finance,2014-04-10 13:30:01,2014-12-16 15:47:02,http://www.vox.com/2014/4/10/5561608/9-charts-...,These nine charts from Thomas Piketty's new bo...,<p>Thomas Piketty's book <i>Capital in the 21s...
3,Remember when legal marijuana was going to sen...,German Lopez,Criminal Justice,2014-04-03 23:25:55,2014-05-06 21:58:42,http://www.vox.com/2014/4/3/5563134/marijuana-...,"Three months after legalizing marijuana, Denve...",<p><span>When Colorado legalized recreational ...
4,Obamacare succeeded for one simple reason: it'...,Sarah Kliff,Health Care,2014-04-01 20:26:14,2014-11-18 15:09:14,http://www.vox.com/2014/4/1/5570780/the-two-re...,"After a catastrophic launch, Obamacare still s...",<p>There's a very simple reason that Obamacare...


Now that we have the data loaded, let's find and take out the top 10 authors in the Politics & Policy category to build our models off of.

In [3]:
# Take just the politics and policy articles
main_df = raw_df[raw_df['category'] == 'Politics & Policy']

# Print out the top ten authors from that list and our total # of articles
top10 = main_df.author.value_counts()[:10]
print("Top 10 P&P authors and how many articles they've written:")
print(top10)
print('\n', "Total number of articles by the top 10:", top10.sum())

Top 10 P&P authors and how many articles they've written:
Matthew Yglesias    198
Andrew Prokop       174
German Lopez        168
Sarah Kliff         142
Dara Lind           103
Dylan Matthews      100
Libby Nelson         96
Timothy B. Lee       73
Ezra Klein           70
Jonathan Allen       61
Name: author, dtype: int64

 Total number of articles by the top 10: 1185


Now we can make a dataframe with just these authors, and the main body of their articles. We'll also have to clean the text of all the html code that was picked up when the creator of this data set scraped it.

In [4]:
# Make a list from the names of the top 10 and take their works for a new dataframe
t10_list = top10.keys()

top10_df = main_df[main_df['author'].isin(t10_list)].copy()

# Take all of the html coding out of the "body" column using beautiful soup
top10_df['body'] = [BeautifulSoup(body).get_text() for body in top10_df['body']]

# Drop the columns I won't need
top10_df = top10_df[['author', 'body']]

# Take a look at it
top10_df.head()

Unnamed: 0,author,body
21,Matthew Yglesias,Patricia Arquette's speech about the gender pa...
53,Andrew Prokop,Who really matters in our democracy — the gene...
141,Andrew Prokop,We've written about gerrymandering here on Vox...
193,Ezra Klein,Presidents consistently overpromise and underd...
209,Dylan Matthews,Let's imagine Daniel and Henry are vacationing...


Since our first two models require uniform sizes of clusters, we will just take the first 61 articles from each of these authors. This still leaves us with 610 total articles, so that should serve our purposes.

In [5]:
top10_unif_df = pd.DataFrame()

for author in t10_list:
    top10_unif_df = top10_unif_df.append(top10_df[top10_df['author'] == author].iloc[0:61,:])

Looking good! Now let's make sure we aren't missing any fields before we dive into the models.

In [6]:
# Make sure we aren't missing any fields
top10_unif_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 610 entries, 21 to 10650
Data columns (total 2 columns):
author    610 non-null object
body      610 non-null object
dtypes: object(2)
memory usage: 14.3+ KB


Now that we're certain our data is ready to use, we can create our variables.

In [7]:
# Create our feature variable
X = top10_unif_df.body

# Create our target categories
Y = top10_unif_df.author

# Split our data up into 25/75 for training and testing
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.25, random_state = 42)

### K-Means Model

The data is ready to be modeled, so we'll begin with a K-Means model. Since we know there are 10 authors, this is a logical model to start with.

In [8]:
# Start a timer so that we can see how long it takes
km_start = time.time()

In [9]:
# Create a pipeline that will take us through the modeling process
kmeans_pipe = Pipeline([
    # Use TfidfVectorizer to select words and get rid of useless stop words
    ('vect', TfidfVectorizer(ngram_range=(1, 2), stop_words='english', sublinear_tf=True)),
    
    # Select the best features to train from using the chi squared algorithm
    ('chi', SelectKBest(chi2)),
    
    # Choose our classifier algorithm
    ('clf', KMeans(n_clusters=10))
])

# Fit the data
kmeans_model = kmeans_pipe.fit(X_train, y_train)

In [10]:
# Make predictions
kmeans_pred = kmeans_model.fit_predict(X_test, y_test)

# See how well it did
print("Adjusted Rand Index score:", metrics.adjusted_rand_score(y_test, kmeans_pred))
print("The K-Means model took", time.time() - km_start, "seconds")

Adjusted Rand Index score: 0.04730302833673588
The K-Means model took 1.8170316219329834 seconds


Since the Adjusted Rand Index score is so close to zero, it looks like this model is close to predicting at random. That's no good! Let's try something else.

### Spectral Clustering Model

As we can again provide the number of clusters to spectral clustering, let's see how this algorithm performs instead.

In [11]:
# Start a timer so that we can see how long it takes
sc_start = time.time()

In [12]:
# Create a pipeline that will take us through the modeling process
sc_pipe = Pipeline([
    # Use TfidfVectorizer to select words and get rid of useless stop words
    ('vect', TfidfVectorizer(ngram_range=(1, 2), stop_words='english', sublinear_tf=True)),
    
    # Select the best features to train from using the chi squared algorithm
    ('chi', SelectKBest(chi2)),
    
    # Choose our classifier algorithm
    ('clf', SpectralClustering(n_clusters=10))
])

# Fit the data
sc_model = sc_pipe.fit(X_train, y_train)

In [13]:
# Make predictions
sc_pred = sc_model.fit_predict(X_test, y_test)

# See how well it did
print("Adjusted Rand Index score:", metrics.adjusted_rand_score(y_test, sc_pred))
print("The Spectral Clustering model took", time.time() - sc_start, "seconds")

Adjusted Rand Index score: 0.05347068950096642
The Spectral Clustering model took 1.7456982135772705 seconds


Well that's a little better, but not by much. Let's try another approach then.

### Mean-Shift Model

This should work better for our data since the clusters aren't necessarily the same size. Since we don't need uniform sizes for the clusters, we can go back to the full data with > 1000 articles.

In [14]:
# Create our feature variable
X = top10_df.body

# Create our target categories
Y = top10_df.author

# Split our data up into 25/75 for training and testing
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.25, random_state = 42)

In [15]:
# Start a timer so that we can see how long it takes
ms_start = time.time()

In [16]:
# Create a pipeline that will take us through the modeling process
ms_pipe = Pipeline([
    # Use TfidfVectorizer to select words and get rid of useless stop words
    ('vect', TfidfVectorizer(ngram_range=(1, 2), stop_words='english', sublinear_tf=True)),
    
    # Select the best features to train from using the chi squared algorithm
    ('chi', SelectKBest(chi2)),
    
    # Transform the sparse matrix into dense data so that MeanShift can use it
    ('ftrans', FunctionTransformer(lambda x: x.todense(), validate=False)),
    
    # Choose our classifier algorithm
    ('clf', MeanShift())
])

# Fit the data
ms_model = ms_pipe.fit(X_train, y_train)

In [18]:
# Make predictions
ms_pred = ms_model.fit_predict(X_test, y_test)

# See how well it did
print("Adjusted Rand Index score:", metrics.adjusted_rand_score(y_test, ms_pred))
print("The Spectral Clustering model took", time.time() - ms_start, "seconds")

Adjusted Rand Index score: 0.04338273133219622
The Spectral Clustering model took 16.928893566131592 seconds
