In this project, we'll be examining executive orders from various presidents starting with Clinton and ending with Trump. We'll examine the order lengths, and see whether the party of the president or the term in which the order was given (first or second) can act as reliable predictors of the contents of each order. We'll start by reading in the dataset and examining the data

In [1]:
import pandas as pd

import numpy as np

file = './documents_of_type_presidential_document_and_of_presidential_document_type_executive_order.csv'

ex_dataset = pd.read_csv(file)

In [2]:
print(type(ex_dataset))
ex_dataset.head()

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0,citation,document_number,end_page,html_url,pdf_url,type,subtype,publication_date,signing_date,start_page,title,disposition_notes,executive_order_number
0,83 FR 61505,2018-26156,61507,https://www.federalregister.gov/documents/2018...,https://www.gpo.gov/fdsys/pkg/FR-2018-11-29/pd...,Presidential Document,Executive Order,11/29/2018,11/27/2018,61505,Blocking Property of Certain Persons Contribut...,,13851
1,83 FR 55243,2018-24254,55245,https://www.federalregister.gov/documents/2018...,https://www.gpo.gov/fdsys/pkg/FR-2018-11-02/pd...,Presidential Document,Executive Order,11/02/2018,11/01/2018,55243,Blocking Property of Additional Persons Contri...,,13850
2,83 FR 48195,2018-20816,48200,https://www.federalregister.gov/documents/2018...,https://www.gpo.gov/fdsys/pkg/FR-2018-09-21/pd...,Presidential Document,Executive Order,09/21/2018,09/20/2018,48195,Authorizing the Implementation of Certain Sanc...,"See: EO 13694 of 4/1/2015, EO 13757 of 12/28/2...",13849
3,83 FR 46843,2018-20203,46848,https://www.federalregister.gov/documents/2018...,https://www.gpo.gov/fdsys/pkg/FR-2018-09-14/pd...,Presidential Document,Executive Order,09/14/2018,09/12/2018,46843,Imposing Certain Sanctions in the Event of For...,"See: 13694 of 4/1/2015, EO 13757 of 12/28/2016...",13848
4,83 FR 45321,2018-19514,45323,https://www.federalregister.gov/documents/2018...,https://www.gpo.gov/fdsys/pkg/FR-2018-09-06/pd...,Presidential Document,Executive Order,09/06/2018,08/31/2018,45321,Strengthening Retirement Security in America,,13847


Before we try and extract the text from the URLs, I want to create another column called "President" and one called "Party" so that we have those pre-labeled before we make any transformations. I can see online that orders 13765-13851 were signed by Trump, 13490-13764 were signed by Obama, 13199-13487 were signed by Bush and 12890-13197 were signed by Clinton. This will help us assign each order to the president that gave it.

In [3]:
def president(x):
    if x > 13764:
        return "Trump"
    elif x > 13490:
        return "Obama"
    elif x > 13199:
        return "Bush"
    else:
        return "Clinton"

In [4]:
ex_dataset["President"] = ex_dataset["executive_order_number"].apply(president)

print(ex_dataset.head())

      citation document_number  end_page  \
0  83 FR 61505      2018-26156     61507   
1  83 FR 55243      2018-24254     55245   
2  83 FR 48195      2018-20816     48200   
3  83 FR 46843      2018-20203     46848   
4  83 FR 45321      2018-19514     45323   

                                            html_url  \
0  https://www.federalregister.gov/documents/2018...   
1  https://www.federalregister.gov/documents/2018...   
2  https://www.federalregister.gov/documents/2018...   
3  https://www.federalregister.gov/documents/2018...   
4  https://www.federalregister.gov/documents/2018...   

                                             pdf_url                   type  \
0  https://www.gpo.gov/fdsys/pkg/FR-2018-11-29/pd...  Presidential Document   
1  https://www.gpo.gov/fdsys/pkg/FR-2018-11-02/pd...  Presidential Document   
2  https://www.gpo.gov/fdsys/pkg/FR-2018-09-21/pd...  Presidential Document   
3  https://www.gpo.gov/fdsys/pkg/FR-2018-09-14/pd...  Presidential Document   
4  

Now we create the function that will add a col for the different political parties

In [5]:
def party(x):
    if x == "Trump":
        return "Republican"
    elif x == "Obama":
        return "Democrat"
    elif x == "Bush":
        return "Republican"
    else:
        return "Democrat"

In [6]:
ex_dataset["Party"] = ex_dataset["President"].apply(party)

print(ex_dataset.head())


      citation document_number  end_page  \
0  83 FR 61505      2018-26156     61507   
1  83 FR 55243      2018-24254     55245   
2  83 FR 48195      2018-20816     48200   
3  83 FR 46843      2018-20203     46848   
4  83 FR 45321      2018-19514     45323   

                                            html_url  \
0  https://www.federalregister.gov/documents/2018...   
1  https://www.federalregister.gov/documents/2018...   
2  https://www.federalregister.gov/documents/2018...   
3  https://www.federalregister.gov/documents/2018...   
4  https://www.federalregister.gov/documents/2018...   

                                             pdf_url                   type  \
0  https://www.gpo.gov/fdsys/pkg/FR-2018-11-29/pd...  Presidential Document   
1  https://www.gpo.gov/fdsys/pkg/FR-2018-11-02/pd...  Presidential Document   
2  https://www.gpo.gov/fdsys/pkg/FR-2018-09-21/pd...  Presidential Document   
3  https://www.gpo.gov/fdsys/pkg/FR-2018-09-14/pd...  Presidential Document   
4  

Now we're going to test how to get the text from one URL, and once we get that working we can nest the code in a for loop and apply it to the whole dataset

In [7]:
test_url = ex_dataset.iloc[0][3]
print(test_url)

https://www.federalregister.gov/documents/2018/11/29/2018-26156/blocking-property-of-certain-persons-contributing-to-the-situation-in-nicaragua


In [8]:
from urllib.request import Request, urlopen


In [9]:
import urllib
req = urllib.request.Request(test_url, headers={'User-Agent' : "Magic Browser"}) 
con = urllib.request.urlopen( req )
html = con.read()
# print(html)


Next we'll use Beauiful Soup to prettify the HTML text to make it easier to read

In [10]:
from bs4 import BeautifulSoup

soup = BeautifulSoup(html)

soup_text = soup.get_text()

# print(soup_text)



 BeautifulSoup(YOUR_MARKUP})

to this:

 BeautifulSoup(YOUR_MARKUP, "lxml")

  markup_type=markup_type))


In [11]:
print(soup.title)

<title>
      Federal Register
       :: 
      Blocking Property of Certain Persons Contributing to the Situation in Nicaragua
    </title>


In [12]:
pretty_soup = soup.prettify()

# print(pretty_soup)

In [13]:
print(type(pretty_soup))

<class 'str'>


Now I looked through the documents to try and find a phrase that all the executive orders started with (full texts omitted here for brevity, but you can just remove the #s to print the text in full). I will use this to split the document so we only get the text of the order, not all the HTML information. I chose "Use the PDF linked in the document sidebar for the official electronic format". It's long, but I has issues with short phrases appearing earlier in the text. Below, I do the same with the end of the order and the phrase "[FR Doc."

In [14]:
doc1 = soup_text.split("Use the PDF linked in the document sidebar for the official electronic format",1)[1]
# print(doc1)



In [15]:
a,b = doc1.split("[FR Doc.")

print(a)


.





Start Printed Page 61505
Executive Order 13851 of November 27, 2018
        Blocking Property of Certain Persons Contributing to the Situation in Nicaragua
By the authority vested in me as President by the Constitution and the laws of the United States of America, including the International Emergency Economic Powers Act (50 U.S.C. 1701 et seq.) (IEEPA), the National Emergencies Act (50 U.S.C. 1601 et seq.) (NEA), section 212(f) of the Immigration and Nationality Act of 1952 (8 U.S.C. 1182(f)), and section 301 of title 3, United States Code,
I, DONALD J. TRUMP, President of the United States of America, find that the situation in Nicaragua, including the violent response by the Government of Nicaragua to the protests that began on April 18, 2018, and the Ortega regime's systematic dismantling and undermining of democratic institutions and the rule of law, its use of indiscriminate violence and repressive tactics against civilians, as well as its corruption leading to the destabi

In [None]:
ex_text = []

ex_text.append(a)

print(ex_text)


[".\n\n\n\n\n\nStart Printed Page 61505\nExecutive Order 13851 of November 27, 2018\n        Blocking Property of Certain Persons Contributing to the Situation in Nicaragua\nBy the authority vested in me as President by the Constitution and the laws of the United States of America, including the International Emergency Economic Powers Act (50 U.S.C. 1701 et seq.) (IEEPA), the National Emergencies Act (50 U.S.C. 1601 et seq.) (NEA), section 212(f) of the Immigration and Nationality Act of 1952 (8 U.S.C. 1182(f)), and section 301 of title 3, United States Code,\nI, DONALD J. TRUMP, President of the United States of America, find that the situation in Nicaragua, including the violent response by the Government of Nicaragua to the protests that began on April 18, 2018, and the Ortega regime's systematic dismantling and undermining of democratic institutions and the rule of law, its use of indiscriminate violence and repressive tactics against civilians, as well as its corruption leading to

Since that worked for the example, we just wrap the code in a for loop to execute it

In [None]:
orders_text = []

for entry in range(len(ex_dataset)):
    url = ex_dataset.iloc[entry][3]
    req = urllib.request.Request(url, headers={'User-Agent' : "Magic Browser"}) 
    con = urllib.request.urlopen( req )
    html = con.read()
    soup = BeautifulSoup(html)
    soup_text = soup.get_text()
    doc1 = soup_text.split("Use the PDF linked in the document sidebar for the official electronic format",1)[1]
    a,b = doc1.split("[FR Doc.")
    orders_text.append(a)
   



 BeautifulSoup(YOUR_MARKUP})

to this:

 BeautifulSoup(YOUR_MARKUP, "lxml")

  markup_type=markup_type))


Here we have an issue. I did some investigating and found out that not all the orders are avaliable in HTML format. The text of those documents just link to a flat file. I need to remove those documents. It will reduce the number of observances in our dataset, but there's no other way

In [None]:
exec_orders_len = len(ex_dataset)

print(exec_orders_len)

In [None]:
# I'm going to look at each text and create a boolean to see whether the text contains the phrase "[FR Doc."

test_cleaning = []

for entry in range(0, 928):
    url = ex_dataset.iloc[entry][3]
    req = urllib.request.Request(url, headers={'User-Agent' : "Magic Browser"}) 
    con = urllib.request.urlopen( req )
    html = con.read()
    soup = BeautifulSoup(html)
    soup_text = soup.get_text()
    test_cleaning.append(soup_text)
    



In [None]:
substring = "[FR Doc."

check = []

for entry in test_cleaning:
    if substring in entry:
        value = "Yes"
        check.append(value)
    else:
        value = "No"
        check.append(value)

print(check)

In [None]:
ind = range(0, 928)
df = pd.DataFrame(index = ind, data = check, columns=["In_HTML"])

#print(df.head())

print((df[df["In_HTML"] == "No"]).head())

Okay, now we know which orders aren't in HTML format, and we can remove those. There are 223

In [None]:
ex_dataset["In_HTML"] = df["In_HTML"]

print(ex_dataset.head())

Okay, now we know which ones are in HTML. We can make a new dataset that contains only HTML orders, and work off that

In [None]:
cleaned_ex_data = ex_dataset[ex_dataset["In_HTML"] == "Yes"]

print(928-223)

# These number should be the same if the above code is successful
print(len(cleaned_ex_data))

Great! We now have 705 executive orders in HTML ready to be parsed into our dataframe.

In [None]:
exec_text = []

for entry in range(0, 705):
    url = cleaned_ex_data.iloc[entry][3]
    req = urllib.request.Request(url, headers={'User-Agent' : "Magic Browser"}) 
    con = urllib.request.urlopen( req )
    html = con.read()
    soup = BeautifulSoup(html)
    soup_text = soup.get_text()
    doc1 = soup_text.split("Use the PDF linked in the document sidebar for the official electronic format",1)[1]
    a,b = doc1.split("[FR Doc.")
    exec_text.append(a)

# I'm going to print a random order (100) to test that the for loop worked
print(exec_text[100])

Next we're going to append this list of strings to our dataset to get our final result

In [None]:
cleaned_ex_data["Order_Text"] = exec_text

print(cleaned_ex_data.head())

Next, we're going to strip the new line \n from the text

In [None]:
for index, row in cleaned_ex_data.iterrows():
    cleaned_ex_data.loc[index, "Order_Text"] = cleaned_ex_data.loc[index, "Order_Text"].replace('\n', '')
    
print(cleaned_ex_data.head())

Now our data is clean and ready for analysis! Let's get some exploratory statistics, such as the number of orders per president

In [None]:
print("Trump:",len(cleaned_ex_data[cleaned_ex_data["President"] == "Trump"]))
print("Obama:",len(cleaned_ex_data[cleaned_ex_data["President"] == "Obama"]))
print("Bush:",len(cleaned_ex_data[cleaned_ex_data["President"] == "Bush"]))
print("Clinton:",len(cleaned_ex_data[cleaned_ex_data["President"] == "Clinton"]))

We can see that Clinton has the least. This is likely because we discarded many of the executive orders that were't in HTML format, which are the older orders. Trump also doesn't have very many, likely becuase he hasn't been in office very long. Now let's compare Dems and Republicans:

In [None]:
print("Democrats:",len(cleaned_ex_data[cleaned_ex_data["Party"] == "Democrat"]))
print("Republicans:",len(cleaned_ex_data[cleaned_ex_data["Party"] == "Republican"]))

These numbers are pretty close, which will be good for comparisons. Next let's add a col that contains the length of the different orders

In [None]:
lengths = []

for index, row in cleaned_ex_data.iterrows():
    num = len(cleaned_ex_data.loc[index, "Order_Text"])
    lengths.append(num)

# this looks good, now we just add this to the main dataframe   
print(lengths)


In [None]:
cleaned_ex_data["Order_Length"] = lengths

print(cleaned_ex_data.head())

Now let's compare the average length of each executive order across parties

In [None]:
import numpy as np

dems = cleaned_ex_data[cleaned_ex_data["Party"] == "Democrat"]
reps = cleaned_ex_data[cleaned_ex_data["Party"] == "Republican"]

print("Democrats:", np.mean(dems["Order_Length"]))
print("Republicans:", np.mean(reps["Order_Length"]))


Now let's compare these two groups with some boxplots

In [None]:
import seaborn as sns

import matplotlib.pyplot as plt

len_box_adj = sns.boxplot(x="Order_Length", y = "Party", data = cleaned_ex_data, palette = "Set2", notch = True)

# I'm adjusting the x axis here to make the plot easier to see
plt.xlim(0, 21000)

We can see (by the notches) that executive orders written by democrats are statistically significantly longer than those written by republicans. Let's look at individual presidents as well

In [None]:

obama = cleaned_ex_data[cleaned_ex_data["President"] == "Obama"]
trump = cleaned_ex_data[cleaned_ex_data["President"] == "Trump"]
bush = cleaned_ex_data[cleaned_ex_data["President"] == "Bush"]
clinton = cleaned_ex_data[cleaned_ex_data["President"] == "Clinton"]

print("Obama:", np.mean(obama["Order_Length"]))
print("Trump:", np.mean(trump["Order_Length"]))
print("Bush:", np.mean(bush["Order_Length"]))
print("Clinton:", np.mean(clinton["Order_Length"]))


It appears that Bush's are by far the shortest, and surprisingly Trump's are the longest. Let's create another boxplot

In [None]:
len_box_adj = sns.boxplot(x="Order_Length", y = "President", data = cleaned_ex_data, palette = "Set2", notch = True)

# I'm adjusting the x axis here to make the plot easier to see
plt.xlim(0, 21000)

It appears that the only president with significantly shorter executive orders is Bush. This likely explains why orders by Republicans appear to be much shorter. Next let's see how the length of all orders changed over time

In [None]:
# we need to change the date to a datetime object
cleaned_ex_data["publication_date"] = pd.to_datetime(cleaned_ex_data["publication_date"])


In [None]:
plt.plot(cleaned_ex_data["publication_date"], cleaned_ex_data["Order_Length"])

We can see several spikes here, probably from unusually long orders, but there doesn't appear to be any kind of trend persisting over time. We can also only look at the data below 21000, which is where we limited the boxplots

In [None]:
smaller_data = cleaned_ex_data[cleaned_ex_data["Order_Length"] < 21000]

plt.plot(smaller_data["publication_date"], smaller_data["Order_Length"])

Here we can see a possible positive trend, so let's do a linear regression to see if it's a positive trend and calculate pearson's r to see if it's significant

In [None]:
import datetime as dt

date_graph_data = cleaned_ex_data

date_graph_data["publication_date"]=date_graph_data["publication_date"].map(dt.datetime.toordinal)

In [None]:
import statsmodels.api as sm

X = date_graph_data["publication_date"]
y = date_graph_data["Order_Length"]

model = sm.OLS(y, X).fit()
predictions = model.predict(X) # make the predictions by the model

# Print out the statistics
model.summary()


Okay, so we can see that while we have a medium sized R-sq of 0.447, the coefficient of publication date is just 0.0098, which is quite small. Let's try this with the less extreme data

In [None]:
date_graph_data2 = smaller_data

date_graph_data2["publication_date"]=date_graph_data2["publication_date"].map(dt.datetime.toordinal)

X2 = date_graph_data2["publication_date"]
y2 = date_graph_data2["Order_Length"]

model2 = sm.OLS(y2, X2).fit()
predictions2 = model2.predict(X2) # make the predictions by the model

# Print out the statistics
model2.summary()


We have a higher R-sq (0.667), but still quite a small coeff for publication date (0.0084). It's safe to say that the mean lengths of the executive orders haven't gone up over the years a clinically significant amount

In [None]:
# if the coefficient is calculated per day, this means that the average executive order length 
# goes up about 3 characters a year. Likely not a clinically significant difference, especially given our limited data
0.0084*365

We should do some textual cleaning before we try and create a model. Let's remove all the punctuation and numbers from the texts. We'll do this using re (a regular expression)

In [None]:
import re 
output = []
for i in cleaned_ex_data["Order_Text"]:
    output.append(re.sub("\S*\d\S*","", i).strip())
    
    

Now we'll append the clean data to a new col, Cleaned_Text, in our dataset

In [None]:
se = pd.Series(output)
cleaned_ex_data['Cleaned_Text'] = se.values

We'll also change the words to all lower case, so that words with different cases will still be categorized together for analysis

In [None]:
cleaned_ex_data['Cleaned_Text'] = cleaned_ex_data['Cleaned_Text'].astype(str).str.lower()

In [None]:
# let's examine our changes

cleaned_ex_data.head()

In [None]:
# Next we'll start building our first prediction model: Party. We'll use test train split to split the data so we can test
# our model later

from sklearn.model_selection import train_test_split

y = cleaned_ex_data["Party"]

X_train, X_test, y_train, y_test = train_test_split(cleaned_ex_data['Cleaned_Text'], y, test_size = 0.33, random_state = 53)

Now that the data is split, we'll use a count vectorizer to transform the data and get some feature names

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vectorizer = TfidfVectorizer(stop_words="english", max_df=0.7)

tfidf_train = tfidf_vectorizer.fit_transform(X_train)

tfidf_test = tfidf_vectorizer.transform(X_test)

print(tfidf_vectorizer.get_feature_names()[:10])

print(tfidf_train.A[:5])

We have a few entries that are blanks ( the '____' ones) but it's only a few so we won't worry about those

In [None]:
# now we inspect the vectorizer

tfidf_df = pd.DataFrame(tfidf_train.A, columns=tfidf_vectorizer.get_feature_names())

print(tfidf_df.head())

It's a very large datset! Next I'll fit the classifier, which is what we'll use to classify the test executive orders as Democrat or Republican in our predictive model

In [None]:
from sklearn.naive_bayes import MultinomialNB

from sklearn.metrics import accuracy_score, confusion_matrix

nb_classifier = MultinomialNB()

# Fit the classifier
nb_classifier.fit(tfidf_train, y_train)

# Create the predicted tags
pred = nb_classifier.predict(tfidf_test)

# Calculate the accuracy score
score = accuracy_score(y_test, pred)
print(score)


So we have an accuracy score of 0.65, which is better than chance! Next we'll test a few different alphas and see which works best

In [None]:
alphas = np.arange(0, 1, 0.1)

# Define train_and_predict()
def train_and_predict(alpha):
    # Instantiate the classifier: nb_classifier
    nb_classifier = MultinomialNB(alpha=alpha)
    # Fit to the training data
    nb_classifier.fit(tfidf_train, y_train)
    # Predict the labels: pred
    pred = nb_classifier.predict(tfidf_test)
    # Compute accuracy: score
    score = accuracy_score(y_test, pred)
    return score

# Iterate over the alphas and print the corresponding score
for alpha in alphas:
    print('Alpha: ', alpha)
    print('Score: ', train_and_predict(alpha))
    print()

It looks like the best option we have here is Alpha = 0.3, which produces a score of 0.66

Next we'll look at the features with the most weight, determining their importance

In [None]:

class_labels = nb_classifier.classes_

feature_names = tfidf_vectorizer.get_feature_names()

feat_with_weights = sorted(zip(nb_classifier.coef_[0], feature_names))

print(class_labels[0], feat_with_weights[:20])

print(class_labels[1], feat_with_weights[-20:])


Here we can see what words/features are the best indicators that the order is from a Democrat (first set) or a Republican (second set)

Now we'll work on the next preditive model, based on term. First, we'll label the different orders by first or second term

In [None]:
#First, I'll change the signing date to a datetime object, so we can use it to determine the terms

import datetime

cleaned_ex_data['signing_date'] = pd.to_datetime(cleaned_ex_data['signing_date'])

In [None]:
def term(x, y):
    y = pd.to_datetime(y)
    if x == "Obama":
        date = datetime.datetime(2013, 1, 20)
        if y > date:
            return 'Second'
        else:
            return 'First'
    if x == "Bush":
        date = datetime.datetime(2005, 1, 20)
        if y > date:
            return 'Second'
        else:
            return 'First'
    if x == "Clinton":
        date = datetime.datetime(1997, 1, 20)
        if y > date:
            return 'Second'
        else:
            return 'First'
    else:
        return "First" #because Trump has only had one term

In [None]:
cleaned_ex_data["Term"] = np.vectorize(term)(cleaned_ex_data['President'], cleaned_ex_data['signing_date'])

print(cleaned_ex_data.head())

Now let's do some basic counts, so we know how many orders were given by each president in each term

In [None]:
print("First Term")
print("Trump:",len(cleaned_ex_data[(cleaned_ex_data["President"] == "Trump") & (cleaned_ex_data["Term"] == "First")]))
print("Obama:",len(cleaned_ex_data[(cleaned_ex_data["President"] == "Obama") & (cleaned_ex_data["Term"] == "First")]))
print("Bush:",len(cleaned_ex_data[(cleaned_ex_data["President"] == "Bush") & (cleaned_ex_data["Term"] == "First")]))
print("Clinton:",len(cleaned_ex_data[(cleaned_ex_data["President"] == "Clinton") & (cleaned_ex_data["Term"] == "First")]))

print(" ")
print("Second Term")
print("Trump:",len(cleaned_ex_data[(cleaned_ex_data["President"] == "Trump") & (cleaned_ex_data["Term"] == "Second")]))
print("Obama:",len(cleaned_ex_data[(cleaned_ex_data["President"] == "Obama") & (cleaned_ex_data["Term"] == "Second")]))
print("Bush:",len(cleaned_ex_data[(cleaned_ex_data["President"] == "Bush") & (cleaned_ex_data["Term"] == "Second")]))
print("Clinton:",len(cleaned_ex_data[(cleaned_ex_data["President"] == "Clinton") & (cleaned_ex_data["Term"] == "Second")]))





In [None]:
y = cleaned_ex_data["Term"]

X_train, X_test, y_train, y_test = train_test_split(cleaned_ex_data['Cleaned_Text'], y, test_size = 0.33, random_state = 53)

Now we've created the new variable, redefined Y, and we've done the test train split. Now we create the new model based on the term rather than the party

In [None]:
tfidf_vectorizer = TfidfVectorizer(stop_words="english", max_df=0.7)

tfidf_train = tfidf_vectorizer.fit_transform(X_train)

tfidf_test = tfidf_vectorizer.transform(X_test)

# now we inspect the vectorizer

tfidf_df = pd.DataFrame(tfidf_train.A, columns=tfidf_vectorizer.get_feature_names())

print(tfidf_df.head())

In [None]:
nb_classifier = MultinomialNB()

# Fit the classifier
nb_classifier.fit(tfidf_train, y_train)

# Create the predicted tags
pred = nb_classifier.predict(tfidf_test)

# Calculate the accuracy score
score = accuracy_score(y_test, pred)
print(score)

An accuracy score of 56%, which means it doesn't look like there's a huge difference in the contents of the executive orders depending on term

In [None]:
alphas = np.arange(0, 1, 0.1)

for alpha in alphas:
    print('Alpha: ', alpha)
    print('Score: ', train_and_predict(alpha))
    print()

It looks like the highest score comes with an alpha of 0, which produces an accuract score of almost 61%. Now let's look at the features with their weights, so we can see what words are being used as predictors

In [None]:
class_labels = nb_classifier.classes_

feature_names = tfidf_vectorizer.get_feature_names()

feat_with_weights = sorted(zip(nb_classifier.coef_[0], feature_names))

print(class_labels[0], feat_with_weights[:20])

print('')

print(class_labels[1], feat_with_weights[-20:])

Here we can see again what words or features indicate whether an executive order was written in the first term (first set) or the second term (second set). As the predictive model wasn't as good for the terms, these features likely have much less significance. This is exemplified by the fact that the spaces are seen as indicators that the order is from the first term. This could be a formatting issue, but it likely also indicates that no extremely good predictive words were found