# Text Classification

## 0. Prerequisites

This notebook assumes that you have a basic command of the Python programming language. If you have no experience with Python, there are innumerable great tutorials and introductions out there. You can find a good overview of some of them [here](https://wiki.python.org/moin/BeginnersGuide/Programmers).

It also assumes that you have gone through following sessions in the [Text Analysis with Python series](https://git.dartmouth.edu/lib-digital-strategies/RDS/workshops/text-analysis/text-analysis-with-python):
- Strings and Files
- Word counts
- TF/IDF
- Topics and Emotions

Finally, we will use the machine learning toolkit `scikit-learn` and assume that you have gon the [Intro to Machine Learning with scikit-learn](https://git.dartmouth.edu/lib-digital-strategies/RDS/workshops/machine-learning/intro-to-machine-learning-with-scikit-learn).

## 1. Introduction

Let's say you are dealing with a number of different pieces of writing and want to categorize them into a set of predefined groups. For example, you might want to find out the language a particular text is written in. Or you are dealing with an email at work and are trying to figure out which department to forward it to. In text classification, you assign a *class* to each piece of text. So, for example, you assign the class `English` or `German` to a document, or the class `accounting` or `customer service` to the email.

<div class="alert alert-block alert-info"> 

How would you, as a human, solve these examples? Think about it step-by-step: How would you mentally process the text, what would inform your decision?
</div>

We can do this manually for smaller amounts of text, but we of course quickly run into problems at a larger scale: You would not want to classify every single incoming email at Dartmouth as `spam` or `no spam`, of course. This is where algorithmic text classification using machine learning can help!

Text Classification is a three-step process:
1. Extract descriptive features from a sufficiently large number of texts belonging to known categories/classes
2. Train a classifier using these features to discriminate between these classes
3. Use the trained classifier to classify new text pieces

In this session, we will walk through this process by building up a classifier that can tell us if any given State of the Union address was written by a Republican or a Democratic president. If you are running this notebook on Dartmouth's JupyterHub, the dataset is already available to you under `~/shared/RR-workshop-data/state-of-the-union-dataset/txt`. Otherwise you can download the dataset [here](https://git.dartmouth.edu/lib-digital-strategies/RDS/datasets/state-of-the-union-dataset/-/archive/main/state-of-the-union-dataset-main.zip) and put the in a folder of your choosing.

<div class="alert alert-block alert-info"> 

**Caveat emptor:** This particular system we are building here is most likely not the optimal system for this task. There are literally thousands of models and algorithms we could choose from and even more feature sets we could consider. The main purpose of this notebook, however, is to give you a relatively simple example that will hopefully give you a good idea of how text classification works *in principle*. Maybe you even feel inspired to engineer your own features or try out different classifiers?

If you do, please [let us know how it went](mailto:simon.stone@dartmouth.edu?subject=Text%20classification%20workshop)!
</div>

As mentioned above, the examples in our training set need to be already labeled (i.e., marked as *Republican* or *Democrat*). Fortunately, the State of the Union dataset includes some meta information we can use to do that. Since this kind of processing is outside the scope of this notebook, we moved this task to a separate notebook `add-meta.ipynb`. If you are interested, you can open that notebook and see how it works excactly, but for now we will simply run that other notebook using a [magic command](https://ipython.readthedocs.io/en/stable/interactive/magics.html):

In [1]:
%run add-meta.ipynb

<div class="alert alert-block alert-info"> 

**Note:** If your State of the Union dataset is not at the default location (e.g., you are not working on Dartmouth's JupyterHub), you may have to open `add-meta.ipynb` and change the variable `dataset_folder`  accordingly!
</div>

That notebook produced a CSV file we can now read using pandas:

In [2]:
import pandas as pd

sotu = pd.read_csv('data/sotu-extended.csv')
# Look at five random samples from the dataframe
sotu.sample(5)

Unnamed: 0,Year,Name,First Name,Last Name,Party,Text
218,2009,Barack Obama,Barack,Obama,Democratic,"Madame Speaker, Mr. Vice President, Members of..."
84,1874,Ulysses S. Grant,Ulysses S.,Grant,Republican,To the Senate and House of Representatives:\n\...
193,1984,Ronald Reagan,Ronald,Reagan,Republican,"Mr. Speaker, Mr. President, distinguished Memb..."
204,1995,Bill Clinton,Bill,Clinton,Democratic,"Mr. President, Mr. Speaker, members of the 104..."
130,1920,Woodrow Wilson,Woodrow,Wilson,Democratic,GENTLEMEN OF THE CONGRESS:\n\nWhen I addressed...


## 2. Feature Extraction

### 2.1 Numeric features and encoding

The goal of feature extraction is to *define* and *extract* features that hopefully best help to distinguish the classes of interest. Basically anything that describes the content of the analyzed texts can be a feature, *as long as you can express it as a number*. 

This limitation is imposed by the fact that most machine learning algorithms work with numeric computations and have no way of dealing with more abstract concepts like "meaning" or "context".

We already talked about some numeric features in previous sessions: 
- word count
- word frequencies
- TF/IDF

But even features that are not immediately numeric can be expressed in numbers (i.e., *encoded*). Let's take for example the feature *emotion*. Intuitively, this feature would have some descriptive levels like *sad*, *happy*, or *angry*. But we could also express these as numbers, if we mark the presence of one of these emotions in a text using `1` (for present) and `0` (for absent):


<style type="text/css" >
table {
    border-collapse: collapse;
    text-align: center;
    border-top: 3px solid;
    border-bottom: 3px solid;
}

tr, td, th {
    border-bottom: none !important;
    border-left: none !important;
    border-right: none !important;
}

</style>

<table>
  <tr>
    <th>Text</th>
    <th>Sad</th>
    <th>Happy</th>
    <th>Angry</th>
  </tr>
  <tr>
    <td>"My favorite show has been cancelled."</td>
    <td>1</td>
    <td>0</td>
    <td>0</td>
  </tr>  
  <tr>
    <td>"I got a promotion at work."</td>
    <td>0</td>
    <td>1</td>
    <td>0</td>
  </tr>
  <tr>
  <td>"My phone battery died just when I needed it the most."</td>
  <td>0</td>
  <td>0</td>
  <td>1</td>
  </tr>
  <tr>
    <td>"I was happy to see my childhood home one last time before it was sold.</td>
    <td>1</td>
    <td>1</td>
    <td>0</td>
  </tr>
</table>
  


<div class="alert alert-block alert-info"> 

**Note:** We can even express mixed emotions this way! Just take a look at the last example sentence.
</div>

### 2.2 Document- versus sentence-level features

The next important thing to consider when you want to extract features is the *level* at which they are extracted.

So far, we have only extracted features at a *document level*: Counting the words in the entire text, determmining the word frequencies in the entire text, and so on.

But a *document* consists of *paragraphs*, a *paragraph* consists of *sentences*, and a *sentence* consists of *words*. You could even find more units, like clauses or even letters.

So almost every unit of text is actually a sequence of smaller units. You could consider extracting features at any of these levels. For example, you could extract the word count in every sentence instead of for an entire document.

The *representation* of each document in that case also becomes a sequence. For example:

In [3]:
from collections import Counter

a_text = 'This is a text about cats and dogs . But not about bunnies . I cannot stress enough , how little this text is about bunnies . I like bunnies , but this is not about them.'

# Document-level feature extraction
word_frequencies = Counter(a_text.lower().split()).most_common()


print("At a document-level, we observe the following features:")
print(f"{word_frequencies = }")

At a document-level, we observe the following features:
word_frequencies = [('about', 4), ('this', 3), ('is', 3), ('.', 3), ('bunnies', 3), ('text', 2), ('but', 2), ('not', 2), ('i', 2), (',', 2), ('a', 1), ('cats', 1), ('and', 1), ('dogs', 1), ('cannot', 1), ('stress', 1), ('enough', 1), ('how', 1), ('little', 1), ('like', 1), ('them.', 1)]


<div class="alert alert-block alert-info"> 

As you can see, document-level features lose all the structure and context inherent in a document. The basic assumption here is that the relative order of the words does not matter for the intended purposes. This perspective on a text is therefore often called the *bag-of-words* model.
</div>

So if we look at this text at a document level, we might conclude that it is about bunnies! The word bunnies appears 3 times, after all. But if we look at it at a sentence level:

In [4]:
print("At a sentence level, we observe the following sequence of features:")
for idx, sentence in enumerate(a_text.split('.')):
    print(f'Sentence {idx}:')
    print(Counter(sentence.lower().split()).most_common())


At a sentence level, we observe the following sequence of features:
Sentence 0:
[('this', 1), ('is', 1), ('a', 1), ('text', 1), ('about', 1), ('cats', 1), ('and', 1), ('dogs', 1)]
Sentence 1:
[('but', 1), ('not', 1), ('about', 1), ('bunnies', 1)]
Sentence 2:
[('i', 1), ('cannot', 1), ('stress', 1), ('enough', 1), (',', 1), ('how', 1), ('little', 1), ('this', 1), ('text', 1), ('is', 1), ('about', 1), ('bunnies', 1)]
Sentence 3:
[('i', 1), ('like', 1), ('bunnies', 1), (',', 1), ('but', 1), ('this', 1), ('is', 1), ('not', 1), ('about', 1), ('them', 1)]
Sentence 4:
[]


Now we see that the word `bunnies` always occurs together with some form of `not` in the same sentence. So maybe this text is not about bunnies, after all?

<div class="alert alert-block alert-info"> 

This (somewhat crude) example demonstrates that sentence-level features are much better at capturing *local context*. However, since sentence-level features are more complex to process (see below), there are other techniques like N-grams and collocations that try to achieve the same thing while still remaining at the document level.
</div>

Knowing this, you might be tempted to always go with sentence-level features. However, where a document-level approach converts a document into one set of features (a *feature vector*), a sentence-level approach converts the document into a *sequence* of feature vectors. The problem here is that conventional machine learning models can only process fixed-size feature vectors, not sequences of them. Neural networks, on the other hand, can be very good at handling sequences, which is why Large Language Models (like ChatGPT) are so good at what they do. 

Using neural networks, though, comes with many challenges regarding the complexity of the models involved and, above all, the amount of data needed to train them. For many tasks, it is therefore advisable to follow [the principle of parsimony](https://en.wikipedia.org/wiki/Occam's_razor) and use document-level features. They can still get the job done!

### Extracting features from the State of the Union addresses



In [5]:
import string

print('Replace all occurrences of these punctuation marks: ', string.punctuation)
for symbol in string.punctuation:
    sotu['Text'] = sotu['Text'].str.replace(symbol, ' ', regex=False)
   
sotu['Text']    

Replace all occurrences of these punctuation marks:  !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~


0      Fellow Citizens of the Senate  and House of Re...
1      Fellow Citizens of the Senate and House of Rep...
2      Fellow Citizens of the Senate and House of Rep...
3      Fellow Citizens of the Senate and House of Rep...
4      Fellow Citizens of the Senate and House of Rep...
                             ...                        
223    Mr  Speaker  Mr  Vice President  Members of Co...
224    Mr  Speaker  Mr  Vice President  Members of Co...
225    Mr  Speaker  Mr  Vice President  Members of Co...
226    Thank you very much  Mr  Speaker  Mr  Vice Pre...
227    Mr  Speaker  Mr  Vice President  Members of Co...
Name: Text, Length: 228, dtype: object

In [6]:
from nltk.tokenize import WhitespaceTokenizer

sotu['Tokens'] = sotu['Text'].apply(WhitespaceTokenizer().tokenize)
sotu

Unnamed: 0,Year,Name,First Name,Last Name,Party,Text,Tokens
0,1790,George Washington,George,Washington,Unaffiliated,Fellow Citizens of the Senate and House of Re...,"[Fellow, Citizens, of, the, Senate, and, House..."
1,1791,George Washington,George,Washington,Unaffiliated,Fellow Citizens of the Senate and House of Rep...,"[Fellow, Citizens, of, the, Senate, and, House..."
2,1792,George Washington,George,Washington,Unaffiliated,Fellow Citizens of the Senate and House of Rep...,"[Fellow, Citizens, of, the, Senate, and, House..."
3,1793,George Washington,George,Washington,Unaffiliated,Fellow Citizens of the Senate and House of Rep...,"[Fellow, Citizens, of, the, Senate, and, House..."
4,1794,George Washington,George,Washington,Unaffiliated,Fellow Citizens of the Senate and House of Rep...,"[Fellow, Citizens, of, the, Senate, and, House..."
...,...,...,...,...,...,...,...
223,2014,Barack Obama,Barack,Obama,Democratic,Mr Speaker Mr Vice President Members of Co...,"[Mr, Speaker, Mr, Vice, President, Members, of..."
224,2015,Barack Obama,Barack,Obama,Democratic,Mr Speaker Mr Vice President Members of Co...,"[Mr, Speaker, Mr, Vice, President, Members, of..."
225,2016,Barack Obama,Barack,Obama,Democratic,Mr Speaker Mr Vice President Members of Co...,"[Mr, Speaker, Mr, Vice, President, Members, of..."
226,2017,Donald Trump,Donald,Trump,Republican,Thank you very much Mr Speaker Mr Vice Pre...,"[Thank, you, very, much, Mr, Speaker, Mr, Vice..."


In [7]:
from nltk import corpus

stopwords = corpus.stopwords.words('english')

def remove_stopwords(text):
    return [word for word in text if not word in stopwords]

sotu['Tokens w/o stopwords'] = sotu['Tokens'].apply(remove_stopwords)
sotu


Unnamed: 0,Year,Name,First Name,Last Name,Party,Text,Tokens,Tokens w/o stopwords
0,1790,George Washington,George,Washington,Unaffiliated,Fellow Citizens of the Senate and House of Re...,"[Fellow, Citizens, of, the, Senate, and, House...","[Fellow, Citizens, Senate, House, Representati..."
1,1791,George Washington,George,Washington,Unaffiliated,Fellow Citizens of the Senate and House of Rep...,"[Fellow, Citizens, of, the, Senate, and, House...","[Fellow, Citizens, Senate, House, Representati..."
2,1792,George Washington,George,Washington,Unaffiliated,Fellow Citizens of the Senate and House of Rep...,"[Fellow, Citizens, of, the, Senate, and, House...","[Fellow, Citizens, Senate, House, Representati..."
3,1793,George Washington,George,Washington,Unaffiliated,Fellow Citizens of the Senate and House of Rep...,"[Fellow, Citizens, of, the, Senate, and, House...","[Fellow, Citizens, Senate, House, Representati..."
4,1794,George Washington,George,Washington,Unaffiliated,Fellow Citizens of the Senate and House of Rep...,"[Fellow, Citizens, of, the, Senate, and, House...","[Fellow, Citizens, Senate, House, Representati..."
...,...,...,...,...,...,...,...,...
223,2014,Barack Obama,Barack,Obama,Democratic,Mr Speaker Mr Vice President Members of Co...,"[Mr, Speaker, Mr, Vice, President, Members, of...","[Mr, Speaker, Mr, Vice, President, Members, Co..."
224,2015,Barack Obama,Barack,Obama,Democratic,Mr Speaker Mr Vice President Members of Co...,"[Mr, Speaker, Mr, Vice, President, Members, of...","[Mr, Speaker, Mr, Vice, President, Members, Co..."
225,2016,Barack Obama,Barack,Obama,Democratic,Mr Speaker Mr Vice President Members of Co...,"[Mr, Speaker, Mr, Vice, President, Members, of...","[Mr, Speaker, Mr, Vice, President, Members, Co..."
226,2017,Donald Trump,Donald,Trump,Republican,Thank you very much Mr Speaker Mr Vice Pre...,"[Thank, you, very, much, Mr, Speaker, Mr, Vice...","[Thank, much, Mr, Speaker, Mr, Vice, President..."


In [31]:
from nltk.sentiment import SentimentIntensityAnalyzer
from sklearn.base import BaseEstimator, TransformerMixin


class MySentimentIntensityAnalyzer(BaseEstimator, TransformerMixin):
    def __init__(self):        
        super().__init__()
        self.sia = SentimentIntensityAnalyzer()
        
        
    def fit(self, X, y=None):
        return self
    
    def transform(self, X, y=None):
        out = []
        for x in X:
            ps = self.sia.polarity_scores(x)
            out.append([ps['pos'], ps['neu'], ps['neg']])
        return np.array(out)
    


In [32]:
""" Prepare the Pipeline """

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.neighbors import KNeighborsClassifier

vectorizer = TfidfVectorizer(input='content', 
                             preprocessor=lambda x: x,   # We did the preprocessing ourselves, so just pass everything through
                             tokenizer=lambda x: x,      # We also did the tokenization ourselves
                             token_pattern=None,         # Since we do not tokenize, we can avoid a warning by setting this to None
                             max_features=1000)          # Limit the vocabulary to 1000 words

feature_extractor = ColumnTransformer([
    ('tfidf', vectorizer, 'Tokens w/o stopwords'),
    ('sentiment', MySentimentIntensityAnalyzer(), 'Text')
])

pipe = Pipeline(steps=[
    ('feature_extraction', feature_extractor), 
    ('classifier', KNeighborsClassifier())
    ])

In [33]:
feature_extractor.fit_transform(sotu)

AttributeError: 'list' object has no attribute 'size'

In [161]:
""" Holdout validate pipeline """
from sklearn.metrics import classification_report
from sklearn.model_selection import GroupShuffleSplit

subset = sotu.query("Party == 'Democratic' | Party == 'Republican'")

splits = GroupShuffleSplit(1, test_size=0.1).split(subset, groups=subset['Name'])
for train_idx, test_idx in splits:
    sotu_train = subset.iloc[train_idx]
    sotu_test = subset.iloc[test_idx]


In [162]:
from sklearn.model_selection import GroupShuffleSplit

cv_splits = GroupShuffleSplit(n_splits=5, test_size=0.2)

In [165]:
""" Grid search hyperparamters """
from sklearn.model_selection import GridSearchCV

param_grid = {
    'feature_extraction__tfidf__ngram_range': [(1, 1), (1, 2), (1, 3)],
    'classifier__n_neighbors': [1, 3],     
    'classifier__weights': ['uniform', 'distance']
    # 'classifier__criterion': ['gini', 'entropy'],
    # 'classifier__max_depth': [3, 5, 10, 20],
}


search = GridSearchCV(pipe, param_grid, n_jobs=-1, cv=cv_splits, verbose=3)
search.fit(sotu_train, sotu_train['Party'], groups=sotu_train['Name'])
print(f"Best parameter (CV score={search.best_score_:0.3f}):")
print(search.best_params_)


Fitting 5 folds for each of 48 candidates, totalling 240 fits


9931.68s - pydevd: Sending message related to process being replaced timed-out after 5 seconds
9931.70s - pydevd: Sending message related to process being replaced timed-out after 5 seconds
9931.71s - pydevd: Sending message related to process being replaced timed-out after 5 seconds
9931.71s - pydevd: Sending message related to process being replaced timed-out after 5 seconds
9931.73s - pydevd: Sending message related to process being replaced timed-out after 5 seconds
9931.74s - pydevd: Sending message related to process being replaced timed-out after 5 seconds
9931.75s - pydevd: Sending message related to process being replaced timed-out after 5 seconds
9931.75s - pydevd: Sending message related to process being replaced timed-out after 5 seconds
9931.76s - pydevd: Sending message related to process being replaced timed-out after 5 seconds
9931.77s - pydevd: Sending message related to process being replaced timed-out after 5 seconds


[CV 1/5] END classifier__n_neighbors=1, classifier__weights=uniform, feature_extraction__tfidf__ngram_range=(1, 1), feature_extraction__topics__num_topics=10;, score=0.525 total time=   3.9s
[CV 2/5] END classifier__n_neighbors=1, classifier__weights=uniform, feature_extraction__tfidf__ngram_range=(1, 1), feature_extraction__topics__num_topics=10;, score=0.606 total time=   4.3s
[CV 5/5] END classifier__n_neighbors=1, classifier__weights=uniform, feature_extraction__tfidf__ngram_range=(1, 1), feature_extraction__topics__num_topics=10;, score=0.722 total time=   3.9s
[CV 3/5] END classifier__n_neighbors=1, classifier__weights=uniform, feature_extraction__tfidf__ngram_range=(1, 1), feature_extraction__topics__num_topics=10;, score=0.568 total time=   4.3s
[CV 4/5] END classifier__n_neighbors=1, classifier__weights=uniform, feature_extraction__tfidf__ngram_range=(1, 1), feature_extraction__topics__num_topics=10;, score=0.282 total time=   4.3s
[CV 1/5] END classifier__n_neighbors=1, class



[CV 2/5] END classifier__n_neighbors=1, classifier__weights=uniform, feature_extraction__tfidf__ngram_range=(1, 3), feature_extraction__topics__num_topics=10;, score=0.576 total time=   8.3s
[CV 3/5] END classifier__n_neighbors=1, classifier__weights=uniform, feature_extraction__tfidf__ngram_range=(1, 3), feature_extraction__topics__num_topics=10;, score=0.595 total time=   8.0s
[CV 1/5] END classifier__n_neighbors=1, classifier__weights=uniform, feature_extraction__tfidf__ngram_range=(1, 3), feature_extraction__topics__num_topics=15;, score=0.600 total time=   7.3s
[CV 5/5] END classifier__n_neighbors=1, classifier__weights=uniform, feature_extraction__tfidf__ngram_range=(1, 3), feature_extraction__topics__num_topics=10;, score=0.722 total time=   7.7s
[CV 4/5] END classifier__n_neighbors=1, classifier__weights=uniform, feature_extraction__tfidf__ngram_range=(1, 3), feature_extraction__topics__num_topics=10;, score=0.282 total time=   8.3s
[CV 3/5] END classifier__n_neighbors=1, class

9967.41s - pydevd: Sending message related to process being replaced timed-out after 5 seconds
9968.39s - pydevd: Sending message related to process being replaced timed-out after 5 seconds
9969.06s - pydevd: Sending message related to process being replaced timed-out after 5 seconds


[CV 1/5] END classifier__n_neighbors=1, classifier__weights=uniform, feature_extraction__tfidf__ngram_range=(1, 3), feature_extraction__topics__num_topics=20;, score=0.500 total time=   6.0s
[CV 2/5] END classifier__n_neighbors=1, classifier__weights=uniform, feature_extraction__tfidf__ngram_range=(1, 3), feature_extraction__topics__num_topics=20;, score=0.606 total time=   6.2s
[CV 3/5] END classifier__n_neighbors=1, classifier__weights=uniform, feature_extraction__tfidf__ngram_range=(1, 3), feature_extraction__topics__num_topics=20;, score=0.541 total time=   6.0s


9969.96s - pydevd: Sending message related to process being replaced timed-out after 5 seconds
9970.61s - pydevd: Sending message related to process being replaced timed-out after 5 seconds


[CV 4/5] END classifier__n_neighbors=1, classifier__weights=uniform, feature_extraction__tfidf__ngram_range=(1, 3), feature_extraction__topics__num_topics=20;, score=0.359 total time=   6.3s


9971.42s - pydevd: Sending message related to process being replaced timed-out after 5 seconds


[CV 5/5] END classifier__n_neighbors=1, classifier__weights=uniform, feature_extraction__tfidf__ngram_range=(1, 3), feature_extraction__topics__num_topics=20;, score=0.694 total time=   5.9s


9975.15s - pydevd: Sending message related to process being replaced timed-out after 5 seconds
9975.83s - pydevd: Sending message related to process being replaced timed-out after 5 seconds


[CV 1/5] END classifier__n_neighbors=1, classifier__weights=uniform, feature_extraction__tfidf__ngram_range=(1, 3), feature_extraction__topics__num_topics=25;, score=0.575 total time=   6.2s


9976.52s - pydevd: Sending message related to process being replaced timed-out after 5 seconds


[CV 2/5] END classifier__n_neighbors=1, classifier__weights=uniform, feature_extraction__tfidf__ngram_range=(1, 3), feature_extraction__topics__num_topics=25;, score=0.576 total time=   6.6s


9977.50s - pydevd: Sending message related to process being replaced timed-out after 5 seconds


[CV 3/5] END classifier__n_neighbors=1, classifier__weights=uniform, feature_extraction__tfidf__ngram_range=(1, 3), feature_extraction__topics__num_topics=25;, score=0.514 total time=   6.4s
[CV 1/5] END classifier__n_neighbors=1, classifier__weights=distance, feature_extraction__tfidf__ngram_range=(1, 1), feature_extraction__topics__num_topics=10;, score=0.525 total time=   3.6s
[CV 4/5] END classifier__n_neighbors=1, classifier__weights=uniform, feature_extraction__tfidf__ngram_range=(1, 3), feature_extraction__topics__num_topics=25;, score=0.385 total time=   7.4s
[CV 5/5] END classifier__n_neighbors=1, classifier__weights=uniform, feature_extraction__tfidf__ngram_range=(1, 3), feature_extraction__topics__num_topics=25;, score=0.556 total time=   6.7s
[CV 3/5] END classifier__n_neighbors=1, classifier__weights=distance, feature_extraction__tfidf__ngram_range=(1, 1), feature_extraction__topics__num_topics=10;, score=0.568 total time=   4.0s
[CV 2/5] END classifier__n_neighbors=1, cla

10012.39s - pydevd: Sending message related to process being replaced timed-out after 5 seconds


[CV 3/5] END classifier__n_neighbors=1, classifier__weights=distance, feature_extraction__tfidf__ngram_range=(1, 3), feature_extraction__topics__num_topics=20;, score=0.541 total time=   7.4s
[CV 2/5] END classifier__n_neighbors=1, classifier__weights=distance, feature_extraction__tfidf__ngram_range=(1, 3), feature_extraction__topics__num_topics=20;, score=0.606 total time=   7.8s
[CV 5/5] END classifier__n_neighbors=1, classifier__weights=distance, feature_extraction__tfidf__ngram_range=(1, 3), feature_extraction__topics__num_topics=20;, score=0.694 total time=   7.2s


10014.31s - pydevd: Sending message related to process being replaced timed-out after 5 seconds


[CV 4/5] END classifier__n_neighbors=1, classifier__weights=distance, feature_extraction__tfidf__ngram_range=(1, 3), feature_extraction__topics__num_topics=20;, score=0.359 total time=   8.2s
[CV 1/5] END classifier__n_neighbors=1, classifier__weights=distance, feature_extraction__tfidf__ngram_range=(1, 3), feature_extraction__topics__num_topics=25;, score=0.575 total time=   8.2s
[CV 2/5] END classifier__n_neighbors=1, classifier__weights=distance, feature_extraction__tfidf__ngram_range=(1, 3), feature_extraction__topics__num_topics=25;, score=0.576 total time=   8.5s
[CV 3/5] END classifier__n_neighbors=1, classifier__weights=distance, feature_extraction__tfidf__ngram_range=(1, 3), feature_extraction__topics__num_topics=25;, score=0.514 total time=   8.3s
[CV 1/5] END classifier__n_neighbors=3, classifier__weights=uniform, feature_extraction__tfidf__ngram_range=(1, 1), feature_extraction__topics__num_topics=10;, score=0.475 total time=   5.3s
[CV 2/5] END classifier__n_neighbors=3, c

10023.61s - pydevd: Sending message related to process being replaced timed-out after 5 seconds


[CV 4/5] END classifier__n_neighbors=1, classifier__weights=distance, feature_extraction__tfidf__ngram_range=(1, 3), feature_extraction__topics__num_topics=25;, score=0.385 total time=  12.7s
[CV 1/5] END classifier__n_neighbors=3, classifier__weights=uniform, feature_extraction__tfidf__ngram_range=(1, 1), feature_extraction__topics__num_topics=15;, score=0.475 total time=   8.3s
[CV 2/5] END classifier__n_neighbors=3, classifier__weights=uniform, feature_extraction__tfidf__ngram_range=(1, 1), feature_extraction__topics__num_topics=15;, score=0.636 total time=   8.6s
[CV 5/5] END classifier__n_neighbors=1, classifier__weights=distance, feature_extraction__tfidf__ngram_range=(1, 3), feature_extraction__topics__num_topics=25;, score=0.556 total time=  13.1s
[CV 3/5] END classifier__n_neighbors=3, classifier__weights=uniform, feature_extraction__tfidf__ngram_range=(1, 1), feature_extraction__topics__num_topics=15;, score=0.405 total time=   8.3s
[CV 4/5] END classifier__n_neighbors=3, cla

10073.21s - pydevd: Sending message related to process being replaced timed-out after 5 seconds


[CV 2/5] END classifier__n_neighbors=3, classifier__weights=distance, feature_extraction__tfidf__ngram_range=(1, 1), feature_extraction__topics__num_topics=20;, score=0.636 total time=   4.1s
[CV 3/5] END classifier__n_neighbors=3, classifier__weights=distance, feature_extraction__tfidf__ngram_range=(1, 1), feature_extraction__topics__num_topics=20;, score=0.486 total time=   4.0s
[CV 5/5] END classifier__n_neighbors=3, classifier__weights=distance, feature_extraction__tfidf__ngram_range=(1, 1), feature_extraction__topics__num_topics=20;, score=0.556 total time=   3.7s
[CV 4/5] END classifier__n_neighbors=3, classifier__weights=distance, feature_extraction__tfidf__ngram_range=(1, 1), feature_extraction__topics__num_topics=20;, score=0.385 total time=   4.1s
[CV 1/5] END classifier__n_neighbors=3, classifier__weights=distance, feature_extraction__tfidf__ngram_range=(1, 1), feature_extraction__topics__num_topics=25;, score=0.450 total time=   4.2s
[CV 2/5] END classifier__n_neighbors=3, 

10089.30s - pydevd: Sending message related to process being replaced timed-out after 5 seconds


[CV 4/5] END classifier__n_neighbors=3, classifier__weights=distance, feature_extraction__tfidf__ngram_range=(1, 2), feature_extraction__topics__num_topics=25;, score=0.436 total time=   5.7s
[CV 5/5] END classifier__n_neighbors=3, classifier__weights=distance, feature_extraction__tfidf__ngram_range=(1, 2), feature_extraction__topics__num_topics=25;, score=0.417 total time=   4.9s


10091.69s - pydevd: Sending message related to process being replaced timed-out after 5 seconds


[CV 1/5] END classifier__n_neighbors=3, classifier__weights=distance, feature_extraction__tfidf__ngram_range=(1, 3), feature_extraction__topics__num_topics=10;, score=0.450 total time=   7.1s
[CV 2/5] END classifier__n_neighbors=3, classifier__weights=distance, feature_extraction__tfidf__ngram_range=(1, 3), feature_extraction__topics__num_topics=10;, score=0.576 total time=   7.6s
[CV 3/5] END classifier__n_neighbors=3, classifier__weights=distance, feature_extraction__tfidf__ngram_range=(1, 3), feature_extraction__topics__num_topics=10;, score=0.568 total time=   7.3s
[CV 5/5] END classifier__n_neighbors=3, classifier__weights=distance, feature_extraction__tfidf__ngram_range=(1, 3), feature_extraction__topics__num_topics=10;, score=0.694 total time=   6.9s
[CV 4/5] END classifier__n_neighbors=3, classifier__weights=distance, feature_extraction__tfidf__ngram_range=(1, 3), feature_extraction__topics__num_topics=10;, score=0.385 total time=   7.9s
[CV 1/5] END classifier__n_neighbors=3, 

In [158]:
print(classification_report(sotu_test['Party'], search.predict(sotu_test)))

              precision    recall  f1-score   support

  Democratic       0.20      0.25      0.22         4
  Republican       0.80      0.75      0.77        16

    accuracy                           0.65        20
   macro avg       0.50      0.50      0.50        20
weighted avg       0.68      0.65      0.66        20



In [159]:
distance, neighbor = search.best_estimator_[1].kneighbors(search.best_estimator_[0].transform(sotu_test))
for idx, speech in sotu_test.reset_index().iterrows():
    print('Speech to classify:')
    print(f"year={speech['Year']}, president={speech['Name']}, party={speech['Party']}")
    print('Most similar speech:')
    print(f"year={sotu_train.iloc[neighbor[idx][0], 0]}, president={sotu_train.iloc[neighbor[idx][0], 1]}, party={sotu_train.iloc[neighbor[idx][0], 4]}")
    print('-'*10)

Speech to classify:
year=1889, president=Benjamin Harrison, party=Republican
Most similar speech:
year=1885, president=Grover Cleveland, party=Democratic
----------
Speech to classify:
year=1890, president=Benjamin Harrison, party=Republican
Most similar speech:
year=1879, president=Rutherford B. Hayes, party=Republican
----------
Speech to classify:
year=1891, president=Benjamin Harrison, party=Republican
Most similar speech:
year=1885, president=Grover Cleveland, party=Democratic
----------
Speech to classify:
year=1892, president=Benjamin Harrison, party=Republican
Most similar speech:
year=1894, president=Grover Cleveland, party=Democratic
----------
Speech to classify:
year=1901, president=Theodore Roosevelt, party=Republican
Most similar speech:
year=1912, president=William Howard Taft, party=Republican
----------
Speech to classify:
year=1902, president=Theodore Roosevelt, party=Republican
Most similar speech:
year=1912, president=William Howard Taft, party=Republican
----------

## 
<table >
<tbody>
  <tr>
    <td style="padding:0px;border-width:0px;vertical-align:center">    
    Created by Simon Stone for Dartmouth College Library under <a href="https://creativecommons.org/licenses/by/4.0/">Creative Commons CC BY-NC 4.0 License</a>.<br>For questions, comments, or improvements, email <a href="mailto:researchdatahelp@groups.dartmouth.edu">Research Data Services</a>.
    </td>
    <td style="padding:0 0 0 1em;border-width:0px;vertical-align:center"><img alt="Creative Commons License" src="https://i.creativecommons.org/l/by/4.0/88x31.png"/></td>
  </tr>
</tbody>
</table>
