# Introduction to Natural Language Processing
## 3. Supervised Learning

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

from util import clean_tweet, plot_confusion_matrix, evaluate_model

# dictionary of colours for making nice plots later
PARTY_COLOURS = {"trump": "#E91D0E", "obama": "#00A6EF"}

%matplotlib inline

## Data Cleaning

In [None]:
df = pd.read_pickle("tweets.pkl")
df["text"] = df["text"].map(clean_tweet)

In [None]:
df.head()

## Supervised Learning

Given our vector representations of text data, we can train any number of machine learning models to try and classify Trump vs. Obama tweets. Given that we tend to have very large feature vectors, it's generally good to use models that are not prone to overfitting when you have more features than samples. Naive Bayes and SVMs are good choices.

**Exercise: Your task is to try and train the best classifier to distinguish Obama/Trump tweets that you can. We have held out 600 tweets from the dataset which will be used to test your models.**

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline

In [None]:
# create a vectorizer
count_vectorizer = CountVectorizer(
    ngram_range=(1, 1), max_features=10, max_df=0.6
)

# create some classifier
knn = KNeighborsClassifier()

# create a pipeline
clf = Pipeline(steps=[("count", count_vectorizer), ("knn", knn)])

# do a train/test split to assess our model
df_train, df_test = train_test_split(df, test_size=0.2, random_state=42)

# train the model
clf.fit(df_train["text"], df_train["label"])

# assess the model
preds = clf.predict(df_test["text"])
score = clf.score(df_test["text"], df_test["label"])
print(f"Model's accuracy is {score:.3f}")

f, ax = plot_confusion_matrix(
    df_test["label"], preds, classes=["Obama", "Trump"]
)

### Evaluate your model

When you're happy with your classifier, **and only then**, you can evaluate its performance against the hold out set. To do so run the cell bellow.

In [None]:
evaluate_model(clf)
plt.show()