In [None]:
# Cyberthon Data Science Training Materials
# Author: Ragul Balaji <ragulbalaji@ctf.sg>
# Dataset: Public Domain
# ALT-TAB LABS LLP (C) 2019-present

If you're opening this locally, make sure your environment has an install of the packages from the following versions. Uncomment the following cell and run it.

In [None]:
# ! pip install pandas==2.2.1 scikit-learn==1.0.2 matplotlib==3.5.1

# Part I: The Problem

Can we predict who sent a tweet? 🤔

This is a classification problem - the target variable belongs to either of the two categories, in our example, the categories are `Donald Trump` and `Justin Trudeau`.

Let's dive in and explore!

## Loading the Dataset

We will load the file `train.csv` using pandas `read.csv()` into a `Dataframe` object.

Before doing any analysis, we should always understand our dataset.

In [None]:
import pandas as pd

data = pd.read_csv("train.csv")
data.head()

In [None]:
# Inspecting the Tweets in more details
data['status'].iloc[2]

## Target variable and Predictor(s)

We first identify that our target variable is `"author"`, and our predictor variable is `"status"`.

You can also apply stemming/lemmatization (covered earlier), or engineer new features that could be useful in model prediction (not covered).

In [None]:
# identify target and predictor
y = data['author']
X = data['status']

## Training, Validation and Test set

### Training set

- Data used to train our model
- Data and labels are provided to the model, the model tune its parameters to fit the model

### Validation set

- Data used to tune the hyperparameters of our model
- Optional

### Test Set

- Data used to test our model after it has been trained
- Predicted labels are compared against our true labels to compute the accuracy, to determine the model's performance.

### Why shouldn't the model be trained using the test set too?

- We want our model to generalize well on unseen data.
- Trained model gains information about the test set and the predictions made by the model will be biased towards the test set, resulting in overestimation of the model's performance.
- In practice, we CANNOT have data in the train set appearing in the test set. This is called "data leakage".

### Using Exams as analogy

- Train Set: Learning materials for students (Lecture notes, tutorials, Ten-year Series, ...)
- Test Set: Exams

You will not know the exam questions before the exams, unless the questions are leaked in advance (data leakage) 

Here, we specify some parameters
- `random_state=42` : for reproducible results
- `test_size=0.25` : dataset will be split into 75% training and 25% test

In [None]:
from sklearn.model_selection import train_test_split

# split dataset
X_train, X_test, y_train, y_test = train_test_split(X,y,random_state=42,test_size=0.25)

Great! Now that we have split our data, are we able to train the model using training set yet? No!

## Text Vectorization

As covered earlier, machine learning models require numerical numbers as inputs. Text data unfortunately doesn't work here.😞

Let's use `CountVectorizer` to convert the tweets into numerical vectors!

We first create an instance of `CountVectorizer`.

We want the model to learn the vocabulary from the training set, so we fit the `CountVectorizer` using only the training set, not the test set.

Then transform the training set and test set to encode each tweet as a vector.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

# Create an instance of CountVectorizer
vectorizer = CountVectorizer()

# Fit and transform/vectorize the training set
vec_train = vectorizer.fit_transform(X_train)

# Transform/Vectorize the validation set
vec_validation = vectorizer.transform(X_test)

Confused? 😓 Here's the simple demo of `CountVectorizer` using the 2 documents mentioned in the training slides.

`doc1 = "Data Science taught during Cyberthon"`

`doc2 = "CSIT focused on data science and data analytics"`

We fit `CountVectorizer` using `doc1` and `doc2`, so the vocabulary contain words from both documents.

In [None]:
doc1 = "Data Science taught during Cyberthon"
doc2 = "CSIT focused on data science and data analytics"

demo_vec = CountVectorizer()
matrix = demo_vec.fit_transform([doc1, doc2])

# CountVectorizer learns the vocabulary from doc1
demo_vec.get_feature_names_out()

Generate the document term matrix. 

There are 2 documents and 10 unique terms, hence 2x10 matrix.

Cell specify the count of term in the document.

In [None]:
pd.DataFrame(matrix.toarray(), columns=demo_vec.get_feature_names_out(), index = ['doc1','doc2'])

## Machine Learning Model

We will use the library `sklearn` due to its wide selection of models. Machine learning models in `sklearn` are objects which need to be initialized first.

Let's use what we learnt earlier - Decision 🌲!

Here, we specify some parameters
- `random_state=42` : for reproducible results
- `max_depth=8` : to limit tree depth to 8

For detailed documention, refer here: https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html

Some useful methods
- `.fit()` : to pass in our training set to train a machine learning model
- `.predict()` : to pass in our test set to make predictions using our model

In [None]:
from sklearn.tree import DecisionTreeClassifier

# Initialize our machine learning model
model = DecisionTreeClassifier(random_state=42, max_depth=8)

In [None]:
# Train the model
model.fit(vec_train,y_train)

In [None]:
# Make predictions on test set
y_pred = model.predict(vec_validation)

In [None]:
# Score the model using accuracy_score
from sklearn.metrics import accuracy_score, ConfusionMatrixDisplay

acc = accuracy_score(y_true = y_test, y_pred = y_pred)

print(f"Accuracy: {acc}")

In [None]:
# Visualise the confusion matrix
from sklearn.metrics import ConfusionMatrixDisplay
import matplotlib.pyplot as plt

ConfusionMatrixDisplay.from_predictions(y_true = y_test, y_pred = y_pred)
plt.show()

### Challenge
Retrain the model, but use TF-IDF as features instead of CountVectorizer.

# Part II: Model Interpretability

How does the model make its predictions? Let's visualize the 🌲!

### Example: Model predicts Justin Trudeau sent the tweet
Starting at the root node (the node where the 🌲 begins), <br>
if `de` is absent in the tweet, go to the left branch, <br>
if `rt` is present in the tweet, go to the right branch, <br>
if `thank` and `hannity` and `seanhannity` are absent in the tweet, <br>
the model predicts the tweet sender as `Justin Trudeau`.

In [None]:
# RUN THIS PART AFTER YOU ARE 
# DONE WITH PART 1

from sklearn import tree
import matplotlib.pyplot as plt

plt.figure(figsize=(16,16))
tree.plot_tree(model, fontsize=10, filled=True, class_names=['Donald', 'Justin'], feature_names=vectorizer.get_feature_names_out())
plt.show()

The _absence_ of `de`, _presence_ of `rt`, _absence_ of `thank` and `hannity` and `seanhannity` result in the model predicting `Justin Trudeau` as the tweet sender.

# Part III: Model Attack

Now, you are able to identify what features the model uses to make its predictions. 

Are you able to make minimal modifications to a tweet to fool the model?

We have established in Part II that `de`, `rt`, `thank`, `hannity` and `seanhannity` are features used by the model to make its predictions.

Specifically, the absence of `de`, presence of `rt`, absence of `thank` and `hannity` and `seanhannity` result in the model predicting `Justin Trudeau` as the tweet sender.

Using the same features, what would fool the model to classify `Donald Trump` as the tweet sender? If `thank` or `hannity` is present in the tweet!

Let's explore modifying a sample tweet from the test set!

In [None]:
sampleid = 1
truth = y_test.iloc[sampleid]

print('tweet:', repr(X_test.iloc[sampleid]))
print('\n   classifies as:', y_pred[sampleid])
print(' ground truth is:', truth)

In [None]:
# we want to modify an input so that the model misclassifies

original = X_test.iloc[sampleid]
attack1 = X_test.iloc[sampleid] + ' thank'
attack2 = X_test.iloc[sampleid] + ' hannity'

print('\n original:', repr(original))
print('\n attack1:', repr(attack1))
print('\n attack2:', repr(attack2))

Again, same concept as before - vectorize text before input to machine learning models.

In [None]:
# vectorize the attacks
vec_attack = vectorizer.transform([original, attack1, attack2])

# feed attacks into model for prediction
attack_pred = model.predict(vec_attack)

print('\n   truth:', truth)
print('\n   original:', attack_pred[0])
print('\n   attack1:', attack_pred[1])
print('\n   attack2:', attack_pred[2])

Modifying the sample tweet by adding `thank` or `hannity` fooled the model to predict `Donald Trump` instead of `Justin Trudeau` as the tweet sender.

Can you identify another modification to fool the model? 😁