## Train a logistic regression model for binary classification

### Q1

In [29]:
# load twitter dataset into pandas and display
import pandas as pd

df_train = pd.read_csv('data/twitter_binary_classification_dataset/train.csv')
df_test = pd.read_csv('data/twitter_binary_classification_dataset/test.csv')

display(df_train.head())

Unnamed: 0.1,Unnamed: 0,tweet_id,airline_sentiment,airline_sentiment_confidence,negativereason,negativereason_confidence,airline,airline_sentiment_gold,name,negativereason_gold,retweet_count,text,tweet_coord,tweet_created,tweet_location,user_timezone
0,13248,569906532277731328,negative,1.0,Flight Attendant Complaints,0.3855,American,,nic_tudobem,,0,@AmericanAir She could even see that I had tri...,,2015-02-23 09:08:00 -0800,New York,
1,13246,569906807696551936,positive,1.0,,,American,,KaraAtDell,,0,@AmericanAir those were snacks we left on purp...,,2015-02-23 09:09:05 -0800,"Round Rock, TX",
2,4748,569878685723049985,neutral,0.6648,,,Southwest,,SaraAMartens,,0,@SouthwestAir thanks for linking to #Passbook....,,2015-02-23 07:17:21 -0800,"Omaha, NE",Central Time (US & Canada)
3,8249,568558887290441728,negative,1.0,Bad Flight,0.699,Delta,,superhilarious,,0,@JetBlue :/ he was trying to take stuff from t...,,2015-02-19 15:52:56 -0800,,Central Time (US & Canada)
4,5016,569538524321419265,positive,1.0,,,Southwest,,dirtytweetbacon,,0,@SouthwestAir last week I flew from DAL to LAX...,,2015-02-22 08:45:40 -0800,,


In [30]:
# get tweet counts
print("There are ", df_train['text'].count() + df_test['text'].count(), "tweets")

There are  14640 tweets


In [31]:
# get earliest and latest tweet 
e_train = df_train['tweet_created'].min()
l_train = df_train['tweet_created'].max()
e_test = df_test['tweet_created'].min()
l_test = df_test['tweet_created'].max()

earliest = min(e_train, e_test)
lastest = max(l_train, l_test)

print("Earliest tweet : ", earliest)
print("Lastest tweet in train data: ", lastest)

Earliest tweet :  2015-02-16 23:36:05 -0800
Lastest tweet in train data:  2015-02-24 11:53:37 -0800


In [32]:
# count unique airline
df_train['airline'].unique()

array(['American', 'Southwest', 'Delta', 'US Airways', 'United',
       'Virgin America'], dtype=object)

In [33]:
# get numbers of tweets per airline
df_train['airline'].value_counts() + df_test['airline'].value_counts()

United            3822
US Airways        2913
American          2759
Southwest         2420
Delta             2222
Virgin America     504
Name: airline, dtype: int64

In [34]:
# get number of tweets per sentiment label
df_train['airline_sentiment'].value_counts() + df_test['airline_sentiment'].value_counts()

negative    9178
neutral     3099
positive    2363
Name: airline_sentiment, dtype: int64

### Q2

In [35]:
# remove rows with neutural sentiment label
df_train = df_train.drop(df_train[df_train['airline_sentiment'] == 'neutral'].index)
df_test = df_test.drop(df_test[df_test['airline_sentiment'] == 'neutral'].index)
# get numbers of tweets per airline
df_train['airline'].value_counts() + df_test['airline'].value_counts()

United            3125
US Airways        2532
American          2296
Southwest         1756
Delta             1499
Virgin America     333
Name: airline, dtype: int64

In [36]:
# get number of tweets per sentiment label in binary dataset
df_train['airline_sentiment'].value_counts() + df_test['airline_sentiment'].value_counts()

negative    9178
positive    2363
Name: airline_sentiment, dtype: int64

After removing all tweet that are neutral, there's no netural label in the binary dataset, while the number of positive and negative tweets doesn't change. The number of tweets for all the airlines dropped. 

### Q3

In [39]:
from sklearn.feature_extraction.text import TfidfVectorizer

# get a list of lower cased and sapce removed airline names.
air = []
for i in df_test.airline.unique():
    air.append("@" + i.lower().replace(" ", ""))
air

['@southwest',
 '@american',
 '@usairways',
 '@delta',
 '@united',
 '@virginamerica']

In [41]:
# remove the words that is same as words in air array in "text"
for i in air:
    df_test["text"] = df_test.text.str.replace(i, "", case = False)

for i in air:
    df_train["text"] = df_train.text.str.replace(i, "", case = False)

# use TF-IDF vectorizer to convert texts into weighted vectors
vectorizer = TfidfVectorizer()
# fit_transform learns the vocab and produces vectors at the same time
train_vectors = vectorizer.fit_transform(df_train['text'])
# transform uses the same vocab and produces vectors for the new data (test)
test_vectors = vectorizer.transform(df_test['text'])

### Q4

In [42]:
df_train['airline_sentiment'] = df_train['airline_sentiment'].map({'positive': 1, 'negative': 0})
df_test['airline_sentiment'] = df_test['airline_sentiment'].map({'positive': 1, 'negative': 0})

### Q5

In [43]:
import random
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix

# create some dummy data
X_train = train_vectors
y_train = df_train['airline_sentiment']

# train a LR classifier on train data
lr = LogisticRegression()
lr.fit(X_train, y_train)

# predict on train data
yhat_train = lr.predict(X_train)

# calculate performance on the train data
precision = precision_score(y_train, yhat_train)
f1score = f1_score(y_train, yhat_train)
recall = recall_score(y_train, yhat_train)
accuracy = accuracy_score(y_train, yhat_train)

print("Precision:" + str(precision))
print("Recall:" + str(recall))
print("F1-score: " + str(f1score))
print("Accuracy: " + str(accuracy))

Precision:0.9657273419649657
Recall:0.6705446853516658
F1-score: 0.7915106117353308
Accuracy: 0.9276508177190512


### Q6

In [44]:
# create some dummy data
X_test = test_vectors
y_test = df_test['airline_sentiment']

# predict on test data
yhat_test = lr.predict(X_test)

# calculate performance on the test data
precision = precision_score(y_test, yhat_test)
f1score = f1_score(y_test, yhat_test)
recall = recall_score(y_test, yhat_test)
accuracy = accuracy_score(y_test, yhat_test)

print("Precision:" + str(precision))
print("Recall:" + str(recall))
print("F1-score: " + str(f1score))
print("Accuracy: " + str(accuracy))

Precision:0.9352750809061489
Recall:0.6122881355932204
F1-score: 0.7400768245838668
Accuracy: 0.9120450606585788


From the result, we can see hat all the performance scores in testing data are lower than that of training data.The performance differences between the train and test splits suggest that the classifier is slightly overfitting to the training data. Overfitting occurs when a model learns the training data too well and doesn't generalize effectively to new, unseen data. This is why the performance metrics drop when applying the model to the test split.

### Q7

In [45]:
# show confusion matrix
cm = confusion_matrix(y_test, yhat_test)
print()
print('Confusion matrix')
print(cm)


Confusion matrix
[[1816   20]
 [ 183  289]]


True Positives (TP): 289 - the number of positive tweets correctly classified as positive.
True Negatives (TN): 1816 - the number of negative tweets correctly classified as negative.
False Positives (FP): 20 - the number of negative tweets incorrectly classified as positive.
False Negatives (FN): 183 - the number of positive tweets incorrectly classified as negative.

Class Imbalance: There's a class imbalance in the dataset, with more negative tweets (9178 tweets) than positive tweets (2363 tweets). This can lead to a model that is biased toward the majority class (negative sentiment). To mitigate this, techniques like resampling (oversampling the minority class), using vectir weights as what we done in question 3 would be useful.

False Negatives: There are 183 FN, which are potentially problematic because they represent instances where positive sentiment was not recognized. To mitigate FN, fine-tune model hyperparameters, or use techniques like cost-sensitive learning to prioritize recall would be useful.

False Positives: There are 20 FP. While this number is relatively low, it's still important to minimize FP, as misclassifying negative sentiment as positive could lead to incorrect insights. Work on improving the precision of the model could help reduce FN.

## Implement a KNN model for multi-class classification using BERT document embeddings

### Q8

In [46]:
news_train = pd.read_csv('data/ag_news_multiclass_classification_dataset/train.csv')
news_test = pd.read_csv('data/ag_news_multiclass_classification_dataset/test.csv')

display(news_train.head())

Unnamed: 0,3,Wall St. Bears Claw Back Into the Black (Reuters),"Reuters - Short-sellers, Wall Street's dwindling\band of ultra-cynics, are seeing green again."
0,3,Carlyle Looks Toward Commercial Aerospace (Reu...,Reuters - Private investment firm Carlyle Grou...
1,3,Oil and Economy Cloud Stocks' Outlook (Reuters),Reuters - Soaring crude prices plus worries\ab...
2,3,Iraq Halts Oil Exports from Main Southern Pipe...,Reuters - Authorities have halted oil export\f...
3,3,"Oil prices soar to all-time record, posing new...","AFP - Tearaway world oil prices, toppling reco..."
4,3,"Stocks End Up, But Near Year Lows (Reuters)",Reuters - Stocks ended slightly higher on Frid...


In [47]:
# Calculate the number of news articles in each split
print("There are ", len(news_train), "news in the training split")
print("There are ", len(news_test), "news in the testing split")

There are  119999 news in the training split
There are  7599 news in the testing split


In [48]:
# rename columns
news_train.columns = ["label", "title", "text"]
news_test.columns = ["label", "title", "text"]

# concatenate title and text into a single news article
news_train['document'] = news_train['title'] + ' ' + news_train['text']
news_test['document'] = news_test['title'] + ' ' + news_test['text']

# Calculate document lengths
train_doc_lengths = news_train['document'].apply(len)
test_doc_lengths = news_test['document'].apply(len)

# Calculate summary statistics for document lengths
train_avg_length = train_doc_lengths.mean()
train_min_length = train_doc_lengths.min()
train_max_length = train_doc_lengths.max()

test_avg_length = test_doc_lengths.mean()
test_min_length = test_doc_lengths.min()
test_max_length = test_doc_lengths.max()

print('Average training document length:', train_avg_length, 'characters')
print('Minimum training document length:', train_min_length, 'characters')
print('Maximum training document length:', train_max_length, 'characters')
print()
print('Average testing document length:', test_avg_length, 'characters')
print('Minimum testing document length:', test_min_length, 'characters')
print('Maximum testing document length:', test_max_length, 'characters')

Average training document length: 236.47829565246377 characters
Minimum training document length: 100 characters
Maximum training document length: 1012 characters

Average testing document length: 235.3089880247401 characters
Minimum testing document length: 100 characters
Maximum testing document length: 892 characters


In [49]:
# calculate the distribution of labels in each split
news_train['label'].value_counts(), news_test['label'].value_counts()

(4    30000
 2    30000
 1    30000
 3    29999
 Name: label, dtype: int64,
 4    1900
 2    1900
 1    1900
 3    1899
 Name: label, dtype: int64)

The distribution of labels for both the train and test splits are nearly the same, each label has nearly the same counts within each split. This indicate the dataset is very balanced.

### Q9

In [50]:
vectorizer = TfidfVectorizer()
train_vectors = vectorizer.fit_transform(news_train.document)
test_vectors = vectorizer.transform(news_test.document)

# reference: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
# get feature names
feature_names = vectorizer.get_feature_names_out()

# find the highest tf-idf weighted tokens for each label
for i in news_train.label.unique():
    label_docs = train_vectors[news_train['label'] == i]
    label_tfidf_max = label_docs.max(0).toarray()[0]
    top_indices = label_tfidf_max.argsort()[-10:][::-1]
    top_features = [feature_names[i] for i in top_indices]
    print("Top 10 TF-IDF weighted tokens for label", i, ':', top_features)

Top 10 TF-IDF weighted tokens for label 3 : ['evansville', 'cox', 'axa', 'geico', 'giuliani', 'yo', 'anz', 'steel', 'accoona', 'sohu']
Top 10 TF-IDF weighted tokens for label 4 : ['logger', 'dilithium', 'squip', 'blinkx', 'sulphur', 'gigaset', 'fpd', 'picasa', 'oddworld', 'sda']
Top 10 TF-IDF weighted tokens for label 2 : ['lua', 'petke', 'trotter', 'maddox', 'bowl', 'numbers', 'distraction', 'brockton', 'quot', 'rostock']
Top 10 TF-IDF weighted tokens for label 1 : ['azam', 'wrong', 'azzam', 'aceh', 'shipyard', 'lebanese', 'comprehensive', 'eritrea', 'nauru', 'anarchists']


Label 1: The top tokens in this category include terms like "azam," "aceh," "eritrea," and "anarchists".  "aceh" might indicate a geographic location. This class may represent news related to global or regional events, possibly with a focus on conflict or social issues.

Label 2: The top tokens in this category include terms like "lua," "trotter," "bowl," and "rostock." These terms may related to a variety of subjects but with a potential focus on sports.

Label 3: The top tokens in this category include terms like "evansville," "cox," "axa," "geico," and "giuliani." These terms seem to be related to names and companies. This class may represent news related to finance, insurance, and politics.

Label 4: The top tokens in this category include terms like "logger," "dilithium," "squip," and "picasa." These terms appear to be more technical or related to software and technology.

The actual labels are "World", "Sports", "Business", "Sci/Tech".  The guesses based on the top tokens appear to be quite accurate in identifying the general topic represented by each class label.


### Q11

In [66]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report, f1_score

# define dummys
Xt = train_vectors
yt = news_train.label
Xv = test_vectors
yv = news_test.label

# create a KNN classifier with k=5 and cosine similarity
knn = KNeighborsClassifier(n_neighbors=5, metric='cosine')

# train the KNN classifier on the training data
knn.fit(Xt, yt)

# predict labels for test data
yv_hat = knn.predict(Xv)

# calculate micro- and macro-F1 scores
micro_f1 = f1_score(yv, yv_hat, average='micro')
macro_f1 = f1_score(yv, yv_hat, average='macro')

print("Micro-F1:", micro_f1)
print("Macro-F1:", macro_f1)

  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)


Micro-F1: 0.90248716936439
Macro-F1: 0.9021785029974404


### Q12

When comparing test documents against both train and test data, the risk of overfitting in the model's predictions increases. This is because the model may inadvertently incorporate characteristics of the test set into its decision-making process, leading to overly optimistic evaluations and poor generalization to new, unseen data.Spliting the training and testing dataset ensures a more accurate and unbiased assessment of the model's ability to classify new documents.

### Q13

In [65]:
# print the classification report
print(classification_report(yv, yv_hat))

              precision    recall  f1-score   support

           1       0.90      0.90      0.90      1900
           2       0.94      0.97      0.96      1900
           3       0.87      0.87      0.87      1899
           4       0.89      0.87      0.88      1900

    accuracy                           0.90      7599
   macro avg       0.90      0.90      0.90      7599
weighted avg       0.90      0.90      0.90      7599



In [64]:
# identify mis_classified documents
mis_class = pd.DataFrame({"document": news_test.document, "actual": yv, "predicted": yv_hat})
mis_class = mis_class[mis_class.actual != mis_class.predicted]
sample_mis_class = mis_class.sample(3, random_state = 498)

for index, row in sample_mis_class.iterrows():
    print("Actual Label: " + str(row.actual))
    print("Predicted Label: " + str(row.predicted))
    print(row.document)
    print()

Actual Label: 4
Predicted Label: 1
The Shockwaves of Sumatra The Indian Ocean earthquake of December 2004 produced     a shockwave that created tsunamis all across the Indian Ocean. The tsunamis hammered nearby Indonesia and struck as far as     the coast of East Africa. The death toll has climbed over 100,000 and continues to grow.    It also created social shockwaves.  

Actual Label: 2
Predicted Label: 4
Big Game Hunting Virginia, Navy and Maryland face season-defining games, perhaps &lt;em&gt;program-defining &lt;/em&gt;games for the Cavaliers and Midshipmen as they play against Florida State and Notre Dame, respectively on Saturday.

Actual Label: 1
Predicted Label: 3
Vietnam Opens Bunker Used by Ho Chi Minh (AP) AP - Behind thick concrete walls and iron doors, Ho Chi Minh and other top Vietnamese leaders hid in secret underground tunnels during U.S. B-52 bombing raids to plot key military strategies that led to America's defeat in the Vietnam War.



1. Document 1: Actual Label - Sci/Tech, Predicted Label - World

This document is primarily focuses on a natural disaster's social impacts. While the document contains scientific and technical elements related to earthquakes and tsunamis, it's more focused on the societal effects and geological terms. Given the emphasis on the earthquake's impact on human society, it's reasonable to categorize it as "World" rather than "Sci/Tech."

2. Document 2: Actual Label - Sports, Predicted Label - Sci/Tech

This document mentions games involving several locations. The misclassification could be due to the mention of "games" and "program-defining." However, it's more likely that these terms relate to sports rather than technology or science. This misclassification is less justified, and it would be more appropriate to categorize it as "Sports."

3. Document 3: Actual Label - World, Predicted Label - Business

This document discusses a bunker used by Ho Chi Minh and other Vietnamese leaders during the Vietnam War. While there's a mention of B-52 bombing raids and military strategies, it primarily pertains to historical and political events. The misclassification as "Business" is less justifiable and likely due to a lack of clear context. The document is more appropriately categorized as "World."

In summary, document 2 and document 3 appears to have been truly misclassified, while document 1 is not.

### Q14

Logistic Regression is simple, efficient with large datasets, and interpretable. It's suitable for linear relationships but may not perform well with non-linear data. KNN is flexible, capturing non-linear patterns, but can be computationally expensive and sensitive to feature scaling. The choice between these two models depends on data and needs. Normally, ogistic Regression is good for interpretability and large datasets, while KNN suits complex non-linear data.