## **Assignment 3 - Modeling**
## CS585 - NLP
## Oleksandr Shashkov

### The goal of this final phase of the project is to build a text categorization model on your primary dataset, and to evaluate it on both your primary and your secondary dataset.  Follow the steps below.

**1. Data partitioning:** Create a Training set for your model by randomly selecting 70% of the texts in your PRIMARY dataset.  Use the remaining 30% of texts from the PRIMARY dataset as your Test (PRIMARY) set.  Designate 100% of your SECONDARY dataset as the Test (SECONDARY) dataset.  So you should have one Training set (drawn from the PRIMARY data), and two different Test sets (one from PRIMARY and one from SECONDARY).

In [1]:
import re
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
from google.colab import drive
drive.mount('/content/drive',force_remount=True)

Mounted at /content/drive


In [2]:
primary_set_path = "/content/drive/MyDrive/Education/CS585/HW3/PRIMARY/nyt_topic_vaccination.csv"
secondary_set_path = "/content/drive/MyDrive/Education/CS585/HW3/SECONDARY/twitter_topic_vaccination.csv"
# load primary dataset
primary_dataset = pd.read_csv(primary_set_path)
#this is to shuffle dataset
primary_dataset = primary_dataset.sample(frac=1,random_state=55).reset_index(drop=True)
# load secondary dataset
secondary_dataset = pd.read_csv(secondary_set_path)
# quickly explore datasets:
print('Primary dataset shape: ' + str(primary_dataset.shape))
print(primary_dataset.head(5))
print('\n')
print('Secondary dataset shape: ' + str(secondary_dataset.shape))
print(secondary_dataset.head(5))

Primary dataset shape: (813, 2)
                                                text  label
0  I would be much more likely to patronize a bus...   True
1  Show me the real research that supports this. ...  False
2  One of the commenters said it best.   Business...  False
3  Start requiring antibody titers on all employe...   True
4  The Nashville picture says it all. Unmasked pe...  False


Secondary dataset shape: (900, 2)
                                                text  label
0  Putin After Announcing #CovidVaccine #Russian ...   True
1  Courtesy: WA! #WhatsApp #COVID #CovidVaccine h...   True
2  4 of the vaccines Jared bought are expected to...   True
3  One day you will realize CDC Guidelines magica...  False
4  Im far from lying.  Current CDC guidelines is ...   True


In [3]:
test_size_percentage = 0.7
num_of_train_samples = int(primary_dataset.shape[0]*test_size_percentage)
# primary dataset:
x_primary = primary_dataset.text
y_primary = primary_dataset.label.astype(int)
x_prim_train = x_primary[:num_of_train_samples,]
y_prim_train = y_primary[:num_of_train_samples,]
x_prim_test = x_primary[num_of_train_samples:,]
y_prim_test = y_primary[num_of_train_samples:,]
# secondary dataset:
x_sec_test = secondary_dataset.text
y_sec_test = secondary_dataset.label.astype(int)

**2. Baseline model training:** Train a simple bag-of-words classifier on your Training dataset.  If your data comes from the stance task, you will build a multiclass model (one which can assign one of three labels - pro-mitigation, anti-mitigation, or unclear). **If your data comes from the topic task, choose only one of the topics (masking and distancing, lockdowns, vaccination) to model as a binary classification task.  (You should avoid topics with low numbers of positive examples.)**  An example of how to use scikit-learn to build a simple text categorization model is [here](https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html).

In [4]:
# vectorize and normalize the data
vectorizer = TfidfVectorizer(stop_words="english",lowercase=True).fit(x_prim_train)
print("Vocabulary size: {}\n".format(len(vectorizer.vocabulary_)))
x_prim_train_transformed = vectorizer.transform(x_prim_train)
x_prim_test_transformed = vectorizer.transform(x_prim_test)
x_sec_test_transformed = vectorizer.transform(x_sec_test)
#build and train the model on training data from primary dataset
lr_model = LogisticRegression().fit(x_prim_train_transformed, y_prim_train)

Vocabulary size: 4917



**3. Model evaluation 1:** Calculate your baseline model's accuracy for your model's predictions on the Test (PRIMARY) set, and on the Test (SECONDARY) set.  Enter these values in the answer boxes provided.

In [5]:
# baseline performance on primary dataset
baseline_accuracy_prim = lr_model.score(x_prim_test_transformed,y_prim_test)
print("Baseline accuracy of the model on primary dataset: {:.2%}\n".format(baseline_accuracy_prim))
print(classification_report(y_prim_test, lr_model.predict(x_prim_test_transformed)))
# baseline performance on secondary dataset
baseline_accuracy_sec = lr_model.score(x_sec_test_transformed,y_sec_test)
print("Baseline accuracy of the model on secondary dataset: {:.2%}\n".format(baseline_accuracy_sec))
print(classification_report(y_sec_test, lr_model.predict(x_sec_test_transformed)))

Baseline accuracy of the model on primary dataset: 80.74%

              precision    recall  f1-score   support

           0       0.93      0.56      0.70        98
           1       0.77      0.97      0.86       146

    accuracy                           0.81       244
   macro avg       0.85      0.77      0.78       244
weighted avg       0.83      0.81      0.79       244

Baseline accuracy of the model on secondary dataset: 72.44%

              precision    recall  f1-score   support

           0       0.66      0.70      0.68       375
           1       0.78      0.74      0.76       525

    accuracy                           0.72       900
   macro avg       0.72      0.72      0.72       900
weighted avg       0.73      0.72      0.73       900



Note: 
>  low recall for class 0 on primary test data and lower precision for class 1  

**4. Feature engineering:** In order to try to improve your model, think about what features of the text might be associated with the category you are trying to predict.  What attributes of a text besides the presence of individual words might be good predictors (for example, regular expression patterns or specific word sequences)?  Create at least three new features that represent attributes of the text.  Add them to your model and retrain.  An example of how to add a set of features (defined as a vector of 1/0 values indicating whether the attribute is present or absent for a given text) is shown [here](https://gist.github.com/DerrickHiggins/20c77745b080e3d493231424d7da9a2f).

Let's retrieve misclassified data to see how the model can be helped to process it correctly

In [6]:
y_test = np.asarray(y_prim_test)
y_hat_prim_test = lr_model.predict(x_prim_test_transformed)
misclassified_indx = np.where(y_test != y_hat_prim_test)
misclassified_data = pd.DataFrame()
misclassified_indx_corrected = misclassified_indx[0]+x_prim_test.first_valid_index()
misclassified_data['text'] = x_prim_test[misclassified_indx_corrected]
misclassified_data['label'] = y_prim_test[misclassified_indx_corrected]
misclassified_data['prediction'] = y_hat_prim_test[misclassified_indx]
misclassified_data.reset_index(drop=True, inplace=True)
misclassified_data.to_csv("/content/drive/MyDrive/Education/CS585/HW3/misclassified_primary_data.csv",index=False)
print(misclassified_data.head())

y_test = np.asarray(y_sec_test)
y_hat_sec_test = lr_model.predict(x_sec_test_transformed)
misclassified_indx = np.where(y_test != y_hat_sec_test)
misclassified_data = pd.DataFrame()
misclassified_data['text'] = x_sec_test[misclassified_indx[0]]
misclassified_data['label'] = y_sec_test[misclassified_indx[0]]
misclassified_data['prediction'] = y_hat_sec_test[misclassified_indx[0]]
misclassified_data.reset_index(drop=True, inplace=True)
misclassified_data.to_csv("/content/drive/MyDrive/Education/CS585/HW3/misclassified_secondary_data.csv",index=False)
print(misclassified_data.head())

                                                text  label  prediction
0  Closing things because of Covid-19 has hurt a ...      0           1
1  Why is this sliding into 2nd amendment kind of...      0           1
2  I’m fully vaxed and gladly don my very comfort...      0           1
3  Expose the antivaxxer crowd as the disruptive ...      0           1
4  We have not eaten at a restaurant since early ...      1           0
                                                text  label  prediction
0  Putin After Announcing #CovidVaccine #Russian ...      1           0
1  4 of the vaccines Jared bought are expected to...      1           0
2  I don’t see masks. In fact, my child was there...      0           1
3  will the sense touch 40k after a #CovidVaccine...      1           0
4  These patients are slowly being tortured and t...      0           1


For the enhancement features we are going to use:
1.   The presence of hashtag symbol
2.   The presence of both words 'anti' and 'vaccine'
3.   The presence of the sequence 'vax'



In [7]:
pat = re.compile(r"#")
def detect_hastag(text):
  return int(len(pat.findall(text)) > 0)

def contains_word(text, word):
  return (' ' + word + ' ') in (' ' + text + ' ')

def word1_and_word2(text,word1,word2):
  return int((word1 in text) and (word2 in text))

def contains_sequence(text, sequence):
  return int((sequence) in (text))

hash_tags_present_array_prim_train = x_prim_train.apply(detect_hastag)
hash_tags_present_array_prim_test = x_prim_test.apply(detect_hastag)
hash_tags_present_array_sec_test = x_sec_test.apply(detect_hastag)

w1 = 'anti'
w2 = 'vaccine'
w1_and_w2_prim_train = x_prim_train.apply(word1_and_word2,args=(w1,w2))
w1_and_w2_prim_test = x_prim_test.apply(word1_and_word2,args=(w1,w2))
w1_and_w2_sec_test = x_sec_test.apply(word1_and_word2,args=(w1,w2))

seq1 = 'vax'
seq1_prim_train = x_prim_train.apply(contains_sequence,args=(seq1,))
seq1_prim_test = x_prim_test.apply(contains_sequence,args=(seq1,))
seq_sec_test = x_sec_test.apply(contains_sequence,args=(seq1,))

# Add new features to the document representation:
# train dataset
x_prim_train_final = np.asarray(np.insert(x_prim_train_transformed.todense(), x_prim_train_transformed.shape[-1], 
  (hash_tags_present_array_prim_train,
   w1_and_w2_prim_train,
   seq1_prim_train
   ), axis=1))
# primary test dataset
x_prim_test_final = np.asarray(np.insert(x_prim_test_transformed.todense(), x_prim_test_transformed.shape[-1], 
  (hash_tags_present_array_prim_test,
   w1_and_w2_prim_test,
   seq1_prim_test
   ), axis=1))
# secondary test dataset
x_sec_test_final = np.asarray(np.insert(x_sec_test_transformed.todense(), x_sec_test_transformed.shape[-1], 
  (hash_tags_present_array_sec_test,
   w1_and_w2_sec_test,
   seq_sec_test
   ), axis=1))
# train new model
lr_model_enhanced = LogisticRegression().fit(x_prim_train_final, y_prim_train)

**5. Model evaluation 2:** Calculate overall model accuracy for your new model's predictions on the Test (PRIMARY) set, and on the Test (SECONDARY) set.  Enter these values in the answer boxes provided.

In [8]:
# enhanced model performance on primary dataset
enhanced_accuracy_prim = lr_model_enhanced.score(x_prim_test_final,y_prim_test)
print("Enhanced model accuracy on primary dataset: {:.2%}\n".format(enhanced_accuracy_prim))
print(classification_report(y_prim_test, lr_model_enhanced.predict(x_prim_test_final)))
# enhanced model performance on secondary dataset
enhanced_accuracy_sec = lr_model_enhanced.score(x_sec_test_final,y_sec_test)
print("Enhanced model accuracy on secondary dataset: {:.2%}\n".format(enhanced_accuracy_sec))
print(classification_report(y_sec_test, lr_model_enhanced.predict(x_sec_test_final)))

Enhanced model accuracy on primary dataset: 84.84%

              precision    recall  f1-score   support

           0       0.92      0.68      0.78        98
           1       0.82      0.96      0.88       146

    accuracy                           0.85       244
   macro avg       0.87      0.82      0.83       244
weighted avg       0.86      0.85      0.84       244

Enhanced model accuracy on secondary dataset: 76.00%

              precision    recall  f1-score   support

           0       0.68      0.81      0.74       375
           1       0.84      0.73      0.78       525

    accuracy                           0.76       900
   macro avg       0.76      0.77      0.76       900
weighted avg       0.77      0.76      0.76       900



**6. Reflection:** Answer the questions on model performance

---
QUESTION 1
What is your baseline model's overall accuracy for the Test (PRIMARY) and Test (SECONDARY) data sets?
> Answer:  
> Baseline accuracy of the model on primary dataset: 80.74%  
> Baseline accuracy the model on secondary dataset: 72.44%  
---
QUESTION 2
What is your enhanced (with new engineered features) model's overall accuracy for the Test (PRIMARY) and Test (SECONDARY) data sets?
> Answer  
> Enhanced model accuracy on primary dataset: 84.84%  
> Enhanced model accuracy on secondary dataset: 76.00%
---
QUESTION 3
What new features did you add to your model?
> Answer:  
> 1.   The presence of hashtag symbol  
> 2.   The presence of both words 'anti' and 'vaccine'  
> 3.   The presence of the sequence 'vax'
---
QUESTION 4
Did your new features improve model performance?  Why do you think they did (or did not)?
> Answer:  
> Yes, new features improved overal accuracy of the model by 4.1% on primary test dataset and by 3.56% on secondary dataset  
---
QUESTION 5
Did your model have better accuracy on the Test (PRIMARY) or Test (SECONDARY) data set?  Why?
> Answer:  
> The model has better accuracy on Test (PRIMARY) dataset. This is likely due to the fact that the model was trained on the texts sampled from PRIMARY dataset therefore on texts similar in qualities and features to the test dataset. Even though the topic of SECONDARY dataset was the same, it came from conceptually different source which differs significantly from what the model was trained on. For example, PRIMARY dataset had very little hashtags in it, while SECONDARY dataset had a lot of hashtags. therefore, the model was not trained enough to proccess SECONDARY dataset with the higher accuracy.
---
QUESTION 6
Please attach the zip file with your modeling code (as a Jupyter notebook) and any supporting files needed to run it.
> Answer:  
See submission

**7. Code submission:** Create a zip file including your code as a Jupyter notebook and any necessary supporting files.  Submit the file as requested within the assignment.