# Data Preprocessing and Analysis

This pipeline is for preprocessing and analyzing the datafile generated in part 1 of the Scotia AICP Project found [here](https://colab.research.google.com/drive/1Y8OWYB6i9uTX8QUqICe3HS1aRBKEOtC-?usp=sharing). Our goal is to prepare the data for NLP and Predictie ML. The columns of the provided dataframe are 'Article', 'Date_Published', 'Source', and 'GPT_Topic_Classification'.

This pipeline automates the collection and classification of textual data, reducing the manual effort typically involved in such tasks. Our integrated dataframe enables quick retrieval and cross-referencing of articles from various sources, providing a bird's-eye view of the discourse in different media outlets.

## Document Etiquette


When interacting with APIs within our Google Colab file, it's crucial to adhere to a code of conduct that ensures we do not overwhelm the API servers, use up our free API keys or exhaust our request limits. Avoid using the "Run All" feature and test your code using individual cells. Only run cells necessary for your current task.

## Saving Tips in Colab Shared Documents

For those new to working within a shared Google Colab document, it's important to understand how colab edits and saves. Colab provides real-time collaboration, but when two people edit the same cell at the same time, there could be a conflict. To manage this, Colab will show a message prompting you to choose between changes. Click the message and review both changes to make sure you are not overriding an important change. Be sure to make sure your change is saved by doing this.

Always keep an eye on the top of the document for the 'All changes saved in Drive' message to appear after you've made edits. This message confirms that your work has been successfully saved to Google Drive. If you're about to leave the Colab file or take a break, manually trigger a save and wait for this confirmation message to ensure that none of your work is lost.  

In [None]:
import pandas as pd
import numpy as np
import nltk
import evaluate
from datasets import Dataset
from transformers import AutoTokenizer
from transformers import DataCollatorWithPadding
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer

In [None]:
!pip install transformers datasets evaluate accelerate



# CSV Instructions

Please use the csv created in Pt1 of the data pipeline with a the New York Times articles manually added. After adding the relevant information for the New York Times articles delete their titles and urls columns from the spreadsheet. Please ensure to change the spreadsheet name in the line of code below to the one you are using!

*Please note: this pipeline expects to have a clean spreadsheet with the NYT amendments discussed above. Failure to use a valid spreadsheet with the required changes form step 1 may result in unpredictable results.*

In [None]:
df = pd.read_csv("/content/pipeline_2024-01-15_with_nyt.csv")

## Global Label Variable
The variable below is a global variable that defines the source of the labeled data being used to train and test the model. The default is currently set to GPT_Topic_Classifier for now but I recommend that human-generated labels are used instead at a later step for more accurate results.

To change the global variable simply replace it with a column name that represents the labels you would like to train.

In [None]:
label_source = 'GPT_Topic_Classifier'

In [None]:
df.head()

Unnamed: 0,Article,Date_Published,Source,GPT_Topic_Classification
0,The Field Museum in Chicago has covered up sev...,2024-01-11,NYT,miscellaneous
1,The Justice Department charged eBay on Thursda...,2024-01-11,NYT,Miscellaneous
2,The Federal Aviation Administration said on Th...,2024-01-11,NYT,Miscellaneous
3,One of the most hopeful proposals involving po...,2024-01-02,NYT,AI and Machine Learning
4,When Microsoft opened an advanced research lab...,2024-01-10,NYT,AI and Machine Learning


In [None]:
df.describe()

Unnamed: 0,Article,Date_Published,Source,GPT_Topic_Classification
count,349,349,349,349
unique,172,140,16,13
top,The greater of the mortgage contract rate\nplu...,2024-01-02,OSFI,miscellaneous
freq,10,193,192,226


In [None]:
# remove duplicates
df.drop_duplicates(inplace=True)

In [None]:
# classified topics
df['GPT_Topic_Classification'].unique()

array(['miscellaneous', 'Miscellaneous', 'AI and Machine Learning',
       'Record Keeping', 'Generative AI', 'Regulatory Compliance',
       'Cryptocurrency', 'Cybersecurity', 'Fraud Prevention',
       'Risk Assessment', 'International Cooperation',
       'Audit and Assurance', 'Continuous Improvement'], dtype=object)

In [None]:
# sources
df['Source'].unique()

array(['NYT', 'BBC News', 'Engadget', 'Wired', 'Gizmodo.com',
       'Ars Technica', 'Android Central', 'CNET', 'NPR', 'Slashdot.org',
       'Business Insider', 'Yahoo Entertainment', 'The Verge',
       'Financial Times', 'Investing.com', 'OSFI'], dtype=object)

In [None]:
# convert to lowercase
df['Article']= df['Article'].apply(lambda x: x.lower())

In [None]:
df.head()

Unnamed: 0,Article,Date_Published,Source,GPT_Topic_Classification
0,the field museum in chicago has covered up sev...,2024-01-11,NYT,miscellaneous
1,the justice department charged ebay on thursda...,2024-01-11,NYT,Miscellaneous
2,the federal aviation administration said on th...,2024-01-11,NYT,Miscellaneous
3,one of the most hopeful proposals involving po...,2024-01-02,NYT,AI and Machine Learning
4,when microsoft opened an advanced research lab...,2024-01-10,NYT,AI and Machine Learning


In [None]:
# strip characters like newline
df['Article'][25]

'ngun hình nh, getty images\nchp li hình nh, ai cp, quc gia ông dân nht ti bc phi, cng s gia nhp khi brics\nkhi brics s có thêm nm quc gia thành viên mi gia nhp, t châu phi và trung ông.\nt chc này mu… [+3953 chars]'

In [None]:
# Pos tagging
from nltk import pos_tag
nltk.download('punkt')
from nltk.tokenize import word_tokenize
nltk.download('averaged_perceptron_tagger')

def pos_tag_article(article):
    # Tokenize the article
    tokens = word_tokenize(article)
    # Perform POS tagging
    pos_tags = pos_tag(tokens)
    return pos_tags

# Apply POS tagging to the original articles in the DataFrame
df['POS_Tags'] = df['Article'].apply(pos_tag_article)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [None]:
# stop word removal - use nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import nltk

nltk.download('stopwords')
nltk.download('punkt')

def remove_stop_words(text):
    tokens = word_tokenize(text)
    return [word for word in tokens if not word.lower() in stopwords.words('english')]

df['Filtered_Article'] = df['Article'].apply(lambda x: ' '.join(remove_stop_words(x)))

df['Filtered_Article']

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


0      field museum chicago covered several display c...
1      justice department charged ebay thursday stalk...
2      federal aviation administration said thursday ...
3      one hopeful proposals involving police surveil...
4      microsoft opened advanced research lab beijing...
                             ...                        
188                                           december 8
190                                          december 15
191          minimum qualifying rate uninsured mortgages
192    # scam alert : careful phishing emails appear ...
262    “ risk environment operate become volatile , c...
Name: Filtered_Article, Length: 179, dtype: object

# NLP Modelling

We will be using [distilibert for text classification](https://huggingface.co/docs/transformers/tasks/sequence_classification) to categorize topics.

The potential topics are 'Due Diligence', 'Regulatory Compliance', 'Beneficial Ownership' 'AML Technology', 'AI and Machine Learning', 'Generative AI', 'Program Effectiveness', 'International Cooperation', 'Staff Training', 'Risk Assessment', 'Transaction Monitoring', 'Reporting Obligations', 'Record Keeping', 'Customer Identification', 'PEP Screening', 'Sanctions Compliance', 'Fraud Prevention', 'Cybersecurity', 'Third-Party Management', 'Audit and Assurance', 'Ethics and Governance' and/or 'Continuous Improvement'.

We are interested in adding a 'Not Relevant'/'Miscellaneous' topic in the future but are struggling to implement in a way that works effectively due to it quickly becoming the majority category even in cases it should not be.

In [None]:
topics = ['Due Diligence', 'Regulatory Compliance', 'Beneficial Ownership', 'AML Technology',
          'AI and Machine Learning', 'Generative AI', 'Program Effectiveness', 'International Cooperation',
          'Staff Training', 'Risk Assessment', 'Transaction Monitoring', 'Reporting Obligations',
          'Record Keeping', 'Customer Identification', 'PEP Screening', 'Sanctions Compliance',
          'Fraud Prevention', 'Cybersecurity', 'Third-Party Management', 'Audit and Assurance',
          'Ethics and Governance', 'Continuous Improvement']


In [None]:
label2id = {label: idx for idx, label in enumerate(topics)}
id2label = {idx: label for label, idx in label2id.items()}

In [None]:
label2id

{'Due Diligence': 0,
 'Regulatory Compliance': 1,
 'Beneficial Ownership': 2,
 'AML Technology': 3,
 'AI and Machine Learning': 4,
 'Generative AI': 5,
 'Program Effectiveness': 6,
 'International Cooperation': 7,
 'Staff Training': 8,
 'Risk Assessment': 9,
 'Transaction Monitoring': 10,
 'Reporting Obligations': 11,
 'Record Keeping': 12,
 'Customer Identification': 13,
 'PEP Screening': 14,
 'Sanctions Compliance': 15,
 'Fraud Prevention': 16,
 'Cybersecurity': 17,
 'Third-Party Management': 18,
 'Audit and Assurance': 19,
 'Ethics and Governance': 20,
 'Continuous Improvement': 21}

In [None]:
id2label

{0: 'Due Diligence',
 1: 'Regulatory Compliance',
 2: 'Beneficial Ownership',
 3: 'AML Technology',
 4: 'AI and Machine Learning',
 5: 'Generative AI',
 6: 'Program Effectiveness',
 7: 'International Cooperation',
 8: 'Staff Training',
 9: 'Risk Assessment',
 10: 'Transaction Monitoring',
 11: 'Reporting Obligations',
 12: 'Record Keeping',
 13: 'Customer Identification',
 14: 'PEP Screening',
 15: 'Sanctions Compliance',
 16: 'Fraud Prevention',
 17: 'Cybersecurity',
 18: 'Third-Party Management',
 19: 'Audit and Assurance',
 20: 'Ethics and Governance',
 21: 'Continuous Improvement'}

In [None]:
def convert_labels_to_ids(label):
    return label2id[label]

In [None]:
def preprocess_function(examples):
    return tokenizer(examples["text"], truncation=True)

In [None]:
def preprocess_function(examples):
    return tokenizer(examples["Filtered_Article"], truncation=True)

In [None]:
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

In [None]:
df.head()

Unnamed: 0,Article,Date_Published,Source,GPT_Topic_Classification,POS_Tags,Filtered_Article
0,the field museum in chicago has covered up sev...,2024-01-11,NYT,miscellaneous,"[(the, DT), (field, NN), (museum, NN), (in, IN...",field museum chicago covered several display c...
1,the justice department charged ebay on thursda...,2024-01-11,NYT,Miscellaneous,"[(the, DT), (justice, NN), (department, NN), (...",justice department charged ebay thursday stalk...
2,the federal aviation administration said on th...,2024-01-11,NYT,Miscellaneous,"[(the, DT), (federal, JJ), (aviation, NN), (ad...",federal aviation administration said thursday ...
3,one of the most hopeful proposals involving po...,2024-01-02,NYT,AI and Machine Learning,"[(one, CD), (of, IN), (the, DT), (most, RBS), ...",one hopeful proposals involving police surveil...
4,when microsoft opened an advanced research lab...,2024-01-10,NYT,AI and Machine Learning,"[(when, WRB), (microsoft, NN), (opened, VBD), ...",microsoft opened advanced research lab beijing...


In [None]:
df.drop(columns=['Article', 'Source', 'POS_Tags', 'Date_Published'], inplace=True)

In [None]:
df['GPT_Topic_Classification'] = df['GPT_Topic_Classification'].apply(convert_labels_to_ids)

KeyError: 'miscellaneous'

In [None]:
# Assuming 'df' is your Pandas DataFrame
dataset = Dataset.from_pandas(df)

In [None]:
tokenized_topics = dataset.map(preprocess_function, batched=True)

In [None]:
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

In [None]:
train_test_split = tokenized_topics.train_test_split(test_size=0.3)
train_dataset = train_test_split['train']
test_dataset = train_test_split['test']

In [None]:
train_dataset[0]

In [None]:
accuracy = evaluate.load("accuracy")

In [None]:
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return accuracy.compute(predictions=predictions, references=labels)

In [None]:
model = AutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased", num_labels=22, id2label=id2label, label2id=label2id
)

In [None]:
training_args = TrainingArguments(
    output_dir="my_awesome_model",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=2,
    weight_decay=0.01,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    push_to_hub=False,
)

In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
    label_names=["GPT_Topic_Classification"]
)

In [None]:
trainer.train()