# DATA EXPLORATION

This notebook includes the different exploration done in the **data.csv** file.

In [1]:
import numpy as np
import pandas as pd

from transformers import AutoTokenizer, AutoModelForSequenceClassification


from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

pd.set_option('display.max_columns', None)

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
import sys
sys.path.append("../")

In [11]:
import src.constants as const
from src.preprocessing import preprocessing
from src.predict import predict_dataframe, predict_sentence
from src.utils import compute_metrics_by_category

from sklearn.model_selection import train_test_split

In [4]:
df = pd.read_csv("../data/data.csv")
df = preprocessing(df)
train_df, test_df = train_test_split(df, test_size=.2, random_state=17, stratify=df["lang"].values)
train_df, val_df = train_test_split(train_df, test_size=.2, random_state=17, stratify=train_df["lang"].values)

In [5]:
saved_model = AutoModelForSequenceClassification.from_pretrained("../model/")

In [6]:
tokenizer = AutoTokenizer.from_pretrained(const.MODEL)

In [8]:
print(test_df["title+sentences"].iloc[0])

Distrust of TEPCO Hampers Decommissioning. The latest round of public hearings was held only in Fukushima and Tokyo and this didn't seem sufficient to regain public support. Decommissioning of the crippled Fukushima Daiichi nuclear plant is a prerequisite for the reconstruction of areas devastated by the nuclear disaster. To this end, treatment of contaminated water is a must, and it needs be done swiftly. However, there will not be progress, no matter which method is taken, without the consent of the people affected by the nuclear disaster. TEPCO and government officials must offer truthful updates as soon as they happen.


## PREDICT FOR SENTENCES

In [10]:
idx = 10
sentence = test_df["title+sentences"].iloc[idx]
predict_sentence(saved_model, tokenizer, sentence)
print(test_df["label"].iloc[idx])

['NOT ESG']

NOT ESG


## PREDICT FOR DATAFRAMES

In [None]:
# Run this cell with GPU --> on Google Colab

# Predictions for the TRAINING dataset
train_predictions = predict_dataframe(saved_model, tokenizer, train_df)
train_df["pred"] = train_predictions

# Predictions for the VALIDATION dataset
val_predictions = predict_dataframe(saved_model, tokenizer, val_df)
val_df["pred"] = val_predictions

# Predictions for the TEST dataset
test_predictions = predict_dataframe(saved_model, tokenizer, test_df)
test_df["pred"] = test_predictions

In [14]:
train_pred_df = pd.read_csv("../data/train.csv")
val_pred_df = pd.read_csv("../data/val.csv")
test_pred_df = pd.read_csv("../data/test.csv")

In [15]:
# TRAIN
compute_metrics_by_category(df=train_pred_df, category_column="lang", true_column=const.TARGET, pred_column="pred", pos_label="ESG")

Unnamed: 0,category,support,accuracy,precision,recall,f1_score
0,english,5083,0.964785,0.968717,0.967685,0.9682
1,french,289,0.972318,0.975806,0.960317,0.968
2,portuguese,100,0.95,0.928571,0.981132,0.954128
3,japanese,92,0.98913,0.98,1.0,0.989899
4,german,88,0.965909,0.944444,1.0,0.971429
5,spanish,130,0.992308,1.0,0.987342,0.993631
6,italian,53,0.943396,0.969697,0.941176,0.955224


In [16]:
# VAL
compute_metrics_by_category(df=val_pred_df, category_column="lang", true_column=const.TARGET, pred_column="pred", pos_label="ESG")

Unnamed: 0,category,support,accuracy,precision,recall,f1_score
0,german,22,0.818182,0.923077,0.8,0.857143
1,english,1271,0.856019,0.894578,0.84017,0.866521
2,japanese,23,0.826087,0.846154,0.846154,0.846154
3,portuguese,25,0.84,0.9375,0.833333,0.882353
4,italian,13,0.769231,1.0,0.5,0.666667
5,spanish,32,0.84375,1.0,0.705882,0.827586
6,french,73,0.780822,0.857143,0.580645,0.692308


In [13]:
# TEST
compute_metrics_by_category(df=test_pred_df, category_column="lang", true_column=const.TARGET, pred_column="pred", pos_label="ESG")

Unnamed: 0,category,support,accuracy,precision,recall,f1_score
0,english,1589,0.869729,0.900599,0.858447,0.879018
1,japanese,29,0.827586,0.818182,0.75,0.782609
2,italian,17,0.705882,1.0,0.545455,0.705882
3,french,90,0.844444,0.857143,0.769231,0.810811
4,spanish,41,0.926829,0.9,1.0,0.947368
5,portuguese,31,0.741935,0.764706,0.764706,0.764706
6,german,27,0.888889,0.866667,0.928571,0.896552


In [18]:
test_pred_df.head()

Unnamed: 0,level_0,index,date,article_id,company,lang,source,virality,all_esg_keywords,title+sentences,label,nb_words,nb_labels,pred
0,5860,5965,2023-07-10,2258ee546486bcf1a94879ab5e0282f878460537d5e319...,Tokyo Electric Power Company Holding,english,,4,['nuclear disaster'],Distrust of TEPCO Hampers Decommissioning. The...,ESG,100,1,ESG
1,2826,2880,2022-01-14,f6c1d13f6b17a0bde6c22de3c1a8158632325abd29816f...,TEXTRON INC,english,nasdaq.com,3,['violation'],Textron Specialized Vehicles Recalls Personal ...,NOT ESG,94,1,NOT ESG
2,2420,2468,2020-11-29,f18d49e49cc976d1bf126eb57e8dd2d4f94813426301c4...,Coca-Cola Co,english,bdnews24.com,3,"['forced labour', 'forced labor', 'scorn']",Nike and Coca-Cola lobby against Xinjiang forc...,ESG,1189,1,ESG
3,4368,4444,2019-10-09,ffbd346c5396cbd201449b1fc94490e60968fa05804c7b...,easyJet PLC,english,irishmirror.ie,1,['strike'],Nightmare for Irish pensioners after easyJet p...,NOT ESG,120,1,NOT ESG
4,8027,8177,2023-07-03,29282a5daf78a6d0c1ce248f66d4b7f5788da53bf85122...,GRANT THORNTON,english,,2,"['sanction', 'laundering', 'bribery', 'fraud',...",Risk management that gets to “yes”. The causes...,NOT ESG,509,1,NOT ESG
