# Notebook 2: labelling sentiment to the standardised sentences

In [1]:
import pandas as pd



Read the dataset of standardised sentences extracted from Notebook 1.

In [2]:
# Read the csv of the standardised sentences
df_sentence = pd.read_csv("dataset/ar_30companies.csv")
df_sentence.head(6)

Unnamed: 0.1,Unnamed: 0,standardised sentences
0,0,mothercare plc annual report accounts www
1,1,mothercareplc
2,2,com transformation growth
3,3,financial highlights worldwide network sales
4,4,group sales
5,5,operating pro


Combine the two word lists of financial and environmental topic. Read the word list csv files and perform data wrangling.

In [3]:
# Read the csv of financial word list with the sentiment
df_fin_word = pd.read_csv("word_list/Loughran_McDonald_Sentiment_Word_List.csv")
df_fin_word.head(6)

Unnamed: 0,word,sentiment
0,abandon,Negative
1,abandoned,Negative
2,abandoning,Negative
3,abandonment,Negative
4,abandonments,Negative
5,abandons,Negative


In [4]:
# Checking any missing value
df_fin_word.isnull().sum()

word         0
sentiment    0
dtype: int64

In [5]:
# Check the sentiment summary
categorycount = df_fin_word["sentiment"].value_counts()
categorycount

sentiment
Negative        5646
Litigious       1630
Positive        1231
Uncertainty      767
Constraining     432
WeakModal         27
StrongModal       19
Name: count, dtype: int64

Removing the sentiment uncertainty, constraining, weakmodal, strongmodal to reduce the dataset size.

In [6]:
# Create a mask for positive and negative sentiment only
fin_mask = (df_fin_word["sentiment"] == "Positive") | (df_fin_word["sentiment"] == "Negative")

# Filter the unused sentiment litigious, uncertainty, constraining, WeakModal, Strong Modal
filtered_df_fin_word = df_fin_word[fin_mask]
filtered_df_fin_word["sentiment"] = filtered_df_fin_word["sentiment"].str.lower()

# Showing the negative and positive sentiment of the financial word list
df_fin_word_sum = filtered_df_fin_word["sentiment"].value_counts()
df_fin_word_sum

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  filtered_df_fin_word["sentiment"] = filtered_df_fin_word["sentiment"].str.lower()


sentiment
negative    5646
positive    1231
Name: count, dtype: int64

Repeat the above wrangling for environmental word list.

In [7]:
# Read the environmental word list
df_env_word = pd.read_csv("word_list/environmental_word_list.csv")
df_env_word.head(6)

Unnamed: 0,word,sentiment
0,biofuels,positive
1,carbon dioxide,negative
2,co2,negative
3,carbon offsets,positive
4,fossil fuels,negative
5,carcinogens,negative


In [8]:
# Check any missing and null values
df_env_word.isnull().sum()

word         0
sentiment    0
dtype: int64

In [9]:
# Showing the summary of sentiment category of envrionmental word list
categorycount = df_env_word["sentiment"].value_counts()
categorycount

sentiment
positive    123
negative     61
Name: count, dtype: int64

Next, combine the two word list into a sigle dataframe.

In [10]:
# Combine both cleanned word lists
combined_word_list = pd.concat([filtered_df_fin_word, df_env_word], ignore_index = True)
combined_word_list

Unnamed: 0,word,sentiment
0,abandon,negative
1,abandoned,negative
2,abandoning,negative
3,abandonment,negative
4,abandonments,negative
...,...,...
7056,seg,positive
7057,environmental permit,positive
7058,anaerobic digestion,positive
7059,energy from waste,positive


In [11]:
# Checking any missin value
combined_word_list.isnull().sum()

word         0
sentiment    0
dtype: int64

In [12]:
# Showing the summary of sentiment category of the combined word list
categorycount = combined_word_list["sentiment"].value_counts()
categorycount

sentiment
negative    5707
positive    1354
Name: count, dtype: int64

In [13]:
# Download as csv for backup
combined_word_list.to_csv('word_list_final.csv', index = False)

Next, using this word list to label the standardised sentences which has been extracted in Notebook 1.

In [14]:
# Droup the unnecessary column of standardised sentences dataframe
df_sentence_1 = df_sentence.drop("Unnamed: 0", axis = 1)

In [15]:
# Show the dataframe after dropping the unnecessary column
df_sentence_1.head(6)

Unnamed: 0,standardised sentences
0,mothercare plc annual report accounts www
1,mothercareplc
2,com transformation growth
3,financial highlights worldwide network sales
4,group sales
5,operating pro


Next, transfer to the above dataframe as a list as the global functions needs a list of the standardised sentences.

In [16]:
# Transfer dataframe as a list
sentence_list = df_sentence_1["standardised sentences"].tolist()
print("Print the first 6 sentence of the list:")
print(sentence_list[:6])
print(f"\nData type: {type(sentence_list)}")

Print the first 6 sentence of the list:
['mothercare plc annual report accounts www', 'mothercareplc', 'com transformation growth', 'financial highlights worldwide network sales', 'group sales', 'operating pro']

Data type: <class 'list'>


Then, it will import the global functions "label_sentences" to label the sentences.

In [17]:
# Import the global nlp function to label the sentences
from nlp_functions import label_sentences
df_sen_labels = label_sentences(sentence_list,combined_word_list)

In [25]:
# print the labels and sentences
df_sen_labels.head(20)

Unnamed: 0,sentences,labels
0,mothercare plc annual report accounts www,neutral
1,mothercareplc,neutral
2,com transformation growth,neutral
3,financial highlights worldwide network sales,neutral
4,group sales,neutral
5,operating pro,neutral
6,uk operating loss,negative
7,n vs,neutral
8,pro,neutral
9,million last year international operating pro,neutral


In [19]:
# Check any missing and null values
df_sen_labels.isnull().sum()

sentences    0
labels       0
dtype: int64

In [20]:
# Summary of labels for the sentences
categorysum = df_sen_labels["labels"].value_counts()
categorysum

labels
positive    396692
neutral     301895
negative    205911
Name: count, dtype: int64

In [21]:
# Download as csv for back up
df_sen_labels.to_csv('sentences_labels.csv', index = False)

Next, it can covert the above labels into numeric value before tokenization.

In [23]:
from nlp_functions import convert_labels_to_numeric
# Mapping the labels with numeric values
label_mapping = {'positive': 1, 'negative': -1, 'neutral': 0}

# Converting the labels into numeric values
converted_sen_labels = convert_labels_to_numeric(df_sen_labels,label_mapping)

In [26]:
# Print the coverted dataframe
converted_sen_labels.head(20)

Unnamed: 0,sentences,labels
0,mothercare plc annual report accounts www,0
1,mothercareplc,0
2,com transformation growth,0
3,financial highlights worldwide network sales,0
4,group sales,0
5,operating pro,0
6,uk operating loss,-1
7,n vs,0
8,pro,0
9,million last year international operating pro,0


In [27]:
# Show the numeric labels summary
numeric_labels_counts = converted_sen_labels['labels'].value_counts()
numeric_labels_counts

labels
 1    396692
 0    301895
-1    205911
Name: count, dtype: int64

In [29]:
# Download as csv for back up
converted_sen_labels.to_csv("sen_with_numeric_labels.csv", index = False)