# Data Preprocessing

In [37]:
import numpy as np
import pandas as pd
from pathlib import Path
import warnings

## Setup

In [38]:
# Set random seed
np.random.seed(1040)

In [39]:
DIR = Path("data")
input_dir = DIR / "input"
stream_1_path = input_dir / "Stream1.xlsx"
stream_2_path = input_dir / "Stream2.xlsx"
stream_3_path = input_dir / "Stream3.xlsx"

In [40]:
# this is ok and will not cause problems
warnings.filterwarnings("ignore", message="Workbook contains no default style, apply openpyxl's default")

stream_1_data = pd.read_excel(stream_2_path, engine="openpyxl") # stream 1 contains the chronologically second part
stream_2_data = pd.read_excel(stream_1_path, engine="openpyxl") # stream 2 contains the chronologically first part
stream_3_data = pd.read_excel(stream_3_path, engine="openpyxl")

data = pd.concat([stream_1_data, stream_2_data, stream_3_data], ignore_index=True);

## First Look

The data contains a great number of different attributes for each observation.
therefore we start by looking at the attributes to get an idea of what to keep and what to get rid of.

First we convert all attribute-names to lowercase and replace white-spaces to underscores to make things simpler

In [41]:
data.columns = [c.replace(' ', '_').lower() for c in stream_1_data.columns]

In [42]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150024 entries, 0 to 150023
Data columns (total 63 columns):
 #   Column                                       Non-Null Count   Dtype  
---  ------                                       --------------   -----  
 0   post_id                                      143071 non-null  object 
 1   sound_bite_text                              150015 non-null  object 
 2   ratings_and_scores                           0 non-null       float64
 3   title                                        71124 non-null   object 
 4   source_type                                  143053 non-null  object 
 5   post_type                                    85476 non-null   object 
 6   is_paid                                      150000 non-null  object 
 7   media_type                                   143053 non-null  object 
 8   url                                          143053 non-null  object 
 9   media_link                                   17428 non-null

Important:

- post_id (ID)
- Sound Bite Text (main text corpus)
- Published Date (GMT+01:00) London (used to create dynamic embeddings)
- Sentiment (used for extrinsic evaluation)

In the following I focus first on these attributes to keep things clear and simple

In [43]:
important_attributes = ["post_id", "sound_bite_text", "published_date_(gmt+01:00)_london", "sentiment"]

data = pd.DataFrame(data, columns=important_attributes)

For simplicity's sake I chose to rename the attributes to a more readable and manageable form

In [44]:
data.rename(columns={"source_type": "source", "sound_bite_text":"raw_text", "published_date_(gmt+01:00)_london": "date", "post_id":"id"}, inplace=True)

Lets look at how many values are actually there for the selected attributes

In [45]:
# number of observations
n = len(data)

# Display relative counts of missing values
data.isnull().sum().divide(n).sort_values(ascending=False)

sentiment    0.046466
id           0.046346
date         0.000160
raw_text     0.000060
dtype: float64

Both the date and the text have almost no missing values, which is the main thing.
The attribute sentiment will only be used for a part of the evaluation of the embeddings and is therefore not as important.
I therefore decide to go for the following strategy:

Remove observations:

- with missing date
- with missing text

Keep observations:
- with missing sentiment

In [46]:
data.dtypes

id           object
raw_text     object
date         object
sentiment    object
dtype: object

Let us now describe the key characteristics of our (remaining) data

In [47]:
data.head()

Unnamed: 0,id,raw_text,date,sentiment
0,101043269988443_685479222937141_641761047285352,Check this guy out at a school board meeting. ...,"Sep 16, 2022 6:29:09 AM",Neutrals
1,102479025168087_499693558822805_1195202924611670,Trump's one race theory like that of Hitler is...,"Jul 31, 2022 1:10:33 AM",Neutrals
2,10643211755_10161745802786756_627308188838071,Jon C Treleaven Seems right! I believe in pare...,"Sep 26, 2022 5:17:42 PM",Neutrals
3,10643211755_10161748775911756_1235289853921186,This happening in my town. We’re having school...,"Sep 27, 2022 4:23:12 AM",Positives
4,101043269988443_687503842734679_3330045383891840,The blame for all these perverted lifestyles b...,"Sep 5, 2022 6:10:30 PM",Neutrals


In [48]:
data.shape

(150024, 4)

In [49]:
data.dtypes

id           object
raw_text     object
date         object
sentiment    object
dtype: object

The datatypes are mostly as we would like.
We only convert the date attribute from object to date, since we are working with a time series.

In [50]:
data['date'].isnull().sum()

24

Given that temporal word embeddings heavily rely on dates, we consider the date to be crucial. However, out of the 24 tweets available, some lack a date, so I opt to eliminate those observations from the dataset.

In [51]:
data.dropna(subset=['date'], inplace=True)

In [52]:
data['date'] =  pd.to_datetime(data['date'])

Since I focus on the temporal change of words, I chose to sort the observations by date because that makes a manual inspection later on more convenient

In [53]:
data.sort_values('date', inplace=True);

The text of the tweets is the main source of information, lets look how many missing values we encounter here

In [54]:
null_texts = data['raw_text'].isnull().sum()
empty_texts = data[data['raw_text'].str.len() < 2].count().iloc[0]
print(f"Obsersations with no text: {null_texts}")
print(f"Obsersations with empty text: {empty_texts}")

Obsersations with no text: 0
Obsersations with empty text: 0


Since it is only one observation we can safely remove it to prevent it from causing errors later on.

In [55]:
data.dropna(subset=['raw_text'], inplace=True)

Now lets take a look at the different attributes. Since the task at hand is a sentiment analysis, we focus on this attribute first

In [56]:
data["sentiment"].unique()

array(['Positives', 'Neutrals', nan, 'Negatives', 'Mixed'], dtype=object)

So our target is to predict the sentiment from the text (sound_bite_text).
The sentiment is either:

- Positive
- Negative
- Neutral
- Mixed

In [57]:
# Get the range of dates
period = (data['date'].min(), data['date'].max())

# Format the output
formatted_range = tuple(date.strftime("%Y-%m-%d") for date in period)
print("Period of time:", formatted_range)

Period of time: ('2022-06-01', '2023-04-28')


## Text

### 1. Convert to lowercase

In [58]:
data["text"] = data["raw_text"].str.lower()

# rearrange columns
data = data[['id', 'text', "raw_text", 'date', 'sentiment']]

### 2. Remove Unicode Characters

Eliminate the punctuation, URL, and @

In [59]:
import re

def clean_text(text):

    # Removes all of them
    text = re.sub(r"(@\[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)|^rt|http.+?", "", text)

    return text

In [60]:
data["text"] = data["text"].apply(clean_text)

### 3. Remove Stopwords

In [61]:
import nltk
nltk.download('punkt')

def remove_stopwords(text):
    from nltk.corpus import stopwords
    from nltk.tokenize import word_tokenize

    stop_words = set(stopwords.words("english"))
    word_tokens = word_tokenize(text)
    filtered_text = [word for word in word_tokens if word not in stop_words]
    return " ".join(filtered_text)

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/paulschmitt/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [62]:
data["text"] = data["text"].apply(remove_stopwords)

### 4. Stemming

In [63]:
from nltk.stem import SnowballStemmer
from nltk.tokenize import word_tokenize

def perform_stemming(text):
    stemmer = SnowballStemmer(language = "english")
    word_tokens = word_tokenize(text)
    stemmed_text = [stemmer.stem(word) for word in word_tokens]
    return " ".join(stemmed_text)

In [64]:
data["text"] = data["text"].apply(perform_stemming)

In [65]:
data.head(10)

Unnamed: 0,id,text,raw_text,date,sentiment
9343,https://educationactiontoronto.com/articles/sc...,ford govern lack data back claim school spread...,The Ford Government lacked data to back its cl...,2022-06-01 23:00:00,Positives
47887,1115616568456201_7946683995349390_112899859794...,teach hs 1516 get first job everywher hire get...,I teach HS. They are all 15/16 getting their f...,2022-06-01 23:04:54,Neutrals
19611,https://www.natchitochestimes.com/2022/06/01/s...,grassroot effort lsba invalu entir state could...,The grass-roots efforts of the LSBA are invalu...,2022-06-01 23:05:00,Neutrals
42718,BRDRDT2-t1_iau9al6,yes sever parent right us absolv financi oblig...,"Yes, you can sever your parental rights in the...",2022-06-01 23:05:35,Neutrals
42677,BRDRDT2-t1_iau9rdh,let teach crt sex ed gunpoint,Let's teach CRT and Sex Ed at gunpoint now.,2022-06-01 23:09:11,Neutrals
37354,1115616568456201_7948036835214106_131162148936...,million thing right never say word god forbid ...,You can do a million things right and they nev...,2022-06-01 23:09:53,Neutrals
18884,,post delet author,Post deleted by the author.,2022-06-01 23:12:20,
37309,1115616568456201_7948036835214106_567073701495104,,This! ??,2022-06-01 23:13:54,Neutrals
41030,BRDRDT2-t1_iauajjr,think want go florida desanti ban crt school w...,I think they want to go to Florida because DeS...,2022-06-01 23:15:11,Negatives
31734,95475020353_545516537034360_1350369588773474,lisa mari sure guess attack school age childre...,Lisa Marie not sure but I guess attacking scho...,2022-06-01 23:16:28,Neutrals


In [66]:
data.dtypes

id                   object
text                 object
raw_text             object
date         datetime64[ns]
sentiment            object
dtype: object

Check if we inadvertently created some Na, Null Values in our (processed) text column

In [67]:
data["text"].isna().sum()

0

In [68]:
data["text"].isnull().sum()

0

In [69]:
def save_to_csv(data, splits: list, sub_dir: str):

    output_dir = Path("data/split/") / sub_dir
    # Create output directory if it doesn't exist
    output_dir.mkdir(parents=True, exist_ok=True)
    # range of the observations
    lower = data["date"].min()
    upper = data["date"].max()

    for split in splits:
        split_df = data[(lower <= data['date']) & (data['date'] < split)]
        split_filename = output_dir / f"{lower.strftime('%d_%b')}_to_{split.strftime('%d_%b')}.csv"
        # Save the filtered data to csv, overwrite if exists
        split_df.to_csv(split_filename, index=False, mode='w')
        # Update the lower date for the next iteration
        lower = split

    # take care of second half of the last split
    split_df = data[(lower <= data['date']) & (data['date'] <= upper)]
    split_filename = output_dir / f"{lower.strftime('%d_%b')}_to_{upper.strftime('%d_%b')}.csv"
    split_df.to_csv(split_filename, index=False, mode='w')

In [70]:
# List of notable events
griner_release = pd.Timestamp('2022-10-07')
musk_twitter_takeover = pd.Timestamp('2022-10-01')
pelosi_attacked = pd.Timestamp("2022-10-26")
colorado_springs_shooting = pd.Timestamp("2022-11-18")

# did not work
word_cup = pd.Timestamp('2022-11-01')
seoul_halloween = pd.Timestamp('2022-10-28')

In [71]:
# ------------------------------------------------------------------ #
# ------------------------------------------------------------------ #
# ------------------------------------------------------------------ #

In [72]:
# PARAMS TO MODIFY MANUALLY
sub_dir = "colorado_springs"
splits = [colorado_springs_shooting]

save_to_csv(data, splits, sub_dir)