## Imports

In [56]:
# Imports
import pandas as pd
from collections import defaultdict
import re
from langdetect import detect
from langdetect.lang_detect_exception import LangDetectException

## Dataset Overview

In [None]:
# Loading the Twitter dataset
df = pd.read_csv('/Users/diegolemos/Masters/Theses/code/data/raw/twcs.csv')

In [None]:
# Displaying basic informations from the dataset
print(df.shape)
df.head()

(2811774, 7)


Unnamed: 0,tweet_id,author_id,inbound,created_at,text,response_tweet_id,in_response_to_tweet_id
0,1,sprintcare,False,Tue Oct 31 22:10:47 +0000 2017,@115712 I understand. I would like to assist y...,2.0,3.0
1,2,115712,True,Tue Oct 31 22:11:45 +0000 2017,@sprintcare and how do you propose we do that,,1.0
2,3,115712,True,Tue Oct 31 22:08:27 +0000 2017,@sprintcare I have sent several private messag...,1.0,4.0
3,4,sprintcare,False,Tue Oct 31 21:54:49 +0000 2017,@115712 Please send us a Private Message so th...,3.0,5.0
4,5,115712,True,Tue Oct 31 21:49:35 +0000 2017,@sprintcare I did.,4.0,6.0


In [None]:
# Checking for null values
df.isnull().sum()

tweet_id                         0
author_id                        0
inbound                          0
created_at                       0
text                             0
response_tweet_id          1040629
in_response_to_tweet_id     794335
dtype: int64

### Null Values
As we can see above, there are a significative amount of 'response_tweet_id' and 'in_response_to_tweet_id', these missing values indicates tweets that are no part of a reply chain. For now we are going to keep them, and use these columns to recostruct conversations threads. Instead of removing or imputing them, we going to these nulls as anchor points to reconstruct complete conversation chains, by using backward traversal. This will enable us to create a new thread_id column that groups all related tweets together, which is essential for tracking customer sentiment progression across interactions.

## Conversation Thread Reconstruction

In [28]:
# Reconstructing conversation threads

# Creating a luokup for parent tweet IDs
id_to_parent = dict(zip(df['tweet_id'], df['in_response_to_tweet_id']))

# Tracing the root tweet ( Finding the start od the conversation)
def find_root_tweet(tweet_id, id_to_parent_cache):
    while pd.notnull(id_to_parent_cache.get(tweet_id)):
        tweet_id = id_to_parent_cache[tweet_id]
    return tweet_id

# Applting to all tweets
thread_ids = []
for tweet_id in df['tweet_id']:
    thread_id = find_root_tweet(tweet_id, id_to_parent)
    thread_ids.append(thread_id)

df['thread_id'] = thread_ids

### Reconstructing conversation threads
To understand how the messages are connected into conversation, I utilised thread_id which groups all the tweets belonging to the same conversation, with the in_response_to_tweet_id column, I was able to follow the responses back to the original tweet of a thread. This is the relation using which we will backtrack to the genesis of every tweet. We use this ID as thread_id and this way we are able to group messages belong to same support interaction.

In [29]:
# Checking the number of unique threads
print('Unique threads: ', df['thread_id'].nunique())

Unique threads:  798012


## Customer Message Filtering

In [None]:
# Checking amount of messages from customer (True) and agents (False)
df['inbound'].value_counts()

inbound
True     1537843
False    1273931
Name: count, dtype: int64

In [None]:
# Filtering customer messages
customer_df = df[df['inbound'] == True].copy()

In [None]:
# Dropping rows with no text
customer_df.dropna(subset = ['text'], inplace = True)
customer_df = customer_df[customer_df['text'].str.strip() != '']

# Displaying samples
customer_df['text'].sample(5).tolist()


['@AmazonHelp it took you 8 hours to figure that out when i had already mentioned that. both the products are billed in my name and i have provided the...',
 '@115913 I would like to private message you how would I go about that',
 '@AmazonHelp We have also replay on this link please check and get back update me ASAP',
 '@115955 - keep making those silly commercials w the dummy, they r a scream!',
 "@AppleSupport I'm so mad 😡😡😡 My past iMessages are deleted. Why??????"]

### Customers Inbound
Here we have created a copy from the dataset containing only constumer messages as the objective of this project is to detect customer dissarisfaction.

## Text Cleaning

In [58]:
# Creating a function to clean the text
def clean_text(text):
    # Removing URLs
    text = re.sub(r'http\S+|www.\S+', '', text)
    
    # Removing mentions and hastags
    text = re.sub(r'@\w+|#\w+', '', text)
    
    # Removing extra space
    text = re.sub(r'\s+', ' ', text).strip()
    
    return text

# Creating a function to detect language
def detect_language(text):
    try:
        return detect(text)
    except LangDetectException:
        return 'unknown'

In [38]:
# Cleaning customer messagaes
customer_df['clean_text'] = customer_df['text'].apply(clean_text)

## Language Detection & Filtering

In [60]:
# Detecting lamguage
customer_df['lang'] = customer_df['clean_text'].apply(detect_language)

In [61]:
# Filtering only  for English messages
customer_df = customer_df[customer_df['lang'] == 'en'].copy()

In [71]:
# Printting sample with cleanned messages
customer_df[['text', 'clean_text', 'author_id']].sample(5)

Unnamed: 0,text,clean_text,author_id
1019430,@AmericanAir It's stressful enough flying with...,It's stressful enough flying with cancer this ...,386091
30397,The crooks @115873 trying to charge me almost ...,The crooks trying to charge me almost $100 for...,124491
398443,@222781 @XboxSupport I got banned for calling ...,"I got banned for calling someone ""awful"" using...",222780
1074308,@AmericanAir not yet! The #coffeedebacle is st...,"not yet! The is still a debacle, but I'm on th...",396735
2569094,@sainsburys is it really necessary to have you...,is it really necessary to have your packaging ...,767667


In [72]:
# Saving the proessed dataset
customer_df.to_csv('/Users/diegolemos/Masters/Theses/code/data/processed/customer_english.csv', index=False)

### Clean text and language detection
In this step, we have cleaned the raw tweet texts by removing unnecessary elements such as URLs, mentions, hashtags nd extra space. Numbers, punctuation and emojis were kept as it can have important impact on the sentiment impressed in a message. We also have created a new column called 'clean_text' that recieves this clean text. This ensures that the data is ready for tokenisation and sentiment analysis.

Additionally, we have implemented a function to detect the language of the message, further, filtering for Eglish leaguage only, as models like VADER and SistilBERT are very English-centic.