# MS5001 Group Project

#### **Done by:**
TANG WEN YUE; e0920872@u.nus.edu  
ZHANG SHUANG; e1582514@u.nus.edu  
GAN LEA WAH IVOR; e1538123@u.nus.edu  
GAN ZHENG WEI, TIMOTHY; e1561523@u.nus.edu  

#### **Chosen dataset:**
Airline Sentiments

This section imports all the required libraries. For convenience and organisation, the libraries are imported at the top section of this notebook.

In [1]:
#Import libraries
import pandas as pd
import re
import emoji

## Section 1: Data Exploration

In this step, we import the data into a pandas dataframe and perform some exploratory data analysis to get a brief initial look at the dataset.

In [10]:
#Import data into a pandas dataframe
tweet_data = pd.read_csv("Airline Sentiment\Tweets.csv")
display(tweet_data.head(10))
print(f"Number of rows {tweet_data.shape[0]}")

Unnamed: 0,airline_sentiment,sentiment_confidence,text
0,neutral,1.0,@VirginAmerica What @dhepburn said.
1,positive,0.3486,@VirginAmerica plus you've added commercials t...
2,neutral,0.6837,@VirginAmerica I didn't today... Must mean I n...
3,negative,1.0,@VirginAmerica it's really aggressive to blast...
4,negative,1.0,@VirginAmerica and it's a really big bad thing...
5,negative,1.0,@VirginAmerica seriously would pay $30 a fligh...
6,positive,0.6745,"@VirginAmerica yes, nearly every time I fly VX..."
7,neutral,0.634,@VirginAmerica Really missed a prime opportuni...
8,positive,0.6559,"@virginamerica Well, I didn't…but NOW I DO! :-D"
9,positive,1.0,"@VirginAmerica it was amazing, and arrived an ..."


Number of rows 14639


In [None]:
#wrap text so we can see the entire tweet better

# Apply CSS styling for text wrapping for the first 30 rows
styled_df = tweet_data.head(30).style.set_properties(
    **{
        'inline-size': '200px',  # Set a fixed width for the column
        'overflow-wrap': 'break-word',
        'white-space': 'normal'
    },
    subset=['text']
)

# Display the styled DataFrame (will render as HTML in compatible environments)
styled_df

Unnamed: 0,airline_sentiment,sentiment_confidence,text
0,neutral,1.0,@VirginAmerica What @dhepburn said.
1,positive,0.3486,@VirginAmerica plus you've added commercials to the experience... tacky.
2,neutral,0.6837,@VirginAmerica I didn't today... Must mean I need to take another trip!
3,negative,1.0,"@VirginAmerica it's really aggressive to blast obnoxious ""entertainment"" in your guests' faces & they have little recourse"
4,negative,1.0,@VirginAmerica and it's a really big bad thing about it
5,negative,1.0,@VirginAmerica seriously would pay $30 a flight for seats that didn't have this playing. it's really the only bad thing about flying VA
6,positive,0.6745,"@VirginAmerica yes, nearly every time I fly VX this “ear worm” won’t go away :)"
7,neutral,0.634,"@VirginAmerica Really missed a prime opportunity for Men Without Hats parody, there. https://t.co/mWpG7grEZP"
8,positive,0.6559,"@virginamerica Well, I didn't…but NOW I DO! :-D"
9,positive,1.0,"@VirginAmerica it was amazing, and arrived an hour early. You're too good to me."


From the data exploration we can see that the tweets are mostly text-based, with certain structural elements such as handles (@XXX) and tags (#XXX). Additional features we might need to consider are emojis and how to encode them.

## Step 2: Data Cleaning

Clean tweet data

In [20]:
#Clean tweet data

#Set all to lowercase
tweet_data["cleaned_text"]=tweet_data["text"].str.lower()

In [21]:
#convert emojis to text so that vector text encoding can pick up the sentiment

def emoji_to_text(text):
    return emoji.demojize(text)

tweet_data["cleaned_text"]=tweet_data["cleaned_text"].apply(emoji_to_text)

In [22]:
#inspect the cleaned text

# Apply CSS styling for text wrapping for the first 30 rows
styled_df = tweet_data.head(30).style.set_properties(
    **{
        'inline-size': '200px',  # Set a fixed width for the column
        'overflow-wrap': 'break-word',
        'white-space': 'normal'
    },
    subset=['cleaned_text']
)

# Display the styled DataFrame (will render as HTML in compatible environments)
styled_df

Unnamed: 0,airline_sentiment,sentiment_confidence,text,cleaned_text
0,neutral,1.0,@VirginAmerica What @dhepburn said.,@virginamerica what @dhepburn said.
1,positive,0.3486,@VirginAmerica plus you've added commercials to the experience... tacky.,@virginamerica plus you've added commercials to the experience... tacky.
2,neutral,0.6837,@VirginAmerica I didn't today... Must mean I need to take another trip!,@virginamerica i didn't today... must mean i need to take another trip!
3,negative,1.0,"@VirginAmerica it's really aggressive to blast obnoxious ""entertainment"" in your guests' faces & they have little recourse","@virginamerica it's really aggressive to blast obnoxious ""entertainment"" in your guests' faces & they have little recourse"
4,negative,1.0,@VirginAmerica and it's a really big bad thing about it,@virginamerica and it's a really big bad thing about it
5,negative,1.0,@VirginAmerica seriously would pay $30 a flight for seats that didn't have this playing. it's really the only bad thing about flying VA,@virginamerica seriously would pay $30 a flight for seats that didn't have this playing. it's really the only bad thing about flying va
6,positive,0.6745,"@VirginAmerica yes, nearly every time I fly VX this “ear worm” won’t go away :)","@virginamerica yes, nearly every time i fly vx this “ear worm” won’t go away :)"
7,neutral,0.634,"@VirginAmerica Really missed a prime opportunity for Men Without Hats parody, there. https://t.co/mWpG7grEZP","@virginamerica really missed a prime opportunity for men without hats parody, there. https://t.co/mwpg7grezp"
8,positive,0.6559,"@virginamerica Well, I didn't…but NOW I DO! :-D","@virginamerica well, i didn't…but now i do! :-d"
9,positive,1.0,"@VirginAmerica it was amazing, and arrived an hour early. You're too good to me.","@virginamerica it was amazing, and arrived an hour early. you're too good to me."


In [None]:
#


## Step 3: Feature Extraction

Extract features

In [None]:
#Extract features

#Initilaise a new dataframe to hold the extracted features
extracted_features=pd.DataFrame()

#Extract the handles
extracted_features["handles"]=tweet_data['cleaned_text'].str.findall(r'@(\w+)')

#Extract emojis

#example of angry face emoji within the dataset
emojis = emoji.distinct_emoji_list(tweet_data['cleaned_text'].iloc[23])
print(emojis)

# Function to extract emojis from a string
def extract_emojis(text):
    if isinstance(text, str):  # Ensure it's a string to avoid errors with non-string types like NaN
        emoji_list = emoji.distinct_emoji_list(text)
        return emoji_list
    return '' # Return empty string for non-string values or if no emojis are found

# Apply the function to the 'text' column to create a new 'emojis' column
extracted_features['emojis'] = tweet_data['cleaned_text'].apply(extract_emojis)

#named entity recognition?
#how to vectorise the rest of the text?

['😡']


In [None]:
#Inspect the extracted features dataframe
extracted_features.head()

Unnamed: 0,handles,emojis
0,"[virginamerica, dhepburn]",[]
1,[virginamerica],[]
2,[virginamerica],[]
3,[virginamerica],[]
4,[virginamerica],[]


## Step 4: Clustering

Perform clustering based on features

In [None]:
#Perform clustering based on features

#use "extracted_features" dataframe

Check accuracy of clustering against dataset "airline_sentiment" column

In [None]:
#Check accuracy of clustering against dataset "airline_sentiment" column
#Check that the accuracy correlates against the "confidence" column,
# ie. the misrepresented tweets are those that have low confidence scores
# and those that are correct should have high confidence

## Step 5: Something special

From the literature review, we discovered that it's possible for a single tweet to have both positive and negative sentiments when describing different things. Hence, we investigate to what extent this might be true in the dataset, and whether that could have caused some of the misclassified samples.