This notebook implements the removal of duplicate tweets as part of the preprocessing pipeline for both traditional machine learning (ML) models and DistilBERT. While Singh & Kumar (2023) demonstrated the strong performance of DistilBERT for Twitter sentiment classification, they did not explore how data preprocessing, particularly duplicate removal, might influence model outcomes.

**Addressing Limitations:**

To address this gap, controlled experiements were conducted across tradiational ML and DistilBERT pipelines using different dataset variants such as with or without duplicates. These experiments aim to examine the effect of redundancy on performance.

**Reference:**

Singh, A., & Kumar, S. (2023). A comparison of machine learning algorithms and transformer-based methods for Multiclass sentiment analysis on Twitter. *2023 14th International Conference on Computing Communication and Networking Technologies (ICCCNT)*, 1-9. https://doi.org/10.1109/icccnt56998.2023.10306507

In [None]:
import pandas as pd
from google.colab import drive

---
# **1. Load and Inspect the Dataset**

In [None]:
# Load dataset
drive.mount("/content/drive")
data_path = "/content/drive/My Drive/IT1244_Team1_Project/Model & Dataset/dataset.csv"
columns = ["label", "tweet"]
df = pd.read_csv(data_path, header = None, names = columns)

# Check dataset information
print(df.info(), "\n")
print(df.head(), "\n")
print(df["label"].value_counts())  # Check class distribution

Mounted at /content/drive
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 2 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   label   100000 non-null  int64 
 1   tweet   100000 non-null  object
dtypes: int64(1), object(1)
memory usage: 1.5+ MB
None 

   label                                              tweet
0      0  @switchfoot http://twitpic.com/2y1zl - Awww, t...
1      0  is upset that he can't update his Facebook by ...
2      0  @Kenichan I dived many times for the ball. Man...
3      0    my whole body feels itchy and like its on fire 
4      0  @nationwideclass no, it's not behaving at all.... 

label
0    50000
1    50000
Name: count, dtype: int64


---
# **2. Exploratory Data Analysis & Removing Duplicates**

## Duplicated Tweets Analysis

In [None]:
# Detect conflicting tweets (same tweet text with different labels)
label_counts = df.groupby('tweet')['label'].nunique()
conflicting_tweets = label_counts[label_counts > 1].index

# DataFrame: conflicting tweets
conflict_df = df[df['tweet'].isin(conflicting_tweets)].sort_values('tweet')
num_conflict = conflict_df.shape[0]
print(f"Conflicting tweets (same text, different labels): {num_conflict}")
display(conflict_df)

# Remove all rows with conflicting tweets
df_no_conflict = df[~df['tweet'].isin(conflicting_tweets)].copy()
after_conflict_removal = df_no_conflict.shape[0]

Conflicting tweets (same text, different labels): 273


Unnamed: 0,label,tweet
53176,1,"3 days leave then Easter, no work for a week, ..."
2252,0,"3 days leave then Easter, no work for a week, ..."
48314,0,@ work
99586,1,@ work
33103,0,@ work
...,...,...
74503,1,went to the mall and had dinner out yay tori-...
61258,1,woken up v early by a big pair of brown eyes. ...
7959,0,woken up v early by a big pair of brown eyes. ...
2375,0,yay no work todayyy but working for the rest...


In [None]:
# Detect remaining duplicates (same tweet and same label)
dupe_same_label_df = df_no_conflict[df_no_conflict.duplicated(subset=['tweet'], keep=False)]
num_dupe_same_label = dupe_same_label_df.shape[0]
print(f"\nDuplicate tweets with same label (non-conflicting): {num_dupe_same_label}")
display(dupe_same_label_df.sort_values('tweet'))

# Remove duplicate tweets (keep only one per tweet)
df_cleaned = df_no_conflict.drop_duplicates(subset=['tweet']).copy()
after_full_clean = df_cleaned.shape[0]


Duplicate tweets with same label (non-conflicting): 941


Unnamed: 0,label,tweet
88987,1,I'm smiling because I lost! I lost 24 lbs. so...
95660,1,I'm smiling because I lost! I lost 24 lbs. so...
30108,0,(@SHUTUP) i failed to celebrate my 1337th scro...
30132,0,(@SHUTUP) i failed to celebrate my 1337th scro...
87182,1,+ today was awsome.been outside with friends. ...
...,...,...
10080,0,where did the sun go?
20525,0,where did the sun go?
36570,0,wowwww. today is literally the craziest day ev...
34494,0,wowwww. today is literally the craziest day ev...


In [None]:
# Analyze repetition frequency of duplicate tweets (same label) ---
dupe_counts = dupe_same_label_df['tweet'].value_counts()

print("\nRepetition Frequency of Duplicate Tweets (Same Label):")
print(dupe_counts.describe())

print("\nTop 10 Most Frequently Repeated Tweets:")
display(dupe_counts.head(10).to_frame(name='Occurrences'))


Repetition Frequency of Duplicate Tweets (Same Label):
count    309.000000
mean       3.045307
std        2.166275
min        2.000000
25%        2.000000
50%        2.000000
75%        3.000000
max       23.000000
Name: count, dtype: float64

Top 10 Most Frequently Repeated Tweets:


Unnamed: 0_level_0,Occurrences
tweet,Unnamed: 1_level_1
Vote your opinion on Susan Boyle! http://tinyurl.com/SusanBoylePoll,23
@chromachris Clean Me!,13
@reatlas Clean Me!,13
@tweetchild Clean Me!,12
@adlantis Clean Me!,9
Good morning everyone,8
back to work,8
@bridgettegreen Clean Me!,8
good morning,8
@mikead Clean Me!,8


In [None]:
# Summary Stats
original_size = df.shape[0]
removed_conflict = num_conflict
removed_dupes_same_label = after_conflict_removal - after_full_clean
total_removed = removed_conflict + removed_dupes_same_label
percent_removed = (total_removed / original_size) * 100

print("Cleaning Summary:")
print(f"Original dataset size                       : {original_size}")
print(f"Removed tweets with conflicting labels      : {removed_conflict}")
print(f"Removed duplicate tweets with same label    : {removed_dupes_same_label}")
print(f"Final cleaned dataset size                  : {after_full_clean}")
print(f"Total tweets removed                        : {total_removed}")
print(f"Percentage of tweets removed                : {percent_removed:}%")

Cleaning Summary:
Original dataset size                       : 100000
Removed tweets with conflicting labels      : 273
Removed duplicate tweets with same label    : 632
Final cleaned dataset size                  : 99095
Total tweets removed                        : 905
Percentage of tweets removed                : 0.905%


In [None]:
# Check dataset information
print(df_cleaned.info(), "\n")
print(df_cleaned.head(), "\n")
print(df_cleaned["label"].value_counts())  # Check class distribution

<class 'pandas.core.frame.DataFrame'>
Index: 99095 entries, 0 to 99999
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   label   99095 non-null  int64 
 1   tweet   99095 non-null  object
dtypes: int64(1), object(1)
memory usage: 2.3+ MB
None 

   label                                              tweet
0      0  @switchfoot http://twitpic.com/2y1zl - Awww, t...
1      0  is upset that he can't update his Facebook by ...
2      0  @Kenichan I dived many times for the ball. Man...
3      0    my whole body feels itchy and like its on fire 
4      0  @nationwideclass no, it's not behaving at all.... 

label
1    49683
0    49412
Name: count, dtype: int64


In [None]:
# save the cleaned data in a new CSV file
cleaned_file_path='/content/drive/My Drive/IT1244_Team1_Project/Model & Dataset/removed_dups_dataset.csv'
df_cleaned.to_csv(cleaned_file_path,index=False)