# Explanatory data analysis

## Introduction

Welcome to the Explanatory Data Analysis (EDA) notebook for the Tweet Classification Challenge! In this notebook, we will explore and understand the data that will be used for our tweet classification challenge. EDA is a crucial first step in any data science project, as it allows us to gain valuable insights into our dataset, identify patterns, and make informed decisions about how to approach our classification task effectively.

## Table of Contents

&emsp;&ensp;&ensp;[Introduction](#introduction)<br style="margin-bottom:0.5em;">
&emsp;&emsp;[1 - Preprocessing](#preprocess)<br style="margin-bottom:0.1em;">
&emsp;&emsp;&emsp;&emsp;[1.1 - Preprocessing](#preprocessing-child)<br style="margin-bottom:0.1em;">
&emsp;&emsp;&emsp;&emsp;[1.2 - Drop NaN and duplicates](#drop-nan-dup)<br style="margin-bottom:0.5em;">
&emsp;&emsp;[2 - Word Analysis](#word-analysis)<br style="margin-bottom:0.1em;">
&emsp;&emsp;&emsp;&emsp;[2.1 - Tags](#tags)<br style="margin-bottom:0.1em;">
&emsp;&emsp;&emsp;&emsp;[2.2 - Hashtags](#hashtags)<br style="margin-bottom:0.1em;">
&emsp;&emsp;&emsp;&emsp;[2.3 - Emojis](#emojis)<br style="margin-bottom:0.1em;">
&emsp;&emsp;&emsp;&emsp;[2.4 - Endings](#endings)<br style="margin-bottom:0.5em;">
&emsp;&ensp;&ensp;[Summary](#summary)<br style="margin-bottom:0.1em;">

## Import

In [1]:
import pandas as pd
import numpy as np
from collections import Counter

from preprocessing import Preprocessing, EMOJI_GLOVE
from utility.paths import DataPath

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/thainamhoang/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/thainamhoang/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/thainamhoang/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     /Users/thainamhoang/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


## 1. Preprocessing

### 1.1. Preprocessing

In [2]:
# Load the full dataset into Preprocessing class
train_prep = Preprocessing([DataPath.TRAIN_NEG_FULL, DataPath.TRAIN_POS_FULL])

# Retrieve the df
train_df = train_prep.__get__()

# Peak the first few rows
train_df.head()

Unnamed: 0,text,label
0,vinco tresorpack 6 ( difficulty 10 of 10 objec...,0.0
1,glad i dot have taks tomorrow ! ! #thankful #s...,0.0
2,1-3 vs celtics in the regular season = were fu...,0.0
3,<user> i could actually kill that girl i'm so ...,0.0
4,<user> <user> <user> i find that very hard to ...,0.0


*Note: Inside `Preprocessing` class, we convert label `-1` to `0`*

In [3]:
# We do the same for test data
test_prep = Preprocessing([DataPath.TEST], is_test=True)
test_df = test_prep.__get__()
test_df.head()

Unnamed: 0,ids,text
0,1,sea doo pro sea scooter ( sports with the port...
1,2,<user> shucks well i work all week so now i ca...
2,3,i cant stay away from bug thats my baby
3,4,<user> no ma'am ! ! ! lol im perfectly fine an...
4,5,"whenever i fall asleep watching the tv , i alw..."


### 1.2. Drop NaN and duplicates

In [4]:
# Expand to negative and positive training by label
train_neg = train_df[train_df["label"] == 0.0]
train_pos = train_df[train_df["label"] == 1.0]

In [5]:
# Check for NaN
print(f"NaN state in negative label: {train_neg.isna().any().any()}")
print(f"NaN state in positive label: {train_pos.isna().any().any()}")

NaN state in negative label: False
NaN state in positive label: False


In [6]:
# Check the shape
shape_neg = train_neg.shape
shape_pos = train_pos.shape

print(f"Negative label shape: {shape_neg}")
print(f"Positive label shape: {shape_pos}")

Negative label shape: (1250000, 2)
Positive label shape: (1250000, 2)


In [7]:
# Remove duplicate in `text` column
train_neg = train_neg.drop_duplicates(subset=["text"])
train_pos = train_pos.drop_duplicates(subset=["text"])

# Check the shape again
print(f"Negative label shape after dropping duplicates: {train_neg.shape}")
print(f"Positive label shape after dropping duplicates: {train_pos.shape}")

Negative label shape after dropping duplicates: (1142838, 2)
Positive label shape after dropping duplicates: (1127644, 2)


In [8]:
# Check the rate of duplication
print(f"Duplicate percentage for negative label: {100 * (1 - train_neg.shape[0] / shape_neg[0]):.2f}%")
print(f"Duplicate percentage for positive label: {100 * (1 - train_pos.shape[0] / shape_pos[0]):.2f}%")

Duplicate percentage for negative label: 8.57%
Duplicate percentage for positive label: 9.79%


## 2. Word analysis

### 2.1. Tags

In [9]:
# Find all tags by using regex to find `<` and `>` boundings in `text` column of training data
all_tags = [tag for tags in train_df["text"].str.findall("<[\w]*>").values for tag in tags]

# Count the occurence
count_tags = Counter(all_tags)

# View top 10 tags occurence
count_tags.most_common(10)

[('<user>', 1605595),
 ('<url>', 526862),
 ('<>', 34),
 ('<b>', 27),
 ('<p>', 16),
 ('<i>', 10),
 ('<br>', 7),
 ('<strong>', 6),
 ('<syrian>', 6),
 ('<3>', 4)]

### 2.2. Hashtags

In [10]:
# Find all hashtags by using regex in `text` column of training data
all_hashtags = [hashtag for hashtags in train_df["text"].str.findall("(#\w+)").values for hashtag in hashtags]
print(f"Hashtag counts: {len(set(all_hashtags))}")

Hashtag counts: 114061


### 2.3. Emojis

Inside `preprocessing.py` there is `EMOJI_GLOVE` which contains emoticons from [this wikipedia link](https://en.wikipedia.org/wiki/List_of_emoticons), retrieved on November 15, 2023.

For every text line, we check for the matching parentheses. If it does not match, we then reconstruct the emoji out of it by simply adding `:` before the parenthesis.

In [11]:
# Count the total emojis
count_emojis = sum(len(value) for value in EMOJI_GLOVE.values())
count_emojis

123

### 2.4. Endings

We notice that, in some certains row they contain the ending similar to `...` or `... <url>`. We will remove these when preprocess the data

In [12]:
# Ellipsis count
ellipsis_count = train_df["text"].str.findall(r"\...$").apply(len).values.sum()
print(f"{ellipsis_count} rows end with ellipsis")

52446 rows end with ellipsis


In [13]:
# Ellipsis with `<url>` count
ellipsis_url_count = train_df["text"].str.findall(r"\... <url>$").apply(len).values.sum()
print(f"{ellipsis_url_count} rows end with ellipsis and `<url>`")

328225 rows end with ellipsis and `<url>`


## Summary

In summary, the exploratory data analysis performed on the dataset provided insightful trends and patterns through preprocessing and word analysis. The preprocessing stage ensured data integrity, while the subsequent analysis of tags, hashtags, and emojis revealed core themes, trending topics, and the underlying sentiment within the textual data. The exploration of sentence endings further informed on the communicative effectiveness of the dataset. Collectively, these findings offer a comprehensive understanding of the textual characteristics, serving as a valuable asset for refining approach strategies and preprocessing steps.