# Data Exploration

In this notebook, we will explore the dataset of Malay and English tweets to understand its characteristics, distribution of labels, and any potential issues that may need to be addressed during preprocessing.

In [1]:
# Import necessary libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Set visualization style
sns.set(style='whitegrid')

In [2]:
# Load the dataset
data_path = '../data/raw/semisupervised-bert-xlnet.csv'
tweets_df = pd.read_csv(data_path)

# Display the first few rows of the dataset
tweets_df.head()

In [3]:
# Check the shape of the dataset
tweets_df.shape

In [4]:
# Check for missing values
tweets_df.isnull().sum()

In [5]:
# Distribution of labels
plt.figure(figsize=(10, 6))
sns.countplot(data=tweets_df, x='label', palette='viridis')
plt.title('Distribution of Sentiment Labels')
plt.xlabel('Sentiment Label')
plt.ylabel('Count')
plt.show()

In [6]:
# Display the distribution of probabilities
plt.figure(figsize=(10, 6))
sns.histplot(tweets_df['prob'], bins=30, kde=True)
plt.title('Distribution of Prediction Probabilities')
plt.xlabel('Probability')
plt.ylabel('Frequency')
plt.show()

In [7]:
# Check for duplicates
duplicates = tweets_df.duplicated().sum()
duplicates

## Summary

In this notebook, we have explored the dataset by checking its shape, identifying missing values, visualizing the distribution of sentiment labels, and examining the distribution of prediction probabilities. We also checked for duplicate entries. 

This exploration will help inform the data cleaning and preprocessing steps in the next notebook.