# 📊 Test Exploratory Data Analysis — Social Media Sentiment Insights

**Objective:**  
Understand the dataset structure, detect missing values, explore distributions and early patterns to guide future modeling or insights presentation.

---

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Visual settings
pd.set_option("display.max_colwidth", 120)
plt.style.use("default")

print("✅ Libraries loaded correctly.")

## 📥 1. Load Dataset & Initial Overview

We'll load the dataset and quickly explore its structure to understand the number of rows, columns, and get a glimpse of the data.

In [None]:
# Load dataset
df = pd.read_csv("../data/sentimentdataset.csv")
print("✅ Dataset loaded. Shape:", df.shape)

# Quick peek
df.head(30)

In [None]:
df.info()

In [None]:
df.describe()

In [None]:
df.isnull().sum()

In [None]:
df.nunique()

In [None]:
df.sort_values(by="Likes", ascending=False).head(10)

In [None]:
df.sort_values(by="Retweets", ascending=False).head(10)

In [None]:
# pd.set_option('display.float_format', lambda x: '%.2f' % x)
df_corr = df.corr(numeric_only=True).round(2)
df_corr.style.format("{:.2f}").background_gradient(cmap="coolwarm")
# df.corr(numeric_only=True)

Unnamed data should be removed
Total correlation between likes and retweets

In [None]:
df_corr = df.corr(numeric_only=True).round(2)
df_corr.style.format("{:.2f}").background_gradient(cmap="magma")

When applying a background gradient, especially in data visualization libraries like Pandas or Matplotlib, the cmap argument (colormap) specifies the color scheme to be used. While coolwarm provides a diverging colormap ranging from cool blues to warm reds, many other options exist, each with a distinct aesthetic and purpose.
Here are some categories of colormaps and examples of other colors you could use:
1. Sequential Colormaps: These colormaps are designed to show a progression of values, typically from low to high, using a single hue or a gradual change in lightness/saturation.
Examples: viridis, plasma, inferno, magma, gray, Blues, Greens, Reds, Purples, Oranges.
2. Diverging Colormaps: Similar to coolwarm, these colormaps emphasize a central neutral value and diverge to two distinct colors at the extremes, suitable for showing deviations from a mean or a zero point.
Examples: seismic, RdBu, PiYG, PRGn, BrBG, bwr.
3. Qualitative Colormaps: These colormaps are designed to distinguish between discrete categories or groups, using distinct and easily differentiable colors.
Examples: tab10, tab20, Paired, Set1, Set2, Dark2.
4. Cyclic Colormaps: These colormaps are useful for data that wraps around a central point, like angles or phases, where the start and end colors are the same or very similar.
Examples: twilight, hsv.
To choose the best colormap, consider:
Data Type: Is your data sequential, diverging, or categorical?
Clarity: Does the colormap effectively convey the information without causing misinterpretations?
Accessibility: Is the colormap colorblind-friendly? (e.g., viridis is often recommended for this).
Aesthetics: Does the colormap align with the overall design and purpose of your visualization?

In [None]:
df['text_len'] = df['Text'].str.len()
df['word_count'] = df['Text'].str.split().str.len()
df['hashtag_count'] = df['Hashtags'].str.count('#')
df['emoji_count'] = df['Text'].str.count(r'[^\w\s,]')  # emojis/símbolos

df[['text_len', 'word_count', 'hashtag_count', 'emoji_count', 'Likes', 'Retweets']].corr()

In [None]:
sns.heatmap(df.corr(numeric_only=True), annot = True)

plt.rcParams['figure.figsize'] = (20,7)

plt.show()

In [None]:
df_groupby = df.groupby('Platform')[['Likes', 'Retweets']].mean().sort_values(by='Likes', ascending=False)
df_groupby.style.format("{:.1f}")

In [None]:
df['Country'].value_counts()


Strings might have empty values and that messes up de 'country' data

In [None]:
df['Country'].value_counts(normalize=True) * 100

In [None]:
df.groupby('Country')[['Likes', 'Retweets', 'emoji_count', 'word_count']].mean().sort_values(by='Likes', ascending=False)

In [None]:
df_groupby = df.groupby('Sentiment')[['Likes', 'Retweets']].mean().sort_values(by='Retweets', ascending=False)
df_groupby.style.format("{:.1f}")