<p style="font-family:Times New Roman; font-size:30px;font-weight:bold; color:purple;">
Social Media Viral Content & Engagement Analysis
</p>

<p style="font-family:Times New Roman; font-size:20px;">
This notebook contains the exploratory data analysis (EDA) for the final project. <br>
The goal is to explore engagement patterns across platforms and content types to support data-driven decisions for digital marketing stakeholders.
</p>

In [32]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
%matplotlib inline
pd.set_option("display.max_columns", None)
import warnings

# Ignore all warnings
warnings.filterwarnings("ignore")

In [33]:
# Data import 
df = pd.read_csv("../Data/social_media_viral_content_dataset.csv", sep=",")

# View the data
df.head()

Unnamed: 0,post_id,platform,content_type,topic,language,region,post_datetime,hashtags,views,likes,comments,shares,engagement_rate,sentiment_score,is_viral
0,SM_100000,Instagram,text,Sports,ur,UK,2024-12-10 00:00:00,#tech #funny #music,2319102,122058,15800,861,0.0598,0.464,1
1,SM_100001,Instagram,carousel,Sports,ur,Brazil,2024-10-13 00:00:00,#news #fyp #funny #ai #trending,2538464,110368,11289,54887,0.0695,-0.8,1
2,SM_100002,YouTube Shorts,video,Technology,ur,UK,2024-05-03 00:00:00,#ai #news,1051176,87598,47196,44132,0.1702,0.416,0
3,SM_100003,X,text,Politics,ur,US,2024-08-04 00:00:00,#ai #funny,5271440,329465,774,59736,0.074,0.877,1
4,SM_100004,YouTube Shorts,text,Education,es,US,2024-03-28 00:00:00,#news #ai #viral #funny #fyp,3186256,199141,5316,83105,0.0903,0.223,1


In [34]:
# Basic information, Exploring my dataset
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000 entries, 0 to 1999
Data columns (total 15 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   post_id          2000 non-null   object 
 1   platform         2000 non-null   object 
 2   content_type     2000 non-null   object 
 3   topic            2000 non-null   object 
 4   language         2000 non-null   object 
 5   region           2000 non-null   object 
 6   post_datetime    2000 non-null   object 
 7   hashtags         2000 non-null   object 
 8   views            2000 non-null   int64  
 9   likes            2000 non-null   int64  
 10  comments         2000 non-null   int64  
 11  shares           2000 non-null   int64  
 12  engagement_rate  2000 non-null   float64
 13  sentiment_score  2000 non-null   float64
 14  is_viral         2000 non-null   int64  
dtypes: float64(2), int64(5), object(8)
memory usage: 234.5+ KB


In [35]:
# Describe the data - Descriptive statistics.
df.describe()

Unnamed: 0,views,likes,comments,shares,engagement_rate,sentiment_score,is_viral
count,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0
mean,4284860.0,245329.244,24786.929,49936.9815,0.157852,-0.000566,0.699
std,3246193.0,145032.423582,14433.288364,29012.818697,0.535457,0.574911,0.458807
min,4380.0,292.0,14.0,127.0,0.0357,-1.0,0.0
25%,1652742.0,118903.75,12337.75,25698.75,0.057975,-0.507,0.0
50%,3469408.0,239831.0,24519.5,50212.0,0.0845,0.001,1.0
75%,6348078.0,372323.5,37116.25,75433.0,0.142525,0.49525,1.0
max,14371790.0,499983.0,49989.0,99977.0,12.5732,0.999,1.0


In [36]:
# Dataset Shape
df.shape

(2000, 15)


<p style="font-family:Times New Roman; font-size:24px;font-weight:bold; color:darkred;">
Initial Observations, Data Quality Check & Feature Engineering
</p>

<ul style="font-family:Times New Roman; font-size:20px; list-style-type:square;">
  <li>The dataset contains engagement metrics such as likes, shares, comments, and views.</li>
  <li>Data appears to be structured and mostly clean.</li>
  <li>Engagement-related features will be used to create a combined engagement metric.</li>
</ul>

In [None]:
# Numbers of unique values in each columns
df.nunique()

post_id            2000
platform              4
content_type          4
topic                 6
language              5
region                5
post_datetime       366
hashtags           1229
views              2000
likes              1996
comments           1954
shares             1978
engagement_rate    1176
sentiment_score    1251
is_viral              2
dtype: int64

In [38]:
# Check for missing values
df.isnull().sum()

post_id            0
platform           0
content_type       0
topic              0
language           0
region             0
post_datetime      0
hashtags           0
views              0
likes              0
comments           0
shares             0
engagement_rate    0
sentiment_score    0
is_viral           0
dtype: int64

In [39]:
# Find the duplicates
df.duplicated().sum()

np.int64(0)

<p style="font-family:Times New Roman; font-size:24px;font-weight:bold; color:darkred;">
Data Quality Summary
</p>

<ul style="font-family:Times New Roman; font-size:20px; list-style-type:square;">
  <li>The dataset does not contains any missing values.</li>
  <li>No significant duplicate records were found.</li>
  <li>Overall, the dataset is clean and suitable for exploratory analysis.</li>
</ul>

In [None]:
# Feature Engineering - Creating a new feature 'total_engagement'
df["total_engagement"] = (
    df["likes"] +
    df["comments"] +
    df["shares"]
)

In [None]:
# Validate Feature Engineering
df[["likes", "comments", "shares", "total_engagement"]].head()

Unnamed: 0,likes,comments,shares,total_engagement
0,122058,15800,861,138719
1,110368,11289,54887,176544
2,87598,47196,44132,178926
3,329465,774,59736,389975
4,199141,5316,83105,287562


In [None]:
#Engagement Rate Calculation
df["engagement_rate"] = df["total_engagement"] / df["views"]


<p style="font-family:Times New Roman; font-size:24px;font-weight:bold; color:darkred;">
Feature Engineering
</p>

<ul style="font-family:Times New Roman; font-size:20px; list-style-type:square;">
  <li>A new feature `total_engagement` was created by combining likes, comments, and shares.</li>
  <li>This metric represents overall user interaction with content.</li>
  <li>It will be used as the primary KPI throughout the analysis.</li>
</ul>

<p style="font-family:Times New Roman; font-size:24px;font-weight:bold; color:darkred;">
From Data to Business Questions
</p>

<p style="font-family:Times New Roman; font-size:20px;">
After preparing the data, the next step is to align the analysis with stakeholder needs.
Based on the responsibilities of the digital marketing team, we define key business questions
and formulate hypotheses to guide the exploratory analysis.</p>