<h1>Meme trends analysis</h1>
This project aims to analyze trends in memes in order to visualize and correlate data.

In [42]:
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import datetime as dt
from utils import download_dataset

memes = pd.read_csv(download_dataset.get_path())



Dataset URL: https://www.kaggle.com/datasets/podsyp/a-lot-of-memes-info-stats

<h2>Dataset preparation</h2>
We begin by removing some outlying impossible entries from the dataset e.g. memes with a negative number of photos.

In [43]:
print(f"Dataset size before dropping impossible entries: {memes.__len__()}")
memes = memes[memes["views"] >= 0 & memes["views"].notna()]
memes = memes[memes["videos"] >= 0 & memes["videos"].notna()]
memes = memes[memes["photos"] >= 0 & memes["photos"].notna()]
memes = memes[memes["comments"] >= 0 & memes["comments"].notna()]
print(f"Dataset size after dropping impossible entries: {memes.__len__()}")

Dataset size before dropping impossible entries: 21453
Dataset size after dropping impossible entries: 21451


Next, we make some changes to the dataset which will be needed later.

In [44]:
memes["datetime_added"] = memes["date_added"].copy() #only run this once

In [45]:
memes["time_added_local"] = pd.to_datetime(memes["datetime_added"].str[-14:-6], format="%H:%M:%S").dt.time
memes["time_added_utc"] = pd.to_datetime(memes["datetime_added"], utc=True).dt.time
memes["date_added"] = pd.to_datetime(memes["datetime_added"], utc=True).dt.date

In [46]:
memes.info()

<class 'pandas.core.frame.DataFrame'>
Index: 21451 entries, 0 to 21452
Data columns (total 17 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   name              21451 non-null  object
 1   status            21451 non-null  object
 2   type              6588 non-null   object
 3   origin_year       21451 non-null  object
 4   origin_place      21438 non-null  object
 5   date_added        21451 non-null  object
 6   views             21451 non-null  int64 
 7   videos            21451 non-null  int64 
 8   photos            21451 non-null  int64 
 9   comments          21451 non-null  int64 
 10  tags              21448 non-null  object
 11  about             19209 non-null  object
 12  origin            11823 non-null  object
 13  other_text        16893 non-null  object
 14  datetime_added    21451 non-null  object
 15  time_added_local  21451 non-null  object
 16  time_added_utc    21451 non-null  object
dtypes: int64(4), obje

<h2>Numerical analysis</h2>

<h5>Views</h5>

In [47]:
print(f"Total views generated by memes in the dataset: {memes["views"].sum()}")
print(f"Mean views: {memes["views"].mean()}")
print(f"Median views: {memes["views"].median()}")

Total views generated by memes in the dataset: 1634021082
Mean views: 76174.58775814647
Median views: 13011.0


<h5>Photos</h5>

In [48]:
print(f"Number of photos reported in the dataset: {memes["photos"].sum()}")
print(f"Mean number of photos: {memes["photos"].mean()}")
print(f"Median number of photos: {memes["photos"].median()}")

Number of photos reported in the dataset: 1327172
Mean number of photos: 61.86993613351359
Median number of photos: 8.0


<h5>Comments</h5>

In [49]:
print(f"Total comments generated by memes in the dataset: {memes["comments"].sum()}")
print(f"Mean comments: {memes["comments"].mean()}")
print(f"Median comments: {memes["comments"].median()}")

Total comments generated by memes in the dataset: 915904
Mean comments: 42.697496620204184
Median comments: 13.0


The disparity between mean and median values hints at a large number of memes which generate relatively low engagement. Based on this assumption we will later perform a separate analysis on the more "viral" memes rather than all of them at the same time.

<h2>Univariate analysis</h2>

In [50]:
px.histogram(memes, x="date_added")

In [51]:
px.pie(memes, names="status")

In [52]:
memes_big_platforms = memes[memes["origin_place"].isin(["Twitter", "Facebook", "Instagram", "Reddit",
                                                        "TikTok", "YouTube", "iFunny", "Tumblr"])]
memes_big_platforms_counts=memes_big_platforms.groupby("origin_place", as_index=False).count()
memes_big_platforms_sums=memes_big_platforms.groupby("origin_place", as_index=False).sum(numeric_only=True)

fig=make_subplots(cols=2, subplot_titles=["Number of memes", "Sum of memes' views"], specs=[[{"type":"pie"}, {"type":"pie"}]])
fig.add_trace(
    go.Pie(labels=memes_big_platforms_counts["origin_place"], values=memes_big_platforms_counts["views"]),
    row=1, col=1
)
fig.add_trace(
    go.Pie(labels=memes_big_platforms_sums["origin_place"], values=memes_big_platforms_sums["views"]),
    row=1, col=2
)
fig.show()

In [53]:
px.histogram(memes["time_added_utc"].apply(lambda x: x.hour))

In [54]:
px.histogram(memes["time_added_local"].apply(lambda x: x.hour))