# An In-Depth Analysis of the Social Media Platform Trell

*** 
## Introduction

Welcome to our data story project, where we embark on an exciting journey to explore the depths of the social media platform Trell. Through a series of interlinked visualizations and explanatory text, we aim to unravel the intricate relationships within Trell's user data and shed light on the factors influencing user engagement.

Trell, a popular social media platform, offers users a unique space to discover, create, and share their experiences through captivating visual content. In this project, we dive into a comprehensive dataset that encompasses a wide range of attributes related to Trell's users. From user demographics and activity patterns to engagement metrics and content preferences, our dataset provides a rich foundation for uncovering fascinating insights.

Before we delve into the analysis, we diligently preprocess the dataset to ensure data quality and relevance. Cleaning the dataset, handling missing values, and transforming variables where necessary form the crucial groundwork for our exploration. By employing best practices in data preprocessing, we ensure that our subsequent analyses and visualizations are accurate and informative.

Throughout the project, we actively seek feedback from our Teaching Assistant (TA) and peers, recognizing the value of diverse perspectives in refining our analysis and improving the clarity of our visualizations. This iterative process enables us to present a compelling data story that effectively communicates the insights derived from the Trell dataset.

Join us on this captivating journey as we uncover the correlations between various attributes within Trell and unravel the secrets behind user engagement patterns. Through the fusion of data, visualizations, and explanatory text, we hope to empower researchers, marketers, and enthusiasts with a deeper understanding of the dynamic landscape of Trell.

In [None]:
# Imports

import pandas as pd
from scipy.stats import pearsonr
import plotly.graph_objects as go
import plotly.express as px
import plotly.offline as pyo
import numpy as np
pyo.init_notebook_mode()

## Our perspectives


### Perspective 1: Content Creator on Trell

As a content creator on Trell, you play a vital role in shaping the platform's landscape and engaging with its user base. Through this perspective, we aim to provide insights into the factors that contribute to your success and help you optimize your content creation strategy.

By analyzing the dataset, we explore the correlation between various attributes and the content creator's performance on Trell. We investigate factors such as user activity, the age groups of users, and audience sizes to understand their impact on content reach and engagement. Through visualizations and data-driven analysis, we aim to empower content creators with actionable insights to enhance their content's visibility and impact.

<u> Argument #1: A content creator shouldn't upload at night. </u>
- The best time to upload a video would be during the day, with increasing effectiveness towards the evening. Even though the differences between the different parts of the day are less than 2%, it appears in the pie chart that it is a good idea to not upload during the night, as it won't show up to a lot of people their feed.

In [None]:
# Graph 1

data = pd.read_csv('train_age_dataset.csv')

slot1_sum = int(data['slot1_trails_watched_per_day'].sum())
slot2_sum = int(data['slot2_trails_watched_per_day'].sum())
slot3_sum = int(data['slot3_trails_watched_per_day'].sum())
slot4_sum = int(data['slot4_trails_watched_per_day'].sum())

slot_sums = [slot1_sum, slot2_sum, slot3_sum, slot4_sum]

slots = ['00:00-05:59', '06:00-11:59', '12:00-17:59', '18:00-23:59']

fig = go.Figure(data=[go.Pie(labels=slots, values=slot_sums)])

fig.update_layout(
    title='Videos watched per time slot',
    height=500
)

fig.show()

> Graph 1: The different parts of the pie chart represent 6-hour intervals during the day. It implies that users don't really watch videos during the night.

<u> Argument #2: We should make videos aimed at a young audience. </u>

To maximize views and channel growth a content creator should focus on creating content for users under 18 because the younger audience is by far the biggest one. Even if the 18+ audience is compared to the <18 audience it is still smaller than the younger audience.

In [None]:
# Graph 2

# Read the data frosm CSV
data = pd.read_csv('train_age_dataset.csv')

# Map the age group values to the corresponding labels
age_labels = {
    1: '<18',
    2: '18-24',
    3: '24-30',
    4: '>30'
}

data['age_group'] = data['age_group'].map(age_labels)
data['age_group'] = pd.Categorical(data['age_group'], categories=age_labels.values(), ordered=True)

# Group the data by age group and calculate the mean of videos watched
grouped_data = data.groupby('age_group')['content_views'].count().reset_index()

# Sort the grouped data by age group
grouped_data = grouped_data.sort_values('age_group')

# Create lists for age groups and total videos watched
age_groups = grouped_data['age_group'].tolist()
user_amount = grouped_data['content_views'].tolist()

# Create the Plotly bar chart
fig = go.Figure(data=[go.Bar(x=age_groups, y=user_amount)])

# Update the layout
fig.update_layout(
    xaxis_title='Age',
    yaxis_title='Amount of users',
    title='Amount of users per age group',
    height=500
)

# Display the plot
fig.show()

> Graph 2: This bar chart has the number of users on the y-axis. It shows that there are a lot more users below 18 compared to other age groups.

<u> Argument #3: Viewer retention is lower for younger audiences. </u>

The boxplot shows that on average the video completion rate of younger audiences is lower. By making a video more engaging throughout a video, viewer retention for younger audiences can be increased. 

On the other hand, it might also be a good idea to try out different video lengths. As this could factor in the attention span of different audiences.

In [None]:
# Graph 3

# Load the data from the CSV file
df = pd.read_csv('train_age_dataset.csv')

# Calculate the lower and upper bounds for outliers using Tukey's fences
Q1 = np.percentile(df['avgCompletion'], 25)
Q3 = np.percentile(df['avgCompletion'], 75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

# Filter out the outliers
df_filtered = df[(df['avgCompletion'] >= lower_bound) & (df['avgCompletion'] <= upper_bound)].copy()

filtered_age_groups = [1, 4]
df_filtered_subset = df_filtered[df_filtered['age_group'].isin(filtered_age_groups)]

# Map gender codes to labels
age_labels = {
    1: '<18',
    4: '>30'
}
df_filtered_subset.loc[:, 'age_group'] = df_filtered_subset['age_group'].map(age_labels)

# Create the boxplot
fig = px.box(df_filtered_subset, x='age_group', y='avgCompletion', color='age_group',
             labels={'age_group': 'Age', 'avgCompletion': 'Average completion'},
             title='Average completion of a video by age group')

# Set the width and height of the figure
fig.update_layout(height=500)

# Show the boxplot
fig.show()

> Graph 3: It is a boxplot using Tukey Fences to remove outliers, it implies that older users have a longer attention span.

*** 
### Perspective 2: Trell's Perspective

From Trell's standpoint, understanding user behavior and preferences is crucial for increasing the platform's ability to produce a profit. Through this perspective, we delve into the dataset to uncover valuable insights that can inform strategic decisions and shape Trell's future development.

We examine the correlations between attributes such as user demographics, content viewing patterns, and engagement metrics to gain a comprehensive understanding of Trell's user base. By analyzing trends related to video uploads, video completion rate, and different audiences by category, we aim to provide Trell with valuable insights to optimize the user experience and drive platform growth.

<u> Argument #1: We should increase the incentive for male content creators to upload more videos. </u>

By diversifying our content feed we can attract more advertisers towards Trell, this will be beneficial to both our content creators and Trell. It gives more opportunities for advertisers to target the male audience when buying advertisements on Trell.

In [None]:
# Graph 4

data = pd.read_csv('train_age_dataset.csv')

# Define the age group labels
age_labels = {
    1: '<18',
    2: '18-24',
    3: '24-30',
    4: '>30'
}

# Map the age group labels to the age_group column
data['age_group'] = data['age_group'].map(age_labels)
data['age_group'] = pd.Categorical(data['age_group'], categories=age_labels.values(), ordered=True)

# Group the data by age group and gender and calculate the average videos uploaded per person
grouped_data = data.groupby(['age_group', 'gender'])['creations'].mean().reset_index()

# Separate data for each gender
male_data = grouped_data[grouped_data['gender'] == 1]
female_data = grouped_data[grouped_data['gender'] == 2]

# Create bar traces for male and female genders
male_trace = go.Bar(
    x=male_data['age_group'],
    y=male_data['creations'],
    name='Male',
    visible=True  # Set initial visibility to False
)
female_trace = go.Bar(
    x=female_data['age_group'],
    y=female_data['creations'],
    name='Female',
    visible=False,  # Set initial visibility to True
    marker=dict(color='red')
)

# Create the layout
layout = go.Layout(
    title='Average videos uploaded by gender and age',
    xaxis=dict(title='Age'),
    yaxis=dict(title='Average videos uploaded'),
    height=500
)

# Create the figure and add the traces
fig = go.Figure(data=[male_trace, female_trace], layout=layout)

# Create dropdown menu buttons
buttons = [
    dict(
        args=[
            {'visible': [True, True]},
            {'yaxis': {'range': [0, 0.064]}}
        ],  # Show both traces
        label='Both',
        method='update'
    ),
    dict(
        args=[
            {'visible': [True, False]},
            {'yaxis': {'range': [0, 0.064]}}
        ],  # Show only male trace
        label='Male',
        method='update'
    ),
    dict(
        args=[
            {'visible': [False, True]},
            {'yaxis': {'range': [0, 0.064]}}
        ],  # Show only female trace
        label='Female',
        method='update'
    )
]


# Create the updatemenus property
updatemenus = [
    dict(
        buttons=buttons,
        direction='down',
        pad={'r': 10, 't': 10},
        showactive=True,
        x=0.9,
        xanchor='left',
        y=1.2,
        yanchor='top'
    )
]

# Update the figure layout with updatemenus
fig.update_layout(updatemenus=updatemenus)

# Add annotation
fig.update_layout(
    annotations=[
        dict(
            text='',
            showarrow=False,
            x=0,
            y=1.085,
            yref='paper',
            align='left'
        )
    ]
)

# Set the 'Both' trace as the initial visible trace
fig.update_traces(visible=True, selector=dict(name='Female'))

# Show the figure
fig.show()

> Graph 4: This bar chart has the average amount of daily video uploads on the y-axis. It shows that male content creators above the age of 18 are less likely to upload videos. It also shows that younger people are also less likely to upload videos.

<u> Argument #2: We should separate short and long content. </u>
- Short content viewers watch shorter videos and tend to have a higher completion rate and watch more videos. In return, they spend more time on Trell.
- Long content viewers tend to watch fewer videos but watch longer videos, however, they also tend to spend less time on Trell and have a lower completion rate.

These types of viewers can be seen on YouTube, for example, the split between YouTube shorts and normal YouTube videos. It shows that on Trell there is also a split in users who prefer long and short content. By properly targeting the right audience Trell can also tap into this market.

In [None]:
# Graph 5

data = pd.read_csv('train_age_dataset.csv')

content_views_categories = pd.qcut(data['content_views'], q=3, labels=['Low', 'Medium', 'High'])
avgCompletion_categories = pd.qcut(data['avgCompletion'], q=3, labels=['Low', 'Medium', 'High'])
avgTimeSpent_categories = pd.qcut(data['avgTimeSpent'], q=3, labels=['Low', 'Medium', 'High'])
avgDuration_categories = pd.qcut(data['avgDuration'], q=3, labels=['Low', 'Medium', 'High'])

colors = {
    'Low': '#b0c4de',
    'Medium': '#3cb371',
    'High': '#e9967a'
}

fig = go.Figure(data=go.Parcats(
    dimensions=[
        {'label': 'Average videos watched', 'values': content_views_categories, 'categoryorder': 'array', 'categoryarray': ['High', 'Medium', 'Low']},
        {'label': 'Completion rate', 'values': avgCompletion_categories, 'categoryorder': 'array', 'categoryarray': ['High', 'Medium', 'Low']},
        {'label': 'Average duration watched videos', 'values': avgDuration_categories, 'categoryorder': 'array', 'categoryarray': ['High', 'Medium', 'Low']},
        {'label': 'Average time spent', 'values': avgTimeSpent_categories, 'categoryorder': 'array', 'categoryarray': ['High', 'Medium', 'Low']}
    ],

    line={
        'color': avgCompletion_categories.cat.codes,
        'colorscale': [[0, '#b0c4de'], [0.5, '#3cb371'], [1, '#e9967a']]
    }
))

fig.update_layout(title='Metrics concerning user engagement', height = 500)
fig.show()

> Graph 5: This parallel categories diagram uses qcut to split all the users into 3 equal-sized groups multiple times, each grouping the users based on the value from that attribute. It implies that videos with a high completion rate are usually shorter videos, and are usually watched by users that spend a lot of time on Trell.

<u> Argument #3: Trell should market towards larger cities. </u>

Users from larger cities tend to watch more videos. By marketing for this audience and increasing the audience size more ad revenue can be collected because users who watch more videos have more pre-roll advertisements.

In [None]:
# Graph 6

# Read the data from the CSV file
data = pd.read_csv('train_age_dataset.csv')

# Define the tier labels
tier_labels = {
    1: '100,000+',
    2: '50,000 - 99,999',
    3: '20.000 - 49,999'
}

# Map the tier labels to the tier column
data['tier'] = data['tier'].map(tier_labels)
data['tier'] = pd.Categorical(data['tier'], categories=tier_labels.values(), ordered=True)

# Calculate the mean content_views per tier
mean_data = data.groupby('tier')['content_views'].mean().reset_index()

# Sort the data by the tier labels
mean_data = mean_data.sort_values('tier')

# Create bar trace for mean content_views
mean_trace = go.Bar(
    x=mean_data['tier'],
    y=mean_data['content_views'],
    name='Average Videos Watched',
    marker=dict(color='orange')
)

# Create the layout for mean content_views graph
mean_layout = go.Layout(
    title='Average daily videos watched per person',
    xaxis=dict(title='City population'),
    yaxis=dict(title='Daily videos watched'),
    height=400
)

# Create the figure for mean content_views graph
mean_fig = go.Figure(data=[mean_trace], layout=mean_layout)

# Show the mean content_views graph
mean_fig.show()

> Graph 6: This bar chart shows the number of videos watched per day per user compared to the size of the city in which the user resides. It implies that there is a positive correlation between the two.

## Dataset and preprocessing

Our dataset 'train_age_dataset.csv' can be found at: https://www.kaggle.com/datasets/adityak80/trell-social-media-usage-data?resource=download&select=train_age_dataset.csv.
It can be used to find correlations between certain data about users and how many videos they watch or how long they look at a certain post on average.
The only form of preprocessing we really used was Tukey's fences in our boxplot. We used the standard k value of 1.5 to sort out outliers, as we wanted to better show the difference between the age groups.

The attributes we used to create our graphs:
- "slot1_trails_watched_per_day": How many videos the user watches per day on average between 00:00 and 05:59.
- "slot2_trails_watched_per_day": How many videos the user watches per day on average between 06:00 and 11:59.
- "slot3_trails_watched_per_day": How many videos the user watches per day on average between 12:00 and 17:59.
- "slot4_trails_watched_per_day": How many videos the user watches per day on average between 18:00 and 23:59.
- "avgCompletion": How far the user completes watching a video on average, a value between 0 and 1.
- "age_group": A value between 1 and 4, each value meaning a different age group.
- "content_views": How many videos the user watches daily, on average.
- "creations": How many videos the user uploads daily, on average.
- "gender": A value, 1 or 2, representing male or female.
- "avgTimeSpent": Average time spent by the user watching a video in seconds.
- "avgDuration": Average duration of the videos that the user has watched.
- "tier": A metric for the population of the city the user lives in.

In [None]:
# Calculate all possible Pearson's R

# Read the CSV file into a pandas DataFrame
df = pd.read_csv('train_age_dataset.csv')
list_corr = []

for column in df.columns:
    for target_column in df.columns:
        if column != target_column:
            df_cleaned = df.dropna(subset=[column, target_column])

            # Extract the two attributes as separate Series from the DataFrame
            x = df_cleaned[column]
            y = df_cleaned[target_column]
            # Calculate Pearson's correlation coefficient and p-value
            corr, p_value = pearsonr(x, y)

            # Print the correlation coefficient
            list_corr.append([corr, column, target_column])
            #print("Pearson's correlation coefficient:", corr)

list_corr.sort()

list_corr = list_corr[::2]

print(list_corr[-10:])

## Reflection

During our recent work session, we received valuable feedback from both the TA and our peers. We have carefully examined all the feedback and reflected on how we could enhance our data story.

The primary feedback we received was concerning our data story's coherence and the relationship between the plots and the overarching narrative. The TA pointed out that instead of having a single cohesive story, we presented several small, independent stories tied to each plot. To address this, we have unified our data story and ensured that all the plots collectively substantiate this singular narrative. This means finding a common thread that connects each plot, offering a more comprehensive insight when they come together.

Another crucial aspect emphasized in the feedback is the role of the plots in supporting the data story rather than the other way around. The data story should remain the focal point of our presentation, and the plots should serve as supporting arguments, helping to clarify and reinforce our narrative effectively.
Moreover, the feedback regarding the plots involving age groups was valuable. We were reminded of the importance of accounting for the number of people in each age group while analyzing the data. By dividing the information by the number of individuals in each age group, we have gained deeper insights, such as understanding the behavior of the average person within an age group. For example, when we considered the age group of users under 18, it initially appeared that they uploaded the most videos. However, upon plotting the data per person, we discovered that this age group actually uploaded the least number of videos per individual compared to other age groups.

Additionally, the feedback emphasized the need to be cautious about drawing definitive conclusions, as there might be other factors at play that we are unaware of. This reminded us that our findings should always be presented with some degree of uncertainty. For instance, while females spent more time on Trell on average than males, we cannot directly conclude that they watch more videos. Other variables, like video duration, could influence this observation. Therefore, it's essential to present our conclusions with the appropriate nuance and acknowledge the limitations of our analysis.

In conclusion, the feedback we received has been immensely beneficial in shaping our data story. We now understand the importance of creating a coherent and unified narrative, with the plots working in harmony to support it. Additionally, accounting for the number of individuals in age groups and being mindful of potential confounding variables allows us to present a more insightful and accurate analysis. By addressing these areas of improvement, we are confident that our data story is more compelling and robust, leading to a better understanding of our dataset.

## Work distribution

- Creating the visualizations: Desmond
- Adding labels to the visualizations: Jermain, Desmond
- Overall layout updates of the visualizations and descriptions: Simon
- Arguments / perspectives: Jermain, with support from Reinout
- Writing the base reflection: Reinout
- Rewriting / editing the reflection: Simon
- Writing the dataset and preprocessing description: Simon
- Preprocessing Pearson's R: Jermain

## Appendix

Generative AI usage, all with ChatGPT, version Legacy 3.5:
- Increasing the readability of the reflection text: https://chat.openai.com/share/857de284-ae62-4708-bbcb-e17c2afb53bb
- Creating the dropdown menu in graph 4: https://chat.openai.com/share/d239bbe0-a450-42f2-ad95-b2bc11930293
- Bugfixing some code / general information (ignore the first part, it was for an assignment): https://chat.openai.com/share/bb298020-dcea-44f1-aca5-623f02534107
- Helping with calculating Pearson's R: https://chat.openai.com/share/07427d69-5879-4ca4-989b-61d909061156
- Transitioning from matplotlib to plotly: https://chat.openai.com/share/83b2a229-867c-4383-a2c7-5edc06e700b1