<a href="https://colab.research.google.com/github/AnamHJ24/datascience-python-challenges/blob/main/notebooks/Day9.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Day 9 - Instagram
You are a Product Analyst on the **Instagram** Stories team investigating story creation patterns. The team wants to understand the distribution of stories created by users daily. You will analyze user storytelling behavior to optimize engagement strategies.

In [1]:
# Import required libraries
import pandas as pd
import numpy as np

# Import data files
url = "https://raw.githubusercontent.com/AnamHJ24/datascience-python-challenges/refs/heads/main/Data/Day9.txt"
stories_data = pd.read_csv(url)
stories_data.head()

Unnamed: 0,user_id,story_date,story_count
0,user_001,2024-07-03,3.0
1,user_001,2024-07-03,3.0
2,user_001,2024-08-15,5.0
3,user_001,2024-09-10,0.0
4,user_001,2024-10-05,20.0


## Question 1
Take a look at the data in the story_date column. Correct any data type inconsistencies in that column.

## Solution

In [2]:
stories_data.sample(10)

Unnamed: 0,user_id,story_date,story_count
21,user_004,2024-11-11,3.0
53,user_009,,6.0
46,user_008,2025-01-05,15.0
23,user_004,2025-03-01,0.0
54,user_010,2025-03-15,7.0
4,user_001,2024-10-05,20.0
1,user_001,2024-07-03,3.0
47,user_008,2025-01-06,0.0
14,user_003,2024-08-20,8.0
42,user_008,2025-01-01,11.0


In [3]:
# Convert column to datetime anf fix inconsistencies
stories_data['story_date'] = pd.to_datetime(stories_data['story_date'], errors = "coerce")
stories_data.sample(10)

Unnamed: 0,user_id,story_date,story_count
15,user_003,2025-03-11,5.0
42,user_008,2025-01-01,11.0
25,user_005,2024-08-02,2.0
35,user_006,NaT,7.0
5,user_001,NaT,2.0
29,user_005,2024-08-06,5.0
43,user_008,2025-01-02,12.0
26,user_005,2024-08-03,3.0
8,user_002,NaT,6.0
10,user_002,2025-01-15,7.0


## Question 2
Calculate the 25th, 50th, and 75th percentiles of the number of stories created per user per day.

## Solution


In [9]:
# Clean user_id column to remove inconsistencies
stories_data = stories_data.dropna()
stories_data['user_id'] = stories_data['user_id'].str.strip()
stories_data['user_id'] = stories_data['user_id'].str.lower()
stories_data.sample(10)

Unnamed: 0,user_id,story_date,story_count
6,user_002,2024-07-03,4.0
50,user_009,2024-12-03,3.0
27,user_005,2024-08-04,4.0
34,user_006,2024-09-05,8.0
37,user_007,2024-10-11,4.0
26,user_005,2024-08-03,3.0
45,user_008,2025-01-04,14.0
12,user_003,2024-07-10,2.0
44,user_008,2025-01-03,13.0
10,user_002,2025-01-15,7.0


In [13]:
# Calculate stories per user per day
stories_per_user = stories_data.groupby(['user_id', 'story_date'])['story_count'].sum()

# Calculate 25th, 50th and 75th percentile
percent_25 = np.percentile(stories_per_user, 25)

print("Percentage of number of stories created per user per day:")
print("25th Percentile:",percent_25)

percent_50 = np.percentile(stories_per_user, 50)
print("50th Percentile:",percent_50)

percent_75 = np.percentile(stories_per_user, 75)
print("75th Percentile:",percent_75)

Percentage of number of stories created per user per day:
25th Percentile: 3.0
50th Percentile: 5.0
75th Percentile: 10.0


## Question 3
What percentage of users have had at least one day, where they posted more than 10 stories on that day?

## Solution

In [16]:
# Calculate total number of unique users
total_users = stories_data['user_id'].nunique()

# Calculate number of unique users with more than 10 stories a day
stories_per_user = stories_data.groupby(['user_id', 'story_date'])['story_count'].sum().reset_index()
more_than_10 = stories_per_user[stories_per_user['story_count'] > 10]['user_id'].unique()

# Calculate percentage
percentage = (len(more_than_10)/total_users)*100
print("Percentage of users with more than 10 stories a day:",percentage,"%")

Percentage of users with more than 10 stories a day: 60.0 %
