Day 9 of Python Summer Party

by Interview Master

Meta

Instagram Stories Daily User Creation Patterns

You are a Product Analyst on the Instagram Stories team investigating story creation patterns. The team wants to understand the distribution of stories created by users daily. You will analyze user storytelling behavior to optimize engagement strategies.

In [1]:
import pandas as pd
import numpy as np


In [2]:
# Load the CSV file into a DataFrame and display it
stories_data = pd.read_csv('stories_data.csv')
stories_df = stories_data.copy()
stories_df


Unnamed: 0,user_id,story_date,story_count
0,user_001,2024-07-03,3.0
1,user_001,2024-07-03,3.0
2,user_001,2024-08-15,5.0
3,user_001,2024-09-10,0.0
4,user_001,2024-10-05,20.0
5,user_001,,2.0
6,user_002,2024-07-03,4.0
7,user_002,2024-07-04,3.0
8,user_002,,6.0
9,user_002,2024-12-25,1.0


Question 1 of 3

Take a look at the data in the `story_date column`. Correct any data type inconsistencies in that column.


In [3]:
# Gathering data information 
stories_df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 60 entries, 0 to 59
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   user_id      59 non-null     object 
 1   story_date   55 non-null     object 
 2   story_count  59 non-null     float64
dtypes: float64(1), object(2)
memory usage: 1.5+ KB


In [4]:
# We can see that there are a total of 60 rows and all three columns have missing values.
# But first lets change the story_date column to datetime format
stories_df['story_date'] = pd.to_datetime(stories_df['story_date'], format='%Y-%m-%d', errors='coerce')
print(stories_df.info())
print()

# Checking missing values on story_date column
sd_missing_values = stories_df["story_date"].isnull().sum()
print('The number of missing values on "story_date" column of the data set is:', sd_missing_values)



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 60 entries, 0 to 59
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype         
---  ------       --------------  -----         
 0   user_id      59 non-null     object        
 1   story_date   55 non-null     datetime64[ns]
 2   story_count  59 non-null     float64       
dtypes: datetime64[ns](1), float64(1), object(1)
memory usage: 1.5+ KB
None

The number of missing values on "story_date" column of the data set is: 5


Question 2 of 3

Calculate the 25th, 50th, and 75th percentiles of the number of stories created per user per day.

In [5]:
# Printing the dataframe to see the data
print(stories_df)


      user_id story_date  story_count
0    user_001 2024-07-03          3.0
1    user_001 2024-07-03          3.0
2    user_001 2024-08-15          5.0
3    user_001 2024-09-10          0.0
4    user_001 2024-10-05         20.0
5    user_001        NaT          2.0
6    user_002 2024-07-03          4.0
7    user_002 2024-07-04          3.0
8    user_002        NaT          6.0
9    user_002 2024-12-25          1.0
10   user_002 2025-01-15          7.0
11   user_002 2025-06-29         10.0
12   user_003 2024-07-10          2.0
13   user_003 2024-08-20          8.0
14   user_003 2024-08-20          8.0
15   user_003 2025-03-11          5.0
16        NaN 2025-03-12          3.0
17   USER_003 2025-04-01          4.0
18   user_004 2024-07-15          6.0
19   user_004 2024-09-30          7.0
20   user_004        NaT          4.0
21   user_004 2024-11-11          3.0
22   user_004 2025-02-28         12.0
23   user_004 2025-03-01          0.0
24   user_005 2024-08-01          1.0
25   user_00

In [6]:
# Normalizing and cleaning the data
stories_df['user_id'] = stories_df['user_id'].str.lower()
stories_df['user_id'] = stories_df['user_id'].str.lower().str.strip()
print(stories_df['user_id'].unique())


['user_001' 'user_002' 'user_003' nan 'user_004' 'user_005' 'user_006'
 'user_007' 'user_008' 'user_009' 'user_010']


In [7]:
# Keep rows we can measure on (drop missing user_id or date for this metric)
clean = stories_df.dropna(subset=['user_id', 'story_date']).copy()

# Make sure story_count is numeric (and treat NaN as 0 stories)
clean['story_count'] = pd.to_numeric(clean['story_count'], errors='coerce').fillna(0)
print(clean.info())
print(clean)


<class 'pandas.core.frame.DataFrame'>
Index: 54 entries, 0 to 59
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype         
---  ------       --------------  -----         
 0   user_id      54 non-null     object        
 1   story_date   54 non-null     datetime64[ns]
 2   story_count  54 non-null     float64       
dtypes: datetime64[ns](1), float64(1), object(1)
memory usage: 1.7+ KB
None
     user_id story_date  story_count
0   user_001 2024-07-03          3.0
1   user_001 2024-07-03          3.0
2   user_001 2024-08-15          5.0
3   user_001 2024-09-10          0.0
4   user_001 2024-10-05         20.0
6   user_002 2024-07-03          4.0
7   user_002 2024-07-04          3.0
9   user_002 2024-12-25          1.0
10  user_002 2025-01-15          7.0
11  user_002 2025-06-29         10.0
12  user_003 2024-07-10          2.0
13  user_003 2024-08-20          8.0
14  user_003 2024-08-20          8.0
15  user_003 2025-03-11          5.0
17  user_003 2025-04-01    

In [8]:
# We can start by doing a groupby operation on user_id and story_date to count the number of stories created by each user on each day
stories_per_user_per_day = clean.groupby(['user_id', 'story_date']).agg(total_story_count = ('story_count', 'sum')).reset_index().sort_values(by=['user_id', 'story_date'], ascending=[True, True])
print(stories_per_user_per_day)


     user_id story_date  total_story_count
0   user_001 2024-07-03                6.0
1   user_001 2024-08-15                5.0
2   user_001 2024-09-10                0.0
3   user_001 2024-10-05               20.0
4   user_002 2024-07-03                4.0
5   user_002 2024-07-04                3.0
6   user_002 2024-12-25                1.0
7   user_002 2025-01-15                7.0
8   user_002 2025-06-29               10.0
9   user_003 2024-07-10                2.0
10  user_003 2024-08-20               16.0
11  user_003 2025-03-11                5.0
12  user_003 2025-04-01                4.0
13  user_004 2024-07-15                6.0
14  user_004 2024-09-30                7.0
15  user_004 2024-11-11                3.0
16  user_004 2025-02-28               12.0
17  user_004 2025-03-01                0.0
18  user_005 2024-08-01                1.0
19  user_005 2024-08-02                2.0
20  user_005 2024-08-03                3.0
21  user_005 2024-08-04                4.0
22  user_00

In [9]:
# Now we can calculate the 25th, 50th, and 75th percentiles of the number of stories created per user per day
percentiles = stories_per_user_per_day['total_story_count'].quantile([0.25, 0.5, 0.75])
percentiles.index = ['25th', '50th', '75th']
print("\nThe 25th, 50th, and 75th percentiles of the number of stories created per user per day are:")
print(percentiles)



The 25th, 50th, and 75th percentiles of the number of stories created per user per day are:
25th     3.0
50th     5.0
75th    10.0
Name: total_story_count, dtype: float64


In [10]:
per_user_percentiles = (
    stories_per_user_per_day
    .groupby('user_id')['total_story_count']
    .quantile([0.25, 0.5, 0.75])
    .unstack()              # columns: 0.25, 0.5, 0.75
    .rename(columns={0.25:'p25', 0.5:'p50', 0.75:'p75'})
    .reset_index()
)
per_user_percentiles.head()


Unnamed: 0,user_id,p25,p50,p75
0,user_001,3.75,5.5,9.5
1,user_002,3.0,4.0,7.0
2,user_003,3.5,4.5,7.75
3,user_004,3.0,6.0,7.0
4,user_005,1.25,2.5,3.75


Question 3 of 3

What percentage of users have had at least one day where they posted more than 10 stories on that day?

In [11]:
# Display the dataframe to see the data again
print(stories_per_user_per_day)


     user_id story_date  total_story_count
0   user_001 2024-07-03                6.0
1   user_001 2024-08-15                5.0
2   user_001 2024-09-10                0.0
3   user_001 2024-10-05               20.0
4   user_002 2024-07-03                4.0
5   user_002 2024-07-04                3.0
6   user_002 2024-12-25                1.0
7   user_002 2025-01-15                7.0
8   user_002 2025-06-29               10.0
9   user_003 2024-07-10                2.0
10  user_003 2024-08-20               16.0
11  user_003 2025-03-11                5.0
12  user_003 2025-04-01                4.0
13  user_004 2024-07-15                6.0
14  user_004 2024-09-30                7.0
15  user_004 2024-11-11                3.0
16  user_004 2025-02-28               12.0
17  user_004 2025-03-01                0.0
18  user_005 2024-08-01                1.0
19  user_005 2024-08-02                2.0
20  user_005 2024-08-03                3.0
21  user_005 2024-08-04                4.0
22  user_00

In [12]:
# Here we need to first group by user_id and total_story_count to find users who have had at least one day where they posted more than 10 stories on that day
users_with_more_than_10_stories = stories_per_user_per_day[stories_per_user_per_day['total_story_count'] > 10]['user_id'].nunique()
print('The number of users who have had at least one day where they posted more than 10 stories on that day is:', users_with_more_than_10_stories)

# Now we can calculate the percentage of users who have had at least one day where they posted more than 10 stories on that day
total_users = stories_per_user_per_day['user_id'].nunique()
percentage = (users_with_more_than_10_stories / total_users) * 100
print(f"\nThe percentage of users who have had at least one day where they posted more than 10 stories on that day is: {percentage:.2f}%")



The number of users who have had at least one day where they posted more than 10 stories on that day is: 6

The percentage of users who have had at least one day where they posted more than 10 stories on that day is: 60.00%


In [13]:
# # Note: pandas and numpy are already imported as pd and np
# # The following tables are loaded as pandas DataFrames with the same names: stories_data
# # Please print your final result or dataframe

#  # Note: pandas and numpy are already imported as pd and np
# # The following tables are loaded as pandas DataFrames with the same names: stories_data
# # Please print your final result or dataframe

# ################################################################################
# print()
# print("=" * 150)
# print("=" * 150)
# print()
# ################################################################################
# # Question 1 of 3 
# # Take a look at the data in the `story_date column`. Correct any data type inconsistencies in that column.

# # Load the CSV file into a DataFrame and display it
# stories_df = stories_data.copy()
# print(stories_df)
# print("=" * 150)
# print()

# # We can see that there are a total of 60 rows and all three columns have missing values.
# # But first lets change the story_date column to datetime format
# stories_df['story_date'] = pd.to_datetime(stories_df['story_date'], format='%Y-%m-%d')
# print(stories_df.info())
# print("=" * 150)
# print()

# # Answer to Question 1: The number of missing values on "story_date" column of the data set is:
# sd_missing_values = stories_df["story_date"].isnull().sum()
# print('The number of missing values on "story_date" column of the data set is:', sd_missing_values)

# ################################################################################
# print()
# print("=" * 150)
# print("=" * 150)
# print()
# ################################################################################
# # Question 2 of 3
# # Calculate the 25th, 50th, and 75th percentiles of the number of stories created per user per day.

# # Printing the dataframe to see the data
# print(stories_df)
# print("=" * 150)
# print()

# # Normalizing and cleaning the data
# stories_df['user_id'] = stories_df['user_id'].str.lower()
# stories_df['user_id'] = stories_df['user_id'].str.lower().str.strip()
# print(stories_df['user_id'].unique())
# print("=" * 150)
# print()

# # Keep rows we can measure on (drop missing user_id or date for this metric)
# clean = stories_df.dropna(subset=['user_id', 'story_date']).copy()

# # Make sure story_count is numeric (and treat NaN as 0 stories)
# clean['story_count'] = pd.to_numeric(clean['story_count'], errors='coerce').fillna(0)
# print(clean.info())
# print()
# print(clean)
# print("=" * 150)
# print()

# # We can start by doing a groupby operation on user_id and story_date to count the number of stories created by each user on each day
# stories_per_user_per_day = clean.groupby(['user_id', 'story_date']).agg(total_story_count = ('story_count', 'sum')).reset_index().sort_values(by=['user_id', 'story_date'], ascending=[True, True])
# print(stories_per_user_per_day)
# print("=" * 150)
# print()

# # Now we can calculate the 25th, 50th, and 75th percentiles of the number of stories created per user per day
# percentiles = stories_per_user_per_day['total_story_count'].quantile([0.25, 0.5, 0.75])
# percentiles.index = ['25th', '50th', '75th']
# print("\nThe 25th, 50th, and 75th percentiles of the number of stories created per user per day are:")
# print(percentiles)
# print("=" * 150)
# print()

# per_user_percentiles = (
#     stories_per_user_per_day
#     .groupby('user_id')['total_story_count']
#     .quantile([0.25, 0.5, 0.75])
#     .unstack()              # columns: 0.25, 0.5, 0.75
#     .rename(columns={0.25:'p25', 0.5:'p50', 0.75:'p75'})
#     .reset_index()
# )
# print(per_user_percentiles.head())
# print("=" * 150)
# print()

# ################################################################################
# print()
# print("=" * 150)
# print("=" * 150)
# print()
# ################################################################################
# # Question 3 of 3

# # Display the dataframe to see the data again
# print(stories_per_user_per_day)
# print("=" * 150)
# print()

# # Here we need to first group by user_id and total_story_count to find users who have had at least one day where they posted more than 10 stories on that day
# users_with_more_than_10_stories = stories_per_user_per_day[stories_per_user_per_day['total_story_count'] > 10]['user_id'].nunique()
# print('The number of users who have had at least one day where they posted more than 10 stories on that day is:', users_with_more_than_10_stories)
# print()

# # Now we can calculate the percentage of users who have had at least one day where they posted more than 10 stories on that day
# total_users = stories_per_user_per_day['user_id'].nunique()
# percentage = (users_with_more_than_10_stories / total_users) * 100
# print(f"\nThe percentage of users who have had at least one day where they posted more than 10 stories on that day is: {percentage:.2f}%")
# print("=" * 150)
# print()
