Day 7 of Python Summer Party

by Interview Master

Nike

Celebrity Product Drops Sales Performance Analysis

You are a Product Analyst working on Nike's marketing performance team. Your team wants to evaluate the effectiveness of celebrity product collaborations by analyzing sales data. You will investigate the performance of celebrity product drops to inform future marketing strategies.

Question 1

For Q1 2025 (January 1st through March 31st, 2025), can you identify all records of celebrity collaborations from the sales data where the sale_amount is missing? This will help us flag incomplete records that could impact the analysis of Nike's product performance.

In [1]:
import numpy as np
import pandas as pd


In [2]:
# Load the sales data
fct_sales = pd.read_csv('fct_sales.csv')
q1_fct_sales_df = fct_sales.copy()
display(q1_fct_sales_df)


Unnamed: 0,sale_id,sale_date,product_id,sale_amount,celebrity_id
0,1,2025-01-10,901,,101
1,2,2025-01-15,901,1500.0,101
2,3,2025-02-03,902,2000.5,102
3,4,2025-03-12,903,2500.75,103
4,5,2025-03-20,904,,104
5,6,2025-02-28,901,1000.0,101
6,7,2025-03-25,902,300.0,102
7,8,2025-03-30,905,1800.0,105
8,9,2025-01-20,903,1200.0,103
9,10,2025-02-05,906,500.0,106


In [3]:
# Sanity checks and initial exploration
display(q1_fct_sales_df.info())
display()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20 entries, 0 to 19
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   sale_id       20 non-null     int64  
 1   sale_date     20 non-null     object 
 2   product_id    20 non-null     int64  
 3   sale_amount   17 non-null     float64
 4   celebrity_id  20 non-null     int64  
dtypes: float64(1), int64(3), object(1)
memory usage: 932.0+ bytes


None

We can see there are 17 null or "Missing Values" Actually it does not seem that they are called "Null"
Lets start by changing the date to a datetype format

In [4]:
# First we need to change the sale_date column to datetime format
q1_fct_sales_df["sale_date"] = pd.to_datetime(q1_fct_sales_df["sale_date"], format="%Y-%m-%d")
print(q1_fct_sales_df.info())
print()
print(q1_fct_sales_df.head())


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20 entries, 0 to 19
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype         
---  ------        --------------  -----         
 0   sale_id       20 non-null     int64         
 1   sale_date     20 non-null     datetime64[ns]
 2   product_id    20 non-null     int64         
 3   sale_amount   17 non-null     float64       
 4   celebrity_id  20 non-null     int64         
dtypes: datetime64[ns](1), float64(1), int64(3)
memory usage: 932.0 bytes
None

   sale_id  sale_date  product_id  sale_amount  celebrity_id
0        1 2025-01-10         901          NaN           101
1        2 2025-01-15         901      1500.00           101
2        3 2025-02-03         902      2000.50           102
3        4 2025-03-12         903      2500.75           103
4        5 2025-03-20         904          NaN           104


In [5]:
# Now we sort the dataframe by year and month in ascending order for better readability
q1_fct_sales_df = q1_fct_sales_df.sort_values(['sale_date'], ascending=True).reset_index(drop=True)
print(q1_fct_sales_df)


    sale_id  sale_date  product_id  sale_amount  celebrity_id
0        18 2024-07-30         902      1000.00           102
1        20 2024-09-05         903      1100.00           103
2        17 2024-11-15         901       800.00           101
3         1 2025-01-10         901          NaN           101
4         2 2025-01-15         901      1500.00           101
5         9 2025-01-20         903      1200.00           103
6        14 2025-01-25         910       900.00           108
7         3 2025-02-03         902      2000.50           102
8        10 2025-02-05         906       500.00           106
9        12 2025-02-15         908      1300.00           101
10       15 2025-02-20         905       700.00           105
11        6 2025-02-28         901      1000.00           101
12       11 2025-03-01         907      2200.00           107
13        4 2025-03-12         903      2500.75           103
14       13 2025-03-15         909          NaN           102
15      

In [6]:
# Now we filter the dataframe for Q1 2025 which is January 1st through March 31st, 2025
Q1_df = q1_fct_sales_df[(q1_fct_sales_df['sale_date'] >= '2025-01-01') & (q1_fct_sales_df['sale_date'] <= '2025-03-31')]
print(Q1_df)


    sale_id  sale_date  product_id  sale_amount  celebrity_id
3         1 2025-01-10         901          NaN           101
4         2 2025-01-15         901      1500.00           101
5         9 2025-01-20         903      1200.00           103
6        14 2025-01-25         910       900.00           108
7         3 2025-02-03         902      2000.50           102
8        10 2025-02-05         906       500.00           106
9        12 2025-02-15         908      1300.00           101
10       15 2025-02-20         905       700.00           105
11        6 2025-02-28         901      1000.00           101
12       11 2025-03-01         907      2200.00           107
13        4 2025-03-12         903      2500.75           103
14       13 2025-03-15         909          NaN           102
15        5 2025-03-20         904          NaN           104
16        7 2025-03-25         902       300.00           102
17       16 2025-03-28         902      1500.00           102
18      

In [7]:
# We can see there are 3 records with missing sale_amount in Q1 2025
 #Now we select and filter only those records with missing sale_amount
Q1_missing_df = Q1_df[Q1_df['sale_amount'].isnull()]
print(Q1_missing_df)


    sale_id  sale_date  product_id  sale_amount  celebrity_id
3         1 2025-01-10         901          NaN           101
14       13 2025-03-15         909          NaN           102
15        5 2025-03-20         904          NaN           104


In [8]:
# Question 1 Answer: There are 3 records with missing sale_amount in Q1 2025.'
print("Question 1 Answer: There are", len(Q1_missing_df), "records with missing sale_amount in Q1 2025.");
print("These records are:");
print(Q1_missing_df)



Question 1 Answer: There are 3 records with missing sale_amount in Q1 2025.
These records are:
    sale_id  sale_date  product_id  sale_amount  celebrity_id
3         1 2025-01-10         901          NaN           101
14       13 2025-03-15         909          NaN           102
15        5 2025-03-20         904          NaN           104


In [9]:
# # Note: pandas and numpy are already imported as pd and np
# # The following tables are loaded as pandas DataFrames with the same names: fct_sales
# # Please print your final result or dataframe

# # Load the sales data
# q1_fct_sales_df = fct_sales.copy()
# print(q1_fct_sales_df)
# print("=" * 150)
# print()

# # Sanity checks and initial exploration
# print(q1_fct_sales_df.info())
# print("=" * 150)
# print()

# # First we need to change the sale_date column to datetime format
# q1_fct_sales_df["sale_date"] = pd.to_datetime(q1_fct_sales_df["sale_date"], format="%Y-%m-%d")
# print(q1_fct_sales_df.info())
# print()
# print(q1_fct_sales_df.head())
# print("=" * 150)
# print()

# # Now we sort the dataframe by year and month in ascending order for better readability
# q1_fct_sales_df = q1_fct_sales_df.sort_values(['sale_date'], ascending=True).reset_index(drop=True)
# print(q1_fct_sales_df)
# print("=" * 150)
# print()

# # Now we filter the dataframe for Q1 2025 which is January 1st through March 31st, 2025
# Q1_df = q1_fct_sales_df[(q1_fct_sales_df['sale_date'] >= '2025-01-01') & (q1_fct_sales_df['sale_date'] <= '2025-03-31')]
# print(Q1_df)
# print("=" * 150)
# print()

# # We can see there are 3 records with missing sale_amount in Q1 2025
#  #Now we select and filter only those records with missing sale_amount
# Q1_missing_df = Q1_df[Q1_df['sale_amount'].isnull()]
# print(Q1_missing_df)
# print("=" * 150)
# print()

# # Question 1 Answer: There are 3 records with missing sale_amount in Q1 2025.'
# print("Question 1 Answer: There are", len(Q1_missing_df), "records with missing sale_amount in Q1 2025.");
# print("These records are:");
# print(Q1_missing_df)
# print("=" * 150)
# print()


Question 2:

For Q1 2025 (January 1st through March 31st, 2025), can you list the unique combinations of celebrity_id and product_id from the sales table? This will ensure that each collaboration is accurately accounted for in the analysis of Nike's marketing performance.

In [10]:
# Printing Q1 dataframe and its info for verification
print(Q1_df)
print(Q1_df.info())


    sale_id  sale_date  product_id  sale_amount  celebrity_id
3         1 2025-01-10         901          NaN           101
4         2 2025-01-15         901      1500.00           101
5         9 2025-01-20         903      1200.00           103
6        14 2025-01-25         910       900.00           108
7         3 2025-02-03         902      2000.50           102
8        10 2025-02-05         906       500.00           106
9        12 2025-02-15         908      1300.00           101
10       15 2025-02-20         905       700.00           105
11        6 2025-02-28         901      1000.00           101
12       11 2025-03-01         907      2200.00           107
13        4 2025-03-12         903      2500.75           103
14       13 2025-03-15         909          NaN           102
15        5 2025-03-20         904          NaN           104
16        7 2025-03-25         902       300.00           102
17       16 2025-03-28         902      1500.00           102
18      

In [11]:
# We can use groupby to get the unique combinations of celebrity_id and product_id
Q1_unique_combinations = Q1_df.groupby(['celebrity_id', 'product_id']).size().reset_index(name='count')
print(Q1_unique_combinations)


   celebrity_id  product_id  count
0           101         901      3
1           101         908      1
2           102         902      3
3           102         909      1
4           103         903      2
5           104         904      1
6           105         905      2
7           106         906      1
8           107         907      1
9           108         910      1


In [12]:
# Question 2 Answer: The following table shows the unique combinations of celebrity_id and product_id in Q1 2025.
print("Question 2 Answer: The following table shows the unique combinations of celebrity_id and product_id in Q1 2025.")
print(Q1_unique_combinations)


Question 2 Answer: The following table shows the unique combinations of celebrity_id and product_id in Q1 2025.
   celebrity_id  product_id  count
0           101         901      3
1           101         908      1
2           102         902      3
3           102         909      1
4           103         903      2
5           104         904      1
6           105         905      2
7           106         906      1
8           107         907      1
9           108         910      1


In [13]:
# # Note: pandas and numpy are already imported as pd and np
# # The following tables are loaded as pandas DataFrames with the same names: fct_sales
# # Please print your final result or dataframe

# # Load the sales data
# q1_fct_sales_df = fct_sales.copy()
# print(q1_fct_sales_df)
# print("=" * 150)
# print()

# # Sanity checks and initial exploration
# print(q1_fct_sales_df.info())
# print("=" * 150)
# print()

# # First we need to change the sale_date column to datetime format
# q1_fct_sales_df["sale_date"] = pd.to_datetime(q1_fct_sales_df["sale_date"], format="%Y-%m-%d")
# print(q1_fct_sales_df.info())
# print()
# print(q1_fct_sales_df.head())
# print("=" * 150)
# print()

# # Now we sort the dataframe by year and month in ascending order for better readability
# q1_fct_sales_df = q1_fct_sales_df.sort_values(['sale_date'], ascending=True).reset_index(drop=True)
# print(q1_fct_sales_df)
# print("=" * 150)
# print()

# # Now we filter the dataframe for Q1 2025 which is January 1st through March 31st, 2025
# Q1_df = q1_fct_sales_df[(q1_fct_sales_df['sale_date'] >= '2025-01-01') & (q1_fct_sales_df['sale_date'] <= '2025-03-31')]
# print(Q1_df)
# print("=" * 150)
# print()

# # We can see there are 3 records with missing sale_amount in Q1 2025
#  #Now we select and filter only those records with missing sale_amount
# Q1_missing_df = Q1_df[Q1_df['sale_amount'].isnull()]
# print(Q1_missing_df)
# print("=" * 150)
# print()

# # Question 1 Answer: There are 3 records with missing sale_amount in Q1 2025.'
# print("Question 1 Answer: There are", len(Q1_missing_df), "records with missing sale_amount in Q1 2025.");
# print("These records are:");
# print(Q1_missing_df)
# print("=" * 150)
# print()


# # ==============================================================================
# print()
# print("=" * 150)
# print("=" * 150)
# print()
# # ==============================================================================

# # Printing Q1 dataframe and its info for verification
# print(Q1_df)
# print(Q1_df.info())
# print("=" * 150)
# print()

# # We can use groupby to get the unique combinations of celebrity_id and product_id
# Q1_unique_combinations = Q1_df.groupby(['celebrity_id', 'product_id']).size().reset_index(name='count')
# print(Q1_unique_combinations)
# print("=" * 150)
# print()

# # Question 2 Answer: The following table shows the unique combinations of celebrity_id and product_id in Q1 2025.
# print("Question 2 Answer: The following table shows the unique combinations of celebrity_id and product_id in Q1 2025.")
# print(Q1_unique_combinations)
# print("=" * 150)
# print()



Question 3:

For Q1 2025 (January 1st through March 31st, 2025), can you rank the unique celebrity collaborations based on their total sales amounts and list the top 3 collaborations in descending order? This will help recommend the most successful partnerships for Nike's future product drop strategies.

In [14]:
# Printing Q1 dataframe and its info for verification
print(Q1_df)
print(Q1_df.info())


    sale_id  sale_date  product_id  sale_amount  celebrity_id
3         1 2025-01-10         901          NaN           101
4         2 2025-01-15         901      1500.00           101
5         9 2025-01-20         903      1200.00           103
6        14 2025-01-25         910       900.00           108
7         3 2025-02-03         902      2000.50           102
8        10 2025-02-05         906       500.00           106
9        12 2025-02-15         908      1300.00           101
10       15 2025-02-20         905       700.00           105
11        6 2025-02-28         901      1000.00           101
12       11 2025-03-01         907      2200.00           107
13        4 2025-03-12         903      2500.75           103
14       13 2025-03-15         909          NaN           102
15        5 2025-03-20         904          NaN           104
16        7 2025-03-25         902       300.00           102
17       16 2025-03-28         902      1500.00           102
18      

In [15]:
# We can use groupby to get the unique combinations of celebrity_id and product_id
Q1_sales_collabs = Q1_df.groupby(['celebrity_id', 'product_id']).agg(total_sales_volume = ('sale_amount', 'sum')).sort_values(by=['celebrity_id', 'product_id']).reset_index()
print(Q1_sales_collabs)


   celebrity_id  product_id  total_sales_volume
0           101         901             2500.00
1           101         908             1300.00
2           102         902             3800.50
3           102         909                0.00
4           103         903             3700.75
5           104         904                0.00
6           105         905             2500.00
7           106         906              500.00
8           107         907             2200.00
9           108         910              900.00


In [16]:
# Now we rank the collaborations based on total sales volume in descending order
Q1_ranked_collabs = Q1_sales_collabs.copy()
Q1_ranked_collabs = Q1_ranked_collabs.sort_values(by=['total_sales_volume'], ascending=False).reset_index(drop=True)
print(Q1_ranked_collabs)


   celebrity_id  product_id  total_sales_volume
0           102         902             3800.50
1           103         903             3700.75
2           105         905             2500.00
3           101         901             2500.00
4           107         907             2200.00
5           101         908             1300.00
6           108         910              900.00
7           106         906              500.00
8           104         904                0.00
9           102         909                0.00


In [17]:
# We can use .head() to get the top 3 collaborations now that it is ranked
top_3_collabs = Q1_ranked_collabs.head(3)
print(top_3_collabs)


   celebrity_id  product_id  total_sales_volume
0           102         902             3800.50
1           103         903             3700.75
2           105         905             2500.00


In [18]:
# Answer to Question 3: The top 3 celebrity-product collaborations based on total sales volume in Q1 2025 are shown in the table above.
print("The top 3 collaborations based on total sales volume in Q1 2025 are:")
print(top_3_collabs)


The top 3 collaborations based on total sales volume in Q1 2025 are:
   celebrity_id  product_id  total_sales_volume
0           102         902             3800.50
1           103         903             3700.75
2           105         905             2500.00


In [19]:
# # Note: pandas and numpy are already imported as pd and np
# # The following tables are loaded as pandas DataFrames with the same names: fct_sales
# # Please print your final result or dataframe

# # Question 1

# # Load the sales data
# q1_fct_sales_df = fct_sales.copy()
# print(q1_fct_sales_df)
# print("=" * 150)
# print()

# # Sanity checks and initial exploration
# print(q1_fct_sales_df.info())
# print("=" * 150)
# print()

# # First we need to change the sale_date column to datetime format
# q1_fct_sales_df["sale_date"] = pd.to_datetime(q1_fct_sales_df["sale_date"], format="%Y-%m-%d")
# print(q1_fct_sales_df.info())
# print()
# print(q1_fct_sales_df.head())
# print("=" * 150)
# print()

# # Now we sort the dataframe by year and month in ascending order for better readability
# q1_fct_sales_df = q1_fct_sales_df.sort_values(['sale_date'], ascending=True).reset_index(drop=True)
# print(q1_fct_sales_df)
# print("=" * 150)
# print()

# # Now we filter the dataframe for Q1 2025 which is January 1st through March 31st, 2025
# Q1_df = q1_fct_sales_df[(q1_fct_sales_df['sale_date'] >= '2025-01-01') & (q1_fct_sales_df['sale_date'] <= '2025-03-31')]
# print(Q1_df)
# print("=" * 150)
# print()

# # We can see there are 3 records with missing sale_amount in Q1 2025
#  #Now we select and filter only those records with missing sale_amount
# Q1_missing_df = Q1_df[Q1_df['sale_amount'].isnull()]
# print(Q1_missing_df)
# print("=" * 150)
# print()

# # Question 1 Answer: There are 3 records with missing sale_amount in Q1 2025.'
# print("Question 1 Answer: There are", len(Q1_missing_df), "records with missing sale_amount in Q1 2025.");
# print("These records are:");
# print(Q1_missing_df)
# print("=" * 150)
# print()


# # ==============================================================================
# print()
# print("=" * 150)
# print("=" * 150)
# print()
# # ==============================================================================

# #Question 2

# # Printing Q1 dataframe and its info for verification
# print(Q1_df)
# print(Q1_df.info())
# print("=" * 150)
# print()

# # We can use groupby to get the unique combinations of celebrity_id and product_id
# Q1_unique_combinations = Q1_df.groupby(['celebrity_id', 'product_id']).size().reset_index(name='count')
# print(Q1_unique_combinations)
# print("=" * 150)
# print()

# # Question 2 Answer: The following table shows the unique combinations of celebrity_id and product_id in Q1 2025.
# print("Question 2 Answer: The following table shows the unique combinations of celebrity_id and product_id in Q1 2025.")
# print(Q1_unique_combinations)
# print("=" * 150)
# print()


# # ==============================================================================
# print()
# print("=" * 150)
# print("=" * 150)
# print()
# # ==============================================================================

# #Question 3

# # Printing Q1 dataframe and its info for verification
# print(Q1_df)
# print(Q1_df.info())
# print("=" * 150)
# print()

# # We can use groupby to get the unique combinations of celebrity_id and product_id
# Q1_sales_collabs = Q1_df.groupby(['celebrity_id', 'product_id']).agg(total_sales_volume = ('sale_amount', 'sum')).sort_values(by=['celebrity_id', 'product_id']).reset_index()
# print(Q1_sales_collabs)
# print("=" * 150)
# print()

# # Now we rank the collaborations based on total sales volume in descending order
# Q1_ranked_collabs = Q1_sales_collabs.copy()
# Q1_ranked_collabs = Q1_ranked_collabs.sort_values(by=['total_sales_volume'], ascending=False).reset_index(drop=True)
# print(Q1_ranked_collabs)
# print("=" * 150)
# print()

# # We can use .head() to get the top 3 collaborations now that it is ranked
# top_3_collabs = Q1_ranked_collabs.head(3)
# print(top_3_collabs)
# print("=" * 150)
# print()

# # Answer to Question 3: The top 3 celebrity-product collaborations based on total sales volume in Q1 2025 are shown in the table above.
# print("Question 3 Answer: The top 3 collaborations based on total sales volume in Q1 2025 are:")
# print(top_3_collabs)
# print("=" * 150)
# print()
