Day 5 of Python Summer Party

by Interview Master

Nintendo

Switch 2 Pre-sales Demand Forecasting

You are a Product Analyst working with the Nintendo Switch 2 pre-sales team to analyze regional pre-order patterns and customer segmentation. Your team needs to understand how different demographics influence pre-sale volumes across regions. You will leverage historical pre-sale transaction data to extract meaningful insights that can guide marketing strategies.


In [1]:
import numpy as np
import pandas as pd


In [2]:
# Creating a copy of the pre-sale data to avoid modifying the original dataset
data = pd.read_csv("pre_sale_data.csv")
ns2_psd_df = data.copy()

ns2_psd_df


Unnamed: 0,region,customer_id,pre_order_date,demographic_group,pre_order_quantity
0,North America,C001,2024-07-02,Gamer,1
1,Europe,C002,2024-07-03,Casual,2
2,Asia,C003,2024-07-04,Tech Enthusiast,1
3,Latin America,C004,2024-07-05,Family,3
4,Oceania,C005,2024-07-06,Student,2
5,North America,C006,2024-07-07,Gamer,5
6,Europe,C007,2024-07-08,,2
7,,C008,2024-07-09,Casual,1
8,Asia,C009,2024-07-10,Family,4
9,North America,C010,2024-07-11,Gamer,1


Question 1 of 3

What percentage of records have missing values in at least one column? Handle the missing values, so that we have a cleaned dataset to work with.


In [3]:
# Getting initial data info
ns2_psd_df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 60 entries, 0 to 59
Data columns (total 5 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   region              59 non-null     object
 1   customer_id         60 non-null     object
 2   pre_order_date      60 non-null     object
 3   demographic_group   57 non-null     object
 4   pre_order_quantity  60 non-null     int64 
dtypes: int64(1), object(4)
memory usage: 2.5+ KB


In [4]:
# Finding out how many records have missing values in at least one column
missing_values = ns2_psd_df.isnull().sum()
print(missing_values)


region                1
customer_id           0
pre_order_date        0
demographic_group     3
pre_order_quantity    0
dtype: int64


In [5]:
# Calculating the percentage of missing values per row
percent_missing_values = ((missing_values * 100) / len(ns2_psd_df)).round(2)
print(percent_missing_values)


region                1.67
customer_id           0.00
pre_order_date        0.00
demographic_group     5.00
pre_order_quantity    0.00
dtype: float64


In [6]:
# Calculating the percentage of rows with missing values
percent_rows_with_missing = (ns2_psd_df.isnull().any(axis=1).mean() * 100).round(2)
print(percent_rows_with_missing)


6.67


In [7]:
# Calculating the complete duplicate records
duplicate_records = ns2_psd_df.duplicated().sum()
print("The number of duplicate values on the data set is:", duplicate_records)


The number of duplicate values on the data set is: 2


In [8]:
# Identify all duplicate rows, including the first occurrence
all_duplicate_rows = ns2_psd_df[ns2_psd_df.duplicated(keep=False)]
# Display all duplicate rows
print(all_duplicate_rows)


           region customer_id pre_order_date demographic_group  \
9   North America        C010     2024-07-11             Gamer   
10  North America        C010     2024-07-11             Gamer   
19        Oceania        C019     2024-07-20   Tech Enthusiast   
20        Oceania        C019     2024-07-20   Tech Enthusiast   

    pre_order_quantity  
9                    1  
10                   1  
19                   2  
20                   2  


In [9]:
# Handling missing values
# Since the missing values are objects, I will use the Mode to fill them
region_mode = ns2_psd_df["region"].mode()[0]
demographic_group_mode = ns2_psd_df["demographic_group"].mode()[0]

print("Region Mode:", region_mode)
print("Demographic Group Mode:", demographic_group_mode)
print()

Clean_psd_df = ns2_psd_df.fillna(
    {
        "region": region_mode,
        "demographic_group": demographic_group_mode,
    }
)
Clean_psd_df.info()


Region Mode: North America
Demographic Group Mode: Gamer

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 60 entries, 0 to 59
Data columns (total 5 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   region              60 non-null     object
 1   customer_id         60 non-null     object
 2   pre_order_date      60 non-null     object
 3   demographic_group   60 non-null     object
 4   pre_order_quantity  60 non-null     int64 
dtypes: int64(1), object(4)
memory usage: 2.5+ KB


In [10]:
# Printing the answer for question 1
print("The number of missing values in the data set are:")
print(missing_values)
print()

print("The percentage of records with missing values in the data set are:")
print(percent_missing_values)
print()

print(
    "The percentage of rows with at least one missing value in the data set is",
    percent_rows_with_missing,
), "%"
print()

print("The count of duplicate records in the data set are:")
print(duplicate_records)
print()

print("Clean version of the data set with handled missing values:")
Clean_psd_df


The number of missing values in the data set are:
region                1
customer_id           0
pre_order_date        0
demographic_group     3
pre_order_quantity    0
dtype: int64

The percentage of records with missing values in the data set are:
region                1.67
customer_id           0.00
pre_order_date        0.00
demographic_group     5.00
pre_order_quantity    0.00
dtype: float64

The percentage of rows with at least one missing value in the data set is 6.67

The count of duplicate records in the data set are:
2

Clean version of the data set with handled missing values:


Unnamed: 0,region,customer_id,pre_order_date,demographic_group,pre_order_quantity
0,North America,C001,2024-07-02,Gamer,1
1,Europe,C002,2024-07-03,Casual,2
2,Asia,C003,2024-07-04,Tech Enthusiast,1
3,Latin America,C004,2024-07-05,Family,3
4,Oceania,C005,2024-07-06,Student,2
5,North America,C006,2024-07-07,Gamer,5
6,Europe,C007,2024-07-08,Gamer,2
7,North America,C008,2024-07-09,Casual,1
8,Asia,C009,2024-07-10,Family,4
9,North America,C010,2024-07-11,Gamer,1


Question 2 of 3:

Using the cleaned data, calculate the total pre-sale orders per month for each region and demographic group.

In [11]:
Clean_psd_df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 60 entries, 0 to 59
Data columns (total 5 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   region              60 non-null     object
 1   customer_id         60 non-null     object
 2   pre_order_date      60 non-null     object
 3   demographic_group   60 non-null     object
 4   pre_order_quantity  60 non-null     int64 
dtypes: int64(1), object(4)
memory usage: 2.5+ KB


We can see that "pre_order_date has been turned into an object instead of its original datetime format.

We will transform it back for further analysis

In [12]:
Clean_psd_df["pre_order_date"] = pd.to_datetime(Clean_psd_df["pre_order_date"], format="%Y-%m-%d")
Clean_psd_df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 60 entries, 0 to 59
Data columns (total 5 columns):
 #   Column              Non-Null Count  Dtype         
---  ------              --------------  -----         
 0   region              60 non-null     object        
 1   customer_id         60 non-null     object        
 2   pre_order_date      60 non-null     datetime64[ns]
 3   demographic_group   60 non-null     object        
 4   pre_order_quantity  60 non-null     int64         
dtypes: datetime64[ns](1), int64(1), object(3)
memory usage: 2.5+ KB


In [13]:
#We first need to group the data by month 
month_grouper = pd.Grouper(key="pre_order_date", freq="M")
print(month_grouper)


TimeGrouper(key='pre_order_date', freq=<MonthEnd>, axis=0, sort=True, dropna=True, closed='right', label='right', how='mean', convention='e', origin='start_day')


  month_grouper = pd.Grouper(key="pre_order_date", freq="M")


In [14]:
Grouping_df = (
    Clean_psd_df.groupby(
        [month_grouper, "region", "demographic_group"], as_index=False)
    .agg(total_orders=("pre_order_quantity", "sum"))
    .sort_values(by=["pre_order_date", "region", "demographic_group"])
)
Grouping_df


Unnamed: 0,pre_order_date,region,demographic_group,total_orders
0,2024-07-31,Asia,Casual,4
1,2024-07-31,Asia,Family,4
2,2024-07-31,Asia,Gamer,2
3,2024-07-31,Asia,Student,3
4,2024-07-31,Asia,Tech Enthusiast,1
5,2024-07-31,Europe,Casual,2
6,2024-07-31,Europe,Family,4
7,2024-07-31,Europe,Gamer,4
8,2024-07-31,Europe,Student,7
9,2024-07-31,Latin America,Casual,3


Question 3 of 3

Predict the total pre-sales quantity for each region for September 2024. Assume that growth rate from August to September is the same as the growth rate from July to August in each region.

Think about how you might:

- Calculate total pre-sales per region for July and August,
- Compute the growth rate between these two months,
- Use that growth rate to estimate September’s pre-sales.

In [15]:
Clean_psd_df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 60 entries, 0 to 59
Data columns (total 5 columns):
 #   Column              Non-Null Count  Dtype         
---  ------              --------------  -----         
 0   region              60 non-null     object        
 1   customer_id         60 non-null     object        
 2   pre_order_date      60 non-null     datetime64[ns]
 3   demographic_group   60 non-null     object        
 4   pre_order_quantity  60 non-null     int64         
dtypes: datetime64[ns](1), int64(1), object(3)
memory usage: 2.5+ KB


In [16]:
#To answer this question a bit more efficiently we will create a month column so we can easily filter the data
Clean_psd_df["pre_order_year"] = Clean_psd_df["pre_order_date"].dt.year
Clean_psd_df["pre_order_month"] = Clean_psd_df["pre_order_date"].dt.month
Clean_psd_df


Unnamed: 0,region,customer_id,pre_order_date,demographic_group,pre_order_quantity,pre_order_year,pre_order_month
0,North America,C001,2024-07-02,Gamer,1,2024,7
1,Europe,C002,2024-07-03,Casual,2,2024,7
2,Asia,C003,2024-07-04,Tech Enthusiast,1,2024,7
3,Latin America,C004,2024-07-05,Family,3,2024,7
4,Oceania,C005,2024-07-06,Student,2,2024,7
5,North America,C006,2024-07-07,Gamer,5,2024,7
6,Europe,C007,2024-07-08,Gamer,2,2024,7
7,North America,C008,2024-07-09,Casual,1,2024,7
8,Asia,C009,2024-07-10,Family,4,2024,7
9,North America,C010,2024-07-11,Gamer,1,2024,7


In [17]:
Grouping_df = (
    Clean_psd_df.groupby(
        ["pre_order_month", "region", "demographic_group"], as_index=False)
    .agg(total_orders=("pre_order_quantity", "sum"))
    .sort_values(by=["pre_order_month", "region", "demographic_group"])
)
Grouping_df


Unnamed: 0,pre_order_month,region,demographic_group,total_orders
0,7,Asia,Casual,4
1,7,Asia,Family,4
2,7,Asia,Gamer,2
3,7,Asia,Student,3
4,7,Asia,Tech Enthusiast,1
5,7,Europe,Casual,2
6,7,Europe,Family,4
7,7,Europe,Gamer,4
8,7,Europe,Student,7
9,7,Latin America,Casual,3


In [18]:
new_grouping = Grouping_df.groupby(
        ["pre_order_month", "region"])["total_orders"].sum().reset_index()
new_grouping


Unnamed: 0,pre_order_month,region,total_orders
0,7,Asia,14
1,7,Europe,17
2,7,Latin America,13
3,7,North America,14
4,7,Oceania,14
5,8,Asia,17
6,8,Europe,10
7,8,Latin America,14
8,8,North America,19
9,8,Oceania,14


In [19]:
Pivot_df = new_grouping.pivot(index= 'region', columns='pre_order_month', values=['total_orders']).reset_index()
Pivot_df


Unnamed: 0_level_0,region,total_orders,total_orders
pre_order_month,Unnamed: 1_level_1,7,8
0,Asia,14,17
1,Europe,17,10
2,Latin America,13,14
3,North America,14,19
4,Oceania,14,14


In [20]:
JulAug_Growth_rate = Pivot_df['total_orders', 8] / Pivot_df['total_orders', 7]  # Growth rate from July to August
print("Growth rate from July to August:")
print(JulAug_Growth_rate)

# Adding the growth rate to the pivot DataFrame
Pivot_df["Jul-Aug_Growth_Rate"] = JulAug_Growth_rate.round(2)

Pivot_df_growth = Pivot_df
Pivot_df_growth



Growth rate from July to August:
0    1.214286
1    0.588235
2    1.076923
3    1.357143
4    1.000000
dtype: float64


Unnamed: 0_level_0,region,total_orders,total_orders,Jul-Aug_Growth_Rate
pre_order_month,Unnamed: 1_level_1,7,8,Unnamed: 4_level_1
0,Asia,14,17,1.21
1,Europe,17,10,0.59
2,Latin America,13,14,1.08
3,North America,14,19,1.36
4,Oceania,14,14,1.0


In [21]:
Predicted_September = Pivot_df_growth['total_orders', 8] * Pivot_df_growth['Jul-Aug_Growth_Rate']
print("Predicted pre-sales quantity for September 2024:")
print(Predicted_September)

#Adding the predicted September values to the DataFrame
Pivot_df_growth["September_Predicted_presales"] = Predicted_September.round(0)

Predicted_September_presales = Pivot_df_growth
print("Predicted pre-sales quantity for September 2024 by region:")
Predicted_September_presales


Predicted pre-sales quantity for September 2024:
0    20.57
1     5.90
2    15.12
3    25.84
4    14.00
dtype: float64
Predicted pre-sales quantity for September 2024 by region:


Unnamed: 0_level_0,region,total_orders,total_orders,Jul-Aug_Growth_Rate,September_Predicted_presales
pre_order_month,Unnamed: 1_level_1,7,8,Unnamed: 4_level_1,Unnamed: 5_level_1
0,Asia,14,17,1.21,21.0
1,Europe,17,10,0.59,6.0
2,Latin America,13,14,1.08,15.0
3,North America,14,19,1.36,26.0
4,Oceania,14,14,1.0,14.0


In [22]:
Pivot_df.columns = [
    c if isinstance(c, str) else f"{c[0]}{c[1]}"
    for c in Pivot_df.columns
]
Pivot_df.columns


Index(['region', 'total_orders7', 'total_orders8', 'Jul-Aug_Growth_Rate',
       'September_Predicted_presales'],
      dtype='object')

In [23]:
final_df = Pivot_df.rename(columns={
    "total_orders7": "July",
    "total_orders8": "August",
    "Jul-Aug_Growth_Rate": "growth_rate",
    "September_Predicted_presales": "September (predicted)"
})
final_df


Unnamed: 0,region,July,August,growth_rate,September (predicted)
0,Asia,14,17,1.21,21.0
1,Europe,17,10,0.59,6.0
2,Latin America,13,14,1.08,15.0
3,North America,14,19,1.36,26.0
4,Oceania,14,14,1.0,14.0


In [24]:
# Answering the final question by selecting the relevant columns
final_df = final_df[["region", "July", "August", "growth_rate", "September (predicted)"]]
final_df


Unnamed: 0,region,July,August,growth_rate,September (predicted)
0,Asia,14,17,1.21,21.0
1,Europe,17,10,0.59,6.0
2,Latin America,13,14,1.08,15.0
3,North America,14,19,1.36,26.0
4,Oceania,14,14,1.0,14.0


In [None]:
#FINAL CODE

# # Creating a copy of the pre-sale data to avoid modifying the original dataset
# data = pre_sale_data
# ns2_psd_df = data.copy()

# print(ns2_psd_df)
# print("-" * 100)
# print()

# # Getting initial data info
# ns2_psd_df.info()
# print("-" * 100)
# print()

# # Finding out how many records have missing values in at least one column
# missing_values = ns2_psd_df.isnull().sum()
# print(missing_values)
# print("-" * 100)
# print()

# # Calculating the percentage of missing values per row
# percent_missing_values = ((missing_values * 100) / len(ns2_psd_df)).round(2)
# print(percent_missing_values)
# print("-" * 100)
# print()

# # Calculating the percentage of rows with missing values
# percent_rows_with_missing = (ns2_psd_df.isnull().any(axis=1).mean() * 100).round(2)
# print(percent_rows_with_missing)
# print("-" * 100)
# print()

# # Calculating the complete duplicate records
# duplicate_records = ns2_psd_df.duplicated().sum()
# print("The number of duplicate values on the data set is:", duplicate_records)
# print("-" * 100)
# print()

# # Identify all duplicate rows, including the first occurrence
# all_duplicate_rows = ns2_psd_df[ns2_psd_df.duplicated(keep=False)]
# # Display all duplicate rows
# print(all_duplicate_rows)
# print("-" * 100)
# print()

# # Handling missing values
# # Since the missing values are objects, I will use the Mode to fill them
# region_mode = ns2_psd_df["region"].mode()[0]
# demographic_group_mode = ns2_psd_df["demographic_group"].mode()[0]

# print("Region Mode:", region_mode)
# print("Demographic Group Mode:", demographic_group_mode)
# print("-" * 100)
# print()

# Clean_psd_df = ns2_psd_df.fillna(
#     {
#         "region": region_mode,
#         "demographic_group": demographic_group_mode,
#     }
# )
# print(Clean_psd_df.info())
# print("-" * 100)
# print()

# # Printing the answer for question 1
# print("The number of missing values in the data set are:")
# print(missing_values)
# print()

# print("The percentage of records with missing values in the data set are:")
# print(percent_missing_values)
# print()

# print(
#     "The percentage of rows with at least one missing value in the data set is",
#     percent_rows_with_missing,
# ), "%"
# print()

# print("The count of duplicate records in the data set are:")
# print(duplicate_records)
# print()

# print("Clean version of the data set with handled missing values:")
# print(Clean_psd_df)

# #####################################################################

# print(Clean_psd_df.info())
# print("-" * 100)
# print()

# # We can see that "pre_order_date has been turned into an object instead of its original datetime format.
# # We will transform it back for further analysis
# Clean_psd_df["pre_order_date"] = pd.to_datetime(Clean_psd_df["pre_order_date"], format="%Y-%m-%d")
# print(Clean_psd_df.info())
# print("-" * 100)
# print()

# #We first need to group the data by month 
# month_grouper = pd.Grouper(key="pre_order_date", freq="M")
# print(month_grouper)
# print("-" * 100)
# print()

# # Answer
# Grouping_df = (
#     Clean_psd_df.groupby(
#         [month_grouper, "region", "demographic_group"], as_index=False)
#     .agg(total_orders=("pre_order_quantity", "sum"))
#     .sort_values(by=["pre_order_date", "region", "demographic_group"])
# )
# print(Grouping_df)
# print("-" * 100)
# print()

# ################################################################
# print(Clean_psd_df.info())
# print()

# #To answer this question a bit more efficiently we will create a month column so we can easily filter the data
# Clean_psd_df["pre_order_year"] = Clean_psd_df["pre_order_date"].dt.year
# Clean_psd_df["pre_order_month"] = Clean_psd_df["pre_order_date"].dt.month
# print(Clean_psd_df)
# print("-" * 100)
# print()

# Grouping_df = (
#     Clean_psd_df.groupby(
#         ["pre_order_month", "region", "demographic_group"], as_index=False)
#     .agg(total_orders=("pre_order_quantity", "sum"))
#     .sort_values(by=["pre_order_month", "region", "demographic_group"])
# )
# print(Grouping_df)
# print("-" * 100)
# print()

# new_grouping = Grouping_df.groupby(
#         ["pre_order_month", "region"])["total_orders"].sum().reset_index()
# print(new_grouping)
# print("-" * 100)
# print()

# Pivot_df = new_grouping.pivot(index= 'region', columns='pre_order_month', values=['total_orders']).reset_index()
# print(Pivot_df)
# print("-" * 100)
# print()

# JulAug_Growth_rate = Pivot_df['total_orders', 8] / Pivot_df['total_orders', 7]  # Growth rate from July to August
# print("Growth rate from July to August:")
# print(JulAug_Growth_rate)
# print()

# # Adding the growth rate to the pivot DataFrame
# Pivot_df["Jul-Aug_Growth_Rate"] = JulAug_Growth_rate

# Pivot_df_growth = Pivot_df
# print(Pivot_df_growth)
# print("-" * 100)
# print()

# Predicted_September = Pivot_df_growth['total_orders', 8] * Pivot_df_growth['Jul-Aug_Growth_Rate']
# print("Predicted pre-sales quantity for September 2024:")
# print(Predicted_September)

# #Adding the predicted September values to the DataFrame
# Pivot_df_growth["September_Predicted_presales"] = Predicted_September.round(0)

# Predicted_September_presales = Pivot_df_growth
# print("Predicted pre-sales quantity for September 2024 by region:")
# print(Predicted_September_presales)
# print("-" * 100)
# print()



# Pivot_df.columns = [
#     c if isinstance(c, str) else f"{c[0]}{c[1]}"
#     for c in Pivot_df.columns
# ]
# print(Pivot_df.columns)
# print()

# final_df = Pivot_df.rename(columns={
#     "total_orders7": "July 2024",
#     "total_orders8": "August 2024",
#     "Jul-Aug_Growth_Rate": "growth_rate",
#     "September_Predicted_presales": "September 2024 (predicted)"
# })
# print(final_df)
# print()

# # Answering the final question by selecting the relevant columns
# final_df = final_df[["region", "July 2024", "August 2024", "growth_rate", "September 2024 (predicted)"]]
# final_df
