Step 1 | Setup and Initialization

I'll initialize the libraries that will be utilized throughout the project. This generally includes libraries for data manipulation, data visualization, and others based on the specific needs of the project.

In [5]:
%pip install -r requirements.txt

Note: you may need to restart the kernel to use updated packages.


Step 1.2 | Loading the Dataset

Load the survey responses into a pandas DataFrame which will facilitate easy manipulation and analysis:

In [None]:
import os

# Change directory
project_dir = r'D:\Documents\projects\SurveyAnalysis\Checkout'
os.chdir(project_dir)

# Check if files exist
survey_file = 'data/processed/outputs/results/1-checkout_survey_responses.csv'

if os.path.exists(survey_file):
    df = pd.read_csv(survey_file)
    print(f"Loaded {survey_file}")
else:
    print(f"File not found: {survey_file}")
    print("Current directory contents:")
    print(os.listdir('.'))


Loaded data/processed/outputs/results/1-checkout_survey_responses.csv


Step 2 | Initial Data Analysis

Summarize the responses so that we can see what is typical, how much variation there is, and whether there are clear differences between groups.

In [None]:
#Step 2.1 | Survey Response Overview
## Preliminary analysis to understand the structure and content of the dataset
df.head(10)

Unnamed: 0,Age_Group,Ease_of_Checkout,Time_Spent_Minutes,Payment_Method
0,25-34,4,5.6,BNPL
1,55-64,2,4.0,BNPL
2,35-44,3,4.8,BNPL
3,35-44,5,3.6,Card
4,18-24,4,4.4,Card
5,18-24,1,3.8,PayPal
6,18-24,2,4.1,BNPL
7,45-54,4,4.7,PayPal
8,35-44,1,3.0,PayPal
9,35-44,3,6.5,Card


In [12]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 4 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Age_Group           200 non-null    object 
 1   Ease_of_Checkout    200 non-null    int64  
 2   Time_Spent_Minutes  200 non-null    float64
 3   Payment_Method      200 non-null    object 
dtypes: float64(1), int64(1), object(2)
memory usage: 6.4+ KB


In [13]:
print("Shape of complete survey data : ", df.shape)

Shape of complete survey data :  (200, 4)


Inferences:

The survey responses dataset consists of 200 entries and 4 columns. Here is a brief overview of each column:

- Age_Group: This is an object data type column that contains the age range of each respondent. Each group represent common age ranges. 

- Ease_of_Checkout: 

- Time_Spent_Minutes: 

- Payment_Method:



In [None]:
#Step 2.2 | Summary Statistics
#Generate summary statistics to gain initial insights into the data distribution.

#Summary statistics for numerical variables
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Ease_of_Checkout,200.0,3.68,1.159449,1.0,3.0,4.0,4.0,5.0
Time_Spent_Minutes,200.0,3.9945,1.185049,1.2,3.1,4.0,4.725,7.7


In [None]:
#Range for numerical variables
data_range = df.describe().T[['min', 'max']]
range_ease = df["Ease_of_Checkout"].max() - df["Ease_of_Checkout"].min()
range_time = df["Time_Spent_Minutes"].max() - df["Time_Spent_Minutes"].min()
print("Data Range for Numerical Variables:\n", data_range)
print("Range for Ease of Checkout:", range_ease)
print("Range for Time Spent (in minutes):", range_time)

Data Range for Numerical Variables:
                     min  max
Ease_of_Checkout    1.0  5.0
Time_Spent_Minutes  1.2  7.7
Range for Ease of Checkout: 4
Range for Time Spent (in minutes): 6.5


In [24]:
# Interquartile Range (IQR) for numerical variables
iqr_ease = df["Ease_of_Checkout"].quantile(0.75) - df["Ease_of_Checkout"].quantile(0.25)
iqr_time = df["Time_Spent_Minutes"].quantile(0.75) - df["Time_Spent_Minutes"].quantile(0.25)
print("Interquartile Range (IQR) for Ease of Checkout:", iqr_ease)
print("Interquartile Range (IQR) for Time Spent (in minutes):", iqr_time)

Interquartile Range (IQR) for Ease of Checkout: 1.0
Interquartile Range (IQR) for Time Spent (in minutes): 1.6249999999999996


In [26]:
#Mean Absolute Deviation for numerical variables
mad_ease = df["Ease_of_Checkout"].mad()
mad_time = df["Time_Spent_Minutes"].mad()
print("Mean Absolute Deviation for Ease of Checkout:", mad_ease)
print("Mean Absolute Deviation for Time Spent (in minutes):", mad_time)

AttributeError: 'Series' object has no attribute 'mad'

In [15]:
# Summary statistics for categorical variables
df.describe(include='object').T

Unnamed: 0,count,unique,top,freq
Age_Group,200,5,25-34,56
Payment_Method,200,3,Card,98


Inferences:

- <b>Ease of Checkout</b>:
    - the average ease of checkout score is a 3.68.
    - The ease of checkout score is based on a 1 to 5 scale.
    - the standard deviation is quite small, indicating a small spread in the data.

- <b>Time Spent in Minutes</b>:
    - the average time spent (in minutes) is approximately 3.99
    - the average time spent show a small range of 6.5 minutes, from 1.2 minutes (72 seconds) to 7.7 minutes (462 seconds).
    - Similar to the Ease_of_checkout column, the 

- <b>Age Group</b>:
    - There are 5 unique age_groups, representing different audiences.
    - The most frequent age group audience is 25-34, representing 28% of respondents.

- <b>Payment_Method</b>:
    - There are 3 unique payment methods: Buy Now Pay Later (BNPL), PayPal, and Card.
    - THe most frequent payment method is credit card.