# Introduction


The goal of this project is to uncover key insights through a statistical analysis of Katingos sales data from the first nine months of the year. Additionally, the client aims to determine, with statistical evidence, which e-commerce platform generates the highest sales for their baby product line and business product line. The insights derived from this analysis will support the marketing team in optimizing their advertising strategies and improving overall campaign performance.

# Coding

## Loading libraries


Let's import all the libraries that we are going to use first.

In [5]:
import pandas as pd
import numpy as np
import math as mt
from scipy import stats as st
from matplotlib import pyplot as plt


## Loading data

Now we load the data from a CSV file with all the sales from january to september 2025.

In [7]:
sales = pd.read_csv('jan_sep_2025_sales.csv', encoding='latin-1')


## Exploratory Data Analysis (EDA)


First let's take a quick look of the information.

In [10]:

print(sales.info())
print(sales.sample(10))

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1384 entries, 0 to 1383
Data columns (total 18 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Month                    1384 non-null   int64  
 1   Week                     1384 non-null   int64  
 2   Channel                  1384 non-null   object 
 3   State                    1366 non-null   object 
 4   Baby Peanut 250g         700 non-null    float64
 5   Crunchy Peanut 250g      118 non-null    float64
 6   Baby Almond 250g         646 non-null    float64
 7   Vainilla Almond 250g     109 non-null    float64
 8   Chocolate Hazelnut 250g  102 non-null    float64
 9   Baby  Peanut 1kg         170 non-null    float64
 10  Baby Almond 1kg          11 non-null     float64
 11  Crunchy Peanut 1kg       155 non-null    float64
 12  Vainilla Almond 1kg      11 non-null     float64
 13  Chocolate Hazelnut 1kg   12 non-null     float64
 14  Non Crunchy Peanut 4kg  


Looking at the dataset, we can confirm this one is a table with sales and each line it's a different ticket for an specific customer. That is why we have NaN values, because every customer buys some specific products. There are some things we can do to improve our dataset:  
  
1. For style purpose we'll unify the columns names, changing them to lowercase, and replacing spaces with underscores.  
2. To simplify the data wrangling, we can sustitute the NaN values for ceros, this means that the customer didn't buy that product.  
3. Also for data wrangling and because the customers do not buy half's of products, we can change the products columns info into integer type.

In [None]:
# First we change the columns names from uppercases to lowercases.
new_col_name = []

for old_name in sales.columns:
    low_name = old_name.lower()
    stripped_name = '_'.join(old_name.split())
    new_col_name.append(stripped_name)

sales.columns = new_col_name

# Now let's impute zeros in the products columns.
exclude_col = ['month', 'week', 'channel', 'state']
col_to_fill = sales.columns.difference(exclude_col)
sales[col_to_fill] = sales[col_to_fill].fillna(0)

# Finally we change the products columns type to integer.
sales[col_to_fill] = sales[col_to_fill].astype('int')
print(sales.head())

   month  week channel      state  baby_peanut_250g  crunchy_peanut_250g  \
0      1     1      AZ    Jalisco                 1                    0   
1      1     1      AZ  Querétaro                 1                    0   
2      1     1      AZ       León                 1                    0   
3      1     1      AZ    Jalisco                 0                    0   
4      1     1      AZ       CDMX                 1                    0   

   baby_almond_250g  vainilla_almond_250g  chocolate_hazelnut_250g  \
0                 1                     0                        0   
1                 1                     0                        0   
2                 1                     0                        0   
3                 0                     0                        0   
4                 1                     0                        0   

   baby_peanut_1kg  baby_almond_1kg  crunchy_peanut_1kg  vainilla_almond_1kg  \
0                0                0       


## Data Wrangling


For the purpose of the project, we need to translate the amount of products selled into money. And for simplifying the analysis it's a good idea to calculate the total sales for each one of the product lines that we want to do A/B tests. We'll set 3 different product lines and calculate the amount of money for each of them.

In [22]:

# Let's calculate the money sales for the baby product line.
sales['baby_sales'] = (sales['baby_peanut_250g']*100)+(sales['baby_almond_250g']*200)+(sales['baby_peanut_1kg']*280)+(sales['baby_almond_1kg']*600)

# Now we do the same for the business product line.
sales['business_sales'] = (sales['non_crunchy_peanut_4kg']*500)+(sales['crunchy_peanut_4kg']*500)+(sales['vainilla_almond_4kg']*2000)+(sales['chocolate_hazelnut_4kg']*2000)

# The rest of the products are considered as the wellness product line.
sales['wellness_sales'] = (sales['crunchy_peanut_250g']*100)+(sales['vainilla_almond_250g']*200)+(sales['chocolate_hazelnut_250g']*200)+(sales['crunchy_peanut_1kg']*280)+(sales['vainilla_almond_1kg']*600)+(sales['chocolate_hazelnut_1kg']*600)


## Statistical Analysis


We aim to uncover key insights for Katingos’ Marketing Department by analyzing various aspects of sales performance, such as