# **TASK**

## Analyze a small, publicly available dataset using Python's NumPy or Pandas libraries.                                                                                                
1. Explore the data:

Print the first few rows using data.head().
Check the data types of each column using data.info().
Get summary statistics (mean, median, standard deviation, etc.) using data.describe().

2. Calculate basic statistics:

Calculate the mean, median, mode, and standard deviation for each numerical feature.
Calculate the correlation between different features using data.corr().

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [4]:
df = pd.read_csv("snicker_dataset_with_dates.csv")

### I downloaded this dataset from kaggle. It was uploaded by a Kaggle user name ComHek. This dataset is licensed under MIT.

# About Dataset
This dataset contains detailed information related to sneakers product data, enriched with date annotations. It is designed to support exploratory data analysis, time series forecasting and trend analysis involving consumer behavior or product availability over time.

Key Features
1). Comprehensive Product Data: Includes multiple attributes potentially relevant to sneakers inventory, sales, or distribution.

2). Time-Based Structure: A dedicated date column allows for temporal analysis, making the dataset suitable for trend detection, forecasting models, and seasonality insights.

3). Clean and Structured Format: The dataset has been preprocessed for ease of use in data science projects.

* +  #  *Task:* Explore the data:


* + * + * + * + ##  *Task:* Print the first few rows using data.head()..

In [28]:
print(df.shape)
df.head()

(500, 20)


Unnamed: 0,name,type,total_produced,total_sold,damaged,month,year,edition,price,gender,sell_through_rate,damage_rate,unsold_inventory,estimated_revenue,quarter,date,is_limited_edition,price_bucket,manufacturing_date,selling_date
0,Nike Air Force 1 '07 Sneakers,Creamy,24592,5819,9012,September,2017,Limited,115,Men,23.662167,36.646064,9761,669185,Q3,2017-09-01,1,High,2017-09-01,2017-11-06
1,Adidas Originals Samba OG Shoes,Creamy,81482,16395,38698,September,2019,Standard,100,Men,20.121008,47.492698,26389,1639500,Q3,2019-09-01,0,Mid,2019-09-01,2019-09-30
2,Air Jordan 1 Mid Shoes,Peanut Butter,76237,8478,26062,January,2023,Special Release,110,Women,11.120584,34.1855,41697,932580,Q1,2023-01-01,1,High,2023-01-01,2023-03-28
3,Red Tape Casual Sneakers,Brownie,46463,5425,10463,October,2022,Anniversary,35,Men,11.675957,22.518994,30575,189875,Q4,2022-10-01,0,Low,2022-10-01,2022-12-15
4,Nike Court Vision Low Shoes,Peanut Butter,54118,11698,6078,April,2019,Standard,80,Women,21.615729,11.231014,36342,935840,Q2,2019-04-01,0,Mid,2019-04-01,2019-05-06


### There are 500 row and 20 columns in this dataset, so it quallifies as a small dataset

### From the overview of the first 5 rows of the data, this is a dataset about the sale of Sneakers of various types and editions, as can be seen from the 'name', 'type' and 'edition' columns.
* The dataset shows the number of each one that was sold, was damaged and the number in the unsold (Probably still in the storehouse). It also shows the rate at which sales and damages occure, all from the total produced. The 'total_produced', 'total_sold', 'damaged' 'sell_through_rate',  'damage_rate' and the, 'unsold_inventory columns hold this info.
* There is also a times series info about each product in the dataset. This info can be gotten from the 'month',  'year 'quarter',  'date 'manufacturing_date', and  'selling_date columns.
* The 'price', 'estimated_revenue' and 'price_bucket' contain some financial info about the products.
* Thegender column shows that the products are for either the Male or Female gender.''''

* + * + * + * + ##  *Task* Check the data types of each column using data.info().

In [19]:
df.dtypes

name                   object
type                   object
total_produced          int64
total_sold              int64
damaged                 int64
month                  object
year                    int64
edition                object
price                   int64
gender                 object
sell_through_rate     float64
damage_rate           float64
unsold_inventory        int64
estimated_revenue       int64
quarter                object
date                   object
is_limited_edition      int64
price_bucket           object
manufacturing_date     object
selling_date           object
dtype: object

### Almost all the columns are of the right data type except the 'date', 'manufacturing_date' and 'selling_date' columns that should be of the datetime[ns] or int64 data types but are listed as _objects_.

* + * + * + * + ##  *Task* Get summary statistics (mean, median, standard deviation, etc.) using data.describe().

In [21]:
df.describe()

Unnamed: 0,total_produced,total_sold,damaged,year,price,sell_through_rate,damage_rate,unsold_inventory,estimated_revenue,is_limited_edition
count,500.0,500.0,500.0,500.0,500.0,500.0,500.0,500.0,500.0,500.0
mean,53757.092,29288.5,12631.328,2019.558,86.0,55.946654,22.413722,11837.264,2494873.0,0.38
std,26087.657543,21573.376256,13908.38519,2.812553,23.023035,25.77211,19.458792,13373.964644,1970713.0,0.485873
min,10191.0,5035.0,6.0,2015.0,35.0,6.381266,0.013924,9.0,189000.0,0.0
25%,30015.0,12550.75,2366.0,2017.0,70.0,33.568112,5.892475,2492.5,1011160.0,0.0
50%,53357.5,22277.0,7993.0,2020.0,82.5,56.490418,17.676436,6646.5,1863760.0,0.0
75%,76635.0,40691.5,17290.5,2022.0,100.0,77.71783,32.808431,17282.75,3338479.0,1.0
max,99922.0,97181.0,78238.0,2024.0,140.0,99.831256,89.979847,66063.0,10689910.0,1.0


### From the statistical overview, we can see that the mean (average) for the total_sold is higher than that of the damaged products. This nullifies my first impression from my initial glance at the some rows in the datases. It means more products actually get sold from the toatal produced than the number that get damaged. A deeper analysis will make this clearer.

* #### The significant difference between the 75th percentile (17290.50) and the maximum value (78238.00) suggests that there might be some highly damaged items that are outliers.

* + * + * + * + ##  *Task* Calculate basic statistics:
Calculate the mean, median, mode, and standard deviation for each numerical feature.

In [236]:
numeric_columns = df[[
    'total_produced',
    'total_sold',
    'damaged',
    'price',
    'sell_through_rate',
    'damage_rate',
    'unsold_inventory',
    'estimated_revenue'
]]


summary_df = numeric_columns.agg(['mean', 'median', 'std'])



for col in summary_df.columns:
    mean_val = summary_df.loc['mean', col]
    median_val = summary_df.loc['median', col]
    std_val = summary_df.loc['std', col]
    print(f"Basic statistics for {col}:")
    print('-' * 60)
    print(f"The Mean for {col} is: {mean_val}")
    print(f"The Median for {col} is: {median_val}")
    print(f"The Standard Deviation for {col} is: {std_val}")
    print('-' * 60, "\n")


Basic statistics for total_produced:
------------------------------------------------------------
The Mean for total_produced is: 53757.092
The Median for total_produced is: 53357.5
The Standard Deviation for total_produced is: 26087.657543208432
------------------------------------------------------------ 

Basic statistics for total_sold:
------------------------------------------------------------
The Mean for total_sold is: 29288.5
The Median for total_sold is: 22277.0
The Standard Deviation for total_sold is: 21573.37625602836
------------------------------------------------------------ 

Basic statistics for damaged:
------------------------------------------------------------
The Mean for damaged is: 12631.328
The Median for damaged is: 7993.0
The Standard Deviation for damaged is: 13908.385189718714
------------------------------------------------------------ 

Basic statistics for price:
------------------------------------------------------------
The Mean for price is: 86.0
T

* Total Produced has a mean of 53.8 K units and a median of 53.4 K, with a standard deviation of 26.1 K. The proximity of mean and median suggests a roughly symmetric distribution around 53 K, though there is considerable spread (+- 26 K).

* Total Sold shows a lower central tendency (mean 29.3 K, median 22.3 K) and a substantial spread (21.6 K). Because the mean (29.3 K) exceeds the median (22.3 K) by 7 K, sales volumes are right-skewed—i.e., a subset of high-volume SKUs pulls the average upward.

* Damaged units average 12.6 K per SKUs, with a median of 7.99 K and Std = 13.9 K. The fact that the mean (12.6 K) is substantially larger than the median (8.0 K) indicates a right skew: most SKUs incur fewer than 8 K damaged units, but a minority suffer very large damage counts.

* Unsold Inventory has mean 11.8 K and median 6.65 K, with Std = 13.4 K. Again, mean > median implies right skew—some SKUs hold unusually large unsold backlogs.

* The combination of damage, sold, and unsold counts shows that, on average, about 53.8 K units are produced per SKU, of which 29.3 K are sold, 12.6 K are damaged, and ~11.8 K remain unsold.

* Sell-Through Rate (i.e., sold \ \[released]) averages 55.95% (median 56.49%) with Std = 25.77%. The mean and median are nearly equal, indicating a fairly symmetric distribution of rates around 56 %. The large standard deviation (+- 25.8 percentage points) shows wide variation—some SKUs sell nearly 100%, others sell very little.

* Damage Rate averages 22.41% (median 17.68%) with Std = 19.46%. Here the mean exceeds the median by 4.7 points, signifying a right-skewed distribution: most SKUs have < 18% damage, but a tail of SKUs sees much higher rates.

* Price per unit has mean \$86.00, median \\$82.50, and Std = \\$23.02. Since the mean is slightly higher than the median, there are some higher-priced SKUs pulling the average upward, but overall prices cluster in the \\$60–\\$110 range.

* Estimated Revenue (presumably price × sold) averages \\$2.495 million, median $\\1.864 million, with Std = $1.971 million. The mean > median implies right skew: a small number of very high-revenue SKUs drive up the average.

### In summary, most SKUs cluster near the medians: ~53 K produced, ~22 K sold, ~8 K damaged, ~6.6 K unsold, ~56% sell‐through, ~18% damage, and a \\$82.50–\\$86 price range. However, wide standard deviations and mean > median patterns highlight that a minority of SKUs with very large volumes, high damage, or high revenue create a pronounced right‐tail effect in nearly every metric.

## SKU is an accronym for Stock-Keeping Unit. It is a unique identifier that a retailer or manufacturer assigns to each distinct product variant. Each SKU represents a specific item in your inventory. 

### Common alternatives to SKU include:
* + Item code

+ * Product code

* + Item number

+ * Stock code

- - Part number 

* + * + * + * + ##  *Task:* Calculate the correlation between different features using data.corr().

In [151]:
correlation = numerical_columns.corr()
correlation

Unnamed: 0,total_produced,total_sold,damaged,price,sell_through_rate,damage_rate,unsold_inventory,estimated_revenue
total_produced,1.0,0.652516,0.470316,-0.01506,-0.117259,0.114951,0.408955,0.599904
total_sold,0.652516,1.0,-0.141293,-0.048292,0.619053,-0.386761,-0.193332,0.916316
damaged,0.470316,-0.141293,1.0,0.01502,-0.593323,0.86696,0.105371,-0.130416
price,-0.01506,-0.048292,0.01502,1.0,-0.023003,0.013586,0.032902,0.273046
sell_through_rate,-0.117259,0.619053,-0.593323,-0.023003,1.0,-0.662217,-0.610284,0.570843
damage_rate,0.114951,-0.386761,0.86696,0.013586,-0.662217,1.0,-0.053498,-0.357559
unsold_inventory,0.408955,-0.193332,0.105371,0.032902,-0.610284,-0.053498,1.0,-0.172281
estimated_revenue,0.599904,0.916316,-0.130416,0.273046,0.570843,-0.357559,-0.172281,1.0


* The strongest correlation is obviously between the total sold and the estimated revenue. This is expected because the estimated revenue is a multiple of the total sold with the price. So the more products sold, the higher the estimated revenue, this is why there's a strong positive correlation. (Almost perfect)
  
* Another strong correlation is between 'damaged' column and the 'damage_rate' column. This is understandable because the higher the damage, the higher the damage rate, the damage rate is a function of the number of damages. The 'total_sold' and 'sell_through_rate' share the same relationship, but not same strenght.

* The 'sell_through_rate' and the 'damage_rate' are negatively correlated because they both share a percentage of the total produced after the unsold inventory takes its share.
  
* It's disappointing to see that no strong correlation exists between the price and any other feature, so there is no linear relationship between price and other features. So if a model was to be trained to predict the prices based on these available features, a linear model will struggle.