## Quality Assessment

##### To achieve quality data, I followed these steps:

##### 1: Defined Pandas display formats:
*Set the display formats to better view and understand the data.*
##### 2: Excluded unwanted data: 
*Removed irrelevant or unnecessary data to focus on what's important.*
##### 3: Merged different tables: 
*Combined tables based on my goals and business questions to get a complete dataset.*

### import libraries

In [2]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

1: Defined Pandas display formats:

In [4]:
pd.set_option('display.float_format', lambda x: '%.2f' % x)
pd.set_option('display.max_rows', 1000)

2,3: Excluded unwanted data &  Merged different tables:

*Let's review our business questions:*

**1: How should products be classified into different categories to simplify reports and analysis?**

**2: What is the distribution of product prices across different categories?**

**3: How many products are being discounted?**

**4: How big are the offered discounts as a percentage of the product prices?**

**5: How do seasonality and special dates (Christmas, Black Friday) affect sales?**

**6: How could data collection be improved?**

*Lets import our data*

In [73]:
orders = pd.read_csv("C:\\My projects\\pandas\\cleaned_data\\orders_cleaned")
order_lines = pd.read_csv("C:\\My projects\\pandas\\cleaned_data\\orderlines_cleaned")
products = pd.read_csv("C:\\My projects\\pandas\\cleaned_data\\products_cleaned")
brands = pd.read_csv("C:\\My projects\\pandas\\cleaned_data\\brands_cleaned")

*When we want to remove or manipulate data, or when we need to merge tables, we should have domain knowledge about the data or consult someone who is an expert in the business concepts.*

Here’s a description of each table and its columns:

**orders.csv** – Every row in this file represents an order.

`order_id` – a unique identifier for each order

`created_date` – a timestamp for when the order was created

`total_paid` – the total amount paid by the customer for this order, in euros
state 

`“Shopping basket”` – products have been placed in the shopping basket

`“Place Order”` – the order has been placed, but is awaiting shipment details 

`“Pending”` – the order is awaiting payment confirmation

`“Completed”` – the order has been placed and paid, and the transaction is completed.

`“Cancelled”` – the order has been cancelled and the payment returned to the customer.

**orderlines.csv** – Every row represents each one of the different products involved in an order.

`id` – a unique identifier for each row in this file

`id_order` – corresponds to orders.order_id

`product_id` – an old identifier for each product, nowadays not in use

`product_quantity`– how many units of that product were purchased on that order

`sku` – stock keeping unit: a unique identifier for each product

`unit_price` – the unitary price (in euros) of each product at the moment of placing that order

`date` – timestamp for the processing of that product

**products.csv**

`sku` – stock keeping unit: a unique identifier for each product

`name` – product name

`desc` – product description

`price` – base price of the product, in euros

`promo_price` – promotional price, in euros

`in_stock` – whether or not the product was in stock at the moment of the data extraction

`type` – a numerical code for product type

**brands.csv**

`short` – the 3-character code by which the brand can be identified in the first 3 characters of products.sku
`long` – brand name



Let's go through it step by step for each table.

##### 1: orders

In [77]:
orders.head()

Unnamed: 0,order_id,created_date,total_paid,state
0,241319,2017-01-02 13:35:40,44.99,Cancelled
1,241423,2017-11-06 13:10:02,136.15,Completed
2,242832,2017-12-31 17:40:03,15.76,Completed
3,243330,2017-02-16 10:59:38,84.98,Completed
4,243784,2017-11-24 13:35:19,157.86,Cancelled


For the first step, I want to extract the months and years from `created_date` because it helps me interpret the effect of seasonality and special days on sales.

In [83]:
orders['month'] = orders['created_date'].dt.month

AttributeError: Can only use .dt accessor with datetimelike values

What is this error?

Let's check `orders.info()`

In [85]:
orders.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 226904 entries, 0 to 226903
Data columns (total 4 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   order_id      226904 non-null  int64  
 1   created_date  226904 non-null  object 
 2   total_paid    226904 non-null  float64
 3   state         226904 non-null  object 
dtypes: float64(1), int64(1), object(2)
memory usage: 6.9+ MB


If you remember, in the previous notebook (cleaning), we changed the data type of `created_date` to datetime. However, now it has become "object". Why is that?

This happens because, when we save the cleaned data (most likely as a CSV), Pandas includes the index from the original DataFrame as a column by default. When we load the data back into another Jupyter notebook, Pandas interprets that column as regular data instead of the index.

This issue occurs because when we save the DataFrame as a CSV file and then load it back, Pandas may not preserve the datetime64 data type by default.

Instead, it treats the created_date column as an "object" (which is typically a string), losing its original datetime64[ns] format.

**Why this happens**:

When we save the DataFrame as a CSV, the datetime column is converted into a string format (since CSVs store data as text).

When we load the CSV back, Pandas does not automatically convert the string back into a datetime object unless explicitly instructed to do so.








**How to fix it**:

We convert the created_date column back to datetime after loading the CSV

In [87]:
orders['created_date'] = pd.to_datetime(orders['created_date'])

In [89]:
orders.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 226904 entries, 0 to 226903
Data columns (total 4 columns):
 #   Column        Non-Null Count   Dtype         
---  ------        --------------   -----         
 0   order_id      226904 non-null  int64         
 1   created_date  226904 non-null  datetime64[ns]
 2   total_paid    226904 non-null  float64       
 3   state         226904 non-null  object        
dtypes: datetime64[ns](1), float64(1), int64(1), object(1)
memory usage: 6.9+ MB


In [97]:
orders['month'] = orders['created_date'].dt.month_name()

In [99]:
orders['year'] = orders['created_date'].dt.year

In [101]:
orders.head()

Unnamed: 0,order_id,created_date,total_paid,state,month,year
0,241319,2017-01-02 13:35:40,44.99,Cancelled,January,2017
1,241423,2017-11-06 13:10:02,136.15,Completed,November,2017
2,242832,2017-12-31 17:40:03,15.76,Completed,December,2017
3,243330,2017-02-16 10:59:38,84.98,Completed,February,2017
4,243784,2017-11-24 13:35:19,157.86,Cancelled,November,2017


In [105]:
orders.drop('created_date' , axis = 1, inplace = True)

In [107]:
orders.head()

Unnamed: 0,order_id,total_paid,state,month,year
0,241319,44.99,Cancelled,January,2017
1,241423,136.15,Completed,November,2017
2,242832,15.76,Completed,December,2017
3,243330,84.98,Completed,February,2017
4,243784,157.86,Cancelled,November,2017


#### 2: order_lines

In [110]:
order_lines.head()

Unnamed: 0,id,id_order,product_id,product_quantity,sku,unit_price,date
0,1119109,299539,0,1,OTT0133,18.99,2017-01-01 00:07:19
1,1119110,299540,0,1,LGE0043,399.0,2017-01-01 00:19:45
2,1119111,299541,0,1,PAR0071,474.05,2017-01-01 00:20:57
3,1119112,299542,0,1,WDT0315,68.39,2017-01-01 00:51:40
4,1119113,299543,0,1,JBL0104,23.74,2017-01-01 01:06:38


We have these information:

1- `product_id` is an old identifier for each product, nowadays not in use.

2- `id` is a unique identifier for each row in this file

3- `date` is timestamp for the processing of that product


For this analysis I don't need timestamp of processing, so very simple I delete these columns



In [118]:
order_lines.drop(columns= ['product_id', 'id', 'date'] , inplace = True)

In [120]:
order_lines.head()

Unnamed: 0,id_order,product_quantity,sku,unit_price
0,299539,1,OTT0133,18.99
1,299540,1,LGE0043,399.0
2,299541,1,PAR0071,474.05
3,299542,1,WDT0315,68.39
4,299543,1,JBL0104,23.74


#### 3: products

In [123]:
products.head()

Unnamed: 0,sku,name,desc,price,promo_price,in_stock
0,RAI0007,Silver Rain Design mStand Support,Aluminum support compatible with all MacBook,59.99,499.9,1
1,APP0023,Apple Mac Keyboard Keypad Spanish,USB ultrathin keyboard Apple Mac Spanish.,59.0,590.0,0
2,APP0025,Mighty Mouse Apple Mouse for Mac,mouse Apple USB cable.,59.0,569.9,0
3,APP0072,Apple Dock to USB Cable iPhone and iPod white,IPhone dock and USB Cable Apple iPod.,25.0,230.0,0
4,KIN0007,Mac Memory Kingston 2GB 667MHz DDR2 SO-DIMM,2GB RAM Mac mini and iMac (2006/07) MacBook Pr...,34.99,31.99,1


In [125]:
products.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10477 entries, 0 to 10476
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   sku          10477 non-null  object 
 1   name         10477 non-null  object 
 2   desc         10477 non-null  object 
 3   price        10477 non-null  float64
 4   promo_price  10477 non-null  float64
 5   in_stock     10477 non-null  int64  
dtypes: float64(2), int64(1), object(3)
memory usage: 491.2+ KB


*It sounds good, let's look at the last table*

#### 4: brands

In [129]:
brands.head()

Unnamed: 0,short,long
0,8MO,8Mobility
1,ACM,Acme
2,ADN,Adonit
3,AII,Aiino
4,AKI,Akitio


This thable also is good. Now, I want to save them as `quality_data`

In [None]:
orders.to_csv(f'C:\\My projects\\pandas\\quality_data\\orders_q')
order_lines.to_csv(f'C:\\My projects\\pandas\\quality_data\\orderlines_q')
products.to_csv(f'C:\\My projects\\pandas\\quality_data\\products_q')
brands.to_csv(order_lines.to_csv(f'C:\\My projects\\pandas\\quality_data\\orderlines_q')