# orders_items_df -> SAILA

## LOAD the data to Jupyter

In [1]:
import pandas as pd

orders_items_df = pd.read_csv("data/olist_order_items_dataset.csv")

orders_items_df.head()

Unnamed: 0,order_id,order_item_id,product_id,seller_id,shipping_limit_date,price,freight_value
0,00010242fe8c5a6d1ba2dd792cb16214,1,4244733e06e7ecb4970a6e2683c13e61,48436dade18ac8b2bce089ec2a041202,2017-09-19 09:45:35,58.9,13.29
1,00018f77f2f0320c557190d7a144bdd3,1,e5f2d52b802189ee658865ca93d83a8f,dd7ddc04e1b6c2c614352b383efe2d36,2017-05-03 11:05:13,239.9,19.93
2,000229ec398224ef6ca0657da4fc703e,1,c777355d18b72b67abbeef9df44fd0fd,5b51032eddd242adc84c38acab88f23d,2018-01-18 14:48:30,199.0,17.87
3,00024acbcdf0a6daa1e931b038114c75,1,7634da152a4610f1595efa32f14722fc,9d7a1d34a5052409006425275ba1c2b4,2018-08-15 10:10:18,12.99,12.79
4,00042b26cf59d7ce69dfabb4e55b4fd9,1,ac6c3623068f30de03045865e4e10089,df560393f3a51e74553ab94004ba5c87,2017-02-13 13:57:51,199.9,18.14


## Exploratory Data Analysis
- Exploratory Data Analysis to understand the data

In [2]:
orders_items_df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 112650 entries, 0 to 112649
Data columns (total 7 columns):
 #   Column               Non-Null Count   Dtype  
---  ------               --------------   -----  
 0   order_id             112650 non-null  object 
 1   order_item_id        112650 non-null  int64  
 2   product_id           112650 non-null  object 
 3   seller_id            112650 non-null  object 
 4   shipping_limit_date  112650 non-null  object 
 5   price                112650 non-null  float64
 6   freight_value        112650 non-null  float64
dtypes: float64(2), int64(1), object(4)
memory usage: 6.0+ MB


In [26]:
orders_items_df.shape

(112650, 8)

## Columns


**order_id** 
- order unique identifier

**order_item_id** 
- sequential number identifying number of items included in the same order.

**product_id**
- product unique identifier

**seller_id**
- seller unique identifier

**shipping_limit_date**
- Shows the seller shipping limit date for handling the order over to the logistic partner.

**price**
- item price

**freight_value**
- item freight value item (if an order has more than one item the freight value is splitted between items)

In [29]:
orders_items_df.columns

Index(['order_id', 'order_item_id', 'product_id', 'seller_id',
       'shipping_limit_date', 'price', 'freight_value', 'total_freight_cost'],
      dtype='object')

## Data Transformation Overview :
  - Conducted thorough data cleaning, wrangling, and transformation on the `orders_items_df` DataFrame to enhance data quality and prepare it for analysis.
  - Employed various techniques to address statistical anomalies, missing values, and data type inconsistencies, ensuring the dataset's integrity and usability.
  - Implemented strategies to handle duplicate entries, standardize date formats, and derive new insights from the existing dataset.
  - The transformation process involved statistical analysis, data validation, and structural modifications to optimize the dataset for downstream analytical tasks.
  - Through meticulous data processing and refinement, the transformed DataFrame now exhibits improved clarity, consistency, and readiness for advanced analytical exploration.

- **Column Statistics**:
  - Total count of 'order_id' is higher than the count of unique 'order_id' due to multiple products ordered together in the same order.
  - Total count of 'product_id' is higher than the count of unique 'product_id' due to repeated purchases of the same product.
  - Total count of 'seller_id' is higher than the count of unique 'seller_id' due to repeated purchases from the same seller.

- **Missing Values Check**:
  - Checked for missing values in all columns of the DataFrame.

- **Duplicate Values Check**:
  - Checked for duplicate values in every single column and row of the DataFrame.

- **Data Type Conversion**:
  - Converted the 'shipping_limit_date' column from object to datetime format for better analysis and operations.

- **New Column Creation**:
  - Created a new column named 'total_freight_cost' in the DataFrame to store the total freight cost for each order, calculated by summing the freight values for all items associated with the same order_id.

- **Discovery of Total Count Discrepancy**:
  - Investigated the reason behind the discrepancy in the total count, noting that it could be due to duplicate entries for the same order_id with multiple product_ids, resulting in repeated 'total_freight_cost' values.

- **Primary Key Addition**:
  - Added a primary key as a unique identifier to each row in the DataFrame to ensure data integrity and facilitate database operations.

#### Statistic of the columns -- Different between total_count compared to unique_count 
- **order_id** - Reason being order_id will be repeated for each different product_id ordered together with in the same order
- **product_id** - Reason being repeated purchase for same product
- **seller_id** - Reason being repeated purchase thru same seller

In [25]:
orders_items_df.describe()
orders_items_df.describe(include='object')


Unnamed: 0,order_id,product_id,seller_id
count,112650,112650,112650
unique,98666,32951,3095
top,8272b63d03f5f79c56e9e4120aec44ef,aca2eb7d00ea1a7b8ebd4e68314663af,6560211a19b47992c3666cc44a7e94c0
freq,21,527,2033


#### Checking for missing values

In [5]:
orders_items_df.isnull().sum()


order_id               0
order_item_id          0
product_id             0
seller_id              0
shipping_limit_date    0
price                  0
freight_value          0
dtype: int64

#### Checking for duplicate values in the columns-- No duplicates values

In [6]:
duplicates = orders_items_df.duplicated()
print(duplicates.value_counts())

False    112650
Name: count, dtype: int64


#### Checking for duplicate values in each rows-- No duplicates values
*  `keep=False`, every row in the DataFrame is marked as a duplicate if it's identical to another row, regardless of whether it's the first occurrence or not. This allows you to easily identify all rows with duplicate values across all columns.
* `keep=True`, only rows that come after the first occurrence of a duplicate are marked as duplicates, while the initial occurrence is considered unique.

In [7]:
duplicates = orders_items_df.duplicated(keep=False)
print(duplicates.value_counts())


False    112650
Name: count, dtype: int64


#### Converting DataType of the 'shipping_limit_date' column from object --> datetime
- **shipping_limit_date** -->Shows the seller shipping limit date for handling the order over to the logistic partner.
- Convert to datetime: Since the column represents dates and times, converting it to datetime format would be beneficial for various analyses and operations.

In [8]:
orders_items_df['shipping_limit_date'] = pd.to_datetime(orders_items_df['shipping_limit_date'])
orders_items_df['shipping_limit_date'].info

<bound method Series.info of 0        2017-09-19 09:45:35
1        2017-05-03 11:05:13
2        2018-01-18 14:48:30
3        2018-08-15 10:10:18
4        2017-02-13 13:57:51
                 ...        
112645   2018-05-02 04:11:01
112646   2018-07-20 04:31:48
112647   2017-10-30 17:14:25
112648   2017-08-21 00:04:32
112649   2018-06-12 17:10:13
Name: shipping_limit_date, Length: 112650, dtype: datetime64[ns]>

#### 'order_item_id' --> sequential number identifying number of items included in the same order.
* This tallies with **payment_sequential** column from order_payments_df

In [10]:
sequential_counts = orders_items_df.groupby('order_item_id').size()
print(sequential_counts)

order_item_id
1     98666
2      9803
3      2287
4       965
5       460
6       256
7        58
8        36
9        28
10       25
11       17
12       13
13        8
14        7
15        5
16        3
17        3
18        3
19        3
20        3
21        1
dtype: int64


## Create a new COLUMN 'total_freight_cost'
- DataFrame **orders_items_df** will have a new column named 'total_freight_cost' indicating the total freight cost for each order, calculated by summing the freight values for all items associated with the same order_id.
- First calculate the total freight cost for each order by summing the freight values for all items associated with the same order_id. Then create a new column in the DataFrame to store these total freight costs


In [12]:
# Calculate total freight cost for each order
total_freight_per_order = orders_items_df.groupby('order_id')['freight_value'].sum().reset_index()

# Merge total freight cost with the original DataFrame based on order_id
orders_items_df = pd.merge(orders_items_df, total_freight_per_order, on='order_id', suffixes=('', '_total_freight'))

# Drop the extra freight_value column and rename the new column to indicate total freight cost

orders_items_df.rename(columns={'freight_value_total_freight': 'total_freight_cost'}, inplace=True)




In [13]:
orders_items_df.head(20)


Unnamed: 0,order_id,order_item_id,product_id,seller_id,shipping_limit_date,price,freight_value,total_freight_cost
0,00010242fe8c5a6d1ba2dd792cb16214,1,4244733e06e7ecb4970a6e2683c13e61,48436dade18ac8b2bce089ec2a041202,2017-09-19 09:45:35,58.9,13.29,13.29
1,00018f77f2f0320c557190d7a144bdd3,1,e5f2d52b802189ee658865ca93d83a8f,dd7ddc04e1b6c2c614352b383efe2d36,2017-05-03 11:05:13,239.9,19.93,19.93
2,000229ec398224ef6ca0657da4fc703e,1,c777355d18b72b67abbeef9df44fd0fd,5b51032eddd242adc84c38acab88f23d,2018-01-18 14:48:30,199.0,17.87,17.87
3,00024acbcdf0a6daa1e931b038114c75,1,7634da152a4610f1595efa32f14722fc,9d7a1d34a5052409006425275ba1c2b4,2018-08-15 10:10:18,12.99,12.79,12.79
4,00042b26cf59d7ce69dfabb4e55b4fd9,1,ac6c3623068f30de03045865e4e10089,df560393f3a51e74553ab94004ba5c87,2017-02-13 13:57:51,199.9,18.14,18.14
5,00048cc3ae777c65dbb7d2a0634bc1ea,1,ef92defde845ab8450f9d70c526ef70f,6426d21aca402a131fc0a5d0960a3c90,2017-05-23 03:55:27,21.9,12.69,12.69
6,00054e8431b9d7675808bcb819fb4a32,1,8d4f2bb7e93e6710a28f34fa83ee7d28,7040e82f899a04d1b434b795a43b4617,2017-12-14 12:10:31,19.9,11.85,11.85
7,000576fe39319847cbb9d288c5617fa6,1,557d850972a7d6f792fd18ae1400d9b6,5996cddab893a4652a15592fb58ab8db,2018-07-10 12:30:45,810.0,70.75,70.75
8,0005a1a1728c9d785b8e2b08b904576c,1,310ae3c140ff94b03219ad0adc3c778f,a416b6a846a11724393025641d4edd5e,2018-03-26 18:31:29,145.95,11.65,11.65
9,0005f50442cb953dcd1d21e1fb923495,1,4535b0e1091c278dfd193e5a1d63b39f,ba143b05f0110f0dc71ad71b4466ce92,2018-07-06 14:10:56,53.99,11.4,11.4


### Displays Display rows where total_freight_cost is higher than freight_value

In [14]:
# Display rows where total_freight_cost is higher than freight_value
rows_with_higher_total_freight_cost = orders_items_df[orders_items_df['total_freight_cost'] > orders_items_df['freight_value']]

print(rows_with_higher_total_freight_cost)



                                order_id  order_item_id  \
13      0008288aa423d2a3f00fcb17cd7d8719              1   
14      0008288aa423d2a3f00fcb17cd7d8719              2   
32      00143d0f86d6fbd9f9b38ab440ac16f5              1   
33      00143d0f86d6fbd9f9b38ab440ac16f5              2   
34      00143d0f86d6fbd9f9b38ab440ac16f5              3   
...                                  ...            ...   
112635  fff8287bbae429a99bb7e8c21d151c41              2   
112640  fffb9224b6fc7c43ebb0904318b10b5f              1   
112641  fffb9224b6fc7c43ebb0904318b10b5f              2   
112642  fffb9224b6fc7c43ebb0904318b10b5f              3   
112643  fffb9224b6fc7c43ebb0904318b10b5f              4   

                              product_id                         seller_id  \
13      368c6c730842d78016ad823897a372db  1f50f920176fa81dab994f9023523100   
14      368c6c730842d78016ad823897a372db  1f50f920176fa81dab994f9023523100   
32      e95ee6822b66ac6058e2e4aff656071a  a17f621c590ea0f

In [15]:
# Display count of rows
print("Number of rows with higher total freight cost:", rows_with_higher_total_freight_cost.shape[0])


Number of rows with higher total freight cost: 23706


### Discovery : Why the total count is 23706 when orders with > 1 purchase (Total:112650 - Unique:98666 = 13984)
* Look at the rows below for 1 same order_id with 4 product_id the 'total_freight_cost' is repeated thus the additional duplicated count

In [19]:
# Group by order_id and count the number of unique product_ids for each order_id
order_product_count = orders_items_df.groupby('order_id')['product_id'].nunique()

# Filter orders with multiple product_ids
orders_with_multiple_products = order_product_count[order_product_count > 2]

# Display all rows for orders with multiple product_ids
orders_with_multiple_products_df = orders_items_df[orders_items_df['order_id'].isin(orders_with_multiple_products.index)]

orders_with_multiple_products_df


Unnamed: 0,order_id,order_item_id,product_id,seller_id,shipping_limit_date,price,freight_value,total_freight_cost
296,00bcee890eba57a9767c7b5ca12d3a1b,1,6c90c0f6c2d89eb816b9e205b9d6a36a,3bb548e3cb7f70f28e3f11ee9dce0e59,2017-07-26 21:05:07,165.50,15.80,100.09
297,00bcee890eba57a9767c7b5ca12d3a1b,2,b7d94dc0640c7025dc8e3b46b52d8239,9c0e69c7bf2619675bbadf47b43f655a,2017-07-26 21:05:07,175.91,52.69,100.09
298,00bcee890eba57a9767c7b5ca12d3a1b,3,d143bf43abb18593fa8ed20cc990ae84,3bb548e3cb7f70f28e3f11ee9dce0e59,2017-07-26 21:05:07,165.50,15.80,100.09
299,00bcee890eba57a9767c7b5ca12d3a1b,4,55939df5d8d2b853fbc532bf8a00dc32,3bb548e3cb7f70f28e3f11ee9dce0e59,2017-07-26 21:05:07,165.50,15.80,100.09
505,012a238ab54294a3b365812ccc82b135,1,07c055536ebf10dfbb6c6db6dbfc36e5,cca3071e3e9bb7d12640c9fbe2301306,2017-09-21 18:35:10,45.90,12.65,38.07
...,...,...,...,...,...,...,...,...
112532,ffb8f7de8940249a3221252818937ecb,3,bd0ac51dc93e62c4dbe6ca9d70a9b311,1d4587203296c8f4ad134dc286fa6db0,2018-07-27 09:04:32,64.50,42.47,55.63
112533,ffb9a9cd00c74c11c24aa30b3d78e03b,1,fec565c4e3ad965c73fb1a21bb809257,da8622b14eb17ae2831f4ac5b9dab84a,2017-03-22 17:20:21,89.90,18.34,77.35
112534,ffb9a9cd00c74c11c24aa30b3d78e03b,2,fec565c4e3ad965c73fb1a21bb809257,da8622b14eb17ae2831f4ac5b9dab84a,2017-03-22 17:20:21,89.90,18.34,77.35
112535,ffb9a9cd00c74c11c24aa30b3d78e03b,3,03bb06cda40712fb8473f7962fb7d198,da8622b14eb17ae2831f4ac5b9dab84a,2017-03-22 17:20:21,129.90,18.49,77.35


### Add Primary key as unique identifier to each rows

In [31]:
# Add a new column 'id_pk' to the DataFrame
orders_items_df['id_pk'] = range(1, len(orders_items_df) + 1)

# Set the 'id_pk' column as the index
orders_items_df.set_index('id_pk', inplace=True)

In [32]:
orders_items_df

Unnamed: 0_level_0,order_id,order_item_id,product_id,seller_id,shipping_limit_date,price,freight_value,total_freight_cost
id_pk,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,00010242fe8c5a6d1ba2dd792cb16214,1,4244733e06e7ecb4970a6e2683c13e61,48436dade18ac8b2bce089ec2a041202,2017-09-19 09:45:35,58.90,13.29,13.29
2,00018f77f2f0320c557190d7a144bdd3,1,e5f2d52b802189ee658865ca93d83a8f,dd7ddc04e1b6c2c614352b383efe2d36,2017-05-03 11:05:13,239.90,19.93,19.93
3,000229ec398224ef6ca0657da4fc703e,1,c777355d18b72b67abbeef9df44fd0fd,5b51032eddd242adc84c38acab88f23d,2018-01-18 14:48:30,199.00,17.87,17.87
4,00024acbcdf0a6daa1e931b038114c75,1,7634da152a4610f1595efa32f14722fc,9d7a1d34a5052409006425275ba1c2b4,2018-08-15 10:10:18,12.99,12.79,12.79
5,00042b26cf59d7ce69dfabb4e55b4fd9,1,ac6c3623068f30de03045865e4e10089,df560393f3a51e74553ab94004ba5c87,2017-02-13 13:57:51,199.90,18.14,18.14
...,...,...,...,...,...,...,...,...
112646,fffc94f6ce00a00581880bf54a75a037,1,4aa6014eceb682077f9dc4bffebc05b0,b8bc237ba3788b23da09c0f1f3a3288c,2018-05-02 04:11:01,299.99,43.41,43.41
112647,fffcd46ef2263f404302a634eb57f7eb,1,32e07fd915822b0765e448c4dd74c828,f3c38ab652836d21de61fb8314b69182,2018-07-20 04:31:48,350.00,36.53,36.53
112648,fffce4705a9662cd70adb13d4a31832d,1,72a30483855e2eafc67aee5dc2560482,c3cfdc648177fdbbbb35635a37472c53,2017-10-30 17:14:25,99.90,16.95,16.95
112649,fffe18544ffabc95dfada21779c9644f,1,9c422a519119dcad7575db5af1ba540e,2b3e4a2a3ea8e01938cabda2a3e5cc79,2017-08-21 00:04:32,55.99,8.72,8.72


## Converteing Dataframe to .csv file

In [35]:

# Define the file path where you want to save the CSV file
csv_file_path = "cleaned_orders_items.csv"

# Convert the DataFrame to a CSV file
orders_items_df.to_csv(csv_file_path, index= True)

# Confirmation message
print("DataFrame converted to CSV successfully.")


DataFrame converted to CSV successfully.
