# Data Consolidation Process

# Introduction

In the initial phase of this Exploratory Data Analysis (EDA), the primary focus is on consolidating several disparate CSV files into a single comprehensive dataset. This process begins by loading each CSV file. Subsequently, the various datasets are merged based on their corresponding keys, such as 'Order_Id', 'Product_Id', 'Customer_Id', and so forth. The resulting merged data is then stored in a single DataFrame, which is saved as a unified CSV file. This consolidated file is poised to serve as the foundation for further analysis.

In [1]:
# import the relevant libraries
import numpy as np
import pandas as pd
from pandas import DataFrame

Initiating the process, individual CSV files are imported into the notebook, as shown below.

In [2]:
# Load the datasets 
customer_info = pd.read_csv('customer_info.csv')
product_info = pd.read_csv('product_info.csv')
employee_info = pd.read_csv('employee_info.csv')
invoice_product_info = pd.read_csv('invoice_products_info.csv')
invoice_payment_info = pd.read_csv('invoice_payment_info.csv')
invoice_porducts_status= pd.read_csv('invoice_porducts_status.csv')
invoice_reviews= pd.read_csv('invoice_reviews.csv')
lat_log= pd.read_csv('lat&lon.csv')

In the subsequent step, all the datasets are merged based on their respective foreign keys.

In [3]:
# Merging the datsets
df1 = pd.merge(invoice_product_info, invoice_payment_info, on='order_id')
df2 = pd.merge(df1, invoice_porducts_status, on='order_id')
df3 = pd.merge(df2, invoice_reviews,  on='order_id')
df4 = pd.merge(df3, customer_info,  on='customer_id')
df5 = pd.merge(df4, product_info,  on='product_id')
Total_Olist_df = pd.merge(df5, employee_info,  on='seller_id')

In [4]:
Total_Olist_df

Unnamed: 0,order_id,order_item_id,product_id,seller_id,shipping_limit_date,price,freight_value,payment_sequential,payment_type,payment_installments,...,product_name_lenght,product_description_lenght,product_photos_qty,product_weight_g,product_length_cm,product_height_cm,product_width_cm,seller_zip_code_prefix,seller_city,seller_state
0,00010242fe8c5a6d1ba2dd792cb16214,1,4244733e06e7ecb4970a6e2683c13e61,48436dade18ac8b2bce089ec2a041202,2017-09-19 9:45,58.9,13.29,1,credit_card,2,...,58.0,598.0,4.0,650.0,28.0,9.0,14.0,27277,volta redonda,SP
1,130898c0987d1801452a8ed92a670612,1,4244733e06e7ecb4970a6e2683c13e61,48436dade18ac8b2bce089ec2a041202,2017-07-05 2:44,55.9,17.96,1,boleto,1,...,58.0,598.0,4.0,650.0,28.0,9.0,14.0,27277,volta redonda,SP
2,532ed5e14e24ae1f0d735b91524b98b9,1,4244733e06e7ecb4970a6e2683c13e61,48436dade18ac8b2bce089ec2a041202,2018-05-23 10:56,64.9,18.33,1,credit_card,2,...,58.0,598.0,4.0,650.0,28.0,9.0,14.0,27277,volta redonda,SP
3,6f8c31653edb8c83e1a739408b5ff750,1,4244733e06e7ecb4970a6e2683c13e61,48436dade18ac8b2bce089ec2a041202,2017-08-07 18:55,58.9,16.17,1,credit_card,3,...,58.0,598.0,4.0,650.0,28.0,9.0,14.0,27277,volta redonda,SP
4,7d19f4ef4d04461989632411b7e588b9,1,4244733e06e7ecb4970a6e2683c13e61,48436dade18ac8b2bce089ec2a041202,2017-08-16 22:05,58.9,13.29,1,credit_card,4,...,58.0,598.0,4.0,650.0,28.0,9.0,14.0,27277,volta redonda,SP
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
117324,fecc4ea5a3e06ce3192ae2f05b7a8439,2,70adb75b3b2e86cffbb697c90867c3f3,4e2627090e6e5b9fabba883a37897683,2018-01-21 22:09,39.9,14.10,1,credit_card,1,...,31.0,437.0,2.0,100.0,65.0,11.0,11.0,31565,belo horizonte,MG
117325,fecc4ea5a3e06ce3192ae2f05b7a8439,3,70adb75b3b2e86cffbb697c90867c3f3,4e2627090e6e5b9fabba883a37897683,2018-01-21 22:09,39.9,14.10,1,credit_card,1,...,31.0,437.0,2.0,100.0,65.0,11.0,11.0,31565,belo horizonte,MG
117326,fecc4ea5a3e06ce3192ae2f05b7a8439,4,70adb75b3b2e86cffbb697c90867c3f3,4e2627090e6e5b9fabba883a37897683,2018-01-21 22:09,39.9,14.10,1,credit_card,1,...,31.0,437.0,2.0,100.0,65.0,11.0,11.0,31565,belo horizonte,MG
117327,ff701a7c869ad21de22a6994237c8a00,1,5ff4076c0f01eeba4f728c9e3fa2653c,3e35a8bb43569389d3cebef0ce820f69,2018-04-18 20:10,27.9,14.44,1,credit_card,1,...,28.0,242.0,1.0,2000.0,19.0,38.0,19.0,3124,sao paulo,SP


In [5]:
Total_Olist_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 117329 entries, 0 to 117328
Data columns (total 39 columns):
 #   Column                         Non-Null Count   Dtype  
---  ------                         --------------   -----  
 0   order_id                       117329 non-null  object 
 1   order_item_id                  117329 non-null  int64  
 2   product_id                     117329 non-null  object 
 3   seller_id                      117329 non-null  object 
 4   shipping_limit_date            117329 non-null  object 
 5   price                          117329 non-null  float64
 6   freight_value                  117329 non-null  float64
 7   payment_sequential             117329 non-null  int64  
 8   payment_type                   117329 non-null  object 
 9   payment_installments           117329 non-null  int64  
 10  payment_value                  117329 non-null  float64
 11  customer_id                    117329 non-null  object 
 12  order_status                  

The summary of the above dataframe confirms the successful combination of our datasets, revealing a total of 39 columns and 117,329 rows.The subsequent step, all the datasets are merged based on their respective foreign keys.

In [7]:
# Save the merged and cleaned data to a new file for use it for further analysis
Total_Olist_df.to_csv('Olist_CombinedData.csv', index=False)

Lastly, the processed data frame is saved and exported as a CSV file to the designated location.

# Conclusion

The process outlined above demonstrates the successful integration of several separate CSV files into a comprehensive data frame. After carefully merging relevant keys, we had a dataset containing 39 columns and 117,329 rows. This prepared and structured dataset, saved as a new CSV file, lays the groundwork for in-depth analysis in the subsequent stages