# Customers Dataset
Contains an hash to identify the customer and info about his location.

## Initial Column Description


|**Column Title**|**customer_id -> str** |**customer_unique_id -> str** |**customer_zip_code_prefix -> int** |**customer_city -> str**| **customer_state -> str**|
|--|--|--|--|--|--|
|Description |Primary key for this table |Customer Identifier Number |Zip Code from Customer Location |City Name from Customer |State Code from Customer |
|Example |274fa6071e5e17fe303b9748641082c8 |84732c5050c01db9b23e19ba39899398 |06703 |cotia |SP |

### Errors found
+ For this table the raw data didn't contain null or empties values.
+ Cities names contains variations and special characters like:
    + "santana do livramento" / "sant ana do livramento"
    + "varre-sai", "xique-xique"
    + "jaragua do sul" / "jaragua d sul" / "jaragua da sul"

## Required Libraries

In [5]:
# Importar librerías aquí
#Herramienta que nos permite importar, gestionar de mejor forma el conjunto de datos
import pandas as pd
#Herramienta que nos ayuda en el manejo matematico de los datos 
import numpy as np
#Libreria que nos ayuda en la visualización de datos
import matplotlib.pyplot as plt
#Herramienta de visualización de datos
import seaborn as sns

## Data Preprocessing


We decided to create 3 new tables for cities, states, and zipcodes. So it is necessary to replace information in this three columns for his id in the respective table:

|External Table | External Column with new id| column to replace|
|--|--|--|
|code_zip_prefix_dataset |code_zip_prefix_id |customer_zip_code_prefix |
|city_state_dataset |city_state_id |customer_city |
|state_dataset |state_id |customer_state|

Example:

For first row the info of this 3 columns is:

|customer_zip_code_prefix |customer_city |customer_state |
|--|--|--|
|14409 |franca |SP |

Looking in the external table **state_dataset** we find the id **1** corresponds to state **SP**. So we need to replace **SP** for **1**.

|customer_zip_code_prefix |customer_city |customer_state |
|--|--|--|
|14409 |franca |1 |

We make the same process for zipcode. Looking in **code_zip_prefix_dataset** we need to replace **14409** for the new id **5353**.

|customer_zip_code_prefix |customer_city |customer_state |
|--|--|--|
|5353 |franca |1 |

City replace process is a little different. Like in Brazil exist cities with same name, we decide to add the state code to city name. Leaving the original **franca** as **franca/SP**.

|customer_zip_code_prefix |customer_city |customer_state |
|--|--|--|
|5353 |franca/SP |1 |

After this change we look for the new id in **city_state_dataset**. The corresponding id for **franca/SP** is **1842**. After this change our row is ready. **customer_id** and **customer_unique_id** don't change his value.

|customer_zip_code_prefix |customer_city |customer_state |
|--|--|--|
|5353 |1842 |1 |

### Data Correction

#### Replace city name for his corresponding ID

#### Create CSV

When you saved the dataset always mark **"index = False"**. Or pandas will add a new column with a consequtive number. This small script is to remove this useless column.

## Final Column Description

|**Column Title**|**customer_id -> str** |**customer_unique_id -> str** |**customer_zip_code_prefix -> int** |**customer_city -> int**| **customer_state -> int**|
|--|--|--|--|--|--|
|Description |Primary key for this table |Customer Identifier Number |code_zip_prefix_id from code_zip_prefix_dataset |city_state_id from city_state_dataset |state_id from state_dataset |
|Before Preprocessing |274fa6071e5e17fe303b9748641082c8 |84732c5050c01db9b23e19ba39899398 |06703 |cotia |SP |
|After Preprocessing |274fa6071e5e17fe303b9748641082c8 |84732c5050c01db9b23e19ba39899398 |3354 |1437 |1 |

In [11]:
# Importar datos
df = pd.read_csv("../../data/raw/olist_order_payments_dataset.csv")

In [12]:
#se utiliza el metodo para visualizar las primeras filas
df.head()

Unnamed: 0,order_id,payment_sequential,payment_type,payment_installments,payment_value
0,b81ef226f3fe1789b1e8b2acac839d17,1,credit_card,8,99.33
1,a9810da82917af2d9aefd1278f1dcfa0,1,credit_card,1,24.39
2,25e8ea4e93396b6fa0d3dd708e76c1bd,1,credit_card,1,65.71
3,ba78997921bbcdc1373bb41e913ab953,1,credit_card,8,107.78
4,42fdf880ba16b47b59251dd489d4441a,1,credit_card,2,128.45


In [24]:
df.shape

(103886, 5)

In [25]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 103886 entries, 0 to 103885
Data columns (total 5 columns):
 #   Column                Non-Null Count   Dtype  
---  ------                --------------   -----  
 0   order_id              103886 non-null  object 
 1   payment_sequential    103886 non-null  int64  
 2   payment_type          103886 non-null  object 
 3   payment_installments  103886 non-null  int64  
 4   payment_value         103886 non-null  float64
dtypes: float64(1), int64(2), object(2)
memory usage: 4.0+ MB


In [28]:
df.describe()

Unnamed: 0,payment_sequential,payment_installments,payment_value
count,103886.0,103886.0,103886.0
mean,1.092679,2.853349,154.10038
std,0.706584,2.687051,217.494064
min,1.0,0.0,0.0
25%,1.0,1.0,56.79
50%,1.0,1.0,100.0
75%,1.0,4.0,171.8375
max,29.0,24.0,13664.08


In [35]:
len(df['order_id'].unique())

99440

In [20]:
df.duplicated()

0         False
1         False
2         False
3         False
4         False
          ...  
103881    False
103882    False
103883    False
103884    False
103885    False
Length: 103886, dtype: bool

In [22]:
df[df.duplicated(subset=["order_id"])]

Unnamed: 0,order_id,payment_sequential,payment_type,payment_installments,payment_value
1456,683bf306149bb869980b68d48a1bd6ab,1,credit_card,1,8.58
2324,e6a66a8350bb88497954d37688ab123e,2,voucher,1,10.51
2393,8e5148bee82a7e42c5f9ba76161dc51a,1,credit_card,1,0.67
2414,816ccd9d21435796e8ffa9802b2a782f,1,credit_card,1,5.65
2497,2cbcb371aee438c59b722a21d83597e0,2,voucher,1,7.80
...,...,...,...,...,...
103778,fd86c80924b4be8fb7f58c4ecc680dae,1,credit_card,1,76.10
103817,6d4616de4341417e17978fe57aec1c46,1,credit_card,1,19.18
103860,31bc09fdbd701a7a4f9b55b5955b8687,6,voucher,1,77.99
103869,c9b01bef18eb84888f0fd071b8413b38,1,credit_card,6,238.16
