# A Look into Sellers Dataset

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
df = pd.read_csv('../datasets/sellers_dataset.csv')
df

Unnamed: 0,seller_id,seller_zip_code_prefix,seller_city,seller_state
0,3442f8959a84dea7ee197c632cb2df15,13023,campinas,SP
1,d1b65fc7debc3361ea86b5f14c68d2e2,13844,mogi guacu,SP
2,ce3ad9de960102d0677a81f5d0bb7b2d,20031,rio de janeiro,RJ
3,c0f3eea2e14555b6faeea3dd58c1b1c3,4195,sao paulo,SP
4,51a04a8a6bdcb23deccc82b0b80742cf,12914,braganca paulista,SP
...,...,...,...,...
3090,98dddbc4601dd4443ca174359b237166,87111,sarandi,PR
3091,f8201cab383e484733266d1906e2fdfa,88137,palhoca,SC
3092,74871d19219c7d518d0090283e03c137,4650,sao paulo,SP
3093,e603cf3fec55f8697c9059638d6c8eb5,96080,pelotas,RS


## Data Exploration

### Data Description

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3095 entries, 0 to 3094
Data columns (total 4 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   seller_id               3095 non-null   object
 1   seller_zip_code_prefix  3095 non-null   int64 
 2   seller_city             3095 non-null   object
 3   seller_state            3095 non-null   object
dtypes: int64(1), object(3)
memory usage: 96.8+ KB


As can be seen above, no missing value from `sellers_dataset`. <br>
We can also consider changing `seller_zip_code_prefix` dtype into object because the values do not represent a numerical value. Instead it merely represents encoding of certain area thus making it essentially categorical.

In [4]:
df.describe()

Unnamed: 0,seller_zip_code_prefix
count,3095.0
mean,32291.059451
std,32713.45383
min,1001.0
25%,7093.5
50%,14940.0
75%,64552.5
max,99730.0


In [5]:
df.describe(include=[object])

Unnamed: 0,seller_id,seller_city,seller_state
count,3095,3095,3095
unique,3095,611,23
top,9e25199f6ef7e7c347120ff175652c3b,sao paulo,SP
freq,1,694,1849


### Seller City Count

In [6]:
df.seller_city.value_counts().reset_index(name='counts')

Unnamed: 0,seller_city,counts
0,sao paulo,694
1,curitiba,127
2,rio de janeiro,96
3,belo horizonte,68
4,ribeirao preto,52
...,...,...
606,ipua,1
607,muqui,1
608,timoteo,1
609,pouso alegre,1


### Seller State Count

In [7]:
df.seller_state.value_counts().reset_index(name='counts')

Unnamed: 0,seller_state,counts
0,SP,1849
1,PR,349
2,MG,244
3,SC,190
4,RJ,171
5,RS,129
6,GO,40
7,DF,30
8,ES,23
9,BA,19


### Seller Zip Code Prefix Count

In [8]:
df.seller_zip_code_prefix.astype('category').value_counts().reset_index(name='counts')

Unnamed: 0,seller_zip_code_prefix,counts
0,14940,49
1,13660,10
2,16200,9
3,13920,9
4,1026,8
...,...,...
2241,97050,1
2242,96816,1
2243,96530,1
2244,96503,1
