# Sellers dataset

This dataset includes data about the sellers that fulfilled orders made at Olist. Use it to find the seller location and to identify which seller fulfilled each product.

## Initial Column Description


|**Column Title**|**seller_id-> str**|**seller_zip_code_prefix-> int** |**seller_city -> str** |**seller_state -> srt** |
|--|--|--|--|--|
|Description |seller unique identifier - PK |first 5 digits of seller zip code |seller city name |seller state |
|Example |3442f8959a84dea7ee197c632cb2df15 |13023 |campinas	 |SP |

## Required libraries

In [31]:
import pandas as pd
import numpy as np

## Data Preprocessing

#### Importing main and auxiliar csv files

In [32]:
sellers_csv = '../../data/raw/olist_sellers_dataset.csv'
sellers_df = pd.read_csv(sellers_csv)

In [33]:
brazil_cities = '../../data/external/brazil_cities.csv'
brazil_cities = pd.read_csv(brazil_cities)

#### Displaying entries for each of dataframe

In [34]:
sellers_df

Unnamed: 0,seller_id,seller_zip_code_prefix,seller_city,seller_state
0,3442f8959a84dea7ee197c632cb2df15,13023,campinas,SP
1,d1b65fc7debc3361ea86b5f14c68d2e2,13844,mogi guacu,SP
2,ce3ad9de960102d0677a81f5d0bb7b2d,20031,rio de janeiro,RJ
3,c0f3eea2e14555b6faeea3dd58c1b1c3,4195,sao paulo,SP
4,51a04a8a6bdcb23deccc82b0b80742cf,12914,braganca paulista,SP
...,...,...,...,...
3090,98dddbc4601dd4443ca174359b237166,87111,sarandi,PR
3091,f8201cab383e484733266d1906e2fdfa,88137,palhoca,SC
3092,74871d19219c7d518d0090283e03c137,4650,sao paulo,SP
3093,e603cf3fec55f8697c9059638d6c8eb5,96080,pelotas,RS


In [35]:
brazil_cities

Unnamed: 0,code,name,state
0,5200050,abadia de goias,GO
1,3100104,abadia dos dourados,MG
2,5200100,abadiania,GO
3,1500107,abaetetuba,PA
4,3100203,abaete,MG
...,...,...,...
5565,4301552,aurea,RS
5566,4101150,angulo,PR
5567,2900504,erico cardoso,BA
5568,1505106,obidos,PA


Thanks to the previous analysis located in the file `notebooks/exploratory_analysis/olist_sellers_dataset.ipynb` we can extract the following points:

- `sellers_id` consists of unique data, so we can consider it as our primary key.

What we are looking for is first to clean up the `seller_city` column to match the `name` column of the second csv.

So let's start with the `SP` values of both dataframes. However, before we start, let's start by eliminating possible special characters of the Portuguese language inside the `seller_city` column, this in order to make a better comparison.

#### Defining and substituting special characters

In [36]:
city_modify = []
for i in sellers_df.seller_city.to_list():
    aux = ""
    for j in i:
        if j == 'ã':
            aux = aux+'a'
        elif j == 'â':
            aux = aux+'a'
        elif j == 'á':
            aux = aux+'a'
        elif j == 'á':
            aux = aux+'a'
        elif j == 'à':
            aux = aux+'a'
        elif j == 'ã':
            aux = aux+'a'
        elif j == 'é':
            aux = aux+'e'
        elif j == 'ê':
            aux = aux+'e'
        elif j == 'í':
            aux = aux+'i'
        elif j == 'î':
            aux = aux+'i'
        elif j == 'ó':
            aux = aux+'o'
        elif j == 'ô':
            aux = aux+'o'
        elif j == 'õ':
            aux = aux+'o'
        elif j == 'ú':
            aux = aux+'u'
        elif j == 'û':
            aux = aux+'u'
        elif j == 'ü':
            aux = aux+'u'
        elif j == 'ç':
            aux = aux+'c'
        elif j == '-':
            aux = aux+' '
        elif j == "'":
            aux = aux+' '
        elif j == '£':
            pass
        elif j == '.':
            pass
        elif j == "(":
            pass
        elif j == ")":
            pass
        else:
            aux = aux+j
    aux=aux.lower()
    city_modify.append(aux)

Finally we replace the `sellers_df.seller_city` column with the `city_modify` list.

In [37]:
sellers_df['seller_city'] = city_modify

#### Defining and substituting formats other than sellers_df['seller_city']

In [38]:
andira = [['andira'],['andira pr']]
angra_dos_reis = [['angra dos reis'],['angra dos reis rj']]
sao_bernardo_do_campo = [['sao bernardo do campo'],['ao bernardo do campo','sao bernardo do capo','sbc','sbc/sp']]
auriflama = [['auriflama'],['auriflama/sp']]
balneario_camboriu = [['balneario camboriu'],['balenario camboriu']]
barbacena = [['barbacena'],['barbacena/ minas gerais']]
belo_horizonte = [['belo horizonte'],['belo horizont']]
brasilia = [['brasilia'],['brasilia df']]
carapicuiba = [['carapicuiba'],['carapicuiba / sao paulo']]
cariacica = [['cariacica'],['cariacica / es']]
cascavel = [['cascavel'],['cascavael']]
castro = [['castro'],['castro pires']]
ferraz_de_vasconcelos = [['ferraz de vasconcelos'],['ferraz de  vasconcelos']]
florianopolis = [['florianopolis'],['floranopolis']]
novo_gama = [['novo gama'],['gama']]
jacarei = [['jacarei'],['jacarei / sao paulo']]
lages = [['lages'],['lages   sc']]
maua = [['maua'],['maua/sao paulo']]
mogi_das_cruzes = [['mogi das cruzes'],['mogi das cruses','mogi das cruzes / sp']]
novo_hamburgo = [['novo hamburgo'],['"novo hamburgo, rio grande do sul, brasil"','novo hamburgo, rio grande do sul, brasil']]
sao_jose_do_rio_pardo = [['sao jose do rio pardo'],['scao jose do rio pardo',r'scao jose do rio pardo']]
sao_sebastiao_da_grama = [['sao sebastiao da grama'],['sao sebastiao da grama/sp']]
sao_paulo = [['sao paulo'],['sp','sp / sp','são paulo','sao pauo','sao paulop','sao paulo sp',r'sao paulo / sao paulo','sao paulo   sp','sao paluo','sao  paulo']]
sao_miguel_do_oeste = [['sao miguel do oeste'],['sao miguel d oeste']]
sao_jose_dos_pinhais = [['sao jose dos pinhais'],['sao jose dos pinhas','sao  jose dos pinhais']]
sao_jose_do_rio_preto = [['sao jose do rio preto'],['sao jose do rio preto',r'sao jose do rio pret','s jose do rio preto']]
santo_andre = [['santo andre'],['santo andre/sao paulo','sando andre']]
santa_barbara_d_oeste = [['santa barbara d oeste'],['santa barbara d´oeste']]
ribeirao_preto = [['ribeirao preto'],['robeirao preto','riberao preto','ribeirao pretp','ribeirao preto / sao paulo']]
rio_de_janeiro = [['rio de janeiro'],['04482255',r'rio de janeiro / rio de janeiro',r'rio de janeiro \rio de janeiro',r'"rio de janeiro, rio de janeiro, brasil"','rio de janeiro, rio de janeiro, brasil']]
porto_ferreira = [['porto ferreira'],['portoferreira']]
pirpirituba = [['pirpirituba'],['pirituba']]
pinhais = [['pinhais'],['pinhais/pr']]
balneario_picarras = [['balneario picarras'],['picarras']]
paicandu = [['paicandu'],['paincandu']]
taboao_da_serra = [['taboao da serra'],['tabao da serra']]
taguatinga = [['taguatinga'],['aguas claras df']]
guarulhos = [['guarulhos'],['garulhos']]
juazeiro_do_norte = [['juazeiro do norte'],['juzeiro do norte']]
guaruja = [['guaruja'],['vicente de carvalho']]
porto_seguro = [['porto seguro'],['arraial d ajuda porto seguro']]
campo_do_meio = [['campo do meio'],['minas gerais']]
palhoca = [['palhoca'],['santa catarina']]
para_de_minas = [['para de minas'],['centro']]
maringa = [['maringa'],['vendas@creditpartscombr']]
itaguacu_da_bahia = [['itaguacu da bahia'],['bahia']]

words_to_change = [sao_jose_do_rio_pardo,itaguacu_da_bahia,maringa,para_de_minas,palhoca,campo_do_meio,porto_seguro,guaruja,andira,angra_dos_reis,sao_bernardo_do_campo,auriflama,balneario_camboriu,barbacena,belo_horizonte,brasilia,carapicuiba,cariacica,cascavel,castro,ferraz_de_vasconcelos,florianopolis,novo_gama,jacarei,lages,maua,mogi_das_cruzes,novo_hamburgo,sao_sebastiao_da_grama,sao_paulo,sao_miguel_do_oeste,sao_jose_dos_pinhais,sao_jose_do_rio_preto,santo_andre,santa_barbara_d_oeste,sao_jose_do_rio_preto,ribeirao_preto,rio_de_janeiro,porto_ferreira,pirpirituba,pinhais,balneario_picarras,paicandu,taboao_da_serra,taguatinga,guarulhos,juazeiro_do_norte]

In [39]:
for word_to_change in words_to_change:
    sellers_df['seller_city'].replace(word_to_change[1],word_to_change[0][0], inplace=True)

Since there are cities written in the same way, but belonging to different states, we will proceed to concatenate the `seller_city` column and the `seller_state` column. To later rename it in the `seller_city` column.

#### Creating city_state columns for sellers_df and brazil_cities

In [40]:
seller_city_list = sellers_df.seller_city.to_list()
seller_state_list = sellers_df.seller_state.to_list()

In [41]:
seller_city_new = [seller_city_list[i] + '/' + seller_state_list[i] for i in range(len(seller_city_list))]
seller_city_new = pd.DataFrame(seller_city_new,columns=['city_state'])
seller_city_new

Unnamed: 0,city_state
0,campinas/SP
1,mogi guacu/SP
2,rio de janeiro/RJ
3,sao paulo/SP
4,braganca paulista/SP
...,...
3090,sarandi/PR
3091,palhoca/SC
3092,sao paulo/SP
3093,pelotas/RS


In [42]:
brazil_city_list = brazil_cities.name.to_list()
brazil_state_list = brazil_cities.state.to_list()

In [43]:
brazil_city_new = [brazil_city_list[i] + '/' + brazil_state_list[i] for i in range(len(brazil_city_list))]
brazil_city_new = pd.DataFrame(brazil_city_new, columns=['city_state'])
brazil_city_new = brazil_city_new.sort_values(by='city_state')
brazil_city_new['city_state_id'] = np.arange(1,len(brazil_city_new)+1)
brazil_city_new = brazil_city_new.reindex(columns=['city_state_id','city_state'])
brazil_city_new

Unnamed: 0,city_state_id,city_state
0,1,abadia de goias/GO
1,2,abadia dos dourados/MG
2,3,abadiania/GO
4,4,abaete/MG
3,5,abaetetuba/PA
...,...,...
5528,5566,xique xique/BA
5529,5567,zabele/PB
5530,5568,zacarias/SP
5532,5569,ze doca/MA


#### Correcting city_state values in seller_city_new to be equal to brazil_city_new

In [44]:
caxias_do_sul = [['caxias do sul/RS'],['caxias do sul/SP']]
rio_de_janeiro = [['rio de janeiro/RJ'],['rio de janeiro/RN','rio de janeiro/SP']]
laranjeiras_do_sul = [['laranjeiras do sul/PR'],['laranjeiras do sul/SP']]
goioere = [['goioere/PR'],['goioere/SP']]
sertanopolis = [['sertanopolis/PR'],['sertanopolis/SP']]
marechal_candido_rondon = [['marechal candido rondon/PR'],['marechal candido rondon/SP','marechal candido rondon/PA']]
tocantins = [['tocantins/MG'],['tocantins/SP']]
volta_redonda = [['volta redonda/RJ'],['volta redonda/SP']]
belo_horizonte = [['belo horizonte/MG'],['belo horizonte/SP']]
vila_velha = [['vila velha/ES'],['vila velha/SP']]
juiz_de_fora = [['juiz de fora/MG'],['juiz de fora/SP']]
porto_alegre = [['porto alegre/RS'],['porto alegre/SP']]
laguna = [['laguna/SC'],['laguna/SP']]
curitiba = [['curitiba/PR'],['curitiba/SP']]
florianopolis = [['florianopolis/SC'],['florianopolis/SP']]
ipira = [['ipira/SC'],['ipira/SP']]
parana = [['parana/TO'],['parana/PR']]
blumenau = [['blumenau/SC'],['blumenau/SP']]
pirpirituba = [['pirpirituba/PB'],['pirpirituba/SP']]
taguatinga = [['taguatinga/TO'],['taguatinga/SP']]
rio_bonito = [['rio bonito/RJ'],['rio bonito/SP']]
novo_gama = [['novo gama/GO'],['novo gama/DF']]
pinhais = [['pinhais/PR'],['pinhais/SP']]
sao_jose_dos_pinhais = [['sao jose dos pinhais/PR'],['sao jose dos pinhais/SP']]
itajai = [['itajai/SC'],['itajai/SP']]
palhoca = [['palhoca/SC'],['palhoca/SP']]
chapeco = [['chapeco/SC'],['chapeco/SP']]
andradas = [['andradas/MG'],['andradas/SP']]
castro = [['castro/PR'],['castro/MG']]
londrina = [['londrina/PR'],['londrina/SP']]

city_states_to_change = [londrina,castro,andradas,chapeco,palhoca,itajai,sao_jose_dos_pinhais,caxias_do_sul,rio_de_janeiro,laranjeiras_do_sul,goioere,sertanopolis,marechal_candido_rondon,tocantins,volta_redonda,belo_horizonte,vila_velha,
juiz_de_fora,porto_alegre,laguna,curitiba,florianopolis,ipira,parana,blumenau,pirpirituba,taguatinga,taguatinga,rio_bonito,novo_gama,marechal_candido_rondon,pinhais]

In [45]:
for city_state_to_change in city_states_to_change:
    seller_city_new['city_state'].replace(city_state_to_change[1],city_state_to_change[0][0], inplace=True)

In [46]:
seller_city_new

Unnamed: 0,city_state
0,campinas/SP
1,mogi guacu/SP
2,rio de janeiro/RJ
3,sao paulo/SP
4,braganca paulista/SP
...,...
3090,sarandi/PR
3091,palhoca/SC
3092,sao paulo/SP
3093,pelotas/RS


#### Recreating sellers_df in sellers_new_df

seller_id	seller_zip_code_prefix

In [47]:
sellers_new_df = seller_city_new.copy()
sellers_new_df['seller_id'] = sellers_df.seller_id
sellers_new_df['seller_zip_code_prefix'] = sellers_df.seller_zip_code_prefix
sellers_new_df = sellers_new_df.reindex(columns=['seller_id','seller_zip_code_prefix','city_state'])
sellers_new_df.rename(columns={'city_state':'seller_city_state'},inplace=True)
sellers_new_df

Unnamed: 0,seller_id,seller_zip_code_prefix,seller_city_state
0,3442f8959a84dea7ee197c632cb2df15,13023,campinas/SP
1,d1b65fc7debc3361ea86b5f14c68d2e2,13844,mogi guacu/SP
2,ce3ad9de960102d0677a81f5d0bb7b2d,20031,rio de janeiro/RJ
3,c0f3eea2e14555b6faeea3dd58c1b1c3,4195,sao paulo/SP
4,51a04a8a6bdcb23deccc82b0b80742cf,12914,braganca paulista/SP
...,...,...,...
3090,98dddbc4601dd4443ca174359b237166,87111,sarandi/PR
3091,f8201cab383e484733266d1906e2fdfa,88137,palhoca/SC
3092,74871d19219c7d518d0090283e03c137,4650,sao paulo/SP
3093,e603cf3fec55f8697c9059638d6c8eb5,96080,pelotas/RS


#### Creating column seller_state and seller_state_id

In [48]:
sellers_new_df['seller_state'] = sellers_new_df['seller_city_state'].apply(lambda x: x[-2:])

In [49]:
sellers_new_df.astype({'seller_id':str,'seller_city_state':str,'seller_state':str})

Unnamed: 0,seller_id,seller_zip_code_prefix,seller_city_state,seller_state
0,3442f8959a84dea7ee197c632cb2df15,13023,campinas/SP,SP
1,d1b65fc7debc3361ea86b5f14c68d2e2,13844,mogi guacu/SP,SP
2,ce3ad9de960102d0677a81f5d0bb7b2d,20031,rio de janeiro/RJ,RJ
3,c0f3eea2e14555b6faeea3dd58c1b1c3,4195,sao paulo/SP,SP
4,51a04a8a6bdcb23deccc82b0b80742cf,12914,braganca paulista/SP,SP
...,...,...,...,...
3090,98dddbc4601dd4443ca174359b237166,87111,sarandi/PR,PR
3091,f8201cab383e484733266d1906e2fdfa,88137,palhoca/SC,SC
3092,74871d19219c7d518d0090283e03c137,4650,sao paulo/SP,SP
3093,e603cf3fec55f8697c9059638d6c8eb5,96080,pelotas/RS,RS


#### Making merge between file `state_dataset.csv` and dataframe `brazil_city_new`.

In [50]:
state_df = pd.read_csv('../../data/interim/state_dataset.csv')

In [51]:
brazil_city_new['state'] = brazil_city_new.city_state.apply(lambda x : x[-2:])
state_df = state_df.reindex(columns=['state','state_id'])
city_dataset = pd.merge(brazil_city_new,state_df,on='state', how='left')
city_dataset_copy = city_dataset.copy().drop(columns=['state'])
city_dataset_copy

Unnamed: 0,city_state_id,city_state,state_id
0,1,abadia de goias/GO,20
1,2,abadia dos dourados/MG,6
2,3,abadiania/GO,20
3,4,abaete/MG,6
4,5,abaetetuba/PA,15
...,...,...,...
5565,5566,xique xique/BA,7
5566,5567,zabele/PB,11
5567,5568,zacarias/SP,1
5568,5569,ze doca/MA,14


### Creating city_dataset.csv

In [52]:
city_dataset_copy.to_csv('../../data/interim/city_state_dataset.csv', index=False)

#### Making merge between `city_dataset`.csv and `sellers_new_df`.

In [53]:
#city_dataset.drop(columns=['state_id'], inplace=True)

In [54]:
city_dataset

Unnamed: 0,city_state_id,city_state,state,state_id
0,1,abadia de goias/GO,GO,20
1,2,abadia dos dourados/MG,MG,6
2,3,abadiania/GO,GO,20
3,4,abaete/MG,MG,6
4,5,abaetetuba/PA,PA,15
...,...,...,...,...
5565,5566,xique xique/BA,BA,7
5566,5567,zabele/PB,PB,11
5567,5568,zacarias/SP,SP,1
5568,5569,ze doca/MA,MA,14


In [55]:
city_dataset = city_dataset.reindex(columns=['city_state','state','city_state_id','state_id'])
city_dataset.drop(columns=['state'],inplace=True)

In [56]:
sellers_new_df.rename(columns={'seller_city_state':'city_state','seller_state':'state'},inplace=True)

In [57]:
sellers_database = pd.merge(sellers_new_df, city_dataset, on=['city_state'], how='left')
sellers_database.drop(columns=['city_state','state'],inplace=True)
sellers_database.rename(columns={'city_state_id':'seller_city_state_id', 'state_id':'seller_state_id'},inplace=True)

In [58]:
sellers_database

Unnamed: 0,seller_id,seller_zip_code_prefix,seller_city_state_id,seller_state_id
0,3442f8959a84dea7ee197c632cb2df15,13023,953,1
1,d1b65fc7debc3361ea86b5f14c68d2e2,13844,3110,1
2,ce3ad9de960102d0677a81f5d0bb7b2d,20031,4207,4
3,c0f3eea2e14555b6faeea3dd58c1b1c3,4195,4856,1
4,51a04a8a6bdcb23deccc82b0b80742cf,12914,743,1
...,...,...,...,...
3090,98dddbc4601dd4443ca174359b237166,87111,4936,26
3091,f8201cab383e484733266d1906e2fdfa,88137,3523,27
3092,74871d19219c7d518d0090283e03c137,4650,4856,1
3093,e603cf3fec55f8697c9059638d6c8eb5,96080,3732,25


#### Performing merge between `code_zip_prefix_dataset.csv` and `sellers_database` document

In [59]:
code_zip_prefix_dataset = pd.read_csv('../../data/interim/code_zip_prefix_dataset.csv')

In [60]:
code_zip_prefix_dataset = code_zip_prefix_dataset.reindex(columns=['code_zip_prefix','code_zip_prefix_id'])
sellers_database = sellers_database.reindex(columns=['seller_id','seller_city_state_id','seller_state_id','seller_zip_code_prefix'])
sellers_database.rename(columns={'seller_zip_code_prefix':'code_zip_prefix'},inplace=True)

In [61]:
sellers_database = pd.merge(sellers_database,code_zip_prefix_dataset, on='code_zip_prefix', how='left')
sellers_database.drop(columns='code_zip_prefix',inplace=True)
sellers_database = sellers_database.reindex(columns=['seller_id','code_zip_prefix_id','seller_city_state_id','seller_state_id'])
sellers_database.rename(columns={'code_zip_prefix_id':'seller_code_zip_prefix_id'},inplace=True)
sellers_database

Unnamed: 0,seller_id,seller_code_zip_prefix_id,seller_city_state_id,seller_state_id
0,3442f8959a84dea7ee197c632cb2df15,4806,953,1
1,d1b65fc7debc3361ea86b5f14c68d2e2,5197,3110,1
2,ce3ad9de960102d0677a81f5d0bb7b2d,6358,4207,4
3,c0f3eea2e14555b6faeea3dd58c1b1c3,1772,4856,1
4,51a04a8a6bdcb23deccc82b0b80742cf,4778,743,1
...,...,...,...,...
3090,98dddbc4601dd4443ca174359b237166,17108,4936,26
3091,f8201cab383e484733266d1906e2fdfa,17327,3523,27
3092,74871d19219c7d518d0090283e03c137,2169,4856,1
3093,e603cf3fec55f8697c9059638d6c8eb5,18505,3732,25


#### Creating final csv: sellers_database.csv

When you saved the dataset always mark **"index = False"**. Or pandas will add a new column with a consequtive number. This small script is to remove this useless column.

In [62]:
sellers_database.to_csv('../../data/interim/sellers_database.csv',index=False)

## Final Column Description

|**Column Title**|**seller_id-> str**|**seller_code_zip_prefix_id-> int** |**seller_city_state_id -> int** |**seller_state_id -> int** |
|--|--|--|--|--|
|Before Preprocessing |3442f8959a84dea7ee197c632cb2df15	 |4806 |953 |1 |