# Customers Dataset
Contains an hash to identify the customer and info about his location.

## Initial Column Description


|**Column Title**|**customer_id -> str** |**customer_unique_id -> str** |**customer_zip_code_prefix -> int** |**customer_city -> str**| **customer_state -> str**|
|--|--|--|--|--|--|
|Description |Primary key for this table |Customer Identifier Number |Zip Code from Customer Location |City Name from Customer |State Code from Customer |
|Example |274fa6071e5e17fe303b9748641082c8 |84732c5050c01db9b23e19ba39899398 |06703 |cotia |SP |

### Errors found
+ For this table the raw data didn't contain null or empties values.
+ Cities names contains variations and special characters like:
    + "santana do livramento" / "sant ana do livramento"
    + "varre-sai", "xique-xique"
    + "jaragua do sul" / "jaragua d sul" / "jaragua da sul"

## Required Libraries

In [None]:
# Allows to work with CSV easily.
import pandas as pd

## Data Preprocessing


We decided to create 3 new tables for cities, states, and zipcodes. So it is necessary to replace information in this three columns for his id in the respective table:

|External Table | External Column with new id| column to replace|
|--|--|--|
|code_zip_prefix_dataset |code_zip_prefix_id |customer_zip_code_prefix |
|city_state_dataset |city_state_id |customer_city |
|state_dataset |state_id |customer_state|

Example:

For first row the info of this 3 columns is:

|customer_zip_code_prefix |customer_city |customer_state |
|--|--|--|
|14409 |franca |SP |

Looking in the external table **state_dataset** we find the id **1** corresponds to state **SP**. So we need to replace **SP** for **1**.

|customer_zip_code_prefix |customer_city |customer_state |
|--|--|--|
|14409 |franca |1 |

We make the same process for zipcode. Looking in **code_zip_prefix_dataset** we need to replace **14409** for the new id **5353**.

|customer_zip_code_prefix |customer_city |customer_state |
|--|--|--|
|5353 |franca |1 |

City replace process is a little different. Like in Brazil exist cities with same name, we decide to add the state code to city name. Leaving the original **franca** as **franca/SP**.

|customer_zip_code_prefix |customer_city |customer_state |
|--|--|--|
|5353 |franca/SP |1 |

After this change we look for the new id in **city_state_dataset**. The corresponding id for **franca/SP** is **1842**. After this change our row is ready. **customer_id** and **customer_unique_id** don't change his value.

|customer_zip_code_prefix |customer_city |customer_state |
|--|--|--|
|5353 |1842 |1 |

### Data Correction

#### Clean cities names

Delete special characters ('-', ',', "'").

In [None]:
dataset_path = '../../data/interim/customers_dataset.csv'
column_to_replace = 'customer_city'

not_valid_char = ['-', ',', "'"]

def clean(word):
    clean_word = ''
    for letter in word:
        if not(letter in not_valid_char):
            clean_word = clean_word + letter
    return clean_word

dataset = pd.read_csv(dataset_path)
dataset[column_to_replace] = dataset[column_to_replace].map(clean)
dataset.to_csv('../../data/interim/customers_dataset.csv', encoding='utf-8', index=False)

#### Change cities names to city_state format (city/state)

Change city from "franca" to "franca/SP".

In [None]:
dataset_path = '../../data/interim/customers_dataset.csv'
column_to_replace = 'customer_city'
column_state = 'customer_state'

dataset = pd.read_csv(dataset_path)
print(dataset[column_to_replace][0])
for register in range(len(dataset[column_to_replace])):
    dataset[column_to_replace][register] = dataset[column_to_replace][register] + '/' + dataset[column_state][register]

dataset.to_csv('../../data/interim/customers_dataset.csv', encoding='utf-8', index=False)

#### Replace city name for his corresponding ID

Replace city name for an id according with **city_state_dataset**.

In [8]:
replace_dataset = '../../data/interim/city_state_dataset.csv'
replace_id = 'city_state'
replace_column = 'city_state_id'

dataset_path = '../../data/interim/customers_dataset.csv'
column_to_replace = 'customer_city'

replace_table = pd.read_csv(replace_dataset)
replace_keys = list(replace_table[replace_id])
replace_values = list(replace_table[replace_column])
replace_dict = dict(zip(replace_keys, replace_values))

not_find = []

def replace(word):
    try:
        return replace_dict[word]
    except KeyError:
        if type(word) is not int:
            not_find.append(word)
        return word

dataset = pd.read_csv(dataset_path)
dataset[column_to_replace] = dataset[column_to_replace].map(replace)
dataset.to_csv('../../data/interim/customers_dataset.csv', encoding='utf-8', index=False)

#### Correct small cities

We take an official list of Brazil cities from wikipedia. But in raw dataset there are some small towns that are consider cities. We search for the closest great city in official list to replace for it. A dict with replacements are in **small_cities_replace**.

In [None]:
small_cities_dict = {name:'' for name in not_find}
print(small_cities_dict)

In [52]:
boredtext = open('./bored_dict.txt', 'r+')
base = boredtext.read()
edit = ''

mem = 0
for character in base:
    if mem == 1:
        edit = edit + '\n'
        mem = 0
    else:
        edit = edit + character

    if character == ',':
        mem = 1

boredtext.write(edit)
boredtext.close()

In [5]:
small_cities_replace = {'parati/RJ': 'paraty/RJ',
    'embu/SP': 'embu das artes/SP',
    'bom jesus/GO': 'bom jesus de goias/GO',
    'espigao do oeste/RO': 'espigao d oeste/RO',
    'passa tres/RJ': 'rio claro/RJ',
    'ipiabas/RJ': 'barra do pirai/RJ',
    'vitoria/PR': 'porto vitoria/PR',
    'nucleo residencial pilar/BA': 'jaguarari/BA',
    'taperuaba/CE': 'sobral/CE',
    'itaipava/ES': 'itapemirim/ES',
    'glaura/MG': 'ouro preto/MG',
    'bonfim paulista/SP': 'ribeirao preto/SP',
    'vila muriqui/RJ': 'mangaratiba/RJ',
    'nossa senhora do remedio/SP': 'salesopolis/SP',
    'piumhii/MG': 'piumhi/MG',
    'maioba/MA': 'paco do lumiar/MA',
    'monnerat/RJ': 'duas barras/RJ',
    'desembargador otoni/MG': 'diamantina/MG',
    'tocos/RJ': 'campos dos goytacazes/RJ',
    'bom jesus do querendo/RJ': 'natividade/RJ',
    'rainha do mar/RS': 'xangri la/RS',
    'cipo guacu/SP': 'embu guacu/SP',
    'taguatinga/DF': 'brasilia/DF',
    'santana do sobrado/BA': 'casa nova/BA',
    'anta/RJ': 'sapucaia/RJ',
    'arraial d ajuda/BA': 'porto seguro/BA',
    'sacra familia do tingua/RJ': 'engenheiro paulo de frontin/RJ',
    'braco do rio/ES': 'conceicao da barra/ES',
    'california da barra/RJ': 'barra do pirai/RJ',
    'sao joao de petropolis/ES': 'santa teresa/ES',
    'santa isabel do para/PA': 'santa izabel do para/PA',
    'vila nova/PR': 'toledo/PR',
    'portela/RJ': 'itaocara/RJ',
    'caraiba/PE': 'ibimirim/PE',
    'vila dos cabanos/PA': 'barcarena/PA',
    'arace/ES': 'domingos martins/ES',
    'barra do tarrachil/BA': 'chorrocho/BA',
    'capao da porteira/RS': 'viamao/RS',
    'sao thome das letras/MG': 'sao tome das letras/MG',
    'goitacazes/RJ': 'campos dos goytacazes/RJ',
    'quilometro 14 do mutum/ES': 'baixo guandu/ES',
    'sao francisco do humaita/MG': 'mutum/MG',
    'cuite velho/MG': 'conselheiro pena/MG',
    'papucaia/RJ': 'cachoeiras de macacu/RJ',
    'angelo frechiani/ES': 'colatina/ES',
    'praia grande/ES': 'fundao/ES',
    'barra de sao joao/RJ': 'casimiro de abreu/RJ',
    'arrozal/RJ': 'pirai/RJ',
    'santana do livramento/RS': 'sant ana do livramento/RS',
    'valao do barro/RJ': 'sao sebastiao do alto/RJ',
    'engenheiro passos/RJ': 'resende/RJ',
    'primavera/SP': 'rosana/SP',
    'conservatoria/RJ': 'valenca/RJ',
    'colonia vitoria/PR': 'guarapuava/PR',
    'vermelho/MG': 'muriae/MG',
    'sobradinho/DF': 'brasilia/DF',
    'aparecida de sao manuel/SP': 'sao manuel/SP',
    'itapage/CE': 'itapaje/CE',
    'catu de abrantes/BA': 'lauro de freitas/BA',
    'espigao/SP': 'regente feijo/SP',
    'domiciano ribeiro/GO': 'cristalina/GO',
    'alto alegre do iguacu/PR': 'capitao leonidas marques/PR',
    'itaoca/ES': 'itapemirim/ES',
    'macuco de minas/MG': 'itumirim/MG',
    'sao jose do turvo/RJ': 'barra do pirai/RJ',
    'sao sebastiao de campos/RJ': 'campos dos goytacazes/RJ',
    'jacare/SP': 'cabreuva/SP',
    'santa cruz do timbo/SC': 'porto uniao/SC',
    'araguaia/ES': 'marechal floriano/ES',
    'celina/ES': 'alegre/ES',
    'pureza/RJ': 'sao fidelis/RJ',
    'conrado/RJ': 'miguel pereira/RJ',
    'jamapara/RJ': 'sapucaia/RJ',
    'jamaica/SP': 'praia grande/SP',
    'arembepe/BA': 'camacari/BA',
    'santa cruz do prata/MG': 'guaranesia/MG',
    'glicerio/RJ': 'macae/RJ',
    'sao jorge do oeste/PR': 'curitiba/PR',
    'cocais/MG': 'barao de cocais/MG',
    'bemposta/RJ': 'tres rios/RJ',
    'raposo/RJ': 'itaperuna/RJ',
    'japuiba/RJ': 'angra dos reis/RJ',
    'hidreletrica tucurui/PA': 'tucurui/PA',
    'palmeirinha/PR': 'guarapuava/PR',
    'luizlandia do oeste/MG': 'sao goncalo do abaete/MG',
    'cachoeira do campo/MG': 'ouro preto/MG',
    'santo amaro de campos/RJ': 'campos dos goytacazes/RJ',
    'santa rita do ibitipoca/MG': 'santa rita de ibitipoca/MG',
    'polo petroquimico de triunfo/RS': 'triunfo/RS',
    'bataipora/MS': 'bataypora/MS',
    'extrema/RO': 'porto velho/RO',
    'werneck/RJ': 'paraiba do sul/RJ',
    'irape/SP': 'chavantes/SP',
    'antonio pereira/MG': 'ouro preto/MG',
    'vargem alegre/RJ': 'barra do pirai/RJ',
    'serra dos dourados/PR': 'umuarama/PR',
    'avelar/RJ': 'paty do alferes/RJ',
    'sao joao do paraiso/RJ': 'cambuci/RJ',
    'quatro bocas/PA': 'nova timboteua/PA',
    'guara/DF': 'brasilia/DF',
    'porto trombetas/PA': 'oriximina/PA',
    'sao mateus de minas/MG': 'camanducaia/MG',
    'santo antonio dos campos/MG': 'divinopolis/MG',
    'travessao/RJ': 'campos dos goytacazes/RJ',
    'santa margarida/PR': 'bela vista do paraiso/PR',
    'queixada/MG': 'novo cruzeiro/MG',
    'barao de juparana/RJ': 'valenca/RJ',
    'silveira carvalho/MG': 'barao de monte alto/MG',
    'trancoso/BA': 'porto seguro/BA',
    'couto de magalhaes/TO': 'couto magalhaes/TO',
    'lidice/RJ': 'rio claro/RJ',
    'itamira/BA': 'apora/BA',
    'santanesia/RJ': 'pirai/RJ',
    'nossa senhora de caravaggio/SC': 'nova veneza/SC',
    'ibiraja/BA': 'itanhem/BA',
    'abrantes/BA': 'lauro de freitas/BA',
    'sao goncalo do rio das pedras/MG': 'serro/MG',
    'florinia/SP': 'florinea/SP',
    'azurita/MG': 'mateus leme/MG',
    'andrequice/MG': 'tres marias/MG',
    'nossa senhora do o/PE': 'ipojuca/PE',
    'purilandia/RJ': 'porciuncula/RJ',
    'morro vermelho/MG': 'caete/MG',
    'santa rita da floresta/RJ': 'cantagalo/RJ',
    'morro do ferro/MG': 'oliveira/MG',
    'sao benedito/MG': 'santa luzia/MG',
    'vila nelita/ES': 'agua doce do norte/ES',
    'osvaldo kroeff/RS': 'cambara do sul/RS',
    'rio verde/PR': 'colombo/PR',
    'morro de sao paulo/BA': 'cairu/BA',
    'holambra ii/SP': 'holambra/SP',
    'areia branca dos assis/PR': 'mandirituba/PR',
    'conceicao da ibitipoca/MG': 'lima duarte/MG',
    'morada nova/PA': 'maraba/PA',
    'sao joao do sobrado/ES': 'pinheiros/ES',
    'monte verde/MG': 'camanducaia/MG',
    'pocoes de paineiras/MG': 'paineiras/MG',
    'maracana/SC': 'palhoca/SC',
    'sao vitor/MG': 'belo horizonte/MG',
    'ravena/MG': 'sabara/MG',
    'sao luis do paraitinga/SP': 'sao luiz do paraitinga/SP',
    'sao sebastiao da serra/SP': 'brotas/SP',
    'mussurepe/RJ': 'campos dos goytacazes/RJ',
    'cambiasca/RJ': 'sao fidelis/RJ',
    'caldas do jorro/BA': 'feira de santana/BA',
    'aguas claras/RS': 'viamao/RS',
    'santana/PR': 'candoi/PR',
    'monte gordo/BA': 'camacari/BA',
    'santa maria/RJ': 'santa maria madalena/RJ',
    'jacigua/ES': 'vargem alta/ES',
    'antunes/MG': 'igaratinga/MG',
    'boa esperanca/RJ': 'rio bonito/RJ',
    'bandeirantes d oeste/SP': 'sud mennucci/SP',
    'santo antonio das queimadas/PE': 'jurema/PE',
    'perpetuo socorro/MG': 'belo oriente/MG',
    'estevao de araujo/MG': 'araponga/MG',
    'luziapolis/AL': 'campo alegre/AL',
    'quixada/PE': 'quixaba/PE',
    'pirapo/PR': 'arapongas/PR',
    'aribice/BA': 'euclides da cunha/BA',
    'doce grande/PR': 'quitandinha/PR',
    'piacu/ES': 'muniz freire/ES',
    'maristela/SP': 'laranjal paulista/SP',
    'alto alegre/PR': 'cascavel/PR',
    'fragosos/SC': 'campo alegre/SC',
    'colonia jordaozinho/PR': 'guarapuava/PR',
    'missi/CE': 'iraucuba/CE',
    'murucupi/PA': 'barcarena/PA',
    'lagoa do mato/CE': 'itatira/CE',
    'quatituba/MG': 'itueta/MG',
    'campo alegre de minas/MG': 'resplendor/MG',
    'sao domingos/MG': 'uba/MG',
    'engenheiro balduino/SP': 'monte aprazivel/SP',
    'agisse/SP': 'rancharia/SP',
    'alexandrita/MG': 'iturama/MG',
    'perola independente/PR': 'maripa/PR',
    'tecainda/SP': 'martinopolis/SP',
    'central de santa helena/MG': 'divino das laranjeiras/MG',
    'tuparece/MG': 'medina/MG',
    'aparecida de monte alto/SP': 'monte alto/SP',
    'sao geraldo do baguari/MG': 'sao joao evangelista/MG',
    'sao domingos/PE': 'brejo da madre de deus/PE',
    'botelho/SP': 'santa adelia/SP',
    'padre gonzales/RS': 'tres passos/RS',
    'taboquinhas/BA': 'itacare/BA',
    'ceilandia/DF': 'brasilia/DF',
    'visconde de maua/RJ': 'resende/RJ',
    'governador portela/RJ': 'miguel pereira/RJ',
    'humildes/BA': 'feira de santana/BA',
    'ibitioca/RJ': 'campos dos goytacazes/RJ',
    'alto sao joao/PR': 'laranjeiras do sul/PR',
    'chaveslandia/MG': 'santa vitoria/MG',
    'carajas/PA': 'parauapebas/PA',
    'brejo bonito/MG': 'cruzeiro da fortaleza/MG',
    'itaguacu/GO': 'sao simao/GO',
    'colonia castrolanda/PR': 'castro/PR',
    'amanari/CE': 'maranguape/CE',
    'jaguarembe/RJ': 'itaocara/RJ',
    'santana do capivari/MG': 'pouso alto/MG',
    'novo brasil/ES': 'cariacica/ES',
    'ribeiro junqueira/MG': 'leopoldina/MG',
    'cachoeira do brumado/MG': 'mariana/MG',
    'pau d arco/AL': 'lagoa da canoa/AL',
    'posto da mata/BA': 'nova vicosa/BA',
    'carnaiba do sertao/BA': 'juazeiro/BA',
    'sapucaia/MG': 'caratinga/MG',
    'ibiajara/BA': 'rio do pires/BA',
    'ponto do marambaia/MG': 'carai/MG',
    'rechan/SP': 'itapetininga/SP',
    'salobro/BA': 'canarana/BA',
    'santa maria/DF': 'brasilia/DF',
    'guarapua/SP': 'dois corregos/SP',
    'piao/RJ': 'sao jose do vale do rio preto/RJ',
    'flores/CE': 'russas/CE',
    'santo eduardo/RJ': 'campos dos goytacazes/RJ',
    'alexandra/PR': 'paranagua/PR',
    'paraju/ES': 'domingos martins/ES',
    'jacuipe/BA': 'camacari/BA',
    'tres irmaos/RJ': 'cambuci/RJ',
    'ipiranga/RS': 'viamao/RS',
    'boa ventura/RJ': 'itaperuna/RJ',
    'angustura/MG': 'alem paraiba/MG',
    'martinesia/MG': 'uberlandia/MG',
    'prudencio thomaz/MS': 'rio brilhante/MS',
    'major porto/MG': 'patos de minas/MG',
    'conceicao do formoso/MG': 'santos dumont/MG',
    'silvano/MG': 'patrocinio/MG',
    'fonseca/MG': 'alvinopolis/MG',
    'santo antonio do canaa/ES': 'santa teresa/ES',
    'anhandui/MS': 'campo grande/MS',
    'guinda/MG': 'diamantina/MG',
    'lages/CE': 'maranguape/CE',
    'amparo da serra/MG': 'amparo do serra/MG',
    'poco de pedra/RN': 'sao goncalo do amarante/RN',
    'sao clemente/PR': 'santa helena/PR',
    'pinhotiba/MG': 'eugenopolis/MG',
    'guassusse/CE': 'oros/CE',
    'monte alverne/RS': 'santa cruz do sul/RS',
    'ibitira/BA': 'rio do antonio/BA',
    'sao jose do ribeirao/RJ': 'bom jardim/RJ',
    'vargem grande do soturno/ES': 'cachoeiro de itapemirim/ES',
    'vila pereira/MG': 'nanuque/MG',
    'bacaxa/RJ': 'saquarema/RJ',
    'sucesso/CE': 'tamboril/CE',
    'barao ataliba nogueira/SP': 'itapira/SP',
    'santa teresinha/BA': 'santa terezinha/BA',
    'ajapi/SP': 'rio claro/SP',
    'guariroba/SP': 'taquaritinga/SP',
    'itabatan/BA': 'mucuri/BA',
    'baguari/MG': 'governador valadares/MG',
    'sede alvorada/PR': 'cascavel/PR',
    'jaua/BA': 'camacari/BA',
    'mariental/PR': 'lapa/PR',
    'pedra menina/MG': 'espera feliz/MG',
    'ibitiuva/SP': 'pitangueiras/SP',
    'tapinas/SP': 'itapolis/SP',
    'planaltina de goias/GO': 'planaltina/GO',
    'graccho cardoso/SE': 'gracho cardoso/SE',
    'ilha dos valadares/PR': 'paranagua/PR',
    'penedo/RJ': 'itatiaia/RJ',
    'mutum parana/RO': 'porto velho/RO',
    'picarras/SC': 'penha/SC',
    'sao miguel do cambui/PR': 'marialva/PR',
    'vitorinos/MG': 'alto rio doce/MG',
    'brasopolis/MG': 'brazopolis/MG',
    'corrego do ouro/RJ': 'macae/RJ',
    'pitanga de estrada/PB': 'mamanguape/PB',
    'termas de ibira/SP': 'ibira/SP',
    'vila reis/PR': 'apucarana/PR',
    'itacurussa/RJ': 'mangaratiba/RJ',
    'palmital de minas/MG': 'cabeceira grande/MG',
    'jardim abc de goias/GO': 'cidade ocidental/GO',
    'dourado/CE': 'morada nova/CE',
    'santa isabel do rio preto/RJ': 'valenca/RJ',
    'pinheiros/SP': 'sao paulo/SP',
    'mendonca/MG': 'veredinha/MG',
    'adhemar de barros/PR': 'terra rica/PR',
    'sao sebastiao do paraiba/RJ': 'cantagalo/RJ',
    'pacotuba/ES': 'itapemirim/ES',
    'serra bonita/MG': 'buritis/MG',
    'venda branca/SP': 'casa branca/SP',
    'sanga puita/MS': 'ponta pora/MS',
    'siriji/PE': 'sao vicente ferrer/PE',
    'monte bonito/RS': 'pelotas/RS'}

In [7]:
dataset_path = '../../data/interim/customers_dataset.csv'
column_to_replace = 'customer_city'

not_find = []

def replace(word):
    try:
        return small_cities_replace[word]
    except KeyError:
        if type(word) is not int:
            not_find.append(word)
        return word

dataset = pd.read_csv(dataset_path)
dataset[column_to_replace] = dataset[column_to_replace].map(replace)
dataset.to_csv('../../data/interim/customers_dataset.csv', encoding='utf-8', index=False)

In [None]:
#Execute again "Replace city name for his corresponding ID" with the corrected names

#### Replace state for his ID

Replace state name for his corresponding id in **state_dataset**

In [9]:
replace_dataset = '../../data/interim/state_dataset.csv'
replace_id = 'state'
replace_column = 'state_id'

dataset_path = '../../data/interim/customers_dataset.csv'
column_to_replace = 'customer_state'

replace_table = pd.read_csv(replace_dataset)
replace_keys = list(replace_table[replace_id])
replace_values = list(replace_table[replace_column])
replace_dict = dict(zip(replace_keys, replace_values))

not_find = []

def replace(word):
    try:
        return replace_dict[word]
    except KeyError:
        if type(word) is not int:
            not_find.append(word)
        return word

dataset = pd.read_csv(dataset_path)
dataset[column_to_replace] = dataset[column_to_replace].map(replace)
dataset.to_csv('../../data/interim/customers_dataset.csv', encoding='utf-8', index=False)

#### Exploration to replace zipcode

We try to find if there are some zipcodes who arent in the **code_zip_prefix_dataset**. And yes there are some.

In [11]:
dataset = pd.read_csv('../../data/interim/customers_dataset.csv')
zipcode = pd.read_csv('../../data/interim/code_zip_prefix_dataset.csv')
print(len(set(dataset['customer_zip_code_prefix']) ))
print(len(set(zipcode['code_zip_prefix']) ))

14994
19022


#### Replace zipcode

We replace the zipcodes with his correct id in **code_zip_prefix_dataset**. Also make a list named **not_find** of zipcodes who aren't in de dataset. So we add them to **code_zip_prefix_dataset** with a sequential id and after replace them in **customers_dataset** to finish this step.

In [23]:
replace_dataset = '../../data/interim/code_zip_prefix_dataset.csv'
replace_id = 'code_zip_prefix'
replace_column = 'code_zip_prefix_id'

dataset_path = '../../data/interim/customers_dataset.csv'
column_to_replace = 'customer_zip_code_prefix'

replace_table = pd.read_csv(replace_dataset)
replace_keys = list(replace_table[replace_id])
replace_values = list(replace_table[replace_column])
replace_dict = dict(zip(replace_keys, replace_values))

not_find = []

def replace(word):
    try:
        return replace_dict[word]
    except KeyError:
        not_find.append(word)
        return word

dataset = pd.read_csv(dataset_path)
dataset[column_to_replace] = dataset[column_to_replace].map(replace)
dataset.to_csv('../../data/interim/customers_dataset.csv', encoding='utf-8', index=False)

#Take care with this part, only use when you have identified the zipcodes who aren't in the code_zip_prefix_dataset. Or it will confuse ID with zipcode and can ruin de dataset.
# to add unfinded zipcodes
# zipcode = pd.read_csv(replace_dataset)
# id_counter = 19023
# for element in not_find:
#     zipcode.loc[id_counter] = [id_counter, element]
#     id_counter += 1
# zipcode.to_csv('../../data/interim/code_zip_prefix_dataset.csv', encoding='utf-8')

#### Fix Index

When you saved the dataset always mark **"index = False"**. Or pandas will add a new column with a consequtive number. This small script is to remove this useless column.

In [None]:
dataset = pd.read_csv('../../data/interim/order_reviews_dataset.csv')
del(dataset['Unnamed: 0'])
print(dataset)
dataset.to_csv('../../data/interim/order_reviews_dataset.csv', encoding='utf-8', index=False)

## Final Column Description

|**Column Title**|**customer_id -> str** |**customer_unique_id -> str** |**customer_zip_code_prefix -> int** |**customer_city -> int**| **customer_state -> int**|
|--|--|--|--|--|--|
|Description |Primary key for this table |Customer Identifier Number |code_zip_prefix_id from code_zip_prefix_dataset |city_state_id from city_state_dataset |state_id from state_dataset |
|Before Preprocessing |274fa6071e5e17fe303b9748641082c8 |84732c5050c01db9b23e19ba39899398 |06703 |cotia |SP |
|After Preprocessing |274fa6071e5e17fe303b9748641082c8 |84732c5050c01db9b23e19ba39899398 |3354 |1437 |1 |