# Data Preparation

In [1]:
# "magic commands" to enable autoreload of your imported packages
%load_ext autoreload
%autoreload 2

Our goal is to load all 8 `.csv` files each of them as a `pandas.DataFrame` in a single dict named `data` where each key is the name of the csv file, and each value is the dataframe created from the csv

### 1. Create the variable `csv_path`, which stores the path to your `"csv" folder` as a string

In [2]:
# Check your current working directory using `os.getcwd()` below 
import os
os.getcwd()
# `getcwd` a.k.a `get current working directory` refers to the absolute 
#path _from which this notebook is being executed_

'/Users/elohokemi/code/Olist/notebooks'

In [3]:
csv_path = os.path.join(os.getcwd(), "..","raw_data/csv")

In [4]:
import pandas as pd
pd.read_csv(os.path.join(csv_path, 'olist_sellers_dataset.csv')).head()

Unnamed: 0,seller_id,seller_zip_code_prefix,seller_city,seller_state
0,3442f8959a84dea7ee197c632cb2df15,13023,campinas,SP
1,d1b65fc7debc3361ea86b5f14c68d2e2,13844,mogi guacu,SP
2,ce3ad9de960102d0677a81f5d0bb7b2d,20031,rio de janeiro,RJ
3,c0f3eea2e14555b6faeea3dd58c1b1c3,4195,sao paulo,SP
4,51a04a8a6bdcb23deccc82b0b80742cf,12914,braganca paulista,SP


### 2. Create a list containing all csv file names in the csv directory

In [5]:
#returns a list of all the file names in the spacified directory
os.listdir(csv_path)

['olist_sellers_dataset.csv',
 'product_category_name_translation.csv',
 'olist_orders_dataset.csv',
 '.gitkeep',
 'olist_order_items_dataset.csv',
 'olist_customers_dataset.csv',
 'olist_geolocation_dataset.csv',
 'olist_order_payments_dataset.csv',
 'olist_order_reviews_dataset.csv',
 'olist_products_dataset.csv']

In [6]:
file_names = [file for file in os.listdir(csv_path) if file.endswith('csv')]
file_names

['olist_sellers_dataset.csv',
 'product_category_name_translation.csv',
 'olist_orders_dataset.csv',
 'olist_order_items_dataset.csv',
 'olist_customers_dataset.csv',
 'olist_geolocation_dataset.csv',
 'olist_order_payments_dataset.csv',
 'olist_order_reviews_dataset.csv',
 'olist_products_dataset.csv']

### 3.  Create the list of dict key `key_names` 
Starting from file_names and:
- Removing its suffix ".csv" when it exists
- Removing its suffix "_dataset.csv" when it exists
- Removing its prefix "olist_" when it exists

In [11]:
key_names = [name.replace('_dataset.csv','').replace('.csv','').replace('olist_','') for name in file_names]
key_names

['sellers',
 'product_category_name_translation',
 'orders',
 'order_items',
 'customers',
 'geolocation',
 'order_payments',
 'order_reviews',
 'products']

### 4. Construct the dictionary `data`

In [18]:
pd.read_csv(os.path.join(csv_path, file_names[0])).head()

Unnamed: 0,seller_id,seller_zip_code_prefix,seller_city,seller_state
0,3442f8959a84dea7ee197c632cb2df15,13023,campinas,SP
1,d1b65fc7debc3361ea86b5f14c68d2e2,13844,mogi guacu,SP
2,ce3ad9de960102d0677a81f5d0bb7b2d,20031,rio de janeiro,RJ
3,c0f3eea2e14555b6faeea3dd58c1b1c3,4195,sao paulo,SP
4,51a04a8a6bdcb23deccc82b0b80742cf,12914,braganca paulista,SP


In [20]:
data = {key:pd.read_csv(os.path.join(csv_path,value)) for key, value in zip(key_names, file_names)}
