# Data Preparation

My goal is to load all 9 `.csv` as a `pandas.DataFrame` in a single dict named `data` where:
- each **key** is the **cleaned name** of the csv file
- each **value** is the **DataFrame** created from the csv

```python
data = { 
    'sellers': DataFrame1,
    'orders': DataFrame2,
    ...
    }
```

### Import modules

In [1]:
import os
import pandas as pd

### 1. Creating the variable `csv_path`, which stores the path to the `"csv" folder` as a string

In [11]:
csv_path = os.path.join(os.getcwd(),'..', 'data', 'csv')

# Testing code
pd.read_csv(os.path.join(csv_path, 'olist_sellers_dataset.csv')).head()

Unnamed: 0,seller_id,seller_zip_code_prefix,seller_city,seller_state
0,3442f8959a84dea7ee197c632cb2df15,13023,campinas,SP
1,d1b65fc7debc3361ea86b5f14c68d2e2,13844,mogi guacu,SP
2,ce3ad9de960102d0677a81f5d0bb7b2d,20031,rio de janeiro,RJ
3,c0f3eea2e14555b6faeea3dd58c1b1c3,4195,sao paulo,SP
4,51a04a8a6bdcb23deccc82b0b80742cf,12914,braganca paulista,SP


### 2. Creating the list `file_names` containing all csv file names in the csv directory

In [12]:
file_names = [file for file in os.listdir(csv_path) if file.endswith('.csv')]
file_names

['olist_orders_dataset.csv',
 'olist_customers_dataset.csv',
 'olist_order_items_dataset.csv',
 'olist_products_dataset.csv',
 'product_category_name_translation.csv',
 'olist_sellers_dataset.csv',
 'olist_order_payments_dataset.csv',
 'olist_geolocation_dataset.csv',
 'olist_order_reviews_dataset.csv']

### 3.  Creating the list of dict key `key_names` 
Starting from file_names and:
- Removing its suffix ".csv" when it exists
- Removing its suffix "_dataset.csv" when it exists
- Removing its prefix "olist_" when it exists

In [13]:
key_names = [file_name.replace(".csv", "").replace("_dataset", "").replace("olist_", "") for file_name in file_names]
key_names

['orders',
 'customers',
 'order_items',
 'products',
 'product_category_name_translation',
 'sellers',
 'order_payments',
 'geolocation',
 'order_reviews']

### 4. Constructing the dictionary `data`

```python
data = { 
    'sellers': DataFrame1,
    'orders': DataFrame2,
    'order_items': DataFrame3,
    ...
    }
```

In [15]:
data = {key: pd.read_csv(os.path.join(csv_path, csv)) for key, csv in zip(key_names, file_names)}

### 5. Implemented the method `get_data()` in `utils/data.py`

It returns the dictionary `data` upon calling it as per below, if the root folder has been added to `sys.path`:

```python
root_path = os.path.join(os.getcwd(),'..')
if root_path not in sys.path:
    sys.path.append(root_path)

from utils.data import Olist
olist = Olist()
data = olist.get_data()
```
