# Data Preparation

In [26]:
# "magic commands" to enable autoreload of your imported packages
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


Our goal is to load all 9 `.csv` files into 9 `pandas.DataFrame`s in a single dict named `data` where:
- each **key** is the **cleaned name** of the csv file
- each **value** is the **DataFrame** created from the csv

```python
data = { 
    'sellers': DataFrame1,
    'orders': DataFrame2,
    ...
    }
```

### 1. Create the variable `csv_path`, which stores the path to your `"csv" folder` as a string

- When calling `pd.read_csv(csv_path)`, `csv_path` can be absolute or relative:
    - A **`relative path`** can start with `.` or `..`, it is always computed with respect to your current working directory 
        - *Reminder: you can use `!pwd` in your notebook or `pwd` in your terminal to know where you are located*
    - An **`absolute path`** starts with `/` 

In [36]:
# Check your current working directory using `os.getcwd()` below
import os
os.getcwd()

'/home/baska/code/Svvcm/04-Decision-Science/01-Project-Setup/data-data-preparation'

☝️ `getcwd` a.k.a `get current working directory` refers to the absolute path _from which this notebook is being executed_

Create a relative `csv_path` from your current folder to the csv folder.

Try to use [`os.path.join`](https://docs.python.org/3/library/os.path.html), which replaces both:
* Linux/MacOS syntax (e.g. `../folder_name`) 
* and Windows syntax (e.g. `..\\folder_name`) 

and is therefore more robust!

Have a look at the image below to see how you can get from your current location to the csv folder.
<img alt="Folder structure of olist with relative path" src="https://wagon-public-datasets.s3.amazonaws.com/04-Decision-Science/01-Project-Setup/folders-olist-data-relative.png" width=500>

In [28]:
csv_path = "/home/baska/code/Svvcm/04-Decision-Science/01-Project-Setup/data-context-and-setup/data/csv"

In [29]:
# Test your code below
import pandas as pd
pd.read_csv(os.path.join(csv_path, 'olist_sellers_dataset.csv')).head()

Unnamed: 0,seller_id,seller_zip_code_prefix,seller_city,seller_state
0,3442f8959a84dea7ee197c632cb2df15,13023,campinas,SP
1,d1b65fc7debc3361ea86b5f14c68d2e2,13844,mogi guacu,SP
2,ce3ad9de960102d0677a81f5d0bb7b2d,20031,rio de janeiro,RJ
3,c0f3eea2e14555b6faeea3dd58c1b1c3,4195,sao paulo,SP
4,51a04a8a6bdcb23deccc82b0b80742cf,12914,braganca paulista,SP


### 2. Create the list `file_names` containing all csv file names in the csv directory

- It should look like this `file_names = ['olist_sellers_dataset.csv', ....]`
- You can use `os.listdir()`
- Make sure it only lists csv files!

In [None]:
file_names = os.listdir(csv_path)
file_names.remove(".keep")
file_names

['olist_products_dataset.csv',
 'olist_customers_dataset.csv',
 'olist_sellers_dataset.csv',
 'olist_geolocation_dataset.csv',
 'olist_order_items_dataset.csv',
 'olist_order_payments_dataset.csv',
 'product_category_name_translation.csv',
 'olist_orders_dataset.csv',
 'olist_order_reviews_dataset.csv']

### 3.  Create the list of dict key `key_names` 
Starting from file_names and:
- Removing its suffix ".csv" when it exists
- Removing its suffix "_dataset.csv" when it exists
- Removing its prefix "olist_" when it exists

<details>
    <summary>- Hint - </summary>

- `.replace()`
    
- `str` ings are iterables you can slice with [ ]
</details>

In [31]:
key_names = [file_name.replace("_dataset.csv","").replace("olist_","").replace(".csv","")
             for file_name in file_names]

key_names

['products',
 'customers',
 'sellers',
 'geolocation',
 'order_items',
 'order_payments',
 'product_category_name_translation',
 'orders',
 'order_reviews']

### 4. Construct the dictionary `data`

```python
data = { 
    'sellers': DataFrame1,
    'orders': DataFrame2,
    'order_items': DataFrame3,
    ...
    }
```
Where `DataFrame1`, `DataFrame2`, ... should be actual `pandas.DataFrame`s! Not strings containing the file path to the csv files

<details>
    <summary>▸ Hint</summary>

The `zip()` method is very useful to iterate over two lists
```python
for (x, y) in zip(['a','b','c'], [1,2,3]):
    print(x,y)

# returns ('a', 1), ('b', 2), ('c', 3)
    
```
</details>

In [None]:
data = {}

for (x, y) in zip(key_names, file_names):
    data[x] = pd.read_csv(os.path.join(csv_path, y)).head()

data["sellers"]


Unnamed: 0,seller_id,seller_zip_code_prefix,seller_city,seller_state
0,3442f8959a84dea7ee197c632cb2df15,13023,campinas,SP
1,d1b65fc7debc3361ea86b5f14c68d2e2,13844,mogi guacu,SP
2,ce3ad9de960102d0677a81f5d0bb7b2d,20031,rio de janeiro,RJ
3,c0f3eea2e14555b6faeea3dd58c1b1c3,4195,sao paulo,SP
4,51a04a8a6bdcb23deccc82b0b80742cf,12914,braganca paulista,SP


### 5. Implement the method `get_data()` in `olist/data.py`

Time to move our logic from the notebook into our `.py` files. This will allow us to easily load the data in the new notebooks we'll create througout this module. 

Go and open the `olist/data.py` file in the previous challenge's folder, and start moving the code you have written in this notebook to the `get_data()` method. Along the way you will need to make some changes (read further 👇 for some hints).

It should return the dictionary `data` upon calling it as per below

```python
from olist.data import Olist
Olist().get_data()
```
- Take time to understand what happens when `Olist().get_data()` is called
- Your method `get_data()` needs to be callable from various places (e.g your Terminal, this notebook, another notebook located elsewhere, etc...)
- You can't use a relative path this time as the current working directory `os.getcwd()` depends on where you run the code in the first place
- You also can't use a _hardcoded_ absolute path, because that won't work on someone else's system.
- So we will have to let Python create the path for us, starting from `__file__`, which will give us the absolute location of our `data.py` file. Once we know that, we can construct the path to our csv folder again. Explore the image below to see how that works.
   <img alt="Folder structure of olist" src="https://wagon-public-datasets.s3.amazonaws.com/04-Decision-Science/01-Project-Setup/folders-olist-data.png" width=500>

### Test your code

In [None]:
from nbresult import ChallengeResult
from olist.data import Olist
data = Olist().get_data()
result = ChallengeResult('get_data',
    keys_len=len(data),
    keys=sorted(list(data.keys())),
    columns=sorted(list(data['sellers'].columns)),
    vars_used=Olist.get_data.__code__.co_names
    )
result.write()
print(result.check())

In [None]:
from olist.data import Olist
Olist().get_data()['sellers'].head()

❓This piece of code needs to work from anywhere on your machine, not only in this notebook.
- Open a new terminal
- Go to your home folder `cd`
- Launch an `ipython` session
- Test the two lines of code above 👆

🏁 Congratulations !

💾 Don't forget to commit & push: 
* this `data_preparation.ipynb` notebook
* as well as `data.py`