# Exploratory Analysis

In [9]:
%load_ext autoreload
%autoreload 2

## 1 - Code get_data

Our goal is to load all 8 csv as `pandas.DataFrame` in a single dict named `data` where each key is the name of the csv file, and each value the dataframe created from the csv

#### 1.1 Create the variable `csv_path`, which stores as a string the relative path to your csv folder

- A relative path starts with `.` or `..` to indicate that the path is relative to your **current working directory**
- Your **current working directory** is accessible via `os.getcwd()`
- Use [`os.path.join`](https://docs.python.org/3/library/os.path.html) for instance, `os.path.join('..', a_path)`, which replaces either `../a_path` (for Linux) vs `..//a_path` (for Windows)

In [1]:
import os
import pandas as pd
import os; os.getcwd()

In [None]:
# Your code here


In [None]:
# Test your code
pd.read_csv(os.path.join(csv_path, 'olist_sellers_dataset.csv'))

#### 1.2 Create the list `file_names` containing all csv file names in the csv directory

- It should look like this `file_names = ['olist_sellers_dataset.csv', ....]`
- You can use `os.listdir()`
- Make sure it only lists csv files!

#### 1.3  Construct a function that takes a file name `string` as input, and output a cleaner `string` by:
- Removing its suffix ".csv" when it exists
- Removing its suffix "_dataset.csv" when it exists
- Removing its prefix "olist_" when it exists

<details>
    <summary>Hint</summary>

`stings` are iterables you can slice with [ ]
</details>

In [13]:
def key_from_file_name(f):
    pass

#### 1.4 Construct the dictionnary `data`

- Its keys should look like `'sellers'` `'orders'` `'order_items'` etc...
- Its values should be `pandas.DataFrame` objects

Don't hesitate to re-use `file_names` and `key_from_file_name` defined above

In [2]:
data = {}

#### 1.4 Implement and save the fuction `get_data()` in `04-Decision-Science/olist/data.py` that will return the dictionary `data` upon calling it as per below 
```python
from olist.data import Olist
Olist().get_data()
```
- First, take time to understand step by step what happens calling `Olist().get_data()`
- Your method `get_data()` needs to be callable from anywhere (the Command Terminal, this notebook, another one at a different place, etc...)
- Using relative path will not work this time, as the current working directory `os.getcwd()` from wich relative path are computed depends on where you run the code in the first place

👉 You will therefore have to code the absolute path for the csv folder: **However, do not hard-code absolute path manually**: It will comprises of your computer username, and will therefore not be readable by any other potential team member working on the same project than you !


- Have a look at `__file__` built-in python object, which can act as an "anchor".
- Make extensive use of `import ipdb; ipdb.set_trace()` to investigate what `__file__` variable is really. This is a great exercice to learn to debug! 



In [None]:
# Test your code
from olist.data import Olist
Olist().get_data().keys()

In [None]:
# Test your code (bis)
Olist().get_data()['order_reviews'].equals(data['order_reviews'])

## 2 - Run an exploratory analysis with pandas profiling

Run an exploratory analysis for the sub-list of datasets below, using [pandas-profiling](https://github.com/pandas-profiling/pandas-profiling): Create and save one HTML per dataset under a new `04-Decision-Science/reports` folder

don't forget to `pip install pandas-profiling` and import it

In [1]:
import pandas_profiling
profiling_data = ['orders', 'products', 'sellers',
                  'customers', 'order_reviews',
                  'order_items']

# We create a new "reports" folder in which we will store the html reports
!mkdir ../../data/reports

mkdir: ../../data/reports: File exists


In [None]:
# We'll create and save below one html report per dataset (it takes some time to run!)
for d in profiling_data:
    print('exporting: '+d)
    profile = data[d].profile_report(title='Report for '+d)
    profile.to_file(output_file="../../data/reports/"+d+'.html')

Take some time to read the reports: 
Notice when columns have missing data (feel free to complete the list below), or add any other insights of your choice to your db.lewagon.org schema if needed

In [92]:
# columns that have missing data
columns_missing_data = [
]

## 4 - Create main matching table

Looking at our schema, it feels wise to create, ahead of our in-depth analysis, our central matching_table that will join the most important foreign keys altogether. We may re-use it often this week.

❓Create the `matching_table` below. The DataFrame should have the following columns (below).

In [93]:
columns_matching_table = [
    "order_id",
    "review_id",
    "customer_id",
    "product_id",
    "seller_id",
]

👇 During this week, we suggest you to name your various DataFrame in python the following convention  
Make heavy use of "Tab" to auto-complete the key name of your dictionaries!

In [94]:
# orders = data['orders']
# sellers = data['sellers']
# products = data['products']
# items = data['order_items']
# reviews = data['order_reviews']

We don't want to risk loosing any information at that stage of pre-processing, so make sure to merge with outer joints

In [5]:
# Select only the columns of interests in the various dataframes of interest, before proceeding to any merge

In [6]:
# Inspect the cardinality of each DataFrame using pd.DataFrame.shape and pd.Series.nunique()

In [7]:
# Carefully merge DataFrames

In [8]:
# Inspect the cardinality and `nunique` of the final DataFrame. It should match (114100, 5)

___
❓Copy your logic into `get_matching_table()` in `data.py` and test that it output exactly the same DataFrame than from your `matching_table` variable in your notebook

In [99]:
from olist.data import Olist
Olist().get_matching_table()

Unnamed: 0,customer_id,order_id,review_id,product_id,seller_id
0,9ef432eb6251297304e76186b10a928d,e481f51cbdc54678b7cc49136f2d6af7,a54f0611adc9ed256b57ede6b6eb5114,87285b34884572647811a353c7ac498a,3504c0cb71d7fa48d967e0e4c94d59d9
1,b0830fb4747a6c6d20dea0b8c802d7ef,53cdb2fc8bc7dce0b6741e2150273451,8d5266042046a06655c8db133d120ba5,595fac2a385ac33a80bd5114aec74eb8,289cdb325fb7e7f891c38608bf9e0962
2,41ce2a54c0b03bf3443c3d931a367089,47770eb9100c2d0c44946d9cf07ec65d,e73b67b67587f7644d5bd1a52deb1b01,aa4383b373c6aca5d8797843e5594415,4869f7a5dfa277a7dca6462dcf3b52b2
3,f88197465ea7920adcdbec7375364d82,949d5b44dbf5de918fe9c16f97b45f8a,359d03e676b3c069f62cadba8dd3f6e8,d0b61bfb1de832b15ba9d266ca96e5b0,66922902710d126a0e7d26b0e3805106
4,8ab97904e6daea8866dbdbc4fb7aad2c,ad21c59c0840e6cb83a9ceb5573f8159,e50934924e227544ba8246aeb3770dd4,65266b2da20d04dbe00c5c2d3bb7859e,2c9e548be18521d1c43cde1c582c6de8
...,...,...,...,...,...
114095,1fca14ff2861355f6e5f14306ff977a7,63943bddc261676b46f01ca7ac2f7bd8,29bb71b2760d0f876dfa178a76bc4734,f1d4ce8c6dd66c47bbaa8c6781c2a923,1f9ab4708f3056ede07124aad39a2554
114096,1aa71eb042121263aafbe80c1b562c9c,83c1379a015df1e13d02aae0204711ab,371579771219f6db2d830d50805977bb,b80910977a37536adeddd63663f916ad,d50d79cb34e38265a8649c383dcffd48
114097,b331b74b18dc79bcdf6532d51e1637c1,11c177c8e97725db2631073c19f07b62,8ab6855b9fe9b812cd03a480a25058a1,d1c427060a0f73f6b889a5c7c61f2ac4,a1043bafd471dff536d0c462352beb48
114098,b331b74b18dc79bcdf6532d51e1637c1,11c177c8e97725db2631073c19f07b62,8ab6855b9fe9b812cd03a480a25058a1,d1c427060a0f73f6b889a5c7c61f2ac4,a1043bafd471dff536d0c462352beb48


In [100]:
matching_table.equals(Olist().get_matching_table())

True