# TP1 Framing

Here you will work on two exercises in framing data. They will essentially involve the use of `groupby` to aggregate data on a particular column. If you are not yet familiar with the usage of `pd.DataFrame.groupby`, I suggest you study the following tutorial **before** attempting the exercises:
>["pandas GroupBy: Your Guide to Grouping Data in Python" by Brad Solomon, at RealPython.com](https://realpython.com/pandas-groupby)

## Imports

In [24]:
import numpy as np
import pandas as pd


import matplotlib.pyplot as plt

## Exercise 1 - eCommerce data

### Problem statement
1. This data is mostly formatted. You may need to set dtypes when loading the csv. 
2. From this data, select a sample of dirty data (`sample_dirty`) that you will frame manually.
3. Create a target sample dataframe corresponding to the following framing constraints (`sample_clean_invoices`)
    - Group data by `InvoiceNo`. 
    - Add a column `NumItems` with the number of items for a given `InvoiceNo`
    - Add a column `Total` corresponding to the total value of the purchase associated to `InvoiceNo`
    - Also keep `CustomerID`, `date`. Drop the rest.
4. Create a target sample dataframe corresponding to the following framing constraints (`sample_clean_customer`)
    - Group data by `CustomerID`. 
    - Add a column `Total` corresponding to the total value of all purchases associated to `CustomerID`
    - Add a column `NumItems` with the number of items for a given `CustomerID`
    - Add a column `NumInvoices` with the number of invoices for a given `CustomerID`
    - Add a column `FirstInvoiceDate` with the date of the earliest invoice associated to `CustomerID`
    - Drop the remaining columns

### Definition of DONE
1.You have a framing function such that:
 `frame_by_invoice(sample_dirty).equals(sample_clean_invoice)`
2.You have a framing function such that:
 `frame_by_customer(sample_dirty).equals(sample_clean_customer)`

### Implementation
Play around with your `sample_dirty` data until you reach the targets `sample_clean_invoice` and `sample_clean_customer`. The following pandas functions may be useful to you:
- [`pd.DataFrame.groupby`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html)
- [`pd.Series.sort_index`](https://pandas.pydata.org/docs/reference/api/pandas.Series.sort_index.html)
- [`pd.Series.value_counts`](https://pandas.pydata.org/docs/reference/api/pandas.Series.value_counts.html)
- [`pd.DataFrame.reset_index`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.reset_index.html)
- [`pd.core.groupby.GroupBy.apply`](https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.GroupBy.apply.html)


In [25]:
FILE_PATH = 'data/ecommerce.csv'
ecommerce = pd.read_csv(
    FILE_PATH, 
    sep=';',
    dtype=dict(
        Quantity='int',
        UnitPrice='float',
        CustomerID='string',
        InvoiceNo='string',
    ),
    parse_dates=[4],
    encoding='latin1',
)
ecommerce.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 541909 entries, 0 to 541908
Data columns (total 8 columns):
 #   Column       Non-Null Count   Dtype         
---  ------       --------------   -----         
 0   InvoiceNo    541909 non-null  string        
 1   StockCode    541909 non-null  object        
 2   Description  540455 non-null  object        
 3   Quantity     541909 non-null  int32         
 4   InvoiceDate  541909 non-null  datetime64[ns]
 5   UnitPrice    541909 non-null  float64       
 6   CustomerID   406829 non-null  string        
 7   Country      541909 non-null  object        
dtypes: datetime64[ns](1), float64(1), int32(1), object(3), string(2)
memory usage: 31.0+ MB


In [26]:
ecommerce.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 08:26:00,2.55,17850,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,2010-12-01 08:26:00,3.39,17850,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010-12-01 08:26:00,2.75,17850,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,2010-12-01 08:26:00,3.39,17850,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,2010-12-01 08:26:00,3.39,17850,United Kingdom


### Your work here

## Exercise 2 - Gaz data
### Problem statement
1. This data is mostly formatted. You may need to set dtypes when loading the csv. 
2. From this data, select a sample of dirty data (`sample_dirty`) that you will frame manually.
3. Create a target sample dataframe corresponding to the following framing constraints (`sample_clean`)
   - Group data by month (to be extracted from `Date` column)
   - Aggregate `Price` in two forms
      - on  a column `mean` containing the mean price of the corresponding month
      - on a column `median` containing the median price of the corresponding month
   - Add a column `under_mean` with the quantity of days`in the month in which the price was bellow the month's mean 


### Definition of DONE
You have a framing function such that:
 `frame_by_invoice(sample_dirty).equals(sample_clean)`


### Implementation
Play around with your `sample_dirty` data until you reach the target `sample_clean`. The following pandas functions may be useful to you:
- [`pd.DataFrame.groupby`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html)
- [`pd.Series.sort_index`](https://pandas.pydata.org/docs/reference/api/pandas.Series.sort_index.html)
- [`pd.Series.value_counts`](https://pandas.pydata.org/docs/reference/api/pandas.Series.value_counts.html)
- [`pd.DataFrame.reset_index`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.reset_index.html)
- [`pd.core.groupby.GroupBy.apply`](https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.GroupBy.apply.html)

In [28]:
GAZ_FILE_PATH = 'data/gaz_prices_daily_csv.csv'
gaz = pd.read_csv(GAZ_FILE_PATH,
                  sep=';',
                  dtype={'Price':'float'},
                  parse_dates=[0])
gaz.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5953 entries, 0 to 5952
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype         
---  ------  --------------  -----         
 0   Date    5953 non-null   datetime64[ns]
 1   Price   5952 non-null   float64       
dtypes: datetime64[ns](1), float64(1)
memory usage: 93.1 KB


### Your work here