# TP 1 : Formatting and Sanitizing

For this first practical assignemnt, we'll be working with data from Montpleier metropole. You will be downloading the raw "Defibrilators of Montpellier" dataset (the one you've played with in CodinGame) and will write a loader script that cleans and loads a sane dataframe.

To build it, you will proceed with the following steps: 
1. Problem specification
2. Definition of Done
3. Implementation of the cleaning functions

## Technical requisites
Throughout this assigment you'll use mutiple pandas functionalities such as:
- drop useless columns
- convert string types to numeric or datetime types
- manipulating stirngs with `.str` (aka the string accessor)
- checking strings for patterns using regular expressions

To learn how to do this in Pandas, you may reffer to the following documents:
- ["Pythonic Data Cleaning With pandas and NumPy",  by Malay Agarwal at RealPython.com](https://realpython.com/python-data-cleaning-numpy-pandas/)
- ["Working with text data", at pandas docs](https://pandas.pydata.org/docs/user_guide/text.html#)
- ["Time series / date functionality", at pandas docs](https://pandas.pydata.org/docs/user_guide/timeseries.html)

## Problem specification

1. Select a handful of examples from your data that cover the problems you have identified
2. With these examples, compose a sample dirty dataframe (`sample_dirty`)
3. Manually compose an equivalent dataframe where all the formatting problems in the first one got fixed (`sample_formatted`)
4. Manually compose an equivalent dataframe where all the sanity problems in the second one got fixed (`sample_sane`)

## Definition of DONE
Prepare some tests that ensure that your cleaning function works as expected:
1. You have a formatting function that verifies
`format_dataframe(sample_dirty).equals(sample_formatted)`
2. You have a sanitizing function that verifies 
`sanitize_dataframe(sample_formatted).equals(sample_sane)`
3. You have a loading function that verifies
`load_clean_dataframe(sample_dirty).equals(sample_sane)`

## Implementation of the cleaning functions
Your loading function should be in a file `loader.py`. It should be composed of smaller formatting and sanitizing functions. For example:
```python

def format_dataframe(df:pd.DataFrame) -> pd.DataFrame:
    """ One function to do all formatting"""
    df = (df.pipe(format_adr_num)
            .pipe(format_adr_voie)
            .pipe(fonrmat_com_cp)
            .pipe(format_com_nom)
            # ... add others (or remove) if necessary
          )
    return df

def sanitize_dataframe(df:pd.DataFrame) -> pd.DataFrame:
    """ One function to do all sanitizing"""
    df = (df.pipe(sanitize_adr_num)
            .pipe(sanitize_adr_voie)
            .pipe(sanitize_com_cp)
            .pipe(sanitize_com_nom)
            # ... add others (or remove) if necessary
          )
    return df

def load_clean_dataframe(df:pd.DataFrame)-> pd.DataFrame:
    """one function to run it all and return a clean dataframe"""
    df = (df.pipe(format_dataframe)
          .pipe(sanitize_dataframe)
    )
    return df
```
- This notebook can be used as a sandbox for you to explore the data and identify formating and sanitizing problems.
- You may also use it to compose the sample dataframes and test if the cleaning function output matches the expected result.
- As you validate your methods, create the corresponding function in `loader.py`
- Once your script is ready, you can import the loading function and use it to load a clean dataframe.


# Exploration space
## Imports

In [None]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt


In [None]:
pd.set_option('display.max_rows', 20)
pd.set_option('display.max_columns', 51)
pd.set_option('display.width', 1000)

## (Down)Load data

In [None]:

from loader import download_data
url_mtp = "https://data.montpellier3m.fr/sites/default/files/ressources/MMM_MMM_DAE.csv"


data = download_data(url_mtp)

## Drop empty columns

In [None]:
data.dropna(
    axis='columns',
    how='all',
    inplace=True)
data

## Keeping only columns of interest

We want to build a new dataframe containing:
- ID (the dataframe's index)
- Name
- Adress (including postal code (com_cp) and city name (com_nom))
- Contact phone number
- Maintenance frequency
- Latest maintenance date
- Longitude
- Latitude

In [None]:
kept_columns = [
    'nom', 'adr_num','adr_voie',
    'com_cp', 'com_nom', 
    'tel1',
    'freq_mnt', 'dermnt',
    'lat_coor1', 'long_coor1']
data_filter = data.filter(items=kept_columns)
data_filter


In [None]:
data_filter.info()

## Extract problematic cases to specify cleaning functions

Use this space to explore the dataframe and identify problems in formatting and sanitizing. Save the indexes of interesting problematic cases in the following set `idx_problem_cases`, as you expore the data.

For instance, if you identify examples 3 and 45 to be dirty, add them to the set with:
`idx_problem_cases.update([3, 45])`

In [None]:
idx_problem_cases = set()

### Address data

In [None]:
data_filter.filter(regex=r"adr_|com_")

#### adr_num field

In [None]:
data_filter.adr_num

#### adr_voie field

In [None]:
data_filter.adr_voie

#### com_cp field

In [None]:
data_filter.com_cp

#### com_nom field

In [None]:
data_filter.com_nom

### Contact info

In [None]:
data_filter.tel1

### Latest maintenance date

date_filter.dermnt

### Latitude and longitude

In [None]:
data_filter.filter(regex=r"_coor")

## Review selected cases and save as sample dirty data

In [None]:
sample_dirty = data_filter.loc[list(idx_problem_cases)]
sample_dirty

In [None]:
sample_dirty.to_csv('data/sample_dirty.csv')

# Creating test target dataframes

## Create sample_formatted from sample_dirty


In [None]:
sample_formatted = ...

sample_formatted.info()

sample_formatted

## Create sample_sane from sample_formatted


In [None]:
sample_sane = ...


sample_sane.info()

In [None]:
sample_sane

# TEST: Compare dataframe after cleaning to sample_sane

In [None]:
from loader import load_clean_dataframe

loaded_data = load_clean_dataframe(sample_dirty)

loaded_data.equals(sample_sane)