# TP1 Sandbox
- This notebook can be used as a sandbox for you to explore the data and identify formating and sanitizing problems.
- You may also use it to compose the sample dataframes and test if the cleaning function output matches the expected result.
- As you validate your methods, create the corresponding function in `loader.py`
- Once your script is ready, you can import the loading function and use it to load a clean dataframe.

# Exploration space
## Imports

In [None]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt


In [None]:
pd.set_option('display.max_rows', 20)
pd.set_option('display.max_columns', 51)
pd.set_option('display.width', 1000)

## (Down)Load data

In [None]:

from loader import download_data
url_mtp = "https://data.montpellier3m.fr/sites/default/files/ressources/MMM_MMM_DAE.csv"


data = download_data(url_mtp)

## Drop empty columns

In [None]:
data.dropna(
    axis='columns',
    how='all',
    inplace=True)
data

## Keeping only columns of interest

We want to build a new dataframe containing:
- ID (the dataframe's index)
- Name
- Adress (including postal code (com_cp) and city name (com_nom))
- Contact phone number
- Maintenance frequency
- Latest maintenance date
- Longitude
- Latitude

In [None]:
kept_columns = [
    'nom', 'adr_num','adr_voie',
    'com_cp', 'com_nom', 
    'tel1',
    'freq_mnt', 'dermnt',
    'lat_coor1', 'long_coor1']
data_filter = data.filter(items=kept_columns)
data_filter


In [None]:
data_filter.info()

## Extract problematic cases to specify cleaning functions

Use this space to explore the dataframe and identify problems in formatting and sanitizing. Save the indexes of interesting problematic cases in the following set `idx_problem_cases`, as you expore the data.

For instance, if you identify examples 3 and 45 to be dirty, add them to the set with:
`idx_problem_cases.update([3, 45])`

In [None]:
idx_problem_cases = set()

### Address data

In [None]:
data_filter.filter(regex=r"adr_|com_")

#### adr_num field

In [None]:
data_filter.adr_num

#### adr_voie field

In [None]:
data_filter.adr_voie

#### com_cp field

In [None]:
data_filter.com_cp

#### com_nom field

In [None]:
data_filter.com_nom

### Contact info

In [None]:
data_filter.tel1

### Latest maintenance date

date_filter.dermnt

### Latitude and longitude

In [None]:
data_filter.filter(regex=r"_coor")

## Review selected cases and save as sample dirty data

In [None]:
sample_dirty = data_filter.loc[list(idx_problem_cases)]
sample_dirty

In [None]:
sample_dirty.to_csv('data/sample_dirty.csv')