In [1]:
import numpy as np
import pandas as pd

# Charger les données

In [2]:
data_path = '../data/input/household_power_consumption.txt'

data = pd.read_csv(data_path, sep=';')

  interactivity=interactivity, compiler=compiler, result=result)


In [3]:
data.dtypes

Date                      object
Time                      object
Global_active_power       object
Global_reactive_power     object
Voltage                   object
Global_intensity          object
Sub_metering_1            object
Sub_metering_2            object
Sub_metering_3           float64
dtype: object

Dans les informations des données, elles ont indiqué qu'il y a les values manquantes. Je les affiche.

In [4]:
data.loc[data.Date == '28/4/2007']

Unnamed: 0,Date,Time,Global_active_power,Global_reactive_power,Voltage,Global_intensity,Sub_metering_1,Sub_metering_2,Sub_metering_3
190476,28/4/2007,00:00:00,1.368,0.086,233.050,5.800,0.000,1.000,0.0
190477,28/4/2007,00:01:00,1.370,0.086,233.220,5.800,0.000,1.000,0.0
190478,28/4/2007,00:02:00,1.372,0.088,233.570,5.800,0.000,2.000,0.0
190479,28/4/2007,00:03:00,1.370,0.086,233.400,5.800,0.000,1.000,0.0
190480,28/4/2007,00:04:00,1.368,0.086,233.250,5.800,0.000,1.000,0.0
190481,28/4/2007,00:05:00,1.368,0.086,233.170,5.800,0.000,1.000,0.0
190482,28/4/2007,00:06:00,1.370,0.086,233.370,5.800,0.000,1.000,0.0
190483,28/4/2007,00:07:00,1.362,0.084,232.550,5.800,0.000,1.000,0.0
190484,28/4/2007,00:08:00,1.362,0.084,232.430,5.800,0.000,1.000,0.0
190485,28/4/2007,00:09:00,1.366,0.086,233.060,5.800,0.000,2.000,0.0


On voit que les valeurs manquantes sont présentés par `?` ou `NaN`.

# nettoyer les données

## changer le format de `Date` et `Time`

Je veux fusionner les prémière et deuxième colonnes dans une colonne. 

La nouvelle colonne doit être dans le format `datetime`. Elle est aussi l'indice de données. C'est pour facilité de traitement.

In [5]:
df = pd.read_csv(data_path, sep=';', low_memory=False,
                 infer_datetime_format=True, parse_dates={'datetime':[0,1]}, index_col=['datetime'])

In [6]:
df.head()

Unnamed: 0_level_0,Global_active_power,Global_reactive_power,Voltage,Global_intensity,Sub_metering_1,Sub_metering_2,Sub_metering_3
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2006-12-16 17:24:00,4.216,0.418,234.84,18.4,0.0,1.0,17.0
2006-12-16 17:25:00,5.36,0.436,233.63,23.0,0.0,1.0,16.0
2006-12-16 17:26:00,5.374,0.498,233.29,23.0,0.0,2.0,17.0
2006-12-16 17:27:00,5.388,0.502,233.74,23.0,0.0,1.0,17.0
2006-12-16 17:28:00,3.666,0.528,235.68,15.8,0.0,1.0,17.0


## remplir les valeurs manquantes

Pour faire ça, tout d'abord, je replace `?` par `NaN`, et transforme tous les valeurs numériques dans le format `float`.

Ensuite, je remplis ces valeurs par les mêmes données de la même heure la veille. 

In [7]:
df.replace('?', np.nan, inplace=True)

In [8]:
df = df.astype('float64')

In [9]:
def remplir_donnees(values):
    one_day = 60 * 24
    for row in range(values.shape[0]):
        for col in range(values.shape[1]):
            if np.isnan(values[row, col]):
                values[row, col] = values[row - one_day, col]

In [10]:
remplir_donnees(df.values)

## créer la colonne `Sub_metering_4`

Dans les informations, il a aussi indiqué que l'énergie active consommée chaque minute (en wattheures) dans le ménage par du matériel électrique non mesuré dans les `Sub_metering_1, 2 et 3`est calculé par `(global_active_power*1000/60 - sub_metering_1 - sub_metering_2 - sub_metering_3)`.

Je vais créer cette colonne.

In [11]:
values = df.values
df['Sub_metering_4'] = (values[:,0] * 1000 / 60) - (values[:,4] + values[:,5] + values[:,6])

In [12]:
df.head()

Unnamed: 0_level_0,Global_active_power,Global_reactive_power,Voltage,Global_intensity,Sub_metering_1,Sub_metering_2,Sub_metering_3,Sub_metering_4
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2006-12-16 17:24:00,4.216,0.418,234.84,18.4,0.0,1.0,17.0,52.266667
2006-12-16 17:25:00,5.36,0.436,233.63,23.0,0.0,1.0,16.0,72.333333
2006-12-16 17:26:00,5.374,0.498,233.29,23.0,0.0,2.0,17.0,70.566667
2006-12-16 17:27:00,5.388,0.502,233.74,23.0,0.0,1.0,17.0,71.8
2006-12-16 17:28:00,3.666,0.528,235.68,15.8,0.0,1.0,17.0,43.1


# sauvegarder les données nettoyées

In [13]:
out_dir = '../data/output/'
out_name = 'cleaned_household_power_consumption.csv'
df.to_csv(out_dir + out_name)