# Step 1 - Data Engineering

The climate data for Hawaii is provided through two CSV files. Start by using Python and Pandas to inspect the content of these files and clean the data.

Create a Jupyter Notebook file called data_engineering.ipynb and use this to complete all of your Data Engineering tasks.

In [1]:
# Import dependencies
import pandas as pd
import numpy as np
import csv

Use Pandas to read in the measurement and station CSV files as DataFrames.

In [2]:
# Import CSV files into pandas.
df_measurements = pd.read_csv("hawaii_measurements.csv")
df_stations = pd.read_csv("hawaii_stations.csv")

Inspect the data for NaNs and missing values. You must decide what to do with this data.

In [3]:
# Look for missing info in the dataframe
df_measurements.count()

station    19550
date       19550
prcp       18103
tobs       19550
dtype: int64

In [4]:
# Double-check the data to see if NaNs exist
df_measurements.head(10)

Unnamed: 0,station,date,prcp,tobs
0,USC00519397,2010-01-01,0.08,65
1,USC00519397,2010-01-02,0.0,63
2,USC00519397,2010-01-03,0.0,74
3,USC00519397,2010-01-04,0.0,76
4,USC00519397,2010-01-06,,73
5,USC00519397,2010-01-07,0.06,70
6,USC00519397,2010-01-08,0.0,64
7,USC00519397,2010-01-09,0.0,68
8,USC00519397,2010-01-10,0.0,73
9,USC00519397,2010-01-11,0.01,64


In [5]:
# Add index
df_measurements.index.name = 'id'

In [6]:
# Fill in NaN spots with 0
df_measurements['prcp'] = df_measurements['prcp'].fillna(0)
df_measurements.head()

Unnamed: 0_level_0,station,date,prcp,tobs
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,USC00519397,2010-01-01,0.08,65
1,USC00519397,2010-01-02,0.0,63
2,USC00519397,2010-01-03,0.0,74
3,USC00519397,2010-01-04,0.0,76
4,USC00519397,2010-01-06,0.0,73


In [7]:
# Check the next dataframe for null values
df_stations.count()

station      9
name         9
latitude     9
longitude    9
elevation    9
dtype: int64

Don't need to clean NaNs out of stations, since there were none.

Save your cleaned CSV files with the prefix clean_.

In [8]:
# Save the cleaned file
df_measurements.to_csv("clean_hawaii_measurements.csv", encoding = "utf-8")