# Step 1 - Data Engineering

The climate data for Hawaii is provided through two CSV files. Start by using Python and Pandas to inspect the content of these files and clean the data.

- Create a Jupyter Notebook file called data_engineering.ipynb and use this to complete all of your Data Engineering tasks.
- Use Pandas to read in the measurement and station CSV files as DataFrames.
- Inspect the data for NaNs and missing values. You must decide what to do with this data.
- Save your cleaned CSV files with the prefix clean_.

In [1]:
# Dependencies
import pandas as pd
import os

In [2]:
# Creating the filepaths
filepath1=os.path.join("..","Resources","hawaii_measurements.csv")
filepath2=os.path.join("..","Resources","hawaii_stations.csv")

In [3]:
# Reading the csv files and converting to dataframe
Hawaii_Measurement=pd.read_csv(filepath1)
Hawaii_Stations=pd.read_csv(filepath2)

In [4]:
# Displaying the Hawaii Measurement table to inspect
Hawaii_Measurement.head(10)

Unnamed: 0,station,date,prcp,tobs
0,USC00519397,2010-01-01,0.08,65
1,USC00519397,2010-01-02,0.0,63
2,USC00519397,2010-01-03,0.0,74
3,USC00519397,2010-01-04,0.0,76
4,USC00519397,2010-01-06,,73
5,USC00519397,2010-01-07,0.06,70
6,USC00519397,2010-01-08,0.0,64
7,USC00519397,2010-01-09,0.0,68
8,USC00519397,2010-01-10,0.0,73
9,USC00519397,2010-01-11,0.01,64


In [5]:
# Checking for the count of rows in DF
Hawaii_Measurement.count()

station    19550
date       19550
prcp       18103
tobs       19550
dtype: int64

In [6]:
# Due to discrepancies in the prcp columns checking for the unique values to check for NAN
Hawaii_Measurement["prcp"].unique()

array([  8.00000000e-02,   0.00000000e+00,              nan,
         6.00000000e-02,   1.00000000e-02,   4.00000000e-02,
         1.20000000e-01,   3.00000000e-02,   2.00000000e-02,
         4.30000000e-01,   1.70000000e-01,   1.50000000e-01,
         2.70000000e-01,   2.00000000e-01,   5.00000000e-02,
         9.00000000e-02,   7.00000000e-02,   5.70000000e-01,
         3.10000000e-01,   2.30000000e-01,   1.58000000e+00,
         7.70000000e-01,   1.40000000e+00,   1.30000000e-01,
         5.50000000e-01,   2.20000000e-01,   1.29000000e+00,
         1.30000000e+00,   2.90000000e-01,   1.72000000e+00,
         8.20000000e-01,   1.00000000e-01,   8.80000000e-01,
         4.00000000e-01,   1.11000000e+00,   3.60000000e-01,
         2.10000000e+00,   1.10000000e-01,   5.60000000e-01,
         8.90000000e-01,   1.40000000e-01,   1.60000000e+00,
         5.80000000e-01,   2.80000000e-01,   7.00000000e-01,
         1.80000000e-01,   1.03000000e+00,   3.70000000e-01,
         1.08000000e+00,

In [7]:
# Calculating the mean preciptatiion value to replace the rows that has preciptation as NAN
mean_prcp=Hawaii_Measurement["prcp"].mean()

# Filling the NA's with the mean value
Hawaii_Measurement["prcp"].fillna(mean_prcp, inplace=True)

In [8]:
# Checking for the count of variables  after removing NAN's
Hawaii_Measurement.count()

station    19550
date       19550
prcp       19550
tobs       19550
dtype: int64

In [9]:
# Viewing the table after removing NAN's
Hawaii_Measurement.head(10)

Unnamed: 0,station,date,prcp,tobs
0,USC00519397,2010-01-01,0.08,65
1,USC00519397,2010-01-02,0.0,63
2,USC00519397,2010-01-03,0.0,74
3,USC00519397,2010-01-04,0.0,76
4,USC00519397,2010-01-06,0.160644,73
5,USC00519397,2010-01-07,0.06,70
6,USC00519397,2010-01-08,0.0,64
7,USC00519397,2010-01-09,0.0,68
8,USC00519397,2010-01-10,0.0,73
9,USC00519397,2010-01-11,0.01,64


In [10]:
# Viewing the Hawaii Station table
Hawaii_Stations.head(10)

Unnamed: 0,station,name,latitude,longitude,elevation
0,USC00519397,"WAIKIKI 717.2, HI US",21.2716,-157.8168,3.0
1,USC00513117,"KANEOHE 838.1, HI US",21.4234,-157.8015,14.6
2,USC00514830,"KUALOA RANCH HEADQUARTERS 886.9, HI US",21.5213,-157.8374,7.0
3,USC00517948,"PEARL CITY, HI US",21.3934,-157.9751,11.9
4,USC00518838,"UPPER WAHIAWA 874.3, HI US",21.4992,-158.0111,306.6
5,USC00519523,"WAIMANALO EXPERIMENTAL FARM, HI US",21.33556,-157.71139,19.5
6,USC00519281,"WAIHEE 837.5, HI US",21.45167,-157.84889,32.9
7,USC00511918,"HONOLULU OBSERVATORY 702.2, HI US",21.3152,-157.9992,0.9
8,USC00516128,"MANOA LYON ARBO 785.2, HI US",21.3331,-157.8025,152.4


In [11]:
# Checking for count of rows in Hawaii Station DF
Hawaii_Stations.count()

station      9
name         9
latitude     9
longitude    9
elevation    9
dtype: int64

In [12]:
# Storing the cleansed Data Frames
outputfilepath1=os.path.join("..","Resources","clean_hawaii_measurements.csv")
outputfilepath2=os.path.join("..","Resources","clean_hawaii_stations.csv")
Hawaii_Measurement.to_csv(outputfilepath1,index=False)
Hawaii_Stations.to_csv(outputfilepath2,index=False)