
# Data Exploration and Preprocessing

In this tutorial we will use the San Diego Weather dataset included in the "daily_weather.csv" file.
We will explore and then preprocess the dataset, dealing with missing values.


We start by reading the csv file into a dataframe (make sure to give the correct path when you run this). 

In [None]:
print(__doc__)

import numpy as np
import scipy as sp
import pandas as pd

from sklearn.preprocessing import Imputer

#read the csv into a dataframe
df = pd.read_csv('daily_weather.csv')


In [None]:
#print the contents of the dataframe
df

We observe that some of the cells have missing values, denoted as 'NaN'. Since it's impossible to see how many of those exist by visually exploring the dataset, we instead print out a total count of the non-empty cells for each attribute. 

In [None]:
df.count()

We observce that indeed, most of the attributes have missing values. 

We can also view the summary statistics.

In [None]:
df.describe()

#or for an alternative view...
#df.describe().transpose() 



Time to do something about all these missing values.

Let's look closer at one variable, 'air_temp_9am':
 

In [None]:
df['air_temp_9am'].describe()


One approach is to drop the rows that contain the missing values.

(notice that in order to avoid messing up with our original dataframe, we save the resulting dataset in a new dataframe df2) 



In [None]:
df2 = df.dropna(subset=['air_temp_9am'])

Now let's see the total number of rows.

In [None]:
df2.count()

Now let's clear the entire data frame from missing values.

In [None]:
df3 = df.dropna()
df3.count()

How was the dataset affected by this? Let's look at the mean and standard deviation of the attributes. 

In [None]:
df3.describe()

Imputation

An alternative approach to missing values is to impute them. 

The following snippet demonstrates how to replace missing values, encoded as np.nan, using the mean value of the column (axis 0) that contain the missing values.

In [None]:
imp = Imputer(missing_values='NaN', strategy='mean', axis=0)
imputed_DF = pd.DataFrame(imp.fit_transform(df))
imputed_DF.columns = df.columns
imputed_DF.index = df.index

Now let's see how much did the attributes' stats change...

In [None]:
imputed_DF.describe()