## California Housing Data

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split


**Get the data. Make a copy.**

In [3]:
housing_data = pd.read_csv("housing.csv")
data = housing_data.copy()

**Split the data into test and training sets.**<br/> We also considered splitting the data into training and test sets after the "preprocessing" stage, since our "preprocessing" stage involves dropping observations based on certain conditions. We came to the conclusion that postponing splitting the data would be more efficient. <br/>This is still in consideration.....

In [None]:
#data_train, data_test = train_test_split(data, test_size=0.20, random_state=11)
data_train = data
# Remember to change this, or else change the variables up to the split as 'data' and not 'data_train'
data_train.columns

## Getting a grasp on our data

In [None]:
data_train.head()

In [None]:
data_train.hist(bins=50,figsize=(20,15))
plt.show()

<br/>**Visualizing high-density areas**<br/>

In [None]:
data_train.plot(kind='scatter',x='longitude',y='latitude',alpha=0.1,figsize=(12,8));

**Looking for correlations**

In [None]:
corr_matrix = data_train.corr()
corr_matrix

In [None]:
corr_matrix['median_house_value'].sort_values(ascending=False)

## Cleaning the data##
<br/> Deal with NaN or missing values. <br/> Deal with categorical data.

Do we have any missing values?

In [None]:
data_train.info()

Mariano had a look at the csv data and searched for empty values. We were able to confirm that there are 207 empty values (in the given data set) for total_bedrooms. Let's get rid of these!

In [None]:
data_train = data_train.dropna(axis=0,how='any')

We also have capped values for the median_house_value which can improperly influence our experiment/model/results. Since we cannot reconstruct these missing labels, we opted to drop them from the training data. This is also the case for capped values of housing_median_age.

In [None]:
data_train['median_house_value'].max()

In [None]:
data_train = data_train.drop(data_train[(data_train['median_house_value'] > 500000) == True].index)

In [None]:
data_train['housing_median_age'].max()

In [None]:
data_train = data_train.drop(data_train[(data_train['housing_median_age'] > 51) == True].index)

In [None]:
data_train.head()

** One hot encoding for the ocean_proximity feature**<br/>

In [None]:
data_train['ocean_proximity'].value_counts()

In [None]:
housing_cat = data_train['ocean_proximity']
housing_cat_encoded,housing_categories = housing_cat.factorize()

In [None]:
housing_categories

In [None]:
encoder = OneHotEncoder()
housing_cat_1hot = encoder.fit_transform(housing_cat_encoded.reshape(-1,1))

In [None]:
housing_cat_1hot

In [None]:
housing_cat_1hot = housing_cat_1hot.toarray()

In [None]:
data_train.head()

## Feature Scaling ##

<br/> We opted for standardization as opposed to normalization. <br/>

In [None]:
scaler = StandardScaler()
data_scaled = scaler.fit_transform(data_train.iloc[:,:-1])
data_scaled

In [None]:
data_scaled.shape

## Putting the pieces together##
<br/> Create a dataframe with the scaled values. <br/> Add the encoded 5 rows to said dataframe. <br/> Check it out roughly to see if it makes sense.

In [None]:
df_scaled = pd.DataFrame(data_scaled)

In [None]:
df_scaled.columns = ['longitude', 'latitude', 'housing_median_age', 'total_rooms',
       'total_bedrooms', 'population', 'households', 'median_income',
       'median_house_value']

#data_scaled.columns = (data.iloc[:,:-5]).columns.....why is this giving me an error?!?!

In [None]:
df_scaled.shape

In [None]:
ocean_proximity_cat_1hot = pd.DataFrame(housing_cat_1hot)

In [None]:
ocean_proximity_cat_1hot = ocean_proximity_cat_1hot.rename(columns=\
                            {0: 'NEAR BAY', 1: '<1H OCEAN', 2:'INLAND', 3:'NEAR OCEAN', 4:'ISLAND'})

In [None]:
df_scaled = pd.concat([df_scaled, ocean_proximity_cat_1hot], axis=1, sort=False)

In [None]:
df_scaled.head()

In [None]:
#df_scaled = df_scaled.drop(columns=['ocean_proximity'])

In [None]:
df_scaled.hist(bins=50,figsize=(20,15));

In [None]:
df_scaled['ISLAND'].value_counts()

**Where did the other 3 islands go?**<br/>(And can we got there, too?)<br/>They were dropped because the housing_median_age was greater than 51.

## FINALLY.... let's split it up, yo!

In [None]:
df_scaled.shape

In [None]:
data_training, data_testing = train_test_split(df_scaled, test_size=0.20, random_state=11)

In [None]:
data_training.shape

In [None]:
data_testing.shape

**Whew! That feels much better.**