# California Housing Prices 
## Pre-Processing and 
### by Anthony Medina

# Pre-Processing and Training Data Development

## Table of Contents
1. Introduction and Notebook Objectives
2. Imports and Loading the data
3. Dealing with Categorical Variables
4. Normalizing and Scaling Data
5. Splitting the data into testing and training sets
6. Modeling

### 1. Introduction and Notebook Objectives
We'll be preparing a cleaned up version of the [(Kaggle-California Housing Prices)](https://www.kaggle.com/datasets/camnugent/california-housing-prices) data set. 

The objective is to prepare the data for machine learning.
By the end of this notebook, gategorical variables will have been converted to numeric. The data will be scaled and normalized.

### 2. Imports and load the data

In [65]:
# Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from sklearn.model_selection import train_test_split

In [66]:
# Import the data from the cleaned data folder
house_data = pd.read_csv('../cleaned_data/ready_for_EDA.csv', index_col = 0)

In [67]:
house_data.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,-122.23,37.88,41.0,880,129,322,126,83252.0,452600.0,NEAR BAY
1,-122.22,37.86,21.0,7099,1106,2401,1138,83014.0,358500.0,NEAR BAY
2,-122.24,37.85,52.0,1467,190,496,177,72574.0,352100.0,NEAR BAY
3,-122.25,37.85,52.0,1274,235,558,219,56431.0,341300.0,NEAR BAY
4,-122.25,37.85,52.0,1627,280,565,259,38462.0,342200.0,NEAR BAY


In [68]:
house_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 20433 entries, 0 to 20639
Data columns (total 10 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   longitude           20433 non-null  float64
 1   latitude            20433 non-null  float64
 2   housing_median_age  20433 non-null  float64
 3   total_rooms         20433 non-null  int64  
 4   total_bedrooms      20433 non-null  int64  
 5   population          20433 non-null  int64  
 6   households          20433 non-null  int64  
 7   median_income       20433 non-null  float64
 8   median_house_value  20433 non-null  float64
 9   ocean_proximity     20433 non-null  object 
dtypes: float64(5), int64(4), object(1)
memory usage: 1.7+ MB


In [69]:
house_data.dtypes

longitude             float64
latitude              float64
housing_median_age    float64
total_rooms             int64
total_bedrooms          int64
population              int64
households              int64
median_income         float64
median_house_value    float64
ocean_proximity        object
dtype: object

### 3. Dealing with Categorical Variables

In [70]:
# Ocean Proximity
df = house_data
df = pd.get_dummies(df, columns=['ocean_proximity'], drop_first=True)

In [71]:
df.dtypes

longitude                     float64
latitude                      float64
housing_median_age            float64
total_rooms                     int64
total_bedrooms                  int64
population                      int64
households                      int64
median_income                 float64
median_house_value            float64
ocean_proximity_INLAND          uint8
ocean_proximity_ISLAND          uint8
ocean_proximity_NEAR BAY        uint8
ocean_proximity_NEAR OCEAN      uint8
dtype: object

### 4. Normalizing and Scaling Data

In [72]:
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import PowerTransformer

min_max_col = ['housing_median_age','median_income', 'median_house_value']
stadard_col = [ 'total_rooms', 'total_bedrooms', 'population', 'households']
log_trans_col = ['longitude', 'latitude']

In [73]:
# MinMaxScaler
scaler = MinMaxScaler()
scaler.fit(df[min_max_col])
df[min_max_col] = scaler.transform(df[min_max_col])

In [74]:
# Stadardization
scaler = StandardScaler()
scaler.fit(df[stadard_col])
df[stadard_col] = scaler.transform(df[stadard_col])

In [75]:
log=PowerTransformer()
log.fit(df[log_trans_col])
df[log_trans_col] = log.transform(df[log_trans_col])

### 5. Splitting the data into testing and training sets

In [76]:
X = df.drop('median_house_value', axis =1).values
y = df['median_house_value'].values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 21)

### 6. Conclusions and Further Questions

# Did I turn the categorical variables into other types?
# Did I normalize when I should have scaled?
# How will I know when is the best for each?