# 3. Pre-Processing <a id="data_wrangling"></a>

<a id="contents"></a>
# Table of Contents  
3.1. [Introduction](#introduction) <br>
3.2. [Imports](#imports)  <br>
3.3. [Data Processing](#process)<br>
3.4. [Data Splitting](#split)<br>
3.5. [Save Updated Data](#save)

## 3.1 Introduction<a id="introduction"></a>

The goal of this notebook is to create a cleaned development dataset to be used to complete the modeling step of my project.

## 3.2 Imports<a id="imports"></a>

In [1]:
import numpy as np 
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
import seaborn as sns

In [2]:
# Load the dataset
df = pd.read_csv('/Users/heatheradler/Documents/GitHub/Springboard/Springboard_Projects/Capstone 3/df_eda.csv')
print("Dataset loaded.")

Dataset loaded.


## 3.3 Data Processing

In [3]:
# Encode Categorical Variables using LabelEncoder for simplicity
label_encoders = {}
for col in df.select_dtypes(include=[object]).columns:
    le = LabelEncoder()
    df[col] = le.fit_transform(df[col])
    label_encoders[col] = le

In [4]:
# Define the target variable column and feature columns
target = 'price'
features = df.columns[df.columns != target]

## 3.4 Data

In [5]:
#split the Data into Training and Testing Sets
X = df[features]
y = df[target]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [6]:
std = StandardScaler()

print('\033[1mStandardardization on Training set'.center(120))
Train_X_std = std.fit_transform(X_train)
Train_X_std = pd.DataFrame(Train_X_std, columns=X.columns)
display(Train_X_std.describe())

print('\n','\033[1mStandardardization on Testing set'.center(120))
Test_X_std = std.transform(X_test)
Test_X_std = pd.DataFrame(Test_X_std, columns=X.columns)
display(Test_X_std.describe())

                                         [1mStandardardization on Training set                                         


Unnamed: 0,city,street_address,state,zipcode,house_type,bathrooms,bedrooms,yearBuilt,latitude,longitude,sqft,school_rating
count,30521.0,30521.0,30521.0,30521.0,30521.0,30521.0,30521.0,30521.0,30521.0,30521.0,30521.0,30521.0
mean,1.096509e-16,1.306033e-16,0.0,3.941381e-16,1.839156e-17,-4.656091e-18,7.100538e-17,2.900745e-16,-4.765509e-15,2.42736e-14,1.2105840000000001e-17,-3.850587e-16
std,1.000016,1.000016,0.0,1.000016,1.000016,1.000016,1.000016,1.000016,1.000016,1.000016,1.000016,1.000016
min,-1.361832,-1.731565,0.0,-1.684084,-2.343718,-1.328184,-1.393687,-53.40519,-1.940348,-2.214108,-0.2010654,-1.760053
25%,-0.9788169,-0.8657537,0.0,-1.071468,-0.9436615,-0.4522164,-0.4589373,-0.6555934,-0.8012181,-0.9863026,-0.1068113,-0.5377261
50%,-0.4425961,-0.004678381,0.0,0.6797071,-0.0102906,0.131762,-0.4589373,-0.2404602,-0.0880879,0.1959151,-0.07053181,-0.2955106
75%,1.395875,0.8675998,0.0,0.998819,0.9230803,0.131762,0.4758124,0.6174819,0.6648476,0.7700175,-0.01981308,0.2950307
max,1.549081,1.731863,0.0,1.650832,1.856451,68.45725,56.5608,6.955183,2.252052,1.757649,71.38852,4.405198



                                          [1mStandardardization on Testing set                                          


Unnamed: 0,city,street_address,state,zipcode,house_type,bathrooms,bedrooms,yearBuilt,latitude,longitude,sqft,school_rating
count,7631.0,7631.0,7631.0,7631.0,7631.0,7631.0,7631.0,7631.0,7631.0,7631.0,7631.0,7631.0
mean,-0.012009,-9.9e-05,0.0,0.015483,0.017719,-0.02293,-0.025371,-0.007907,-0.001151,0.002552,0.006812,-0.002343
std,0.996944,0.999111,0.0,1.00091,0.997708,0.875589,0.929191,1.299746,0.999637,0.992114,0.970562,0.985877
min,-1.361832,-1.731474,0.0,-1.684084,-2.343718,-1.036195,-1.393687,-53.903352,-1.939643,-2.218058,-0.201065,-1.760053
25%,-0.978817,-0.866801,0.0,-1.071468,-0.943662,-0.452216,-0.458937,-0.655593,-0.807167,-0.96595,-0.106811,-0.537726
50%,-0.442596,0.019276,0.0,0.685617,-0.010291,-0.452216,-0.458937,-0.24046,-0.080416,0.205351,-0.070532,-0.304369
75%,1.395875,0.859539,0.0,0.998819,0.92308,0.131762,0.475812,0.589806,0.681834,0.757866,-0.020648,0.295031
max,1.549081,1.731225,0.0,1.650832,1.856451,17.067138,21.507681,58.791488,2.241914,1.762257,49.829874,4.405198


## 3.5 Save Data

In [7]:
# Save the split data
X_train.to_csv('/Users/heatheradler/Documents/GitHub/Springboard/Springboard_Projects/Capstone 3/X_train.csv', index=False)
X_test.to_csv('/Users/heatheradler/Documents/GitHub/Springboard/Springboard_Projects/Capstone 3/X_test.csv', index=False)
y_train.to_csv('/Users/heatheradler/Documents/GitHub/Springboard/Springboard_Projects/Capstone 3/y_train.csv', index=False)
y_test.to_csv('/Users/heatheradler/Documents/GitHub/Springboard/Springboard_Projects/Capstone 3/y_test.csv', index=False)

print("\nPre-processing Complete. Split datasets saved to CSV files.")


Pre-processing Complete. Split datasets saved to CSV files.
