**Exploratory Data Analysis**

To start exploring the data I will first start off with importing the necessary libraries I will be needing

In [1]:
import pandas as pd
from pandas import Series, DataFrame
import numpy as np
import os
from sklearn.model_selection import train_test_split
from sklearn import ensemble
from sklearn.metrics import mean_absolute_error

**House Data Info**

The dataframe that I will be working with is a houses data set, the first step is that I will read into the csv file

In [2]:
# Read the dataset into a data table using Pandas and show the first 5 rows
house_df = pd.read_csv('Data/house_data_set.csv')
house_df.head()

Unnamed: 0,year_built,stories,num_bedrooms,full_bathrooms,half_bathrooms,livable_sqft,total_sqft,garage_type,garage_sqft,carport_sqft,has_fireplace,has_pool,has_central_heating,has_central_cooling,house_number,street_name,unit_number,city,zip_code,sale_price
0,1978,1,4,1,1,1689,1859,attached,508,0,True,False,True,True,42670,Lopez Crossing,,Hallfort,10907,270897.0
1,1958,1,3,1,1,1984,2002,attached,462,0,True,False,True,True,5194,Gardner Park,,Hallfort,10907,302404.0
2,2002,1,3,2,0,1581,1578,none,0,625,False,False,True,True,4366,Harding Islands,,Lake Christinaport,11203,2519996.0
3,2004,1,4,2,0,1829,2277,attached,479,0,True,False,True,True,3302,Michelle Highway,,Lake Christinaport,11203,197193.0
4,2006,1,4,2,0,1580,1749,attached,430,0,True,False,True,True,582,Jacob Cape,,Lake Christinaport,11203,207897.0


I will get additional information on the dataframe to see if it has any null values and the types of data

In [3]:
#Basic info on data
house_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 42703 entries, 0 to 42702
Data columns (total 20 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   year_built           42703 non-null  int64  
 1   stories              42703 non-null  int64  
 2   num_bedrooms         42703 non-null  int64  
 3   full_bathrooms       42703 non-null  int64  
 4   half_bathrooms       42703 non-null  int64  
 5   livable_sqft         42703 non-null  int64  
 6   total_sqft           42703 non-null  int64  
 7   garage_type          42703 non-null  object 
 8   garage_sqft          42703 non-null  int64  
 9   carport_sqft         42703 non-null  int64  
 10  has_fireplace        42703 non-null  bool   
 11  has_pool             42703 non-null  bool   
 12  has_central_heating  42703 non-null  bool   
 13  has_central_cooling  42703 non-null  bool   
 14  house_number         42703 non-null  int64  
 15  street_name          42703 non-null 

**Data Cleaning & Preparation**

In this section, I will start by cleaning the house dataframe before drawing any conclusions. It will help inspect the data better and get a more accurate general understanding of the data at hand.

**Step 1:** check  all columns in data frame

In [4]:
house_df.columns

Index(['year_built', 'stories', 'num_bedrooms', 'full_bathrooms',
       'half_bathrooms', 'livable_sqft', 'total_sqft', 'garage_type',
       'garage_sqft', 'carport_sqft', 'has_fireplace', 'has_pool',
       'has_central_heating', 'has_central_cooling', 'house_number',
       'street_name', 'unit_number', 'city', 'zip_code', 'sale_price'],
      dtype='object')

Now that I have seen all the columns and what features the dataframe entails, I want to select only the features that will be relative and useful to my model, in this next section I will be preparing the desired features. 

There are some features that will not be useful in my model therefore, I will delete the columns that I won't need. 

The house number isn't going to be useful to include in the model since it's not likely that anyone buys a house because the street number assigned to it, it's just a random number. So I will drop this field from the model. 
The same will go for unit number. As for street name, city, and zip code columns, the location of a house has a big influence on the value so we need to include at least some of this information in the model. However, these columns provide duplicate information. For example, if we know the zip code of a house, we already know what city it's in, so I don't need to include both city and zip code in my model. I will also delete street name since.

In [5]:
#There are 4 columns that I will remove since it will not be included in my model
house_df.drop(['house_number', 'street_name', 'unit_number','zip_code'], axis = 1, inplace=True)
house_df.head()

Unnamed: 0,year_built,stories,num_bedrooms,full_bathrooms,half_bathrooms,livable_sqft,total_sqft,garage_type,garage_sqft,carport_sqft,has_fireplace,has_pool,has_central_heating,has_central_cooling,city,sale_price
0,1978,1,4,1,1,1689,1859,attached,508,0,True,False,True,True,Hallfort,270897.0
1,1958,1,3,1,1,1984,2002,attached,462,0,True,False,True,True,Hallfort,302404.0
2,2002,1,3,2,0,1581,1578,none,0,625,False,False,True,True,Lake Christinaport,2519996.0
3,2004,1,4,2,0,1829,2277,attached,479,0,True,False,True,True,Lake Christinaport,197193.0
4,2006,1,4,2,0,1580,1749,attached,430,0,True,False,True,True,Lake Christinaport,207897.0


The columns with true and false values will be fine to use in my model since they'll be treated as one or zero automatically, so no extra work is needed to prepare it. 

I will use one-hot encoding to the columns with categorical values in order to use it in my model: the garage_type and city. One-Hot Encoding is the process of creating dummy variables for categorical variables. For every categorical feature, a new numerical variable is created.

In [6]:
# Replace categorical data with one-hot encoded data
features_df = pd.get_dummies(house_df, columns=['garage_type', 'city'])