
# Wake County,NC Housing 

## Introduction
Wake county is one of the most livable counties in North Carolina. Raleigh which is one of the cities in the county was ranked 2nd(https://realestate.usnews.com/places/rankings/best-places-to-live) as best place to live in the U.S in 2021-2022 with its suburbs ranked among the best cities in the United states.Wake county has high attraction of individuals moving in across the world with the dream of making this great county their home, Wake county is also great place for families with children to consider to live in.

- Because of the many great things about wake county, I decided to do a comprehensive analysis of the wake county’s housing market. 


### Data Description and Data Processing
The dataset contains data from residential (single family) homes in wake county,NC from 1800s to July 2021. The original data was accessed on 07/20/2021 from wake county website(http://www.wakegov.com/tax/realestate/redatafile/Pages/default.aspx). The data contained all the properties in wake county commercial and residential with 420,971 rows and 87 variables ranging from owner's personal information to housing information.


### Hypothseis
- What is the average price of the house in the different cities of the wake county?
- What are top Cities homes by price?
- In which months are more homes sold in wake county?
- What is the average age of homes?
- What are the average types of homes in wake county?

### Exploratory Analysis
Used excel filter tool to narrow down to only single-family homes and variables that are directly relates with the property sales. Loaded the data pre-processed data to Python to explore the columns and checking for missing values.
- For the Year_Remodeled if no remodeled done then Year_Built was used.


In [None]:
#Importing Libraries and setting the path to data
import pandas as pd
import sklearn as sk
from sklearn import preprocessing
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn import tree
import warnings
pd.options.display.float_format = '{:,.2f}'.format

In [None]:
wake_houses = pd.read_csv('../input/wake-county-housing-nc/WakeCountyHousing.csv')
wake_houses.head()

In [None]:
#Looking at the number of rows and columns
wake_houses.shape

In [None]:
#looking at the dtypes
wake_houses.dtypes

In [None]:
# fix the Real_Estate and Physical_zip to be non numeric
wake_houses.Real_Estate_Id = wake_houses.Real_Estate_Id.astype(str)
wake_houses.Physical_Zip = wake_houses.Physical_Zip.astype(str)

In [None]:
#Checking to make sure the Real_Estate and Physical_zip are non numeric
wake_houses.dtypes

In [None]:
#Checking for null values
wake_houses.isnull().sum()

In [None]:
#Fill nulls with Other for Bath and ALL for Utilities
wake_houses.Bath.fillna('Other',inplace=True)
wake_houses.Utilities.fillna('ALL', inplace=True)
wake_houses.head()

In [None]:
wake_houses.isnull().sum()

In [None]:
wake_houses = wake_houses.dropna()
wake_houses.head()

In [None]:
#checking to make sure we have no nulls 
wake_houses.isnull().sum()

## Creating New Varibles

In [None]:
# creating a function to group the homes by year built/remodeled
def func(x):
    if x <= 1939:
        return '1939 or older'
    elif x <= 1969:
        return '1940-1969'
    elif x <= 1999:
        return '1970-1999'
    else:
        return '2000 or Newer'

In [None]:
wake_houses['Age_Of_Home']=wake_houses['Year_Remodeled'].apply(func)
wake_houses.head()

In [None]:
wake_houses.Design_Style.replace(['Split Foyer','Condo',     
       'Contemporary', 'Modular', 'Colonial', 'Conversion',
       'Log', 'Other', 'Cape', 'Duplex', 'Manuf Multi'],'other', inplace=True)
wake_houses.head()

In [None]:
#Evaluating Numeric Distributions
wake_houses.describe()

In [None]:
wake_houses.describe(percentiles=[0.01,0.05,0.1,0.25,0.75,0.9,0.95,0.99])

Fromethe above ,we can see there are issues that needs to be addressed;

1. Deeded_Acreage 0.00 and 307.91 
2. Year_remodeled 0.00 and 2209
3. Total_sale_price  0  (No house can be sold for 0$)
4. Heated_Area 0, no homes have 0 square feet
5. Year Built 2022

*Will fix these outliers by filtering them out.

In [None]:
# Remove bottom 5% and Top 99% of Deeded acreage 2.06
filter1 = (wake_houses.Deeded_Acreage > 0.03) & (wake_houses.Deeded_Acreage <2.06)

# remove top bottom 99% of Year_Remodeled
filter2 = (wake_houses.Year_Remodeled > 1930) & (wake_houses.Year_Remodeled < 2021)

# Remove bottom 1% ($13,000) 
filter3 =(wake_houses.Total_Sale_Price > 13000) & (wake_houses.Total_Sale_Price < 900000)

# Houses between 820 and 5451 square feet
filter4 =  (wake_houses.Heated_Area > 820)  & (wake_houses.Heated_Area < 5451) 

# Romove Year Built 2022
filter5 = wake_houses.Year_Built < 2021


In [None]:
##use filters to create a new dataframe
wake_houses2 = wake_houses[filter1 & filter2 & filter3 & filter4 & filter5].copy() 

In [None]:
# Checking to make sure the outliers have been removed
wake_houses2.describe(percentiles=[0.01,0.05,0.1,0.25,0.75,0.9,0.95,0.99])

In [None]:
wake_houses2.shape

## Visualization

In [None]:
wake_houses2.hist(column= ['Heated_Area','Deeded_Acreage'])

In [None]:
wake_houses2.hist(column= ['Year_of_Sale','Year_Remodeled'])

### Histogram Interpretation 

- Most homes are build in land between 0.2 and 0.5 acres
- Most Wake county homes are between 1500 and 2500 square feet
- Most homes in wake county were build in the 2000's and above
- As year goes by the number of homes increases

In [None]:
#Checking the relationship between Heated_Area and Total_Sale_Price
wake_houses2.plot.scatter(x='Heated_Area', y='Total_Sale_Price', s=0.01, c = wake_houses2.Year_Built,cmap='viridis')

- There's is a positive ralationship between the total sale price and the heated area.


In [None]:
#check Year_Built distribution by boxplots
wake_houses2.boxplot(column= ['Year_Built'], vert = False)

In [None]:
#check heated_Area distribution by boxplots
wake_houses2.boxplot(column= ['Heated_Area'], vert = False)

### Interpretation
- Houses Built before 1960 are outliers
- Houses with above 4500 square foot are considered outliers. This means houses with this square foot are uncommon in wake county


In [None]:
wake_houses3 = wake_houses2.groupby(['Year_Built']).size().reset_index(name='House_Count')
wake_houses3.head()

In [None]:
#Time series of house count and year built
wake_houses3.plot(x='Year_Built',y='House_Count',figsize=(20,6),linestyle='-',color='b',title='Count of Houses Built')

In [None]:
wake_houses4 = wake_houses2.groupby(['Year_of_Sale']).size().reset_index(name='House_Count')
wake_houses4.head()

In [None]:
#Time series of house count and year Sold
wake_houses4.plot(x='Year_of_Sale',y='House_Count',figsize=(20,6),linestyle='-',color='b', title='Count of Houses Sold')

- For this time series we can see for 2020 the thrend changed to a drop and this can be attributed to COVID-19 and shutting down of businesses which affected the economic.
- The time series shows there is a relationship between the number of houses built and the number of houses sold in a year

#### Checking the correlation of variables with the Total_Sale_Price 

In [None]:
wake_houses2.corr()

#### Heated_Area has the highest correlation with Total_Sale_Price

In [None]:
#Average by city
wake_houses2.groupby(['Physical_City']).mean()

### Interpretation
- The most expensive City in wake county is New Hill with an average Total_sale_price of 426,149.      
and the lowest/cheapest is Youngsville with an average Total Sale price of 126,197. The Expensive Physical_City is about 4 times the total_price of cheapest/economical city. (Total Sale Price is in USD)
- New Hill Phisical City also has houses with large square fit (an average of 2870), a deeper look at New Hill Physical city suggest it homes are newer (mostly built in 2016/17) giving us the reason why the city has pricy houses. 
- Looking into Youngsville Physical_City shows the homes are older and of small size no wonder the average low Total sale price. This can be termed coutry/farm beacuse of its large deeded acreage.

In [None]:
wake_houses5 = wake_houses2.groupby(['Physical_City'])["Total_Sale_Price"].mean().reset_index(name='House_Average_Price_by_City')
wake_houses5.head()

In [None]:
wake_houses5 = wake_houses5.sort_values(by='House_Average_Price_by_City', ignore_index=True, ascending = False)

In [None]:
wake_houses5.plot.bar(x='Physical_City', y= 'House_Average_Price_by_City', title = 'Average House Price by City',legend ='upper left' )

In [None]:
wake_houses6 = wake_houses2.groupby(['Design_Style'])["Total_Sale_Price"].mean().reset_index(name='House_Average_Price_by_Design_Style') 
wake_houses6.head()

- conventional homes average at $287,509.00  followed by Townhouse at $233,384. The cheapest homes in wake county are the Ranch Style at an average price of $150,828.00

In [None]:
wake_houses6 = wake_houses6.sort_values(by='House_Average_Price_by_Design_Style', ignore_index=True, ascending = False)

In [None]:
wake_houses6.plot.bar(x = 'Design_Style', y='House_Average_Price_by_Design_Style', title = 'Average House Price by Design Style',legend ='upper left' )

In [None]:
wake_houses7 = wake_houses2.groupby(['Year_of_Sale','Month_Year_of_Sale'])["Total_Sale_Price"].size().reset_index(name='House_Count')
wake_houses7.head()

In [None]:
#wake_houses7['Month_Year_of_Sale']=pd.to_datetime(wake_houses7['Month_Year_of_Sale'])
#wake_houses7.sort_values(by=['Month_Year_of_Sale'], inplace=True, ascending=False)

In [None]:
wake_houses8 = wake_houses7[wake_houses7.Year_of_Sale > 2016].sort_index()
wake_houses8.head()

In [None]:
wake_houses8.plot.bar(x = 'Month_Year_of_Sale', y='House_Count', title = 'Monthly Trends',legend ='upper left',figsize=(20,6))

In [None]:
wake_houses9 = wake_houses2.groupby(['Design_Style'])["Heated_Area"].mean()
wake_houses9.head()

In [None]:
wake_houses9.plot.bar(x = 'Design_Style', y='Heated_Area', title = 'Heated_Area  Vs. Design Style',legend ='upper left',figsize=(10,5))