# Data Science in Python - Assignment 2

Student Number: 20757091

### In this assignment I will perform four Tasks involving Historic House Sales Data
#### 1. Data Collection and Initial Characterisation
#### 2. Time Series Analysis
#### 3. Correlation and Regression
#### 4. Classification

## Step 1: Data Collection and Initial Characterisation

In this section I will firstly scrape all relevant house data from a unique webpage given to me. I will parse the page to extract relevant data, while performing any relevant data pre-processing and cleansing steps. After this I will load the data into a DataFrame and perform an initial Characterisation of the data

In [None]:
# importing necessary packages
import pandas as pd
import matplotlib.pyplot as plt
import urllib.request
import bs4
from datetime import datetime
import numpy as np
import sklearn
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import ConfusionMatrixDisplay
from sklearn.metrics import precision_score, recall_score
from sklearn.metrics import classification_report
import seaborn as sns
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler

I will first download relevant data from the webpage

In [None]:
#given link to HTML page containing house data
link = "http://mlg.ucd.ie/modules/COMP30760/assign2/20757091.html"
response = urllib.request.urlopen(link)
html = response.read().decode()

In [None]:
# brief overview of data
lines = html.strip().split("\n")
for l in lines:
    print(l)

By looking at our downloaded data, we can see all relevant data we need is under the class "sales". I will now parse the data by this attribute using Beautiful Soup

In [None]:
parser = bs4.BeautifulSoup(html, "html.parser")
for match in parser.find_all(attrs={'class':'sales'}):
    text = match.get_text()
    print(text)

We can see all relevant data is shown after filtering by the class sales. I will now perform a number of steps to process and clean the data before placing it in a DataFrame

Our data is split underneath the headings of Date of Sale, Price, Location, Year built, Size and Description. I will create lists to help sort the data according to these headings

In [None]:
# creating lists for each relevant heading to store data
Date_of_sale = []
Price = []
Location = []
Year_built = []
Size = []
Description = []

# using the modulo operator and incrementing a variable to split the data into the relevant six headings
i = 0
for match in parser.find_all(attrs={'class':'sales'}):
    for data_piece in match:
        text = data_piece.get_text()
        
        if(i % 6 == 0):
            Date_of_sale.append(text)
        if(i % 6 == 1):
            Price.append(text)
        if(i % 6 == 2):
            Location.append(text)
        if(i % 6 == 3):
            Year_built.append(text)
        if(i % 6 == 4):
            Size.append(text)
        if(i % 6 == 5):
            Description.append(text)
        
        i+=1

I will now look at a brief overview of our created lists

In [None]:
for i in range(3):
    print(Date_of_sale[i])
    print(Price[i])
    print(Location[i])
    print(Year_built[i])
    print(Size[i])
    print(Description[i] + "\n")

All of our lists appear in the correct format, apart from Description. This features multiple different attributes under this heading. They will be split up accordingly and given their own list. The attributes to be extracted are House Type, Style (Stories), Bedrooms and Bathrooms

In [None]:
# examination of Description list
for i in range(0,30):
    print(Description[i])

As we can see from above, not only is the data split into seperate attributes under the Description heading, but the data is not always in the same order! This will require further processing to correctly place the relevant data under the correct heading

In [None]:
# start of filtering of Description column
# Due to the column not always having the same structure we will split up the data based on keywords
# This raw data will then be further processed in the next step

# lists to hold raw data
raw_type = []
raw_Story = []
raw_Bedrooms = []
raw_Bathrooms = []

# looping through Description column and placing data into correct list based off keywords
for i in range(0,len(Description)):
    data = Description[i].split(";")
    
    if "Type" in data[0]:
        raw_type.append(data[0])
    if "Style" in data[0]:
        raw_Story.append(data[0])
    if "Bedroom" in data[0]:
        raw_Bedrooms.append(data[0])
    if "Bathroom" in data[0]:
        raw_Bathrooms.append(data[0])
        
    if "Type" in data[1]:
        raw_type.append(data[1])
    if "Style" in data[1]:
        raw_Story.append(data[1])
    if "Bedroom" in data[1]:
        raw_Bedrooms.append(data[1])
    if "Bathroom" in data[1]:
        raw_Bathrooms.append(data[1])
        
    if "Type" in data[2]:
        raw_type.append(data[2])
    if "Style" in data[2]:
        raw_Story.append(data[2])
    if "Bedroom" in data[2]:
        raw_Bedrooms.append(data[2])
    if "Bathroom" in data[2]:
        raw_Bathrooms.append(data[2])
        
    if "Type" in data[3]:
        raw_type.append(data[3])
    if "Style" in data[3]:
        raw_Story.append(data[3])
    if "Bedroom" in data[3]:
        raw_Bedrooms.append(data[3])
    if "Bathroom" in data[3]:
        raw_Bathrooms.append(data[3])

We now have our data in the correct lists so can process each further and isolate necessary items from them

In [None]:
# lists to hold improved data
Type = []
Story = []
Bedrooms = []
Bathrooms = []

# looping through length of Description column and extracting only relevant data which is appended to our lists
for i in range(0, len(Description)):
    
    split_raw_type = raw_type[i]
    split = split_raw_type.split(";")
    split_type = split[0].split(":")
    type_result = split_type[1].replace(" ", "")
    type_result
    Type.append(type_result)
    
    split_story = raw_Story[i].split(":")
    type_split_story = split_story[1]
    type_split_story = type_split_story.split("-")
    type_split_story = type_split_story[0]
    story_result = type_split_story.replace(" ", "")
    Story.append(story_result)
    
    
    split_bedroom = raw_Bedrooms[i]
    split_bedroom = split_bedroom.replace(" ", "")
    split_bedroom = split_bedroom.split("B")
    bedroom_result = split_bedroom[0]
    Bedrooms.append(bedroom_result)
    
    
    split_bathroom = raw_Bathrooms[i]
    split_bathroom = split_bathroom.replace(" ", "")
    split_bathroom = split_bathroom.split("B")
    bathroom_result = split_bathroom[0]
    Bathrooms.append(bathroom_result)

In [None]:
# checking data is as required
for i in range(0,5):
    print("Type: " + str(Type[i]))
    print("Story: " + str(Story[i]))
    print("Bedrooms: " + str(Bedrooms[i]))
    print("Bathrooms: " + str(Bathrooms[i]) + "\n")

Description Data is now expanded into different sections and cleansed. I will now begin cleaning and examining different columns.

#### Columns to be cleansed: Date of Sale, Price, Location, Year Built,  Size

1. Date of Sale

In [None]:
NEW_Date_of_sale = []

# some entries had random commas, these are removed with the step below and placed in a new list
for i in range(0, len(Date_of_sale)):
    NEW_Date_of_sale.append(Date_of_sale[i].replace(",",""))

2. Price

In [None]:
# lists to house edited data
Price_without_euro = []
Price_final = []

# loops to remove euro signs and commas
for i in range(0, len(Price)):
    Price_without_euro.append(Price[i].replace("€",""))
    
for i in range(0, len(Price_without_euro)):
    Price_final.append(Price_without_euro[i].replace(",",""))

3. Location & 4. Year Built - These columns appeared to be fine from initial checks so they will remain unchanged for now

5. Size

In [None]:
Size_without_sqft = []
Size_without_commas = []
Size_final = []

# size column had two different versions of sqft so both must be removed - along with commas
for i in range(0, len(Size)):
    if "sq ft" in Size[i]:
        Size_without_sqft.append(Size[i].replace("sq ft",""))
    if "sqft" in Size[i]:
        Size_without_sqft.append(Size[i].replace("sqft",""))
        
for i in range(0, len(Size_without_sqft)):
    Size_without_commas.append(Size_without_sqft[i].replace(",",""))
    
for i in range(0, len(Size_without_commas)):
    Size_final.append(Size_without_commas[i].replace(" ",""))

Our data has been cleansed and is now in a more useable format. This will now be uploaded to a DataFrame

In [None]:
# creating DataFrame
House_DF = pd.DataFrame()

In [None]:
# adding our data to dataframe
House_DF["Date of Sale"] = NEW_Date_of_sale
House_DF["Price"] = Price_final
House_DF["Location"] = Location
House_DF["Year Built"] = Year_built
House_DF["Size (in Sq Ft)"] = Size_final
House_DF["House Type"] = Type
House_DF["Story no."] = Story
House_DF["Bedrooms no."] = Bedrooms
House_DF["Bathrooms no."] = Bathrooms

In [None]:
# quick look at dataframe
House_DF.head()

In [None]:
# current shape of dataframe with 945 rows and 9 columns
House_DF.shape

Our Data has been scraped and placed into a DataFrame. This is good but I will now create further columns based off this data that can hopefully be used later to help us examine the housing data

In [None]:
# creating new columns to isolate day, month and year of house sale
data = NEW_Date_of_sale[0].split(" ")

# lists for each day, month and year from each Date of Sale values
day = []
month = []
year = []

# splitting data into day, month and year while also appending to relevant lists
for i in range(0, len(NEW_Date_of_sale)):
    data = NEW_Date_of_sale[i].split(" ")
    
    # also removing 0 from day value
    data_day = data[0].replace("0","")
    day.append(data_day)
    month.append(data[1])
    year.append(data[2])

I will also upload the month as a number from 1-12 alongside its string representation to dataframe

In [None]:
# list to house number representation of month
months_num = []

for entry in (month):
    if entry == "Jan":
        months_num.append(1)
    if entry == "Feb":
        months_num.append(2)
    if entry == "Mar":
        months_num.append(3)
    if entry == "Apr":
        months_num.append(4)
    if entry == "May":
        months_num.append(5)
    if entry == "Jun":
        months_num.append(6)
    if entry == "Jul":
        months_num.append(7)
    if entry == "Aug":
        months_num.append(8)
    if entry == "Sep":
        months_num.append(9)
    if entry == "Oct":
        months_num.append(10)
    if entry == "Nov":
        months_num.append(11)
    if entry == "Dec":
        months_num.append(12)

In [None]:
# uploading to DataFrame
House_DF["Day of Sale"] = day
House_DF["Month of Sale"] = month
House_DF["Month of Sale (Number)"] = months_num
House_DF["Year of Sale"] = year

In [None]:
House_DF.head()

Next I will also split the Date of Sale Column into relevant quarters and also quarters with their respective years

In [None]:
quarters = []

# loop to append the correct value to lists quarters based on result
for mon in month:
    if (mon == "Jan" or mon == "Feb" or mon == "Mar"):
        quarters.append("Q1")
    if (mon == "Apr" or mon == "May" or mon == "Jun"):
        quarters.append("Q2")
    if (mon == "Jul" or mon == "Aug" or mon == "Sep"):
        quarters.append("Q3")
    if (mon == "Oct" or mon == "Nov" or mon == "Dec"):
        quarters.append("Q4")

In [None]:
House_DF["Quarter of Sale"] = quarters

In [None]:
month_year = []

# extracting only month and year from Date of Sale (ignoring day)
for sale in NEW_Date_of_sale:
    data = sale.split(" ")
    string = data[1] + " " + data[2]
    str(string)
    month_year.append(string)

In [None]:
quarters_and_year = []

# loop to append the correct value to lists quarters_and_year based on result
for entry in month_year:
    if (entry == "Jan 2016" or entry == "Feb 2016" or entry == "Mar 2016"):
        quarters_and_year.append("Q1 2016")
    if (entry == "Apr 2016" or entry == "May 2016" or entry == "Jun 2016"):
        quarters_and_year.append("Q2 2016")
    if (entry == "Jul 2016" or entry == "Aug 2016" or entry == "Sep 2016"):
        quarters_and_year.append("Q3 2016")
    if (entry == "Oct 2016" or entry == "Nov 2016" or entry == "Dec 2016"):
        quarters_and_year.append("Q4 2016")
        
    if (entry == "Jan 2017" or entry == "Feb 2017" or entry == "Mar 2017"):
        quarters_and_year.append("Q1 2017")
    if (entry == "Apr 2017" or entry == "May 2017" or entry == "Jun 2017"):
        quarters_and_year.append("Q2 2017")
    if (entry == "Jul 2017" or entry == "Aug 2017" or entry == "Sep 2017"):
        quarters_and_year.append("Q3 2017")
    if (entry == "Oct 2017" or entry == "Nov 2017" or entry == "Dec 2017"):
        quarters_and_year.append("Q4 2017")
        
    if (entry == "Jan 2018" or entry == "Feb 2018" or entry == "Mar 2018"):
        quarters_and_year.append("Q1 2018")
    if (entry == "Apr 2018" or entry == "May 2018" or entry == "Jun 2018"):
        quarters_and_year.append("Q2 2018")
    if (entry == "Jul 2018" or entry == "Aug 2018" or entry == "Sep 2018"):
        quarters_and_year.append("Q3 2018")
    if (entry == "Oct 2018" or entry == "Nov 2018" or entry == "Dec 2018"):
        quarters_and_year.append("Q4 2018")
        
    if (entry == "Jan 2019" or entry == "Feb 2019" or entry == "Mar 2019"):
        quarters_and_year.append("Q1 2019")
    if (entry == "Apr 2019" or entry == "May 2019" or entry == "Jun 2019"):
        quarters_and_year.append("Q2 2019")
    if (entry == "Jul 2019" or entry == "Aug 2019" or entry == "Sep 2019"):
        quarters_and_year.append("Q3 2019")
    if (entry == "Oct 2019" or entry == "Nov 2019" or entry == "Dec 2019"):
        quarters_and_year.append("Q4 2019")

In [None]:
House_DF["Quarter & Year of Sale"] = quarters_and_year

In [None]:
House_DF.head()

In [None]:
# current shape of dataframe with 945 rows and 15 columns (6 new columns were created from existing data)
House_DF.shape

In [None]:
# changing data types of numeric columns
House_DF["Price"] = House_DF["Price"].astype(str).astype(float)
House_DF["Year Built"] = House_DF["Year Built"].astype(str).astype(int)
House_DF["Size (in Sq Ft)"] = House_DF["Size (in Sq Ft)"].astype(str).astype(int)
House_DF["Story no."] = House_DF["Story no."].astype(str).astype(float)
House_DF["Bedrooms no."] = House_DF["Bedrooms no."].astype(str).astype(int)
House_DF["Bathrooms no."] = House_DF["Bathrooms no."].astype(str).astype(int)
House_DF["Day of Sale"] = House_DF["Day of Sale"].astype(str).astype(int)
House_DF["Month of Sale (Number)"] = House_DF["Month of Sale (Number)"].astype(str).astype(int)
House_DF["Year of Sale"] = House_DF["Year of Sale"].astype(str).astype(int)

In [None]:
House_DF.dtypes

### I will now begin basic initial characterisation of the data under each of the following headings:
- Date of Sale              
- Price                   
- Location                
- Year Built                
- Size (in Sq Ft)           
- House Type                 
- Story no.                 
- Bedrooms no.                
- Bathrooms no.              
- Day of Sale                 
- Month of Sale                    
- Year of Sale              
- Quarter of Sale            
- Quarter & Year of Sale   

**Note:** No Month of Sale (Number) as for this step it acts the same as Month of Sale

For each feature, I will look at the range of data and mean value. Will also graph data and note interpretations

#### 1. Date of Sale

Graph did not supply good information due to large amount of unique values - will examine without graph

In [None]:
# ordering by number of houses sold per day
House_DF["Date of Sale"].value_counts()

In [None]:
print("Range of Data [%d - %d]" % (House_DF["Date of Sale"].value_counts().min(),  House_DF["Date of Sale"].value_counts().max()))
print("Mean of Data: %.2f" % House_DF["Date of Sale"].value_counts().mean())

Of course this is only dates on which a house was sold, so realistically there were multiple days of the four years where no house was sold

In [None]:
# number of unique dates
House_DF["Date of Sale"].nunique()

Out of 945 observations, only 500 dates were obserevd. Helps to explain our mean sales per day as 1.89

###### *Key Points:*
- Large amount of uniue dates (500) with decent range
- All our of top 5 dates from sales all came from the beginning to the end of summer, could signal a trend
- More information about potential trends will be seen when looking at a monthly/quarterly level along with useful graphs

#### 2. Price

In [None]:
ax = House_DF["Price"].hist(bins=10, figsize=(12,6), color='green', grid=False, rwidth=0.9)
plt.title("Histogram of number of houses per Price point")
plt.ticklabel_format(style='plain')
plt.ylabel("Number of Houses", fontsize=13)
plt.xlabel("Price per House (in Euro)", fontsize=13);

In [None]:
print("Range of Data [%d - %d]" % (House_DF["Price"].min(),  House_DF["Price"].max()))
print("Mean of Data: %d" % House_DF["Price"].mean())

###### *Key Points:*
- Most house prices come in the range of 150,000 - 700,000
- Very few entries come over 900,000
- The most filled histogram bin is in the range of around 250,000 - 400,000
- The three points above could signal that people prioritise affordable housing over other features

#### 3. Location

In [None]:
plt.figure(figsize=(11,8))
House_DF["Location"].value_counts(ascending=True).plot(kind="bar")
plt.title("Bar Chart of number of houses per Location")
plt.xlabel("Locations in our Dataframe", fontsize=13)
plt.ylabel("Number of houses sold", fontsize=13)
plt.show()

In [None]:
House_DF["Location"].value_counts()

In [None]:
print("Range of Data [%d - %d]" % (House_DF["Location"].value_counts().min(),  House_DF["Location"].value_counts().max()))
print("Mean of Data: %d" % House_DF["Location"].value_counts().mean())

###### *Key Points:*
- 7 unique locations with a wide spread of values
- The top location (Oakbrook) has almost the same number of houses sold as the bottom three locations combined
- As we are not also factoring in house pricing, we are unable to determine the attribute which makes Oakbrook so desirable
- Could be due to affordable house prices, or just quantity of available houses in the area

#### 4. Year Built

In [None]:
ax = House_DF["Year Built"].hist(bins=10, figsize=(12,6), color='green', grid=False, rwidth=0.9)
plt.ticklabel_format(style='plain')
plt.title("Histogram of number of houses sold by Year they were built")
plt.xlabel("Year Built", fontsize=13);
plt.ylabel("Number of Houses", fontsize=13)
plt.show()

In [None]:
House_DF["Year Built"].value_counts()

In [None]:
print("Range of Data [%d - %d]" % (House_DF["Year Built"].value_counts().min(),  House_DF["Year Built"].value_counts().max()))
print("Mean of Data: %d" % House_DF["Year Built"].value_counts().mean())

###### *Key Points:*
- The number of houses sold is greatly biased towards newer houses
- The top 5 years for sales based on the year the house was built all came in the range of [2010 - 2018]
- This could be due to the wish for more modern designs
- The mean of the data is greatly being dragged down by houses built between 1880 and 1950

#### 5. Size (in Sq Ft)

In [None]:
House_DF["Size (in Sq Ft)"].value_counts()

In [None]:
ax = House_DF["Size (in Sq Ft)"].hist(bins=10, figsize=(12,6), color='green', grid=False, rwidth=0.9)
plt.ticklabel_format(style='plain')
plt.title("Histogram of number of houses sold by their Size")
plt.xlabel("Size of House (in Sq Ft)", fontsize=13);
plt.ylabel("Number of Houses", fontsize=13)
plt.show()

In [None]:
print("Range of Data [%d - %d]" % (House_DF["Size (in Sq Ft)"].min(),  House_DF["Size (in Sq Ft)"].max()))
print("Mean of Data: %.2f" % House_DF["Size (in Sq Ft)"].value_counts().mean())
print("Mean of Data: %.2f" % House_DF["Size (in Sq Ft)"].mean())

###### *Key Points:*
- Large amount of unique values (704)
- Most data comes in the range [1000-1800]
- Average sales per house price is 1.34 while the average house size is 1475.0
- Very few data points coming over 2500 but few points do extend up to ~3500

#### 6. House Type

In [None]:
House_DF["House Type"].value_counts()

In [None]:
plt.figure(figsize=(11,8))
House_DF["House Type"].value_counts(ascending=True).plot(kind="bar")
plt.title("Bar chart of number of houses sold by House Type")
plt.xlabel("Type of House", fontsize=13);
plt.ylabel("Number of Houses", fontsize=13)
plt.show()

In [None]:
print("Range of Data [%d - %d]" % (House_DF["House Type"].value_counts().min(),  House_DF["House Type"].value_counts().max()))
print("Mean of Data: %d" % House_DF["House Type"].value_counts().mean())

###### *Key Points:*
- Detached is by far the clear choice for best selling House Type
- In fact, it accounts for more sales than every other House Type Combined
- The other 5 House Types all fail to break 100 sales yet the Detached type brings the mean up to 157

#### 7. Story no. 

In [None]:
House_DF["Story no."].value_counts()

In [None]:
plt.figure(figsize=(11,8))
House_DF["Story no."].value_counts().sort_index().plot(kind="bar")
plt.title("Bar chart of number of houses sold by Story number")
plt.xlabel("Number of stories in house", fontsize=13);
plt.ylabel("Number of Houses", fontsize=13)
plt.show()

In [None]:
print("Range of Data [%d - %d]" % (House_DF["Story no."].value_counts().min(),  House_DF["Story no."].value_counts().max()))
print("Mean of Data: %d" % House_DF["Story no."].value_counts().mean())

###### *Key Points:*
- Only three possible attributes for this category, with 1.0 stories being the clear favourite
- The value of 1.5 might not suffer due to people disliking it, just from a lack of houses with this attribute
- Still greatly lowers mean of values

#### 8. Bedrooms no.

In [None]:
House_DF["Bedrooms no."].value_counts()

In [None]:
plt.figure(figsize=(11,8))
House_DF["Bedrooms no."].value_counts().sort_index().plot(kind="bar")
plt.title("Bar chart of number of houses sold by Bedroom number")
plt.xlabel("Number of bedrooms in house", fontsize=13);
plt.ylabel("Number of Houses", fontsize=13)
plt.show()

In [None]:
print("Range of Data [%d - %d]" % (House_DF["Bedrooms no."].value_counts().min(),  House_DF["Bedrooms no."].value_counts().max()))
print("Mean of Data: %d" % House_DF["Bedrooms no."].value_counts().mean())

###### *Key Points:*
- This graph almost takes the form of a symmetric distribution
- Values of 1 and 5 are greatly unpopular
- Values 2 and 4 are closer to the mean score but 3 is still the clear favourite
- Mean score of 189 which only values 2 and 3 are above

#### 9. Bathrooms no.

In [None]:
House_DF["Bathrooms no."].value_counts()

In [None]:
plt.figure(figsize=(11,8))
House_DF["Bathrooms no."].value_counts().sort_index().plot(kind="bar")
plt.title("Bar chart of number of houses sold by Bathroom number")
plt.xlabel("Number of bathrooms in house", fontsize=13);
plt.ylabel("Number of Houses", fontsize=13)
plt.show()

In [None]:
print("Range of Data [%d - %d]" % (House_DF["Bathrooms no."].value_counts().min(),  House_DF["Bathrooms no."].value_counts().max()))
print("Mean of Data: %d" % House_DF["Bathrooms no."].value_counts().mean())

###### *Key Points:*
- Only 3 data attributes for this feature with values of 410, 508 and 27 respectively
- The score of 27 mainly causes the mean to drop to 315
- Shows that having 3 bathrooms is not popular
- Or perhaps linked with more expensive properties causing their lack of popularity

#### 10. Day of Sale

In [None]:
House_DF["Day of Sale"].value_counts().head()

In [None]:
plt.figure(figsize=(17,8))
House_DF["Day of Sale"].value_counts().sort_index().plot(kind="bar")
plt.title("Bar chart of number of houses sold by per Day")
plt.xlabel("Days", fontsize=13);
plt.ylabel("Number of Houses sold", fontsize=13)
plt.show()

In [None]:
print("Range of Data [%d - %d]" % (House_DF["Day of Sale"].value_counts().min(),  House_DF["Day of Sale"].value_counts().max()))
print("Mean of Data: %.2f" % House_DF["Day of Sale"].value_counts().mean())

###### *Key Points:*
- Mean of data is 33.75 yet the mean for days 1-3 is ~70
- Shows that being at the start of a month is more popular
- Outside of this, the other days seem to follow a similar pattern of near the mean

#### 11. Month of Sale

In [None]:
House_DF["Month of Sale (Number)"].value_counts()

In [None]:
plt.figure(figsize=(17,8))
House_DF["Month of Sale (Number)"].value_counts().sort_index().plot(kind="bar")
plt.title("Bar chart of number of houses sold by per Month")
plt.xlabel("Months", fontsize=13);
plt.ylabel("Number of Houses sold", fontsize=13)
plt.show()

In [None]:
print("Range of Data [%d - %d]" % (House_DF["Month of Sale"].value_counts().min(),  House_DF["Month of Sale"].value_counts().max()))
print("Mean of Data: %.2f" % House_DF["Month of Sale"].value_counts().mean())

###### *Key Points:*
- This graph shows the emergence of a trend spotted when looking at feature one, date of sale
- Summer months are clearly the most popular for buying a house
- This graph also follows a symmetric distribution with the start and end of each year not seeming popular among buyers

#### 12. Quarter of Sale

In [None]:
House_DF["Quarter of Sale"].value_counts()

In [None]:
plt.figure(figsize=(17,8))
House_DF["Quarter of Sale"].value_counts().sort_index().plot(kind="bar")
plt.title("Bar chart of number of houses sold by per Quarter")
plt.xlabel("Quarters", fontsize=13);
plt.ylabel("Number of Houses sold", fontsize=13)
plt.show()

In [None]:
print("Range of Data [%d - %d]" % (House_DF["Quarter of Sale"].value_counts().min(),  House_DF["Quarter of Sale"].value_counts().max()))
print("Mean of Data: %.2f" % House_DF["Quarter of Sale"].value_counts().mean())

###### *Key Points:*
- Similar to the graph above, shows the popularity of buying a house in the middle of the year

#### 13. Year of Sale

In [None]:
House_DF["Year of Sale"].value_counts()

In [None]:
plt.figure(figsize=(17,8))
House_DF["Year of Sale"].value_counts().sort_index().plot(kind="bar")
plt.title("Bar chart of number of houses sold by per Year")
plt.xlabel("Years", fontsize=13);
plt.ylabel("Number of Houses sold", fontsize=13)
plt.show()

In [None]:
print("Range of Data [%d - %d]" % (House_DF["Year of Sale"].value_counts().min(),  House_DF["Year of Sale"].value_counts().max()))
print("Mean of Data: %.2f" % House_DF["Year of Sale"].value_counts().mean())

###### *Key Points:*
- This graph shows a trend of an increase in house sales in order of years
- This could be a sign of an increased need for housing in recent years

#### 14. Quarter & Year of Sale

In [None]:
House_DF["Quarter & Year of Sale"].value_counts()

In [None]:
plt.figure(figsize=(17,8))
House_DF["Quarter & Year of Sale"].value_counts().sort_index().plot(kind="bar")
plt.title("Bar chart of number of houses sold by per Quarter & Year")
plt.xlabel("Quarters & Years", fontsize=13);
plt.ylabel("Number of Houses sold", fontsize=13)
plt.show()

In [None]:
print("Range of Data [%d - %d]" % (House_DF["Quarter & Year of Sale"].value_counts().min(),  House_DF["Quarter & Year of Sale"].value_counts().max()))
print("Mean of Data: %.2f" % House_DF["Quarter & Year of Sale"].value_counts().mean())

###### *Key Points:*
- This graph combines the values of Quarter and Year of Sale
- Reinforces the two points previously made
1. Summer months are more popular for buyers
2. Increase in house sales steadily over the past four years

# 2. Time Series Analysis

#### a) Construct a time series from the data, representing the number of house sales per day. Visualise this series at daily, monthly, and quarterly frequencies. Discuss how the number of sales is changing over time.

For this question relating to time series analysis we need each entry in our dataframe in the form of datetime. I will start by creating a new Datetime column for our dataframe values from our existing data

In [None]:
House_DF.head()

In [None]:
# placing data from dataframe into related lists
days = House_DF["Day of Sale"]
months = House_DF["Month of Sale (Number)"]
years = House_DF["Year of Sale"]

In [None]:
# placing datatime values from our dataframe into list date_list
date_list = []
i=0
for entry in days:
    date_list.append(datetime(years[i], months[i], days[i]))
    i+=1

In [None]:
# checking our data is as required
date_list

In [None]:
# uploading this list to our dataframe
House_DF["Datetime"] = date_list

In [None]:
House_DF.head()

##### I will start by looking at this data in a daily frequency

In [None]:
dates_unique = House_DF["Datetime"].unique()

In [None]:
dates_with_counts = House_DF["Datetime"].value_counts(sort=False)

In [None]:
counts = []
for i in range(len(dates_with_counts)):
    print(dates_with_counts[i])
    counts.append(dates_with_counts[i])

In [None]:
plt.figure(figsize=(20,6))
plt.plot(dates_unique, counts)
plt.title("Time Series Analysis of House Sales at a daily frequency")
plt.xlabel("Time in days", fontsize = 13)
plt.ylabel("Number of houses sold", fontsize = 13)
plt.show()

Hard to take much value from this graph due to large amount of unique days as input. Although can see clear spikes in data with most data coming in 1-3 range

##### I will now look at this data in a monthly frequency

In [None]:
ts = pd.Series(counts, index=dates_unique)

In [None]:
# changing time series to monthly frequency
time_series_monthly = ts.resample("M").sum()

In [None]:
plt.figure(figsize=(20,6))
plt.plot(time_series_monthly, marker = "x", markersize=10)
plt.title("Time Series Analysis of House Sales at a monthly frequency")
plt.xlabel("Time in months", fontsize = 13)
plt.ylabel("Number of houses sold", fontsize = 13)
plt.show()

We can see the clear emergence of a trend towards sales peaking in summer months, regardless of the year. Also clear slumps at the start of each year

##### I will now look at this data in a quarterly frequency

In [None]:
# changing time series to quarterly frequency
time_series_quarterly = ts.resample("Q").sum()

In [None]:
plt.figure(figsize=(20,6))
plt.plot(time_series_quarterly, marker = "x", markersize=10)
plt.title("Time Series Analysis of House Sales at a monthly frequency")
plt.xlabel("Time in quarters", fontsize = 13)
plt.ylabel("Number of houses sold", fontsize = 13)
plt.show()

Similar results to our analysis at a monthly frequency. Clear peaks and slumps at summer versus winter months

#### b) Construct another time series from the data, showing how the overall average monthly sale price of houses is changing over time. Discuss the trends in this series.

In [None]:
# isolating datetime and price columns
df = pd.DataFrame(House_DF["Datetime"])
df["Price"] = House_DF["Price"]

In [None]:
print(df.groupby('Datetime').sum())

In [None]:
daily_prices = df.groupby('Datetime').sum()

In [None]:
# finding average house prices per month
new_monthly_prices = daily_prices.resample("M").mean()

In [None]:
# average house prices per month
new_monthly_prices

In [None]:
plt.figure(figsize=(20,6))
plt.plot(new_monthly_prices, marker="o", markersize=10)
plt.title("Time Series Analysis of House Sales at a monthly frequency")
plt.xlabel("Time in months", fontsize = 13)
plt.ylabel("Average sale prices of houses per month", fontsize = 13)
plt.show()

Can see a wide range of values with many drops and peaks in the graph. Data hits its lowest point in late 2017 but instantly picks back up to average the following month. The highest 3 month stretch came in mid 2019

#### c) For each unique location in the data, construct a separate time series representing the average monthly price of houses sold in that location. Compare and discuss the differences between the trends across the locations.

I will start by creating a seperate dataframe containing just Datetime, House Prices and Locations

In [None]:
D_P_Ldf = pd.DataFrame(House_DF["Datetime"])
D_P_Ldf["Location"] = House_DF["Location"]
D_P_Ldf["Price"] = House_DF["Price"]

In [None]:
D_P_Ldf.head()

In [None]:
D_P_Ldf["Location"].value_counts()

I will now go through each of the locations shown above one by one and perform a time series analysis representing the average monthly house price for that location

#### 1. Oakbrook

In [None]:
# isolating specific location
Oakbrook=D_P_Ldf.loc[D_P_Ldf['Location'] == "Oakbrook"]
Oakbrook = Oakbrook.groupby('Datetime').sum()

In [None]:
new_monthly_prices_Oakbrook = Oakbrook.resample("M").mean()

In [None]:
prices = []
for i in range(len(new_monthly_prices_Oakbrook)):
    prices.append(new_monthly_prices_Oakbrook.iloc[i][0])

In [None]:
# finding months with sales and without sales
months_with_sales = 0
months_without_sales = 0
j = 0
for i in range(len(prices)):
    if(prices[i] > 0):
        j+=1
months_with_sales = j
months_without_sales = len(prices) - j

In [None]:
data = {'Months with house sales':months_with_sales, 'Months without house sales':months_without_sales}
courses = list(data.keys())
values = list(data.values())
  
fig = plt.figure(figsize = (5, 5))
 
# creating the bar plot
plt.bar(courses, values, color ='maroon',
        width = 0.8)
plt.title("Bar chart of Months with sales vs. Months with no sales")
plt.show()
print("Total months: " + str(months_with_sales + months_without_sales))
print("Months with House sales: " + str(months_with_sales))
print("Months without House sales: " + str(months_without_sales))

In the graph above, the total months count is 48 as the first sale was in January 2016. This will not be the case for every Location as the first sale coming in March 2016 will cause total months count to be 46 etc. The months without house sales is provided to explain the graphs in the graphs below. The higher the value, the more gaps in the following graphs

In [None]:
plt.figure(figsize=(20,6))
plt.plot(new_monthly_prices_Oakbrook, marker="o", markersize=10)
plt.title("Time Series Analysis of avergae house price in Oakbrook at a monthly frequency")
plt.xlabel("Time in months", fontsize = 13)
plt.ylabel("Number of houses sold", fontsize = 13)
plt.ylim(150000, 1850000)
plt.show()

#### 2. Brookville

In [None]:
# isolating specific location
Brookville=D_P_Ldf.loc[D_P_Ldf['Location'] == "Brookville"]
Brookville = Brookville.groupby('Datetime').sum()

In [None]:
new_monthly_prices_Brookville = Brookville.resample("M").mean()

In [None]:
prices = []
for i in range(len(new_monthly_prices_Brookville)):
    prices.append(new_monthly_prices_Brookville.iloc[i][0])

In [None]:
months_with_sales = 0
months_without_sales = 0
j = 0
for i in range(len(prices)):
    if(prices[i] > 0):
        j+=1
months_with_sales = j
months_without_sales = len(prices) - j

In [None]:
data = {'Months with house sales':months_with_sales, 'Months without house sales':months_without_sales}
courses = list(data.keys())
values = list(data.values())
  
fig = plt.figure(figsize = (5, 5))
 
# creating the bar plot
plt.bar(courses, values, color ='maroon',
        width = 0.8)
plt.title("Bar chart of Months with sales vs. Months with no sales")
plt.show()
print("Total months: " + str(months_with_sales + months_without_sales))
print("Months with House sales: " + str(months_with_sales))
print("Months without House sales: " + str(months_without_sales))

In [None]:
plt.figure(figsize=(20,6))
plt.plot(new_monthly_prices_Brookville, marker="o", markersize=10)
plt.title("Time Series Analysis of avergae house price in Brookville at a monthly frequency")
plt.xlabel("Time in months", fontsize = 13)
plt.ylabel("Number of houses sold", fontsize = 13)
plt.ylim(150000, 18500000)
plt.show()

#### 3. Rivermont

In [None]:
# isolating specific location
Rivermont=D_P_Ldf.loc[D_P_Ldf['Location'] == "Rivermont"]
Rivermont = Rivermont.groupby('Datetime').sum()

In [None]:
new_monthly_prices_Rivermont = Rivermont.resample("M").mean()

In [None]:
prices = []
for i in range(len(new_monthly_prices_Rivermont)):
    prices.append(new_monthly_prices_Rivermont.iloc[i][0])

In [None]:
months_with_sales = 0
months_without_sales = 0
j = 0
for i in range(len(prices)):
    if(prices[i] > 0):
        j+=1
months_with_sales = j
months_without_sales = len(prices) - j

In [None]:
data = {'Months with house sales':months_with_sales, 'Months without house sales':months_without_sales}
courses = list(data.keys())
values = list(data.values())
  
fig = plt.figure(figsize = (5, 5))
 
# creating the bar plot
plt.bar(courses, values, color ='maroon',
        width = 0.8)
plt.title("Bar chart of Months with sales vs. Months with no sales")
plt.show()
print("Total months: " + str(months_with_sales + months_without_sales))
print("Months with House sales: " + str(months_with_sales))
print("Months without House sales: " + str(months_without_sales))

In [None]:
plt.figure(figsize=(20,6))
plt.plot(new_monthly_prices_Rivermont, marker="o", markersize=10)
plt.title("Time Series Analysis of avergae house price in Rivermont at a monthly frequency")
plt.xlabel("Time in months", fontsize = 13)
plt.ylabel("Number of houses sold", fontsize = 13)
plt.ylim(150000, 1850000)
plt.show()

#### 4. East End

In [None]:
# isolating specific location
East_End=D_P_Ldf.loc[D_P_Ldf['Location'] == "East End"]
East_End = East_End.groupby('Datetime').sum()

In [None]:
new_monthly_prices_East_End = East_End.resample("M").mean()

In [None]:
prices = []
for i in range(len(new_monthly_prices_East_End)):
    prices.append(new_monthly_prices_East_End.iloc[i][0])

In [None]:
months_with_sales = 0
months_without_sales = 0
j = 0
for i in range(len(prices)):
    if(prices[i] > 0):
        j+=1
months_with_sales = j
months_without_sales = len(prices) - j

In [None]:
data = {'Months with house sales':months_with_sales, 'Months without house sales':months_without_sales}
courses = list(data.keys())
values = list(data.values())
  
fig = plt.figure(figsize = (5, 5))
 
# creating the bar plot
plt.bar(courses, values, color ='maroon',
        width = 0.8)
plt.title("Bar chart of Months with sales vs. Months with no sales")
plt.show()
print("Total months: " + str(months_with_sales + months_without_sales))
print("Months with House sales: " + str(months_with_sales))
print("Months without House sales: " + str(months_without_sales))

In [None]:
plt.figure(figsize=(20,6))
plt.plot(new_monthly_prices_East_End, marker="o", markersize=10)
plt.title("Time Series Analysis of avergae house price in East End at a monthly frequency")
plt.xlabel("Time in months", fontsize = 13)
plt.ylabel("Number of houses sold", fontsize = 13)
plt.ylim(150000, 1850000)
plt.show()

#### 5. West End

In [None]:
# isolating specific location
West_End=D_P_Ldf.loc[D_P_Ldf['Location'] == "West End"]
West_End = West_End.groupby('Datetime').sum()

In [None]:
new_monthly_prices_West_End = West_End.resample("M").mean()

In [None]:
prices = []
for i in range(len(new_monthly_prices_West_End)):
    prices.append(new_monthly_prices_West_End.iloc[i][0])

In [None]:
months_with_sales = 0
months_without_sales = 0
j = 0
for i in range(len(prices)):
    if(prices[i] > 0):
        j+=1
months_with_sales = j
months_without_sales = len(prices) - j

In [None]:
data = {'Months with house sales':months_with_sales, 'Months without house sales':months_without_sales}
courses = list(data.keys())
values = list(data.values())
  
fig = plt.figure(figsize = (5, 5))
 
# creating the bar plot
plt.bar(courses, values, color ='maroon',
        width = 0.8)
plt.title("Bar chart of Months with sales vs. Months with no sales")
plt.show()
print("Total months: " + str(months_with_sales + months_without_sales))
print("Months with House sales: " + str(months_with_sales))
print("Months without House sales: " + str(months_without_sales))

In [None]:
plt.figure(figsize=(20,6))
plt.plot(new_monthly_prices_West_End, marker="o", markersize=10)
plt.title("Time Series Analysis of avergae house price in West End at a monthly frequency")
plt.xlabel("Time in months", fontsize = 13)
plt.ylabel("Number of houses sold", fontsize = 13)
plt.ylim(150000, 1850000)
plt.show()

#### 6. Avoca

In [None]:
# isolating specific location
Avoca=D_P_Ldf.loc[D_P_Ldf['Location'] == "Avoca"]
Avoca = Avoca.groupby('Datetime').sum()

In [None]:
new_monthly_prices_Avoca = Avoca.resample("M").mean()

In [None]:
prices = []
for i in range(len(new_monthly_prices_Avoca)):
    prices.append(new_monthly_prices_Avoca.iloc[i][0])

In [None]:
months_with_sales = 0
months_without_sales = 0
j = 0
for i in range(len(prices)):
    if(prices[i] > 0):
        j+=1
months_with_sales = j
months_without_sales = len(prices) - j

In [None]:
data = {'Months with house sales':months_with_sales, 'Months without house sales':months_without_sales}
courses = list(data.keys())
values = list(data.values())
  
fig = plt.figure(figsize = (5, 5))
 
# creating the bar plot
plt.bar(courses, values, color ='maroon',
        width = 0.8)
plt.title("Bar chart of Months with sales vs. Months with no sales")
plt.show()
print("Total months: " + str(months_with_sales + months_without_sales))
print("Months with House sales: " + str(months_with_sales))
print("Months without House sales: " + str(months_without_sales))

In [None]:
plt.figure(figsize=(20,6))
plt.plot(new_monthly_prices_Avoca, marker="o", markersize=10)
plt.title("Time Series Analysis of avergae house price in Avoca at a monthly frequency")
plt.xlabel("Time in months", fontsize = 13)
plt.ylabel("Number of houses sold", fontsize = 13)
plt.ylim(150000, 1850000)
plt.show()

#### 7. Beacon Hill

In [None]:
# isolating specific location
Beacon_Hill=D_P_Ldf.loc[D_P_Ldf['Location'] == "Beacon Hill"]
Beacon_Hill = Beacon_Hill.groupby('Datetime').sum()

In [None]:
new_monthly_prices_Beacon_Hill = Beacon_Hill.resample("M").mean()

In [None]:
prices = []
for i in range(len(new_monthly_prices_Beacon_Hill)):
    prices.append(new_monthly_prices_Beacon_Hill.iloc[i][0])

In [None]:
months_with_sales = 0
months_without_sales = 0
j = 0
for i in range(len(prices)):
    if(prices[i] > 0):
        j+=1
months_with_sales = j
months_without_sales = len(prices) - j

In [None]:
data = {'Months with house sales':months_with_sales, 'Months without house sales':months_without_sales}
courses = list(data.keys())
values = list(data.values())
  
fig = plt.figure(figsize = (5, 5))
 
# creating the bar plot
plt.bar(courses, values, color ='maroon',
        width = 0.8)
plt.title("Bar chart of Months with sales vs. Months with no sales")
plt.show()
print("Total months: " + str(months_with_sales + months_without_sales))
print("Months with House sales: " + str(months_with_sales))
print("Months without House sales: " + str(months_without_sales))

In [None]:
plt.figure(figsize=(20,6))
plt.plot(new_monthly_prices_Beacon_Hill, marker="o", markersize=10)
plt.title("Time Series Analysis of avergae house price in Beacon Hill at a monthly frequency")
plt.xlabel("Time in months", fontsize = 13)
plt.ylabel("Number of houses sold", fontsize = 13)
plt.ylim(150000, 1850000)
plt.show()

# 3. Correlation and Regression

### a) Analyse how house sale prices correlate with the other numeric features in the data.

In [None]:
House_DF.dtypes

To look at strength of correlation, I will generate a scatter plot against price for each feature and also calculate the correlation coefficient

###### Numeric Features:
- Year Built
- Size (in SQFT)
- Story No.
- Bedrooms No.
- Bathrooms No.
- Year of Sale

**1. Year Built**

In [None]:
ax = House_DF.plot(kind="scatter", figsize=(10, 6), color='darkblue', s=30, fontsize=13, x="Year Built", y="Price")
plt.xlabel("Year Built", fontsize=12)
plt.ylabel("Price", fontsize=12);

x = np.array(House_DF["Year Built"])
y = np.array(House_DF["Price"])
score = np.corrcoef(x,y)[1][0]
score = round(score, 2)
print("Correlation Coefficient: " + str(score))

**2. Size**

In [None]:
ax = House_DF.plot(kind="scatter", figsize=(10, 6), color='darkblue', s=30, fontsize=13, x="Size (in Sq Ft)", y="Price")
plt.xlabel("Size (in Sq Ft)", fontsize=12)
plt.ylabel("Price", fontsize=12);

x = np.array(House_DF["Size (in Sq Ft)"])
y = np.array(House_DF["Price"])
score = np.corrcoef(x,y)[1][0]
score = round(score, 2)
print("Correlation Coefficient: " + str(score))

**3. Story No.**

In [None]:
ax = House_DF.plot(kind="scatter", figsize=(10, 6), color='darkblue', s=30, fontsize=13, x="Story no.", y="Price")
plt.xlabel("Story no.", fontsize=12)
plt.ylabel("Price", fontsize=12);

x = np.array(House_DF["Story no."])
y = np.array(House_DF["Price"])
score = np.corrcoef(x,y)[1][0]
score = round(score, 2)
print("Correlation Coefficient: " + str(score))

**4. Bedrooms No.**

In [None]:
ax = House_DF.plot(kind="scatter", figsize=(10, 6), color='darkblue', s=30, fontsize=13, x="Bedrooms no.", y="Price")
plt.xlabel("Bedrooms no.", fontsize=12)
plt.ylabel("Price", fontsize=12);

x = np.array(House_DF["Bedrooms no."])
y = np.array(House_DF["Price"])
score = np.corrcoef(x,y)[1][0]
score = round(score, 2)
print("Correlation Coefficient: " + str(score))

**5. Bathrooms No.**

In [None]:
ax = House_DF.plot(kind="scatter", figsize=(10, 6), color='darkblue', s=30, fontsize=13, x="Bathrooms no.", y="Price")
plt.xlabel("Bathrooms no.", fontsize=12)
plt.ylabel("Price", fontsize=12);

x = np.array(House_DF["Bathrooms no."])
y = np.array(House_DF["Price"])
score = np.corrcoef(x,y)[1][0]
score = round(score, 2)
print("Correlation Coefficient: " + str(score))

**6. Year of Sale**

In [None]:
ax = House_DF.plot(kind="scatter", figsize=(10, 6), color='darkblue', s=30, fontsize=13, x="Year of Sale", y="Price")
plt.xlabel("Year of Sale", fontsize=12)
plt.ylabel("Price", fontsize=12);

x = np.array(House_DF["Year of Sale"])
y = np.array(House_DF["Price"])
score = np.corrcoef(x,y)[1][0]
score = round(score, 2)
print("Correlation Coefficient: " + str(score))

Strongest correlation scores can be seen coming from Size, Bathrooms No. and Years Built - in that order

### b) Analyse how house sale prices relate to each of the categorical features in the data.

In [None]:
House_DF.dtypes

**Categorical Features:**
- Date of Sale
- Location
- House Type

As these are categorical features I will not be able to calculate a correlation coefficent so will examine their relationship through plots

**1. Date of Sale**

In [None]:
ax = House_DF.plot(kind="scatter", figsize=(10, 6), color='darkblue', s=30, fontsize=13, x="Date of Sale", y="Price")
plt.xlabel("Date of Sale", fontsize=12)
plt.ylabel("Price", fontsize=12);

Good spread of data, no clear relationship present or obvious

In [None]:
answer = House_DF["Price"].groupby(House_DF["Date of Sale"])

In [None]:
answer = answer.mean().sort_values()

In [None]:
plt.figure(figsize=(20,6))
plt.plot(answer, marker="x", markersize=12)
plt.show()

2. Location

In [None]:
ax = House_DF.plot(kind="scatter", figsize=(10, 6), color='darkblue', s=30, fontsize=13, x="Location", y="Price")
plt.xlabel("Location", fontsize=12)
plt.ylabel("Price", fontsize=12);

In [None]:
answer = House_DF["Price"].groupby(House_DF["Location"])

In [None]:
answer = answer.mean().sort_values()

In [None]:
plt.figure(figsize=(20,6))
plt.plot(answer, marker="x", markersize=12)
plt.show()

3. House Type

In [None]:
ax = House_DF.plot(kind="scatter", figsize=(10, 6), color='darkblue', s=30, fontsize=13, x="House Type", y="Price")
plt.xlabel("House Type", fontsize=12)
plt.ylabel("Price", fontsize=12);

In [None]:
answer = House_DF["Price"].groupby(House_DF["Location"])

In [None]:
answer = answer.mean().sort_values()

In [None]:
plt.figure(figsize=(20,6))
plt.plot(answer, marker="x", markersize=12)
plt.show()

### c) Investigate the use of simple linear regression to predict house sale prices, based on each of the individual numeric features in the data. Which numeric feature appears to be most useful when predicting prices?

**Features:**
- Year Built
- Size (in SQFT)
- Story No.
- Bedrooms No.
- Bathrooms No.
- Year of Sale

For each feature I will introduce a regression line to the previous scatter plots and caluclate the score of the regression

**1. Year Built**

In [None]:
x = np.array(House_DF["Year Built"])
y = np.array(House_DF["Price"])

x = x.reshape(-1, 1)
y = y.reshape(-1, 1)

model = LinearRegression()
model.fit(x,y)

In [None]:
model.intercept_

In [None]:
model.coef_[0]

In [None]:
min_val = (House_DF["Year Built"].min())
max_val = (House_DF["Year Built"].max())
print("Min: %d  Max: %d" % (min_val, max_val))

In [None]:
values = []
years = []
for j in range(min_val, max_val):
    values.append((model.intercept_ + (model.coef_[0] * j)))   
    years.append(j)

In [None]:
print(House_DF["Price"].mean())

In [None]:
r_sq = model.score(x, y)
print("Regression Score: " + str(r_sq))
year_Built = []
year_Built.append(round(r_sq, 2))
plt.figure(figsize=(13, 6))
plt.scatter(House_DF["Year Built"],House_DF["Price"])
plt.plot(years, values, color="red", linewidth=3)
plt.title("Scatter Plot of Year Built vs. Price with Linear Regression Line")
plt.xlabel("Year House was Built", size=14)
plt.ylabel("Price it was sold for", size=14)
plt.show()

**2. Size**

In [None]:
x = np.array(House_DF["Size (in Sq Ft)"])
y = np.array(House_DF["Price"])

x = x.reshape(-1, 1)
y = y.reshape(-1, 1)

model = LinearRegression()
model.fit(x,y)

In [None]:
model.intercept_

In [None]:
model.coef_[0]

In [None]:
min_val = (House_DF["Size (in Sq Ft)"].min())
max_val = (House_DF["Size (in Sq Ft)"].max())
print("Min: %d  Max: %d" % (min_val, max_val))

In [None]:
values = []
years = []
for j in range(min_val, max_val):
    values.append((model.intercept_ + (model.coef_[0] * j)))   
    years.append(j)

In [None]:
print(House_DF["Size (in Sq Ft)"].mean())

In [None]:
r_sq = model.score(x, y)
print("Regression Score: " + str(r_sq))
Size = []
Size.append(round(r_sq, 2))
plt.figure(figsize=(13, 6))
plt.scatter(House_DF["Size (in Sq Ft)"],House_DF["Price"])
plt.plot(years, values, color="red", linewidth=3)
plt.title("Scatter Plot of Size of house vs. Price with Linear Regression Line")
plt.xlabel("Size of House", size=14)
plt.ylabel("Price it was sold for", size=14)
plt.show()

**3. Story No.**

In [None]:
x = np.array(House_DF["Story no."])
y = np.array(House_DF["Price"])

x = x.reshape(-1, 1)
y = y.reshape(-1, 1)

model = LinearRegression()
model.fit(x,y)

In [None]:
model.intercept_

In [None]:
model.coef_[0]

In [None]:
min_val = (House_DF["Story no."].min())
max_val = (House_DF["Story no."].max())
print("Min: %d  Max: %d" % (min_val, max_val))

In [None]:
House_DF["Story no."].value_counts()

In [None]:
values = []
sizes = []
i=1
for j in range(3):
    values.append((model.intercept_ + (model.coef_[0] * i)))   
    sizes.append(i)
    i += 0.5

In [None]:
r_sq = model.score(x, y)
print("Regression Score: " + str(r_sq))
Story = []
Story.append(round(r_sq, 2))
plt.figure(figsize=(13, 6))
plt.scatter(House_DF["Story no."], House_DF["Price"])
plt.plot(sizes, values, color="red", linewidth=3)
plt.title("Scatter Plot of Story No. vs. Price with Linear Regression Line")
plt.xlabel("Stories in House", size=14)
plt.ylabel("Price it was sold for", size=14)
plt.show()

**4. Bedrooms No.**

In [None]:
x = np.array(House_DF["Bedrooms no."])
y = np.array(House_DF["Price"])

x = x.reshape(-1, 1)
y = y.reshape(-1, 1)

model = LinearRegression()
model.fit(x,y)

print(model.intercept_)
print(model.coef_[0])

print(House_DF["Bedrooms no."].min())
print(House_DF["Bedrooms no."].max())

In [None]:
values = []
size = []
for j in range(1, 6):
    values.append((model.intercept_ + (model.coef_[0] * j)))   
    size.append(j)

In [None]:
r_sq = model.score(x, y)
print("Regression Score: " + str(r_sq))
Bedrooms = []
Bedrooms.append(round(r_sq, 2))
plt.figure(figsize=(13, 6))
plt.scatter(House_DF["Bedrooms no."], House_DF["Price"])
plt.plot(size, values, color="red", linewidth=3)
plt.title("Scatter Plot of Bedrooms No. vs. Price with Linear Regression Line")
plt.xlabel("Bedrooms in House", size=14)
plt.ylabel("Price it was sold for", size=14)
plt.show()

**5. Bathrooms No.**

In [None]:
x = np.array(House_DF["Bathrooms no."])
y = np.array(House_DF["Price"])

x = x.reshape(-1, 1)
y = y.reshape(-1, 1)

model = LinearRegression()
model.fit(x,y)

print(model.intercept_)
print(model.coef_[0])

print(House_DF["Bathrooms no."].min())
print(House_DF["Bathrooms no."].max())

In [None]:
values = []
size = []
for j in range(1, 4):
    values.append((model.intercept_ + (model.coef_[0] * j)))   
    size.append(j)

In [None]:
r_sq = model.score(x, y)
print("Regression Score: " + str(r_sq))
Bathrooms = []
Bathrooms.append(round(r_sq, 2))
plt.figure(figsize=(13, 6))
plt.scatter(House_DF["Bathrooms no."], House_DF["Price"])
plt.plot(size, values, color="red", linewidth=3)
plt.title("Scatter Plot of Bathrooms No. vs. Price with Linear Regression Line")
plt.xlabel("Bathrooms in House", size=14)
plt.ylabel("Price it was sold for", size=14)
plt.show()

**6. Year of Sale**

In [None]:
x = np.array(House_DF["Year of Sale"])
y = np.array(House_DF["Price"])

x = x.reshape(-1, 1)
y = y.reshape(-1, 1)

model = LinearRegression()
model.fit(x,y)

print(model.intercept_)
print(model.coef_[0])

print(House_DF["Year of Sale"].min())
print(House_DF["Year of Sale"].max())

In [None]:
values = []
size = []
for j in range(2016, 2020):
    values.append((model.intercept_ + (model.coef_[0] * j)))   
    size.append(j)

In [None]:
r_sq = model.score(x, y)
print("Regression Score: " + str(r_sq))
year_Sale = []
year_Sale.append(round(r_sq, 2))
plt.figure(figsize=(13, 6))
plt.scatter(House_DF["Year of Sale"], House_DF["Price"])
plt.plot(size, values, color="red", linewidth=3)
plt.title("Scatter Plot of Year of Sale vs. Price with Linear Regression Line")
plt.xlabel("Year of Sale", size=14)
plt.ylabel("Price it was sold for", size=14)
plt.show()

##### Results of Regression Testing

While looking at each graph and the regression line is a good way of seeing how well it performs, however I will go based off each features regression score to rate them

In [None]:
print("Year Built: " + str(year_Built))
print("House Size: " + str(Size))
print("Story number: " + str(Story))
print("Bedrooms number: " + str(Bedrooms))
print("Bathrooms number: " + str(Bathrooms))
print("Year of Sale: " + str(year_Sale))

These score allow us rank each feature in terms of how well it performed with regards to our regression testing:
1. House Size
2. Bathrooms number
3. Year Built
4. Story number
5. Bedrooms number
6. Year of Sale

# 4. Classification

#### a) The price of a property is often said to be linked closely to its location, while different areas will have different types of housing stock. Investigate whether it is possible to classify the location of a house, based on the other descriptive features in the house sale dataset. You can use any classification algorithm of your choice. You should evaluate the performance of the classifier using an appropriate strategy.

In this section I will attempt to classify the Location of houses based on other descriptive features in the data. I will do this using a K Nearest Neighbour Classifier. For each feature, I will analyse their performance by using a hold back strategy, generating a confusion matrix and finally using k-fold cross validation. I will also store respective scores in a list and analyse them at the end

To start, it is a good idea to have a look at the value counts for our Location parameter. I will also look at the number of data points to help choose our value of K

In [None]:
House_DF["Location"].value_counts()

In [None]:
len(House_DF)

We have 7 unique house locations with values ranging from 82 - 238. The number of data points is 945

In general K should be <= sqrt(n) where n is the number of observations in our dataset (945)

I will decide on a value of K when I begin examining the first descriptive feature

#### Features to be used in classification:
- Price
- Year Built
- Size
- Story No. 
- Bedrooms No.
- Bathrooms No.

Let us start by isolating our target attribute which is Location

In [None]:
target = House_DF.iloc[:,2]
target

##### 1. Price

In [None]:
data = House_DF.iloc[:,1:2]
data

In [None]:
# splitting price data into training and test data with an 80/20 split
# also including list to hold scores of hold back strategy and k-fold cross validation
price = []
train_data, test_data, train_target, test_target = train_test_split(data, target, test_size=0.2)

In [None]:
print("Training set has %d examples" % train_data.shape[0])
print("Test set has %d examples" % test_data.shape[0])

In [None]:
model = KNeighborsClassifier(n_neighbors=30)
model.fit(train_data, train_target)

In [None]:
# looking at values for k from 1 - 45
for k in range(1, 45):
    # train a classifier with this parameter value
    model = KNeighborsClassifier(n_neighbors=k)
    m = model.fit(train_data, train_target)
    # make predictions
    predicted = model.predict(test_data)
    # evaluate the predictions
    acc = accuracy_score(test_target, predicted)
    print("K=%02d neighbours: Accuracy=%.3f" % (k, acc))

Values seem quite similar in range 15 - 40. Will continue by using K = 30 for every feature

In [None]:
model = KNeighborsClassifier(n_neighbors=30)
model.fit(train_data, train_target)

In [None]:
predicted = model.predict(test_data)
print("Class counts:\n%s" % pd.Series(predicted).value_counts())

In [None]:
accuracy = accuracy_score(test_target, predicted)
print("Accuracy=%.3f" % accuracy)
price.append(round(accuracy, 3))

**Note:** For all confusion matrices, the correct values will be on the diagnol from top left to bottom right. Any other value is incorrect and was not labelled correctly

In [None]:
cm = confusion_matrix(test_target, predicted)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=model.classes_)
fig, ax = plt.subplots(figsize=(10,10))
disp.plot(ax=ax, cmap=plt.cm.Blues)
plt.show()

In [None]:
acc_scores = cross_val_score(model, data, target, cv=5, scoring="accuracy")
print(acc_scores)

In [None]:
print("Final accuracy score: %.2f" % acc_scores.mean())
price.append(round(acc_scores.mean(), 3))

In [None]:
price

##### 2. Year Built

In [None]:
data = House_DF.iloc[:,3:4]
data

In [None]:
# splitting year built data into training and test data with an 80/20 split
# also including list to hold scores of hold back strategy and k-fold cross validation
year_built = []
train_data, test_data, train_target, test_target = train_test_split(data, target, test_size=0.2)

In [None]:
model = KNeighborsClassifier(n_neighbors=30)
model.fit(train_data, train_target)

In [None]:
predicted = model.predict(test_data)
print("Class counts:\n%s" % pd.Series(predicted).value_counts())

In [None]:
accuracy = accuracy_score(test_target, predicted)
print("Accuracy=%.3f" % accuracy)
year_built.append(round(accuracy, 3))

In [None]:
cm = confusion_matrix(test_target, predicted)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=model.classes_)
fig, ax = plt.subplots(figsize=(10,10))
disp.plot(ax=ax, cmap=plt.cm.Blues)
plt.show()

In [None]:
acc_scores = cross_val_score(model, data, target, cv=5, scoring="accuracy")
print(acc_scores)

In [None]:
print("Final accuracy score: %.2f" % acc_scores.mean())
year_built.append(round(acc_scores.mean(), 3))

##### 3. Size

In [None]:
data = House_DF.iloc[:,4:5]
data

In [None]:
# splitting size data into training and test data with an 80/20 split
# also including list to hold scores of hold back strategy and k-fold cross validation
size = []
train_data, test_data, train_target, test_target = train_test_split(data, target, test_size=0.2)

In [None]:
model = KNeighborsClassifier(n_neighbors=30)
model.fit(train_data, train_target)

In [None]:
predicted = model.predict(test_data)
print("Class counts:\n%s" % pd.Series(predicted).value_counts())

In [None]:
accuracy = accuracy_score(test_target, predicted)
print("Accuracy=%.3f" % accuracy)
size.append(round(accuracy, 3))

In [None]:
cm = confusion_matrix(test_target, predicted)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=model.classes_)
fig, ax = plt.subplots(figsize=(10,10))
disp.plot(ax=ax, cmap=plt.cm.Blues)
plt.show()

In [None]:
acc_scores = cross_val_score(model, data, target, cv=5, scoring="accuracy")
print(acc_scores)

In [None]:
print("Final accuracy score: %.2f" % acc_scores.mean())
size.append(round(acc_scores.mean(), 3))

##### 4. Story No.

In [None]:
data = House_DF.iloc[:,6:7]
data

In [None]:
# splitting story no. data into training and test data with an 80/20 split
# also including list to hold scores of hold back strategy and k-fold cross validation
story = []
train_data, test_data, train_target, test_target = train_test_split(data, target, test_size=0.2)

In [None]:
model = KNeighborsClassifier(n_neighbors=30)
model.fit(train_data, train_target)

In [None]:
predicted = model.predict(test_data)
print("Class counts:\n%s" % pd.Series(predicted).value_counts())

In [None]:
accuracy = accuracy_score(test_target, predicted)
print("Accuracy=%.3f" % accuracy)
story.append(round(accuracy, 3))

In [None]:
cm = confusion_matrix(test_target, predicted)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=model.classes_)
fig, ax = plt.subplots(figsize=(10,10))
disp.plot(ax=ax, cmap=plt.cm.Blues)
plt.show()

In [None]:
acc_scores = cross_val_score(model, data, target, cv=5, scoring="accuracy")
print(acc_scores)

In [None]:
print("Final accuracy score: %.2f" % acc_scores.mean())
story.append(round(acc_scores.mean(), 3))

##### 5. Bedrooms No.

In [None]:
data = House_DF.iloc[:,7:8]
data

In [None]:
# splitting bedrooms no. data into training and test data with an 80/20 split
# also including list to hold scores of hold back strategy and k-fold cross validation
bedrooms = []
train_data, test_data, train_target, test_target = train_test_split(data, target, test_size=0.2)

In [None]:
model = KNeighborsClassifier(n_neighbors=30)
model.fit(train_data, train_target)

In [None]:
predicted = model.predict(test_data)
print("Class counts:\n%s" % pd.Series(predicted).value_counts())

In [None]:
accuracy = accuracy_score(test_target, predicted)
print("Accuracy=%.3f" % accuracy)
bedrooms.append(round(accuracy, 3))

In [None]:
cm = confusion_matrix(test_target, predicted)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=model.classes_)
fig, ax = plt.subplots(figsize=(10,10))
disp.plot(ax=ax, cmap=plt.cm.Blues)
plt.show()

In [None]:
acc_scores = cross_val_score(model, data, target, cv=5, scoring="accuracy")
print(acc_scores)

In [None]:
print("Final accuracy score: %.2f" % acc_scores.mean())
bedrooms.append(round(acc_scores.mean(), 3))

##### 6. Bathrooms No.

In [None]:
data = House_DF.iloc[:,8:9]
data

In [None]:
# splitting bathrooms data into training and test data with an 80/20 split
# also including list to hold scores of hold back strategy and k-fold cross validation
bathrooms = []
train_data, test_data, train_target, test_target = train_test_split(data, target, test_size=0.2)

In [None]:
model = KNeighborsClassifier(n_neighbors=30)
model.fit(train_data, train_target)

In [None]:
predicted = model.predict(test_data)
print("Class counts:\n%s" % pd.Series(predicted).value_counts())

In [None]:
accuracy = accuracy_score(test_target, predicted)
print("Accuracy=%.3f" % accuracy)
bathrooms.append(round(accuracy, 3))

In [None]:
cm = confusion_matrix(test_target, predicted)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=model.classes_)
fig, ax = plt.subplots(figsize=(10,10))
disp.plot(ax=ax, cmap=plt.cm.Blues)
plt.show()

In [None]:
acc_scores = cross_val_score(model, data, target, cv=5, scoring="accuracy")
print(acc_scores)

In [None]:
print("Final accuracy score: %.2f" % acc_scores.mean())
bathrooms.append(round(acc_scores.mean(), 3))

##### Final Results

I will now print out the score for every descriptive feature received for  hold back strategy and k-fold cross validation.

In [None]:
print("Price - hold back score: %.3f    k-fold score: %.3f" % (price[0], price[1]))
print("Year Built - hold back score: %.3f    k-fold score: %.3f" % (year_built[0], year_built[1]))
print("Size - hold back score: %.3f    k-fold score: %.3f" % (size[0], size[1]))
print("Story - hold back score: %.3f    k-fold score: %.3f" % (story[0], story[1]))
print("Bedrooms - hold back score: %.3f    k-fold score: %.3f" % (bedrooms[0], bedrooms[1]))
print("Bathrooms - hold back score: %.3f    k-fold score: %.3f\n" % (bathrooms[0], bathrooms[1]))

Different Results will occur on different runs of the algorithm but for my run I received the following scores:
- Price: [0.439, 0.438]
- Year Built: [0.519, 0.546]
- Size: [0.259, 0.308]
- Story: [0.291, 0.313]
- Bedrooms: [0.228, 0.251]
- Bathrooms: [0.143, 0.298]

I will now calculate the combined score for both attributes

In [None]:
print("Price: %.2f" % sum(price))
print("Year Built: %.2f" % sum(year_built))
print("Size: %.2f" % sum(size))
print("Story: %.2f" % sum(story))
print("Bedrooms: %.2f" % sum(bedrooms))
print("Bathrooms: %.2f" % sum(bathrooms))

Different Results will occur on different runs of the algorithm but for my run I received the following scores:
- Price: 0.88
- Year Built: 1.06
- Size: 0.57
- Story: 0.60
- Bedrooms: 0.48
- Bathrooms: 0.44

###### This gives an overall ranking of best features for classification in the following order:
1. Year Built
2. Price
3. Story
4. Size
5. Bedrooms
6. Bathrooms

#### b) Experiment with applying the same classifier in combination with different subsets of descriptive features. Which feature(s) appear to be be most useful for classification?

As the number of features is six, it is impractical to test every possible combination of feautures due to the large possible number of them

Instead, I will choose combinations of features with the highest classification scores and also those which seem interesting

The combinations of features to be tested are:
- Year Built and Price
- Year Built and Story
- Price and Story
- Price, Bedrooms and Bathrooms
- Year Built, Price and Story
- Size, Bedrooms, Bathrooms
- Year Built, Price, Story and Size

At the end I will analyse the selected features using the same methods and scoring as in part a)

###### 1. Year Built and Price

In [None]:
data = House_DF[["Year Built", "Price"]]
data

In [None]:
# normalising data values
normalizer = StandardScaler()
data_scaled = normalizer.fit_transform(data.values)
data_scaled

In [None]:
# splitting data into training and test data with an 80/20 split
# also including list to hold scores of hold back strategy and k-fold cross validation
YB_P = []
train_data, test_data, train_target, test_target = train_test_split(data_scaled, target, test_size=0.2)

In [None]:
model = KNeighborsClassifier(n_neighbors=30)
model.fit(train_data, train_target)

In [None]:
predicted = model.predict(test_data)
print("Class counts:\n%s" % pd.Series(predicted).value_counts())

In [None]:
accuracy = accuracy_score(test_target, predicted)
print("Accuracy=%.3f" % accuracy)
YB_P.append(round(accuracy, 3))

In [None]:
cm = confusion_matrix(test_target, predicted)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=model.classes_)
fig, ax = plt.subplots(figsize=(10,10))
disp.plot(ax=ax, cmap=plt.cm.Blues)
plt.show()

In [None]:
acc_scores = cross_val_score(model, data_scaled, target, cv=5, scoring="accuracy")
print(acc_scores)

In [None]:
print("Final accuracy score: %.2f" % acc_scores.mean())
YB_P.append(round(acc_scores.mean(), 3))

###### 2. Year Built and Story

In [None]:
data = House_DF[["Year Built", "Story no."]]
data

In [None]:
# normalising data values
normalizer = StandardScaler()
data_scaled = normalizer.fit_transform(data.values)
data_scaled

In [None]:
# splitting data into training and test data with an 80/20 split
# also including list to hold scores of hold back strategy and k-fold cross validation
YB_S = []
train_data, test_data, train_target, test_target = train_test_split(data_scaled, target, test_size=0.2)

In [None]:
model = KNeighborsClassifier(n_neighbors=30)
model.fit(train_data, train_target)

In [None]:
predicted = model.predict(test_data)
print("Class counts:\n%s" % pd.Series(predicted).value_counts())

In [None]:
accuracy = accuracy_score(test_target, predicted)
print("Accuracy=%.3f" % accuracy)
YB_S.append(round(accuracy, 3))

In [None]:
cm = confusion_matrix(test_target, predicted)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=model.classes_)
fig, ax = plt.subplots(figsize=(10,10))
disp.plot(ax=ax, cmap=plt.cm.Blues)
plt.show()

In [None]:
acc_scores = cross_val_score(model, data_scaled, target, cv=5, scoring="accuracy")
print(acc_scores)

In [None]:
print("Final accuracy score: %.2f" % acc_scores.mean())
YB_S.append(round(acc_scores.mean(), 3))

###### 3. Price and Story

In [None]:
data = House_DF[["Price", "Story no."]]
data

In [None]:
# normalising data values
normalizer = StandardScaler()
data_scaled = normalizer.fit_transform(data.values)
data_scaled

In [None]:
# splitting data into training and test data with an 80/20 split
# also including list to hold scores of hold back strategy and k-fold cross validation
P_S = []
train_data, test_data, train_target, test_target = train_test_split(data_scaled, target, test_size=0.2)

In [None]:
model = KNeighborsClassifier(n_neighbors=30)
model.fit(train_data, train_target)

In [None]:
predicted = model.predict(test_data)
print("Class counts:\n%s" % pd.Series(predicted).value_counts())

In [None]:
accuracy = accuracy_score(test_target, predicted)
print("Accuracy=%.3f" % accuracy)
P_S.append(round(accuracy, 3))

In [None]:
cm = confusion_matrix(test_target, predicted)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=model.classes_)
fig, ax = plt.subplots(figsize=(10,10))
disp.plot(ax=ax, cmap=plt.cm.Blues)
plt.show()

In [None]:
acc_scores = cross_val_score(model, data_scaled, target, cv=5, scoring="accuracy")
print(acc_scores)

In [None]:
print("Final accuracy score: %.2f" % acc_scores.mean())
P_S.append(round(acc_scores.mean(), 3))

###### 4. Price, Bedrooms and Bathrooms

In [None]:
data = House_DF[["Price", "Bedrooms no.", "Bathrooms no."]]
data

In [None]:
# normalising data values
normalizer = StandardScaler()
data_scaled = normalizer.fit_transform(data.values)
data_scaled

In [None]:
# splitting data into training and test data with an 80/20 split
# also including list to hold scores of hold back strategy and k-fold cross validation
P_BE_BA = []
train_data, test_data, train_target, test_target = train_test_split(data_scaled, target, test_size=0.2)

In [None]:
model = KNeighborsClassifier(n_neighbors=30)
model.fit(train_data, train_target)

In [None]:
predicted = model.predict(test_data)
print("Class counts:\n%s" % pd.Series(predicted).value_counts())

In [None]:
accuracy = accuracy_score(test_target, predicted)
print("Accuracy=%.3f" % accuracy)
P_BE_BA.append(round(accuracy, 3))

In [None]:
cm = confusion_matrix(test_target, predicted)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=model.classes_)
fig, ax = plt.subplots(figsize=(10,10))
disp.plot(ax=ax, cmap=plt.cm.Blues)
plt.show()

In [None]:
acc_scores = cross_val_score(model, data_scaled, target, cv=5, scoring="accuracy")
print(acc_scores)

In [None]:
print("Final accuracy score: %.2f" % acc_scores.mean())
P_BE_BA.append(round(acc_scores.mean(), 3))

###### 5. Year Built, Price and Story

In [None]:
data = House_DF[["Year Built", "Price", "Story no."]]
data

In [None]:
# normalising data values
normalizer = StandardScaler()
data_scaled = normalizer.fit_transform(data.values)
data_scaled

In [None]:
# splitting data into training and test data with an 80/20 split
# also including list to hold scores of hold back strategy and k-fold cross validation
YB_P_S = []
train_data, test_data, train_target, test_target = train_test_split(data_scaled, target, test_size=0.2)

In [None]:
model = KNeighborsClassifier(n_neighbors=30)
model.fit(train_data, train_target)

In [None]:
predicted = model.predict(test_data)
print("Class counts:\n%s" % pd.Series(predicted).value_counts())

In [None]:
accuracy = accuracy_score(test_target, predicted)
print("Accuracy=%.3f" % accuracy)
YB_P_S.append(round(accuracy, 3))

In [None]:
cm = confusion_matrix(test_target, predicted)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=model.classes_)
fig, ax = plt.subplots(figsize=(10,10))
disp.plot(ax=ax, cmap=plt.cm.Blues)
plt.show()

In [None]:
acc_scores = cross_val_score(model, data_scaled, target, cv=5, scoring="accuracy")
print(acc_scores)

In [None]:
print("Final accuracy score: %.2f" % acc_scores.mean())
YB_P_S.append(round(acc_scores.mean(), 3))

###### 6. Size, Bedrooms, Bathrooms

In [None]:
data = House_DF[["Size (in Sq Ft)", "Bedrooms no.", "Bathrooms no."]]
data

In [None]:
# normalising data values
normalizer = StandardScaler()
data_scaled = normalizer.fit_transform(data.values)
data_scaled

In [None]:
# splitting data into training and test data with an 80/20 split
# also including list to hold scores of hold back strategy and k-fold cross validation
S_BE_BA = []
train_data, test_data, train_target, test_target = train_test_split(data_scaled, target, test_size=0.2)

In [None]:
model = KNeighborsClassifier(n_neighbors=30)
model.fit(train_data, train_target)

In [None]:
predicted = model.predict(test_data)
print("Class counts:\n%s" % pd.Series(predicted).value_counts())

In [None]:
accuracy = accuracy_score(test_target, predicted)
print("Accuracy=%.3f" % accuracy)
S_BE_BA.append(round(accuracy, 3))

In [None]:
cm = confusion_matrix(test_target, predicted)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=model.classes_)
fig, ax = plt.subplots(figsize=(10,10))
disp.plot(ax=ax, cmap=plt.cm.Blues)
plt.show()

In [None]:
acc_scores = cross_val_score(model, data_scaled, target, cv=5, scoring="accuracy")
print(acc_scores)

In [None]:
print("Final accuracy score: %.2f" % acc_scores.mean())
S_BE_BA.append(round(acc_scores.mean(), 3))

###### 7. Year Built, Price, Story and Size

In [None]:
data = House_DF[["Year Built", "Price", "Story no.", "Size (in Sq Ft)"]]
data

In [None]:
# normalising data values
normalizer = StandardScaler()
data_scaled = normalizer.fit_transform(data.values)
data_scaled

In [None]:
# splitting data into training and test data with an 80/20 split
# also including list to hold scores of hold back strategy and k-fold cross validation
Y_P_S_S = []
train_data, test_data, train_target, test_target = train_test_split(data_scaled, target, test_size=0.2)

In [None]:
model = KNeighborsClassifier(n_neighbors=30)
model.fit(train_data, train_target)

In [None]:
predicted = model.predict(test_data)
print("Class counts:\n%s" % pd.Series(predicted).value_counts())

In [None]:
accuracy = accuracy_score(test_target, predicted)
print("Accuracy=%.3f" % accuracy)
Y_P_S_S.append(round(accuracy, 3))

In [None]:
cm = confusion_matrix(test_target, predicted)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=model.classes_)
fig, ax = plt.subplots(figsize=(10,10))
disp.plot(ax=ax, cmap=plt.cm.Blues)
plt.show()

In [None]:
acc_scores = cross_val_score(model, data_scaled, target, cv=5, scoring="accuracy")
print(acc_scores)

In [None]:
print("Final accuracy score: %.2f" % acc_scores.mean())
Y_P_S_S.append(round(acc_scores.mean(), 3))

##### Final Results

I will now print out the score for every descriptive feature combination received for  hold back strategy and k-fold cross validation.

In [None]:
print("Year Built and Price - hold back score: %.3f    k-fold score: %.3f" % (YB_P[0], YB_P[1]))
print("Year Built and Story - hold back score: %.3f    k-fold score: %.3f" % (YB_S[0], YB_S[1]))
print("Price and Story - hold back score: %.3f    k-fold score: %.3f" % (P_S[0], P_S[1]))
print("Price, Bedrooms and Bathrooms - hold back score: %.3f    k-fold score: %.3f" % (P_BE_BA[0], P_BE_BA[1]))
print("Year Built, Price and Story - hold back score: %.3f    k-fold score: %.3f" % (YB_P_S[0], YB_P_S[1]))
print("Size, Bedrooms, Bathrooms - hold back score: %.3f    k-fold score: %.3f" % (S_BE_BA[0], S_BE_BA[1]))
print("Year Built, Price, Story and Size - hold back score: %.3f    k-fold score: %.3f" % (Y_P_S_S[0], Y_P_S_S[1]))

Different Results will occur on different runs of the algorithm but for my run I received the following scores:
- Year Built and Price: [0.603, 0.603]
- Year Built and Story: [0.577, 0.563]
- Price and Story: [0.503, 0.461]
- Price, Bedrooms and Bathrooms: [0.466, 0.497]
- Year Built, Price and Story: [0.635, 0.589]
- Size, Bedrooms, Bathrooms: [0.492, 0.426]
- Year Built, Price, Story and Size: [0.508, 0.605]

I will now calculate the combined score for both attributes

In [None]:
print("Year Built and Price: %.2f" % sum(YB_P))
print("Year Built and Story: %.2f" % sum(YB_S))
print("Price and Story: %.2f" % sum(P_S))
print("Price, Bedrooms and Bathrooms: %.2f" % sum(P_BE_BA))
print("Year Built, Price and Story: %.2f" % sum(YB_P_S))
print("Size, Bedrooms, Bathrooms: %.2f" % sum(S_BE_BA))
print("Year Built, Price, Story and Size: %.2f" % sum(Y_P_S_S))

Different Results will occur on different runs of the algorithm but for my run I received the following scores: - re run and add scores
- Year Built and Price: 1.21
- Year Built and Story: 1.14
- Price and Story: 0.96
- Price, Bedrooms and Bathrooms: 0.96
- Year Built, Price and Story: 1.22
- Size, Bedrooms, Bathrooms: 0.92
- Year Built, Price, Story and Size: 1.11

###### This gives an overall ranking of best combination of features for classification in the following order:
1. Year Built, Price and Story
2. Year Built and Price
3. Year Built and Story
4. Year Built, Price, Story and Size
5. Price, Bedrooms and Bathrooms
6. Price and Story
7. Size, Bedrooms, Bathrooms

**Note:** Position 5 and 6 was a tie so I placed them alphabetically