# STINTSY GROUP 8 MACHINE PROJECT
## PIANO, TAHIMIC, TAMAYO, TIPAN

## I. INTRODUCTION

The dataset selected for this project is the "House Prices" dataset, encompassing house sale data in King County from May 2014 to May 2015. This project aims to tackle a significant problem in the real estate domain: predicting house prices. By employing regression analysis, we seek to establish a predictive model that estimates house prices based on various features like size, condition, location, etc. This endeavor is crucial as it aids in real estate valuation, informs investment decisions, and can be a valuable tool for urban planning and housing market analysis. Specifically, the project will explore whether a more complex regression model (such as multiple or polynomial regression) provides a more accurate prediction compared to a simple linear approach, given the multifaceted nature of real estate pricing.

## II. Description of the Dataset

This dataset offers a comprehensive snapshot of the real estate market in King County, covering a one-year period from May 2014 to May 2015. It includes data on 21,613 properties, each characterized by 20 features. These features range from basic descriptors like the number of bedrooms and bathrooms (‘bedrooms’, ‘bathrooms’) to more nuanced ones like the quality of view (‘view’) and construction grade (‘grade’). Each row in the dataset represents a unique property, while columns encapsulate these diverse characteristics.

In terms of data collection, while specific details are not provided, it is typical for such datasets to be compiled from property sales records, real estate listings, and county tax assessments. This method implies potential biases - the dataset may not represent unlisted properties or those not involved in transactions during this period. Furthermore, the geographic and temporal scope of the data may affect the generalizability of our findings, as housing markets can vary significantly across regions and over time.

Each feature in the dataset potentially influences the property's price. For instance, ‘sqft_living’ and ‘sqft_lot’ directly relate to the size of the property, a significant price determinant. The presence of a waterfront (‘waterfront’) or a high construction grade (‘grade’) likely elevates property values due to their desirability and quality. Understanding these relationships will be crucial in building our predictive model.

### Tabularized List of Variables

| [house_prices.csv] **Variable Name** | **Description**|
|--------------------------------------|----------------|
|**id** | A notation for the house|
|**date** | Date the house is sold|
|**price** | The Sale price of the house|
|**bedrooms** | The Number of bedrooms |
|**bathrooms** | The Number of bathrooms |
|**sqft_livin** | Size of the living area in square feet |
|**sqft_lo**t | Size of the lot in square feet |
|**floors** | Total floors in the house |
|**waterfront** | '1' if the property has waterfront, '0' if none |
|**view** | An index of 0 to 4 of good the view of the property was |
|**condition** | Condition of the house, ranked from 1 to 5 |
|**grade** | Classification by construction quality which refers to the types of materials used and the quality of workmanship. Buildings of better quality (higher grade) cost more to build per unit of measure and command higher value. |
|**sqft_above** | Square feet above ground |
|**sqft_basement** | Square feet below ground |
|**yr_built** | Year the house was built |
|**yr_renovated** | Year the house was renovated, 0 if never renovated |
|**zipcode** | 5 Digit zip code |
|**lat** | Latitude Coordinate |
|**long** | Longitude Cooridinate |
|**sqft_living15** | Average size of interior housing living space for the closest 15 houses, in square feet |
|**sqft_lot15** | Average size of land lots for the closest 15 houses, in square feet |

## III. Libraries Required

For this project, the following Python libraries have been utilized:

- `pandas`: For efficient data handling and manipulation.
- `numpy`: For numerical computations and handling array-type data structures.
- `matplotlib` and `seaborn`: For data visualization, crucial in understanding data distributions and patterns.
- `scikit-learn`: For implementing and evaluating various regression models. This library provides a comprehensive suite of tools for machine learning tasks.

## IV. DATA PREPROCESSING AND CLEANING

Perform the necessary steps before using the data. In this section of the notebook, please take note of the following: • If needed, perform preprocessing techniques to transform the data to the appropriate representation. This may include binning, log transformations, conversion to one-hot encoding, normalization, standardization, interpolation, truncation, and feature engineering, among others. There should be a correct and proper justification of the use of each preprocessing technique used in the project.

Make sure that the data is clean, especially features that are used in the project. This may include checking for misrepresentations, checking the data type, dealing with missing data, dealing with duplicate data, and dealing with outliers, among others. There should be a correct and proper justification of the application (or non-application) of each data cleaning method used in the project. Clean only the variables utilized in the study

In [1]:
import pandas as pd
house = pd.read_csv('house_prices.csv')
house.head()

Unnamed: 0,id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,...,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
0,7129300520,20141013T000000,221900.0,3,1.0,1180,5650,1.0,0,0,...,7,1180,0,1955,0,98178,47.5112,-122.257,1340,5650
1,6414100192,20141209T000000,538000.0,3,2.25,2570,7242,2.0,0,0,...,7,2170,400,1951,1991,98125,47.721,-122.319,1690,7639
2,5631500400,20150225T000000,180000.0,2,1.0,770,10000,1.0,0,0,...,6,770,0,1933,0,98028,47.7379,-122.233,2720,8062
3,2487200875,20141209T000000,604000.0,4,3.0,1960,5000,1.0,0,0,...,7,1050,910,1965,0,98136,47.5208,-122.393,1360,5000
4,1954400510,20150218T000000,510000.0,3,2.0,1680,8080,1.0,0,0,...,8,1680,0,1987,0,98074,47.6168,-122.045,1800,7503


## Checking for Missing Values
Missing values in a dataset can greatly influence the outcome of analyses and predictive modeling. They can lead to biased estimates, reduced statistical power, and erroneous conclusions. It's crucial to identify and appropriately handle these missing values to ensure the integrity and reliability of our analysis.

In [2]:
# Checking for missing values in each column
missing_values = house.isnull().sum()

# Display the columns with missing values and their count
missing_values = missing_values[missing_values > 0]
print("Columns with missing values:\n", missing_values)

Columns with missing values:
 Series([], dtype: int64)


## Handling Duplicate Entries Across All Features

Duplicate entries, when considering all features of a dataset, may indicate exact replicas of data points. These exact duplicates could be a result of data entry errors or data collection issues and can lead to skewed analyses and biased model training. Unlike duplicates based on a single identifier like 'id', which might be legitimate in certain contexts (such as a house being sold multiple times), complete feature-wise duplicates typically do not add value and can distort the true distribution of the data.

In [5]:
duplicate_ids = house.duplicated(keep=False)
duplicates = house[duplicate_ids]

# Display the duplicate rows
print("Duplicate Entries:\n", duplicates)

# Removing duplicates if found
if not duplicates.empty:
    df = df.drop_duplicates(keep='first')
    print(f"Duplicates removed. New dataset size: {df.shape[0]} rows.")
else:
    print("No complete duplicates found across all features.")

Duplicate Entries:
 Empty DataFrame
Columns: [id, date, price, bedrooms, bathrooms, sqft_living, sqft_lot, floors, waterfront, view, condition, grade, sqft_above, sqft_basement, yr_built, yr_renovated, zipcode, lat, long, sqft_living15, sqft_lot15]
Index: []

[0 rows x 21 columns]
No complete duplicates found across all features.


## Ensuring Data Type Consistency

Proper data type alignment is critical for precise and effective data analysis and modeling. Incorrect data types can lead to erroneous calculations and modeling outcomes. It's essential to ensure that each feature in the dataset is represented by the most appropriate data type for its nature and the intended analytical use.

Assessing each feature individually to determine if its data type aligns with its real-world representation and analysis requirements. For instance, the 'bathrooms' feature represents a count that can include half-bathrooms (e.g., a toilet and sink, but no shower), hence it is appropriately maintained as a floating value.

In [6]:
print("Current Data Types:\n", house.dtypes)

# Converting 'date' from object to datetime
house['date'] = pd.to_datetime(house['date'])

# Converting 'zipcode' to a categorical variable
house['zipcode'] = house['zipcode'].astype('category')

print("\nUpdated Data Types:\n", house.dtypes)

Current Data Types:
 id                 int64
date              object
price            float64
bedrooms           int64
bathrooms        float64
sqft_living        int64
sqft_lot           int64
floors           float64
waterfront         int64
view               int64
condition          int64
grade              int64
sqft_above         int64
sqft_basement      int64
yr_built           int64
yr_renovated       int64
zipcode            int64
lat              float64
long             float64
sqft_living15      int64
sqft_lot15         int64
dtype: object

Updated Data Types:
 id                        int64
date             datetime64[ns]
price                   float64
bedrooms                  int64
bathrooms               float64
sqft_living               int64
sqft_lot                  int64
floors                  float64
waterfront                int64
view                      int64
condition                 int64
grade                     int64
sqft_above                int64
sq

In order to improve the performance of our machine learning models, it is important to remove features that may cause noise, or features that doesn't see to be relevant to the goal, such as the id and the date as this information doest have a relation with the price of the house. Additonally, locational data such data such as zipcode, lat, and long is hard to relate to the price of the house, as lat and long are all almost the same, while the zipcode, just identifies what part of King County the house is in, and doesn't really describe the features of the house the much. 

In [None]:
house = house.drop('id' , axis = 1)
house = house.drop('date' , axis = 1)
house = house.drop('zipcode', axis=1)
house = house.drop('lat', axis=1)
house = house.drop('long', axis=1)

Columns such as sqft_living15, and sqft_lot15 talks about the neighbor's houses, specifically the living space, and the lot sizes of the nearby 15 houses, given that it is not about the current house, and that it doesn't give much information other than the surroundings, the dropping of this column may prove helpful in predicting the price of the house based on its own features, instead of its surrounding houses.

In [None]:
house = house.drop('sqft_living15', axis=1)
house = house.drop('sqft_lot15', axis=1)

In [None]:
X = house.iloc[:, 1:]
y = house ['price']

Batch Normalization is an important preprocessing technique that can improve the performance of the ML model. As such it is very important to perform this preprocessing technique on this dataset due to the nature of the values for each feature. For example the values that represents the sqft sizes of the houses are in the thousands, while the values for the other features like the number of bedrooms are just in either the tens or ones place.

In [None]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X)
scaled_X = scaler.transform(X)
print(scaled_X)

## V. EXPLORATORY DATA ANALYSIS

Load the House Prices Dataset

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline
import numpy as np

In [None]:
house.head()

### What is the Mean Price of the Houses?

In [None]:
mean_price = y.mean()
print (mean_price)

### Find What Feature Affects the Price of the House the Most through correlation

In [None]:
correlation_matrix = house.corr()
corr_price = correlation_matrix['price']

plt.figure(figsize=(12, 8)) 
plt.bar(corr_price.index, corr_price.values)
plt.xlabel('Features')
plt.ylabel('Correlation Value')
plt.title('Correlation of Features and Price')
plt.xticks(rotation=45, ha='right')
plt.show()

According to the correlation chart, the size of the living space is what affects the price of the house the most, after the living space, the grade or the build quality is what affects the price the most.

### What is the relationship of Living Space and the Price of the house

In [None]:
Price_per_livingspace = house.groupby('price')['sqft_living'].mean()

plt.figure(figsize=(12, 8)) 
plt.plot(Price_per_livingspace.index, Price_per_livingspace.values)
plt.xlabel('price')
plt.ylabel('sqft_living')
plt.title('Average price per living size')
plt.show()

As seen in the graph above, it can be seen that there is an upward trend in the price of the Houses. Despite it being unstable around the lower parts, it can be seen that the minimum and maximum prices increases as the living size of the houses increase as well.

### Do Homes that has a lot of Living Space usually have good Build Quality?

In [None]:
mean_size_per_Grade = house.groupby('grade')['sqft_living'].mean()

plt.figure(figsize=(12, 8)) 
plt.plot(mean_size_per_Grade.index, mean_size_per_Grade.values)
plt.xlabel('grade')
plt.ylabel('sqft_living')
plt.title('Average Size per Grade')
plt.show()

According to the graph, the relationship between the grade and the living space is directly proportional

### 

## VI. MODEL TRAINING (tama pa ba tong pinaggawa ko HAHAHAHA )

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split (scaled_X, y, random_state = 0)
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

## Ordinary Least Squares Linear Regression

In [None]:
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit (X_train, y_train)

In [None]:
y_pred = model.predict (X_train)

In [None]:
def compute_RMSE(y_true, y_pred):
    rmse = (y_pred - y_true) ** 2
    rmse = np.mean(rmse)
    rmse = np.sqrt(rmse)
    return rmse

In [None]:
print (compute_RMSE(y_train, y_pred))

In [None]:
y_pred = model.predict (X_test)
print (compute_RMSE(y_test, y_pred))

## Regression with Stochastic Gradient Descent 

Stochastic Gradient Descent will be done to train the dataset, and will be using default parameters.
* loss = "squared_error"
* penalty - l2 (ridge)
* alpha = 0.0001
* max_iter = 1000
* learning_rate = invscaling
* eta0 or initial learning rate = 0.01

In [None]:
from sklearn.linear_model import SGDRegressor
model = SGDRegressor()
model.fit(X_train, y_train)

In [None]:
y_pred = model.predict (X_train)

In [None]:
print (compute_RMSE(y_train, y_pred))

In [None]:
y_pred = model.predict (X_test)
print (compute_RMSE(y_test, y_pred))

## Regression with Support Vector Machines

In [None]:
from sklearn.svm import SVR
model = SVR()
print (model.get_params())
model.fit (X_train, y_train)
y_pred = model.predict (X_train)
print (compute_RMSE(y_train, y_pred))

In [None]:
y_pred = model.predict (X_test)
print (compute_RMSE(y_test, y_pred))

## VII. HYPERPARAMETER TUNING

In Order to find the best model to use in predicting the house prices given the features of the house, hyperparameter tuning must be accomplished.

### Hyperparameter Tuning by RandomizedSearchCV

#### Ordinary Least Squares Linear Regression

In [None]:
from sklearn.model_selection import RandomizedSearchCV
model = LinearRegression()
print (model.get_params())
model.get_params().keys()

In [None]:
hyperparameters = [
    {
        "copy_X" : [ True, False ],
        "fit_intercept" : [True, False],
        "positive" : [True, False]
    }
]

In [None]:
rsc_house = RandomizedSearchCV (estimator = model , param_distributions = hyperparameters , n_iter = 8, cv = 5, random_state = 42)

In [None]:
rsc_house.fit (X_train, y_train)

In [None]:
best_params = rsc_house.best_params_
print(best_params)

In [None]:
best_estimator = rsc_house.best_estimator_
y_pred = best_estimator.predict (X_test)
print (compute_RMSE(y_test, y_pred))

#### Regression with Stochastic Gradient Descent

In [None]:
model = SGDRegressor()
print (model.get_params())
model.get_params().keys()

In [None]:
hyperparameters = [
    {
        "alpha" : [0.0001 , 0.001, 0.01, 0.1 , 1.0],
        'l1_ratio': [0.0, 0.1, 0.5, 0.9, 1.0],
        'learning_rate': ['constant', 'optimal', 'invscaling', 'adaptive'],
        'eta0': [0.01, 0.1, 0.5]
    }
]

In [None]:
rsc_house = RandomizedSearchCV (estimator = model , param_distributions = hyperparameters , n_iter = 50, cv = 5, random_state = 42)

In [None]:
rsc_house.fit(X_train, y_train)

In [None]:
rsc_house.best_params_

In [None]:
best_estimator = rsc_house.best_estimator_
y_pred = best_estimator.predict (X_test)
print (compute_RMSE(y_test, y_pred))

#### Support Vector Regression

In [None]:
model = SVR()
print (model.get_params())
model.get_params().keys()

In [None]:
hyperparameters = [
    {
        'kernel': ['linear', 'poly', 'rbf', 'sigmoid'],
        'C': [0.1, 1, 10],
        'gamma': ["scale", "auto", 0.01, 0.1, 1],
        'epsilon': [0.1, 0.2, 0.5]
    }
]

In [None]:
rsc_house = RandomizedSearchCV (estimator = model , param_distributions = hyperparameters , n_iter = 50, n_jobs = -2, cv = 5, random_state = 42)
# To make processing faster, adjust n_jobs, however it will still take a while.
# -1 (all cpu cores) , -2(all cores but 1) , -3(all cores but two)
rsc_house.fit(X_train, y_train)

In [None]:
rsc_house.best_params_

In [None]:
best_estimator = rsc_house.best_estimator_
y_pred = best_estimator.predict (X_test)
print (compute_RMSE(y_test, y_pred))