# COGS 118A - Project Checkpoint

# Names

- Sean Perry
- Alberto Valencia
- Kevin Hu
- Justin Huang

# Abstract 

We want to be able to predict house prices in the United States so that potential homebuyers can get an estimate of how much they will have to spend depending on their list of home requirements. The data represents aspects of a house (location, size, num rooms etc) that a buyer may be interested in deciding on. A dataset for this can be found at https://www.kaggle.com/datasets/ruchi798/housing-prices-in-metropolitan-areas-of-india for homes in India. Our plan is to use regression to predict the house price given the features found in the dataset and use $R^2$ and MSE to evaluate our model.

# Background

Among other metrics, house prices are oftentimes considered in evaluating the economic performance of a particular country. On a macro and micro level, housing prices can have significant consequences–ranging from an individual deciding to purchase their first home to evaluating a country’s Gross Domestic Product<a name="Congressional"></a>[<sup>[1]</sup>](#Congressionalnote). Thus, given this importance, some study has been done into home price prediction.
This importance has led to many attempts to predict house prices by researchers dating back to the 1990s.  The article, Predicting House Prices Using Multiple Listings Data, writes about the importance of accurately predicting house prices for its importance in administering mortgages and home owner insurance <a name="Dubin"></a>[<sup>[2]</sup>](#Dubinnote) . To do so, the author discusses using three algorithms: Ordinary Least Squares, Maximum likelihood, and kriging. Noting that geographical location is a difficult variable to take into account using ordinary linear regression, the author delves into each algorithm and how it is implemented, considering what model is appropriate for each scenario; in particular, the author notes that the maximum likelihood function is useful for variables which are difficult to account for. The purpose behind this is to use the correlations between the prices of nearby homes to get a more accurate estimate of a given home’s price in a more effective method given the difficulties with OLS and geography. The article then discusses a practical example from listing with Baltimore, comparing the various techniques explored in the article. In particular, the article uses grid search to maximize the parameters of the likelihood function to provide the best prediction. While the $R^2$ score for the algorithms was around $0.7$, the author did see significant improvements (as much as 65.3% improvement) using methods to determine relative housing price from nearby houses over traditional OLS. A table of values comparing the estimation sample from the prediction samples show that the estimations are fairly accurate. However, given the age of the article, it would be interesting to see if such techniques still hold up today in the world of deep networks. 
Recent research attempts to use a generalized Linear Regression in conjunction with other statistical measures as a baseline model to predict housing prices. This is done to improve the accuracy of models that attempt to predict housing price with the importance housing prices have for both a consumer and supplier as each considers the risks involved in purchasing real estate<a name="li"></a>[<sup>[3]</sup>](#Linote) . In particular, the article considers controlling investment in real estate as it has historically caused long-term economic issues citing this as a motivating factor for accurately predicting housing prices. The article delves into a detailed explanation of how their model is trained and all the different parameters it considers. Although the accuracy of the model is not stated, it seems that this group use of nonparametric parameters has led to a more accurate model compared to previous models.
With respect to the dataset we are currently using, a handful of kaggle users have made attempts to predict house price given the dataset already, with limited success. There have been 15 notebooks made with a handful using ML with varying degrees of quality and accuracy. One of the more completed notebooks is<a name="Masghiff"></a>[<sup>[4]</sup>](#Masghiffnote) . This notebook looked at 5 different models (Decision Trees, Random Forests, Gradient Boosting, Ridge CV, and ElasticNetCV) with onehot encoding and scaling transformations. There was no cross validation, model selection or hyper parameter tuning. The best scoring model  was random forests with a R^2 of 0.72 and an MSE of.0.27. Our work will focus much more on model improvements (and greater dataset expansion if possible) as a result to differentiate between us and previous work. 


# Problem Statement

Given the number of beds, baths, arces, city, state, zip code, and the size of a given house in cities in India, we want to predict the price of that house as closely to the actual price of that house as possible. This will be done using regression models (such as linear regression, random forests, etc) on the USA Real Estate Dataset Kaggle Dataset with performance measured by $R^2$ and MSE (and other potential evaluation meterics for regression based problems).

# Data

UPDATED FROM PROPOSAL!

You should have obtained and cleaned (if necessary) data you will use for this project.

Please give the following infomration for each dataset you are using
- link/reference to obtain it
- description of the size of the dataset (# of variables, # of observations)
- what an observation consists of
- what some critical variables are, how they are represented
- any special handling, transformations, cleaning, etc you have done should be demonstrated here!





In [8]:
#Data Cleaning
import pandas as pd
import os
import numpy as np

#Data is saved on diffrent CSVs for each city
#To make things easier, we can combine the data into one dataframe
dfs = []
for file in os.listdir("data"):
    df = pd.read_csv(os.path.join("data", file))
    df["City"] = file.replace(".csv", "")
    dfs.append(df)
df = pd.concat(dfs)

#As documented on kaggle, 9 implies that this information was not found for a home.
#Therefore we replaced all 9s with np.nan as is standard for empty values

temp = df["No. of Bedrooms"].copy()
df = df.applymap(lambda x: (np.nan if x == 9  else  x))
df["No. of Bedrooms"] = temp
df

Unnamed: 0,Price,Area,Location,No. of Bedrooms,Resale,MaintenanceStaff,Gymnasium,SwimmingPool,LandscapedGardens,JoggingTrack,...,BED,VaastuCompliant,Microwave,GolfCourse,TV,DiningTable,Sofa,Wardrobe,Refrigerator,City
0,30000000,3340,JP Nagar Phase 1,4,0,1.0,1.0,1.0,1.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Bangalore
1,7888000,1045,Dasarahalli on Tumkur Road,2,0,0.0,1.0,1.0,1.0,1.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Bangalore
2,4866000,1179,Kannur on Thanisandra Main Road,2,0,0.0,1.0,1.0,1.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Bangalore
3,8358000,1675,Doddanekundi,3,0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Bangalore
4,6845000,1670,Kengeri,3,0,1.0,1.0,1.0,1.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Bangalore
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7714,14500000,1180,Mira Road East,2,0,,,,,,...,,,,,,,,,,Mumbai
7715,14500000,530,Naigaon East,1,1,,,,,,...,,,,,,,,,,Mumbai
7716,4100000,700,Shirgaon,1,0,,,,,,...,,,,,,,,,,Mumbai
7717,2750000,995,Mira Road East,2,0,,,,,,...,,,,,,,,,,Mumbai


In [9]:
#A quick check for missinginess shows that the distrubtion for all nan values is equal accross the columns
df.isnull().sum()/df.shape[0]

Price                  0.000000
Area                   0.000000
Location               0.000000
No. of Bedrooms        0.000000
Resale                 0.000000
MaintenanceStaff       0.693808
Gymnasium              0.693808
SwimmingPool           0.693808
LandscapedGardens      0.693808
JoggingTrack           0.693808
RainWaterHarvesting    0.693808
IndoorGames            0.693808
ShoppingMall           0.693808
Intercom               0.693808
SportsFacility         0.693808
ATM                    0.693808
ClubHouse              0.693808
School                 0.693808
24X7Security           0.693808
PowerBackup            0.693808
CarParking             0.693808
StaffQuarter           0.693808
Cafeteria              0.693808
MultipurposeRoom       0.693808
Hospital               0.693808
WashingMachine         0.693808
Gasconnection          0.693808
AC                     0.693808
Wifi                   0.693808
Children'splayarea     0.693808
LiftAvailable          0.693808
BED     

In [10]:
#Given this implies that a given row with a nan values likely contains nan values, we can simply drop all rows with nan
cleaned_df = df[~df.isnull().any(axis=1)]
cleaned_df

Unnamed: 0,Price,Area,Location,No. of Bedrooms,Resale,MaintenanceStaff,Gymnasium,SwimmingPool,LandscapedGardens,JoggingTrack,...,BED,VaastuCompliant,Microwave,GolfCourse,TV,DiningTable,Sofa,Wardrobe,Refrigerator,City
0,30000000,3340,JP Nagar Phase 1,4,0,1.0,1.0,1.0,1.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Bangalore
1,7888000,1045,Dasarahalli on Tumkur Road,2,0,0.0,1.0,1.0,1.0,1.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Bangalore
2,4866000,1179,Kannur on Thanisandra Main Road,2,0,0.0,1.0,1.0,1.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Bangalore
3,8358000,1675,Doddanekundi,3,0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Bangalore
4,6845000,1670,Kengeri,3,0,1.0,1.0,1.0,1.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Bangalore
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1393,62000000,1450,Worli,3,0,1.0,1.0,1.0,1.0,1.0,...,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,Mumbai
1394,2500000,540,Virar East,1,0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Mumbai
1395,19000000,1267,Belapur,3,1,0.0,1.0,1.0,1.0,1.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Mumbai
1396,14900000,1245,Airoli,2,0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Mumbai


# Proposed Solution

This problem is a regression problem: Given some input features relating to a house, we predict some real valued output that is the price of that house. Therefore our proposed solution will find some optimal model for predicting housing prices.To achieve this, we will start by using various techniques and models with minimal error. The steps involved in our solution are as follows:

Step 1 Data processing: We will process the housing data and handle missing values by filling them  in, doing some scaling, and encoding categorical variables if needed. This will help make sure that the data will be easily analyzed. Specifically we will at least attempt:

- Scale: It may be appropriate to scale say geographical distances on a scale [0,1]. We are not sure yet how to scale but know based on our research that it may come in hand when considering geographical distances. Also, it may be useful to scale when considering similarity between housing which may be useful in predicting price. As for other features such as arce size and house size, logrithmic scaling may be appropriate to get better senses of magnitude. 
- Onehot encoding - it is possible to have categorical variables which in terms of linear regression would require one hot encoding

Step 2 Model Selection: we will use all of the different regression models that we covered in class, and from there find the optimal one. These models may include:

- Linear Regression:  A linear regression model seems appropriate given that we want to predict house price, say Y, given a dataset of houses and their respective attributes, say X.  In other words, given some input features relating to a house, we predict some real valued output that is the price of that house. Thus we can model this problem as $Y = Xw$
- random forests: Each decision tree in the random forest guesses some prediction for the house price based on a random sampling of houses and features. Then we can take the mean guess of those random forests and use that as our prediction score. 
- gradient boosting: When we are training our model it is probable that there will be some sort of misclassification so it will be necessary that we take into account the error and focus on correcting it on the next iteration. Also, we are not guaranteed a smooth continuous function for our data which may require gradient descent so that we can choose our optimal value, which in this case will be the most accurate prediction.
- Neural Network: Vectorize all the data, that will be our input. Each layer of the neural network will transform and condense the input data. The last layer of the neural network will be a single output representing some number. We can then train our model by attempting to minimize MSE between the output value and the real price of the house.
- Other possible alternatives: Based on background research, we have noticed that regression based ML systems may struggle with understanding geologically important information. Thus we may want to look into possible methods that can handle geographic data.  

Step 3 Cross Validation: After gathering the data with the models and given the size of the dataset, we can use a train test split (80/20) and cross validation for model selection.

Step 4 Hyperparameter Tuning: Using techniques such as grid search, we can find out the best set of hyperparameters to be used for each model and optimize their performance.

Step 5: Model Selection: The optimal model to be used to predict and be trained on will be found based on getting the lowest MSE, MAE, or highest R-Squared value. From there we have our machine learning predictor

Much of this work can be easily implemented in sklearn using pandas to hold the data, numpy for extra transformations if needed, and pytorch for any neural network based model we are interested in trying out. 

A baseline model for this dataset, as previously discussed in the background section is a kaggle notebook which did our problem on this dataset at <a name="Masghiff"></a>[<sup>[4]</sup>](#Masghiffnote). Given the original dataset did not do any kind of cross validation to conduct model selection, We can use thier pipeline for transforming the data in addition to the best model they discovered to compare against our techniques. Even if we attempt to expand the dataset via web scraping, we can still use that model for a baseline model. 

# Evaluation Metrics


**Mean Squared Error (MSE)**  
MSE is defined as: $$MSE = \frac{1}{n} \sum_{i=1}^{n}(y_i - \hat{y_i})^2$$
This metric measures the average of the squared differences between predicted and actual values. This in an appropriate metric because it provides us a measurement of how accurate our predictions are with respect to the actual values. Based on our MSE value, we can quantify how close or far we are when we compare our predicted values to acutal values. The lower the MSE, the better the model is at fitting the data. 
  
**R-squared ($R^2$)**    
R^2 is defined as:$$R^2 = 1 - \frac{\sum_{i=1}^{n}(y_i - \hat{y_i})^2}{\sum_{i=1}^{n}(y_i - \bar{y})^2}$$  
This metric measures the fit of a regression model. This value ranges from 0 to 1. An R^2 value of 1 implies a perfect fit while a value of 0 there is no fit or relationship between the dependent and indepedent variables. This is an appropriate metric because it allows us to quantify the variance in the dependent variable in relation to the independent variable. Additionally, we can understand the fit of our data, meaning whether our data is overfit or underfit.

In [11]:
## Given a possible pipeline, get the score metrics for it
def score_suite(pipe, X_test, y_test):
    y_pred = pipe.predict(X_test)
    return (
        pipe.score(X_test, y_test),
        sklearn.metrics.mean_squared_error(y_test, y_pred),
    )

# Preliminary results

NEW SECTION!

Please show any preliminary results you have managed to obtain.

Examples would include:
- Analyzing the suitability of a dataset or alogrithm for prediction/solving your problem 
- Performing feature selection or hand-designing features from the raw data. Describe the features available/created and/or show the code for selection/creation
- Showing the performance of a base model/hyper-parameter setting.  Solve the task with one "default" algorithm and characterize the performance level of that base model.
- Learning curves or validation curves for a particular model
- Tables/graphs showing the performance of different models/hyper-parameters



## EDA

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.pyplot as plt
import numpy as np
import plotly.figure_factory as ff
import pandas as pd
import os
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats

In [13]:
df.columns.values

array(['Price', 'Area', 'Location', 'No. of Bedrooms', 'Resale',
       'MaintenanceStaff', 'Gymnasium', 'SwimmingPool',
       'LandscapedGardens', 'JoggingTrack', 'RainWaterHarvesting',
       'IndoorGames', 'ShoppingMall', 'Intercom', 'SportsFacility', 'ATM',
       'ClubHouse', 'School', '24X7Security', 'PowerBackup', 'CarParking',
       'StaffQuarter', 'Cafeteria', 'MultipurposeRoom', 'Hospital',
       'WashingMachine', 'Gasconnection', 'AC', 'Wifi',
       "Children'splayarea", 'LiftAvailable', 'BED', 'VaastuCompliant',
       'Microwave', 'GolfCourse', 'TV', 'DiningTable', 'Sofa', 'Wardrobe',
       'Refrigerator', 'City'], dtype=object)

In [14]:
len(df["Location"].unique())

1776

Distrubtions of all columns as box blots

In [None]:
for col in cleaned_df:
    if (cleaned_df[col].dtype != object):
        print(col)
        seaborn.boxplot(data=cleaned_df[col].values)
        plt.show()

Below we illustrate the number of houses from each city

In [None]:
sns.countplot(x=cleaned_df['City'])

Using a piechart we can break down the distribution the percentage of homes within each city

In [None]:
fi = plt.figure(figsize =(10, 7))
cities =['Hyderabad','Chennai','Delhi','Bangalore','Mumbai','Kolkata']
plt.pie(cleaned_df['City'].value_counts(), labels=cities, autopct='%.1f%%')
plt.show()

We visualize each city's average price per home

In [None]:
ax = cleaned_df.groupby('City')['Price'].mean().plot(kind='bar')
ax.set_ylabel("average price")

We determine the strongest correlation between price,number of bedrooms, and area

In [None]:
vars_to_consider = ['Price','No. of Bedrooms','Area']
corr = cleaned_df[vars_to_consider].corr()
sns.heatmap(corr, annot = True)

Below we try to find which ammenities have the strongest correlation with price.

In [None]:
ammenities = ['MaintenanceStaff', 'Gymnasium', 'SwimmingPool', 'LandscapedGardens',
       'JoggingTrack', 'RainWaterHarvesting', 'IndoorGames', 'ShoppingMall',
       'Intercom', 'SportsFacility', 'ATM', 'ClubHouse', 'School',
       '24X7Security', 'PowerBackup', 'CarParking', 'StaffQuarter',
       'Cafeteria', 'MultipurposeRoom', 'Hospital', 'WashingMachine',
       'Gasconnection', 'AC', 'Wifi', "Children'splayarea", 'LiftAvailable',
       'BED', 'Microwave', 'GolfCourse', 'TV',
       'DiningTable', 'Sofa', 'Wardrobe', 'Refrigerator']

corr_price_ammenities = {}
for ele in ammenities:
    key = "Price and " + ele
    corr = cleaned_df['Price'].corr(df[ele])
    corr_price_ammenities[key] = corr

sorted_dict = sorted(corr_price_ammenities.items(), key = lambda x:x[1], reverse = True)
new_df  = pd.Series(sorted_dict).to_frame()
new_df

ETC

In [15]:
city_list = []
hist_list = []
for city in cleaned_df['City'].unique():
    city_list.append(city)
    hist_list.append(cleaned_df.loc[cleaned_df['City'] == city, 'Price'])

fig = ff.create_distplot(
    hist_data = hist_list,
    group_labels = city_list,
    show_rug = False,
    show_hist = False,
)

fig.update_xaxes(range=[0, 50000000])

fig

NameError: name 'ff' is not defined

In [None]:
grouped_df = cleaned_df.groupby('City')

for city, data in grouped_df:
    plt.scatter(data['Area'], data['Price']/1e8, label=city)

plt.xlabel('Area')
plt.ylabel('Price')
plt.title('Scatterplot of Area vs. Price for Each City in Millions')

plt.legend()

plt.show() 


In [None]:
import pandas as pd
import os
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats


# In[3]:


train = pd.read_csv(os.path.join("data", "Bangalore.csv"))
train.columns.values


# In[4]:


train = pd.read_csv(os.path.join("data", "Chennai.csv"))
train.columns.values


# In[5]:


dfs = []
for file in os.listdir("data"):
    df = pd.read_csv(os.path.join("data", file))
    df["City"] = file.replace(".csv", "")
    dfs.append(df)
df = pd.concat(dfs)


# In[6]:


len(np.unique(col))


# In[7]:


df


# In[8]:


df.to_csv("india_homes.csv")


# In[9]:


df = df[~(df == 9).any(axis=1)]
len(df)


# In[10]:


cities = df['City'].unique()
cities


# In[64]:


# List of cities
cities = ['Kolkata', 'Delhi', 'Chennai', 'Bangalore', 'Hyderabad', 'Mumbai']

# Create an empty list to store the price data for each city
city_prices = []

# Iterate over each city
for city in cities:
    # Filter the prices of homes for the current city
    prices = df[df['City'] == city]['Price']
    
    # Add the prices to the city_prices list
    city_prices.append(prices)

# Create the box plot
plt.figure(figsize=(10, 6))
plt.boxplot(city_prices, labels=cities)
plt.title('Price Distribution by City')
plt.xlabel('City')
plt.ylabel('Price')
plt.show()


# In[56]:


# Create subplots for histograms
fig, axs = plt.subplots(1, 4, figsize=(12, 4))

# Plot histogram for Price
axs[0].hist(df['Price'], bins=10, edgecolor='k', color='#20A5DF')
axs[0].set_title('Price')
axs[0].set_xlabel('Price')
axs[0].set_ylabel('Frequency')

# Plot histogram for Area
axs[1].hist(df['Area'], bins=10, edgecolor='k', color='#45DF20')
axs[1].set_title('Area')
axs[1].set_xlabel('Area')
axs[1].set_ylabel('Frequency')

# Plot histogram for No. of Bedrooms
axs[2].hist(df['No. of Bedrooms'], bins=10, edgecolor='k', color='#DF5A20')
axs[2].set_title('Bedrooms')
axs[2].set_xlabel('No. of Bedrooms')
axs[2].set_ylabel('Frequency')

# Compute No. of Amenities for each row
sum_values = df.loc[:, 'MaintenanceStaff':'Refrigerator'].sum(axis=1)

# Plot histogram for No. of Amenities
axs[3].hist(sum_values, bins=10, edgecolor='k', color='#BA20DF')
axs[3].set_title('Amenities')
axs[3].set_xlabel('No. of Amenities')
axs[3].set_ylabel('Frequency')

# Plot Title
fig.suptitle('Histograms of Feature Distribution ', fontsize=14, fontweight='bold')

# Adjust spacing between subplots
plt.tight_layout()

# Display the histograms
plt.show()


# In[62]:


# Compute the sum of columns from MaintenanceStaff to Refrigerator
sum_values = df.loc[:, 'MaintenanceStaff':'Refrigerator'].sum(axis=1)

# Create a new DataFrame with the sum of amenities included
correlation_df = df.assign(**{'Sum of Amenities': sum_values})

# Select the numerical columns for correlation analysis
numerical_columns = ['Price', 'Area', 'No. of Bedrooms', 'Sum of Amenities']

# Calculate the correlation matrix
correlation_matrix = correlation_df[numerical_columns].corr()

# Create a heatmap
plt.figure(figsize=(8, 6))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', square=True)

# Rotate x-axis tick labels
plt.xticks(rotation=25)

# Set the title and display the plot
plt.title('Correlation Heatmap')
plt.show()


#=============================================================================




import matplotlib.pyplot as plt


grouped_df = cleaned_df.groupby('City')


for city, data in grouped_df:
    plt.scatter(data['Area'], data['Price']/1e8, label=city)

plt.xlabel('Area')
plt.ylabel('Price')
plt.title('Scatterplot of Area vs. Price for Each City in Millions')

plt.legend()

plt.show() 

In [None]:
city_list = []
hist_list = []
for city in cleaned_df['City'].unique():
    city_list.append(city)
    hist_list.append(cleaned_df.loc[cleaned_df['City'] == city, 'Price'])

fig = ff.create_distplot(
    hist_data = hist_list,
    group_labels = city_list,
    show_rug = False,
    show_hist = False,


)

fig.update_xaxes(range=[0, 50000000])

fig

## Baseline Pipeline

In [16]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

In [None]:
ct = ColumnTransformer(
    [
        ("Location_One_Hot",  OneHotEncoder(handle_unknown="ignore"), ["Location", "City"]),
    ],
    remainder="passthrough"
)

In [17]:
pipe = Pipeline([('transformers', ct), ('svc', LinearRegression())])
pipe

NameError: name 'ct' is not defined

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    cleaned_df.drop("Price", axis=1), 
    cleaned_df["Price"], 
    test_size=0.33, 
    random_state=42
)

In [None]:
pipe.fit(X_train, y_train)

In [None]:
score_suite(pipe, X_test, y_test)

# Ethics & Privacy

There are three primarily ethics issues with our dataset 1) Lack of Informed Consent and 2) Possible collection and/or dataset bias and 3) issues with future generalizablity of the model. 

Since the data we are planning to use is web scraped data from publicly available websites, homeowners are unlikely to be aware of our project. Thus the rough description of their home may be included in our data analysis similar to the information available to anyone with access to realtor.com or zillow.com. Founrelty, we are unable to determine who current homeowners are with the data we have, so all other elements of a homeowners identity are hidden. However, given this we will be unable to contact these homeowners and be able to gain consent to use data about thier house. Thus leading to a lack of informed consent regrauding our project. The best solution for this issue is to allow people to message us online via github where this project is located and raise an issue with the data so we may remove their house from the analysis. 

We may also have a collection and/or dataset bias if we choose to use the current kaggle dataset. While the kaggle dataset advertises itself as relating to the whole USA, it is limited to only Massachusetts and Puerto Rico. This means if we wish to keep the scope of the project to the entire USA, we will be biasing our model by only considering these 2 states/terrtories. The best remedy we have for this is to gather more data, mostly likely by building our own webscraper to gather information about the houses on realtor.com (which the kaggle dataset scraped their data from) or by getting in contact with the original creator of the dataset and seeing if a more encompassing dataset can be produced. That way the location of the house can more accurately and ethically predict the price of the house. 

Finally there is the potential for users to be harmed by this model in the future as housing prices can change with the housing market, policy decisions etc that the model will not be able to account for given the data we have. If a user in 10 years wishes to use the model to predict what their house price will be, it has the potential to be very wrong if there is a large change in the housing market of a particular area. If that user makes a decision on that information it will likely harm their ability to search for a good house. Thus the model will likely need a pipeline to continuously update itself given a new market if put into production. 


# Team Expectations 

* Hold in-person meetings to discuss ideas and progress
* Communication will be handled remotely through Discord
* Ensure minimum consensus of 3/4 members when making decisions
* Expected individual contribution of at least 25% for each checkpoint/deliverable
* Individual tasks will have earlier deadlines prior to submission

# Project Timeline Proposal

# UPDATE THE PROPOSAL TIMELINE ACCORDING TO WHAT HAS ACTUALLY HAPPENED AND HOW IT HAS EFFECTED YOUR FUTURE PLANS


| Meeting Date  | Meeting Time| Completed Before Meeting  | Discuss at Meeting |
|---|---|---|---|
| 5/12 | 4 PM | Determine best form of communication |  Discuss each individuals interests to find a topic for the project|
| 5/17  | Before 11:59 PM |  Look for interesting datasets; finalize Research Proposal | Webscraping and additional data plans |
| 5/20  |  12 PM | Finalize Web Scraper If Used | Discuss EDA plan |
| 5/25  |  12 PM | EDA Finished | Begin planing Baseline Model |
| 5/30  |  12 PM | Baseline Model Finshed | Prep checkpoint notebook for submission |
| 5/31  |   Before 11:59 PM |  Checkpoint Due | Discuss model selection plans |
| 6/10  |  12 PM | Finalize Model Selection | Plan out remaining tasks | 
| 6/14  |   Before 11:59 PM |  Project Due | NA |

# Footnotes
<a name="Congressionalnote"></a>1.[^](#Congressional) Congressional Research Service. (2023, January 3). Introduction to U.S. economy: Housing market - federation of American ... Introduction to U.S. Economy: Housing Market. https://sgp.fas.org/crs/misc/IF11327.pdf

<a name="Dubinnote"></a>2.[^](#Dubin)  Dubin, R. A. (1998). Predicting House Prices Using Multiple Listings Data. The Journal of Real Estate Finance and Economics, 17(1), 35–59. https://doi.org/10.1023/a:1007751112669

<a name="Linote"></a>3.[^](#li)  Li, X. (2022). Prediction and analysis of housing price based on the generalized linear regression model. Computational Intelligence and Neuroscience, 2022, 1–9. https://doi.org/10.1155/2022/3590224

<a name="Masghiffnote"></a>4.[^](#Masghiff) Masghiff. (2023, May 5). Predicting housing prices EDA + ml. Kaggle. https://www.kaggle.com/code/masghiff/predicting-housing-prices-eda-ml 