# Capstone - Predictions

## 1. Problem and sourcing
House price predictions are very important for investors to make decisions of buying houses. There are many features which play roles in house price decisions. Based on historical data of house sales in different regions in the U.S., we need to predict house price based on features.

Which features can significantly affect housing prices? 
      
Build a model in one month to reflect on the relationships between price of houses and the features.

In [1]:
# Import relevant libraries and packages.
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt 
import seaborn as sns # For all our visualization needs.
import statsmodels.api as sm # What does this do? Find out and type here.
from statsmodels.graphics.api import abline_plot # What does this do? Find out and type here.
from sklearn.metrics import mean_squared_error, r2_score # What does this do? Find out and type here.
from sklearn.model_selection import train_test_split #  What does this do? Find out and type here.
from sklearn import linear_model, preprocessing # What does this do? Find out and type here.
import warnings # For handling error messages.

Content
The dataset has 1 CSV file with 10 columns -
    realtor-data.csv (1.4 Million+ entries)
    status (Housing status - a. ready for sale or b. ready to build)
    bed (# of beds)
    bath (# of bathrooms)
    acre_lot (Property / Land size in acres)
    city (city name)
    state (state name)
    zip_code (postal code of the area)
    house_size (house area/size/living space in square feet)
    prev_sold_date (Previously sold date)
    price (Housing price, it is either the current listing price or recently sold price if the house is sold recently)
NB: acre_lot means the total land area, and house_size denotes the living space/building area

In [62]:
df = pd.read_csv('updated_file.csv',low_memory=False)

In [63]:
df.status.unique()

array(['for_sale', 'ready_to_build'], dtype=object)

In [64]:
# turns status column into a dummy variable 
dummy=pd.get_dummies(df['status'])
dummy.tail()

Unnamed: 0,for_sale,ready_to_build
1110638,1,0
1110639,1,0
1110640,1,0
1110641,1,0
1110642,1,0


In [65]:
dummy.head()

Unnamed: 0,for_sale,ready_to_build
0,1,0
1,1,0
2,1,0
3,1,0
4,1,0


In [66]:
# Filter rows where status is 'ready_to_build'
ready_to_build_df = df[df['status'] == 'ready_to_build']

# Display the filtered DataFrame
print(ready_to_build_df)

# If you only want to count the number of such rows
ready_to_build_count = len(ready_to_build_df)
print(f"Number of 'ready_to_build' entries: {ready_to_build_count}")

                status  bed      bath   acre_lot                city  \
62055   ready_to_build  2.0  2.487426  32.150131          Boxborough   
62057   ready_to_build  2.0  2.487426  32.150131          Boxborough   
62058   ready_to_build  2.0  2.487426  32.150131          Boxborough   
62059   ready_to_build  2.0  2.487426  32.150131          Boxborough   
62060   ready_to_build  2.0  2.487426  32.150131          Boxborough   
...                ...  ...       ...        ...                 ...   
993519  ready_to_build  3.0  2.487426  32.150131  Upper Saddle River   
993520  ready_to_build  3.0  2.487426  32.150131      Franklin Lakes   
993522  ready_to_build  3.0  2.487426  32.150131  Upper Saddle River   
993523  ready_to_build  3.0  2.487426  32.150131      Franklin Lakes   
993524  ready_to_build  3.0  2.487426  32.150131      Franklin Lakes   

                state zip_code  house_size      price  years_since_sold  \
62055   Massachusetts   1719.0      2516.0   852995.0       

In [67]:
df=pd.concat([df,dummy],axis=1)
df.head()

Unnamed: 0,status,bed,bath,acre_lot,city,state,zip_code,house_size,price,years_since_sold,average_acre_lot,average_house_size,bath/bed,acre_lot/house_size,for_sale,ready_to_build
0,for_sale,3.0,2.0,0.12,Adjuntas,Puerto Rico,601.0,920.0,105000.0,1,0.04,306.666667,0.666667,0.00013,1,0
1,for_sale,4.0,2.0,0.08,Adjuntas,Puerto Rico,601.0,1527.0,80000.0,1,0.02,381.75,0.5,5.2e-05,1,0
2,for_sale,2.0,1.0,0.15,Juana Diaz,Puerto Rico,795.0,748.0,67000.0,1,0.075,374.0,0.5,0.000201,1,0
3,for_sale,4.0,2.0,0.1,Ponce,Puerto Rico,731.0,1800.0,145000.0,1,0.025,450.0,0.5,5.6e-05,1,0
4,for_sale,6.0,2.0,0.05,Mayaguez,Puerto Rico,680.0,2178.642539,65000.0,1,0.008333,363.10709,0.333333,2.3e-05,1,0


In [68]:
df=df.merge(dummy,left_index=True,right_index=True)
df.head()

Unnamed: 0,status,bed,bath,acre_lot,city,state,zip_code,house_size,price,years_since_sold,average_acre_lot,average_house_size,bath/bed,acre_lot/house_size,for_sale_x,ready_to_build_x,for_sale_y,ready_to_build_y
0,for_sale,3.0,2.0,0.12,Adjuntas,Puerto Rico,601.0,920.0,105000.0,1,0.04,306.666667,0.666667,0.00013,1,0,1,0
1,for_sale,4.0,2.0,0.08,Adjuntas,Puerto Rico,601.0,1527.0,80000.0,1,0.02,381.75,0.5,5.2e-05,1,0,1,0
2,for_sale,2.0,1.0,0.15,Juana Diaz,Puerto Rico,795.0,748.0,67000.0,1,0.075,374.0,0.5,0.000201,1,0,1,0
3,for_sale,4.0,2.0,0.1,Ponce,Puerto Rico,731.0,1800.0,145000.0,1,0.025,450.0,0.5,5.6e-05,1,0,1,0
4,for_sale,6.0,2.0,0.05,Mayaguez,Puerto Rico,680.0,2178.642539,65000.0,1,0.008333,363.10709,0.333333,2.3e-05,1,0,1,0


In [69]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1110643 entries, 0 to 1110642
Data columns (total 18 columns):
 #   Column               Non-Null Count    Dtype  
---  ------               --------------    -----  
 0   status               1110643 non-null  object 
 1   bed                  1110643 non-null  float64
 2   bath                 1110643 non-null  float64
 3   acre_lot             1110643 non-null  float64
 4   city                 1110643 non-null  object 
 5   state                1110643 non-null  object 
 6   zip_code             1110643 non-null  object 
 7   house_size           1110643 non-null  float64
 8   price                1110643 non-null  float64
 9   years_since_sold     1110643 non-null  int64  
 10  average_acre_lot     1110643 non-null  float64
 11  average_house_size   1110643 non-null  float64
 12  bath/bed             1110643 non-null  float64
 13  acre_lot/house_size  1110643 non-null  float64
 14  for_sale_x           1110643 non-null  uint8  
 15

In [70]:
from sklearn.preprocessing import StandardScaler

# point numeric_columns
numeric_columns = ['bed', 'bath', 'acre_lot', 'house_size', 'price','years_since_sold','average_acre_lot',
                   'average_house_size','bath/bed','acre_lot/house_size']

# Select only the numeric columns for standardization
numeric_df = df[numeric_columns]

# Initialize StandardScaler
scaler = StandardScaler()

# Fit and transform the numeric data
scaled_features = scaler.fit_transform(numeric_df)

# Create a DataFrame with the scaled features
scaled_df = pd.DataFrame(scaled_features, columns=numeric_columns)

# Display the first few rows of the scaled DataFrame
print(scaled_df.head())

        bed      bath  acre_lot  house_size     price  years_since_sold  \
0 -0.040498 -0.134003 -0.686209   -1.427912 -1.097422         -0.736721   
1  0.836912 -0.134003 -0.688933   -0.450514 -1.181282         -0.736721   
2 -0.917907 -1.357138 -0.684166   -1.704869 -1.224889         -0.736721   
3  0.836912 -0.134003 -0.687571   -0.010926 -0.963246         -0.736721   
4  2.591731 -0.134003 -0.690975    0.598769 -1.231598         -0.736721   

   average_acre_lot  average_house_size  bath/bed  acre_lot/house_size  
0         -0.578197           -1.010219 -0.268967            -0.555330  
1         -0.580284           -0.794734 -0.844653            -0.560792  
2         -0.574544           -0.816976 -0.844653            -0.550425  
3         -0.579762           -0.598861 -0.844653            -0.560570  
4         -0.581501           -0.848238 -1.420339            -0.562852  


In [71]:
from sklearn.preprocessing import LabelEncoder

# Ensure the zip_code column is of type string to avoid issues with LabelEncoder
df['zip_code'] = df['zip_code'].astype(str)
df['city'] = df['city'].astype(str)
df['state'] = df['state'].astype(str)
# df['prev_sold_date'] = df['prev_sold_date'].astype(str)
df['status'] = df['status'].astype(str)

# Label encode the zip_code column
le = LabelEncoder()
df['zip_code_encoded'] = le.fit_transform(df['zip_code'])
zip_code_labelencode_df = df['zip_code_encoded']

# Label encode the city column
df['city_encoded'] = le.fit_transform(df['city'])
city_labelencode_df = df['city_encoded']

# Label encode the state column
df['state_encoded'] = le.fit_transform(df['state'])
state_labelencode_df = df['state_encoded']

# Label encode the prev_sold_date column
# df['prev_sold_date_encoded'] = le.fit_transform(df['prev_sold_date'])
# prev_sold_date_labelencode_df = df['prev_sold_date_encoded']

# Label encode the status column
df['status_encoded'] = le.fit_transform(df['status'])
status_labelencode_df = df['status_encoded']

In [72]:
# Concatenate the encoded zip_code column back into the DataFrame
df = pd.concat([scaled_df, zip_code_labelencode_df,city_labelencode_df,state_labelencode_df,status_labelencode_df], axis=1)

# Display the updated DataFrame
print("Updated DataFrame with Label Encoded zip_code:")
print(df)

Updated DataFrame with Label Encoded zip_code:
              bed      bath  acre_lot  house_size     price  years_since_sold  \
0       -0.040498 -0.134003 -0.686209   -1.427912 -1.097422         -0.736721   
1        0.836912 -0.134003 -0.688933   -0.450514 -1.181282         -0.736721   
2       -0.917907 -1.357138 -0.684166   -1.704869 -1.224889         -0.736721   
3        0.836912 -0.134003 -0.687571   -0.010926 -0.963246         -0.736721   
4        2.591731 -0.134003 -0.690975    0.598769 -1.231598         -0.736721   
...           ...       ...       ...         ...       ...               ...   
1110638 -0.040498 -0.134003 -0.490108   -0.384495 -0.812633          2.072122   
1110639 -0.040498 -1.357138 -0.687571   -1.295875 -1.349337         -0.034510   
1110640  0.836912 -0.134003 -0.669867    0.352982 -0.819342          1.955087   
1110641 -0.917907 -0.134003 -0.684847   -0.384495 -1.141029          1.603981   
1110642  1.714322 -0.134003 -0.684847    0.162977 -1.114529   

In [73]:
# df.to_csv('final_data_labelencode.csv', index=False)
df.to_csv('final_data_labelencode.csv', index=False)

In [74]:
from sklearn.model_selection import train_test_split

# df = pd.read_csv('final_data_labelencode.csv')
df = pd.read_csv('final_data_labelencode.csv',low_memory=False)

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1110643 entries, 0 to 1110642
Data columns (total 14 columns):
 #   Column               Non-Null Count    Dtype  
---  ------               --------------    -----  
 0   bed                  1110643 non-null  float64
 1   bath                 1110643 non-null  float64
 2   acre_lot             1110643 non-null  float64
 3   house_size           1110643 non-null  float64
 4   price                1110643 non-null  float64
 5   years_since_sold     1110643 non-null  float64
 6   average_acre_lot     1110643 non-null  float64
 7   average_house_size   1110643 non-null  float64
 8   bath/bed             1110643 non-null  float64
 9   acre_lot/house_size  1110643 non-null  float64
 10  zip_code_encoded     1110643 non-null  int64  
 11  city_encoded         1110643 non-null  int64  
 12  state_encoded        1110643 non-null  int64  
 13  status_encoded       1110643 non-null  int64  
dtypes: float64(10), int64(4)
memory usage: 118.6 MB


In [75]:
# Use 'price' as the target variable
X = df.drop('price', axis=1)  # Features
y = df['price']  # Target variable


# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Display the shapes of the resulting splits
print("\nShapes of the resulting splits:")
print(f"X_train: {X_train.shape}, X_test: {X_test.shape}")
print(f"y_train: {y_train.shape}, y_test: {y_test.shape}")


Shapes of the resulting splits:
X_train: (888514, 13), X_test: (222129, 13)
y_train: (888514,), y_test: (222129,)


## Use Random Forest Model

In [77]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np

In [78]:
# Assuming 'price' is the target variable
X = df.drop('price', axis=1)  # Features
y = df['price']  # Target variable

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create the Random Forest model
model = RandomForestRegressor(random_state=42)

# Train the model
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print("\nModel Performance:")
print(f"Mean Squared Error: {mse}")
print(f"R^2 Score: {r2}")


Model Performance:
Mean Squared Error: 0.014193229245588218
R^2 Score: 0.9857668849909297
