# RenoStrategix: Transforming Homes, Elevating Values.

# <img src="kings.PNG" alt="RenoStrategix: Transforming Homes, Elevating Values" length ="100" width="1100">

## Authors.
* Caroline Njeri.
* Amadi Growman.
* Lynns Waswa.
* Robert Gasembe.
* David Kirianja.
* James Nyamu.


## Business Overview.

Welcome to RenoStrategix, where the world of real estate meets the art and science of optimizing home renovations for increased property value. In this project, we address a real-world problem faced by our visionary real estate agency known as "CasaCrafters Realty Solutions."
CasaCrafters aims to revolutionize the way homeowners approach renovations by providing data-driven advice on how specific home improvements impact property values. The primary challenge is to assist homeowners in making informed decisions about renovations that yield the highest return on investment. To achieve this objective, we will use multiple linear regression modeling to analyze house sales data in the King County area. Our focus is not just on increasing value but also on enhancing the overall appeal for potential buyers.




## Business Problem

Homeowners often find themselves at a crossroads when deciding on renovations. The lack of clear guidance on which improvements will significantly enhance their property values becomes a real-world challenge. CasaCrafters seeks to bridge this gap by offering tailored advice based on a comprehensive analysis of the King County real estate market.

In the vibrant landscape of King County, CasaCrafters endeavors to create a narrative that extends beyond buying and selling homes. Imagine a homeowner named Alex, who dreams of transforming their house into a haven. However, Alex is uncertain about which renovations will not only fulfill personal desires but also enhance the property's market value.

Through the journey of Alex's home transformation, CasaCrafters navigates the complex realm of real estate, uncovering insights that go beyond the expected. The story unfolds as Alex learns about the potential return on investment for various renovations, transforming not just the home but the overall real estate experience.


# Objective 1: Identify Key Features Impacting Property Values

**Objective:** Develop a multiple linear regression model to identify and quantify the influence of various features (e.g., square footage, number of bedrooms, location) on the sale prices of houses in King County.

**Rationale:** Understanding the key features that significantly affect property values is crucial for providing targeted recommendations to homeowners. By analyzing historical sales data, the model will reveal which features have the most substantial impact on sale prices. This information will empower CasaCrafters to guide homeowners on prioritizing renovations that are likely to yield the highest returns.

---

# Objective 2: Build a Predictive Model for Property Valuation

**Objective:** Construct a robust multiple linear regression model that accurately predicts house sale prices based on selected features, allowing for personalized property valuation.

**Rationale:** The predictive model will serve as a valuable tool for CasaCrafters to estimate the potential impact of specific renovations on a property's value. By inputting proposed changes into the model, homeowners can receive personalized predictions of how these renovations might affect the sale price. This enables informed decision-making and helps homeowners focus on improvements that align with their goals while maximizing return on investment.

---

# Objective 3: Provide Renovation Recommendations for Maximum ROI

**Objective:** Utilize the developed multiple linear regression model to generate personalized recommendations for homeowners, suggesting specific renovations that are predicted to have the highest positive impact on property values.

**Rationale:** CasaCrafters aims to be a trusted advisor for homeowners seeking to enhance their properties. By leveraging the regression model's insights, the agency can offer tailored recommendations, outlining the renovations that are not only aligned with the homeowner's vision but also expected to yield the greatest return on investment. This proactive approach adds significant value to CasaCrafters' services, fostering trust and satisfaction among homeowners.

## Data Understanding.
This project uses the King County House Sales dataset, which can be found in kc_house_data.csv in the data folder in this assignment's GitHub repository. The description of the column names can be found in column_names.md in the same folder. As with most real world data sets, the column names are not perfectly described, so you'll have to do some research or use your best judgment if you have questions about what the data means.

It is up to you to decide what data from this dataset to use and how to use it. If you are feeling overwhelmed or behind, we recommend you ignore some or all of the following features:

* date
* view
* sqft_above
* sqft_basement
* yr_renovated
* zipcode
* lat
* long
* sqft_living15
* sqft_lot15

#Data Pre-Processing#

The first task to read the data file into our working environment then explore it to gain an initial
understanding of the dataset. Moreover, this step also allows us to determine what data wrangling techniques to apply in order to
transform the data into a form that can be analysed and modeled.

In [4]:
# Import the necessary libraries for the project

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import statsmodels.api as sm
import scipy.stats as stats

# from matplotlib import style    # NB: We can determine which style to use
# style.use('ggplot')
# style.use('seaborn-v0_8-whitegrid')

%matplotlib inline

In [5]:
# Load the file by using a file path into the notebook's working memory
import os

file_path = os.path.join('data', 'kc_house_data.csv')
data = pd.read_csv(file_path)



#Exploratory data analysis

In [6]:
# Check the shape of the dataset
data.shape

(21597, 21)

In [7]:
# Inspect the properties of the dataset

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21597 entries, 0 to 21596
Data columns (total 21 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   id             21597 non-null  int64  
 1   date           21597 non-null  object 
 2   price          21597 non-null  float64
 3   bedrooms       21597 non-null  int64  
 4   bathrooms      21597 non-null  float64
 5   sqft_living    21597 non-null  int64  
 6   sqft_lot       21597 non-null  int64  
 7   floors         21597 non-null  float64
 8   waterfront     19221 non-null  object 
 9   view           21534 non-null  object 
 10  condition      21597 non-null  object 
 11  grade          21597 non-null  object 
 12  sqft_above     21597 non-null  int64  
 13  sqft_basement  21597 non-null  object 
 14  yr_built       21597 non-null  int64  
 15  yr_renovated   17755 non-null  float64
 16  zipcode        21597 non-null  int64  
 17  lat            21597 non-null  float64
 18  long  

In [8]:
# Dataset description
data.describe()

Unnamed: 0,id,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,sqft_above,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
count,21597.0,21597.0,21597.0,21597.0,21597.0,21597.0,21597.0,21597.0,21597.0,17755.0,21597.0,21597.0,21597.0,21597.0,21597.0
mean,4580474000.0,540296.6,3.3732,2.115826,2080.32185,15099.41,1.494096,1788.596842,1970.999676,83.636778,98077.951845,47.560093,-122.213982,1986.620318,12758.283512
std,2876736000.0,367368.1,0.926299,0.768984,918.106125,41412.64,0.539683,827.759761,29.375234,399.946414,53.513072,0.138552,0.140724,685.230472,27274.44195
min,1000102.0,78000.0,1.0,0.5,370.0,520.0,1.0,370.0,1900.0,0.0,98001.0,47.1559,-122.519,399.0,651.0
25%,2123049000.0,322000.0,3.0,1.75,1430.0,5040.0,1.0,1190.0,1951.0,0.0,98033.0,47.4711,-122.328,1490.0,5100.0
50%,3904930000.0,450000.0,3.0,2.25,1910.0,7618.0,1.5,1560.0,1975.0,0.0,98065.0,47.5718,-122.231,1840.0,7620.0
75%,7308900000.0,645000.0,4.0,2.5,2550.0,10685.0,2.0,2210.0,1997.0,0.0,98118.0,47.678,-122.125,2360.0,10083.0
max,9900000000.0,7700000.0,33.0,8.0,13540.0,1651359.0,3.5,9410.0,2015.0,2015.0,98199.0,47.7776,-121.315,6210.0,871200.0


In [9]:
# Sample the dataset

data.head()

Unnamed: 0,id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,...,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
0,7129300520,10/13/2014,221900.0,3,1.0,1180,5650,1.0,,NONE,...,7 Average,1180,0.0,1955,0.0,98178,47.5112,-122.257,1340,5650
1,6414100192,12/9/2014,538000.0,3,2.25,2570,7242,2.0,NO,NONE,...,7 Average,2170,400.0,1951,1991.0,98125,47.721,-122.319,1690,7639
2,5631500400,2/25/2015,180000.0,2,1.0,770,10000,1.0,NO,NONE,...,6 Low Average,770,0.0,1933,,98028,47.7379,-122.233,2720,8062
3,2487200875,12/9/2014,604000.0,4,3.0,1960,5000,1.0,NO,NONE,...,7 Average,1050,910.0,1965,0.0,98136,47.5208,-122.393,1360,5000
4,1954400510,2/18/2015,510000.0,3,2.0,1680,8080,1.0,NO,NONE,...,8 Good,1680,0.0,1987,0.0,98074,47.6168,-122.045,1800,7503


##Data cleaning

In [10]:
#Checking for null values
pd.DataFrame(data.isna().sum()).T

Unnamed: 0,id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,...,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
0,0,0,0,0,0,0,0,0,2376,63,...,0,0,0,0,3842,0,0,0,0,0


In [11]:
# Replacing the Null values at Waterfront with "No"
data['waterfront'].fillna('NO', inplace=True)

In [12]:
# Replacing the null values in the year renovated with the
# values in the year built
data['yr_renovated'].fillna(data['yr_built'], inplace=True)

The assumption made in this case was that the datapoints where there was a NaN value represented houses that had never been renovated.

In [13]:
# Dropping the null values in the view column
data.dropna(inplace=True)

In [14]:
# Replacing "?" with 0 in sqft_basement column
data['sqft_basement'] = data['sqft_basement'].replace('?', 'None', inplace=True)

The assumption made in this case was that the datapoints where there was a '?' represented houses that have no basement.

In [15]:
# Checking for null values after cleaning the data
pd.DataFrame(data.isna().sum()).T

Unnamed: 0,id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,...,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
0,0,0,0,0,0,0,0,0,0,0,...,0,0,21534,0,0,0,0,0,0,0


In [16]:
data.info

<bound method DataFrame.info of                id        date     price  bedrooms  bathrooms  sqft_living  \
0      7129300520  10/13/2014  221900.0         3       1.00         1180   
1      6414100192   12/9/2014  538000.0         3       2.25         2570   
2      5631500400   2/25/2015  180000.0         2       1.00          770   
3      2487200875   12/9/2014  604000.0         4       3.00         1960   
4      1954400510   2/18/2015  510000.0         3       2.00         1680   
...           ...         ...       ...       ...        ...          ...   
21592   263000018   5/21/2014  360000.0         3       2.50         1530   
21593  6600060120   2/23/2015  400000.0         4       2.50         2310   
21594  1523300141   6/23/2014  402101.0         2       0.75         1020   
21595   291310100   1/16/2015  400000.0         3       2.50         1600   
21596  1523300157  10/15/2014  325000.0         2       0.75         1020   

       sqft_lot  floors waterfront  view  .

In [17]:
# Creating a new column to find the difference between the year renovated and year built
# 0 Has been replaced for the rows that returned -ve years.
data['Years_Since_Renovation'] = data.apply(lambda row: row['yr_renovated'] - row['yr_built'] if row['yr_renovated'] > 0 else 0, axis=1)


In [18]:
# Checking for any negative values in the new column.
if (data['Years_Since_Renovation'] < 0).any():
    print("There are negative values in the 'Years_Since_Renovation' column.")
else:
    print("There are no negative values in the 'Years_Since_Renovation' column.")


There are no negative values in the 'Years_Since_Renovation' column.


In [19]:
#Checking the data types in the data frame
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 21534 entries, 0 to 21596
Data columns (total 22 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   id                      21534 non-null  int64  
 1   date                    21534 non-null  object 
 2   price                   21534 non-null  float64
 3   bedrooms                21534 non-null  int64  
 4   bathrooms               21534 non-null  float64
 5   sqft_living             21534 non-null  int64  
 6   sqft_lot                21534 non-null  int64  
 7   floors                  21534 non-null  float64
 8   waterfront              21534 non-null  object 
 9   view                    21534 non-null  object 
 10  condition               21534 non-null  object 
 11  grade                   21534 non-null  object 
 12  sqft_above              21534 non-null  int64  
 13  sqft_basement           0 non-null      object 
 14  yr_built                21534 non-null

In [20]:
#Changing the years data type to int
data['yr_built'] = data['yr_built'].astype(int)
data['yr_renovated'] = data['yr_renovated'].astype(int)
data['Years_Since_Renovation'] = data['Years_Since_Renovation'].astype(int)

#Changing the date column data type to datetime.
data['date'] = pd.to_datetime(data['date'])


In [21]:
#Checking data types after cleaning
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 21534 entries, 0 to 21596
Data columns (total 22 columns):
 #   Column                  Non-Null Count  Dtype         
---  ------                  --------------  -----         
 0   id                      21534 non-null  int64         
 1   date                    21534 non-null  datetime64[ns]
 2   price                   21534 non-null  float64       
 3   bedrooms                21534 non-null  int64         
 4   bathrooms               21534 non-null  float64       
 5   sqft_living             21534 non-null  int64         
 6   sqft_lot                21534 non-null  int64         
 7   floors                  21534 non-null  float64       
 8   waterfront              21534 non-null  object        
 9   view                    21534 non-null  object        
 10  condition               21534 non-null  object        
 11  grade                   21534 non-null  object        
 12  sqft_above              21534 non-null  int64 

In [22]:
data.head(30)

Unnamed: 0,id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,...,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15,Years_Since_Renovation
0,7129300520,2014-10-13,221900.0,3,1.0,1180,5650,1.0,NO,NONE,...,1180,,1955,0,98178,47.5112,-122.257,1340,5650,0
1,6414100192,2014-12-09,538000.0,3,2.25,2570,7242,2.0,NO,NONE,...,2170,,1951,1991,98125,47.721,-122.319,1690,7639,40
2,5631500400,2015-02-25,180000.0,2,1.0,770,10000,1.0,NO,NONE,...,770,,1933,1933,98028,47.7379,-122.233,2720,8062,0
3,2487200875,2014-12-09,604000.0,4,3.0,1960,5000,1.0,NO,NONE,...,1050,,1965,0,98136,47.5208,-122.393,1360,5000,0
4,1954400510,2015-02-18,510000.0,3,2.0,1680,8080,1.0,NO,NONE,...,1680,,1987,0,98074,47.6168,-122.045,1800,7503,0
5,7237550310,2014-05-12,1230000.0,4,4.5,5420,101930,1.0,NO,NONE,...,3890,,2001,0,98053,47.6561,-122.005,4760,101930,0
6,1321400060,2014-06-27,257500.0,3,2.25,1715,6819,2.0,NO,NONE,...,1715,,1995,0,98003,47.3097,-122.327,2238,6819,0
8,2414600126,2015-04-15,229500.0,3,1.0,1780,7470,1.0,NO,NONE,...,1050,,1960,0,98146,47.5123,-122.337,1780,8113,0
9,3793500160,2015-03-12,323000.0,3,2.5,1890,6560,2.0,NO,NONE,...,1890,,2003,0,98038,47.3684,-122.031,2390,7570,0
10,1736800520,2015-04-03,662500.0,3,2.5,3560,9796,1.0,NO,NONE,...,1860,,1965,0,98007,47.6007,-122.145,2210,8925,0


#Model Buliding#

#Model Evaluation#

#Recommendation System#

#Conclusion#