## Final Project Submission

Please fill out:
* Student names: CALVIN OMWEGA,COLLINS BIWOTT,INGAVI KILAVUKA,MERCY KIRAGU
                 
* Student pace: Full time-Hybrid
* Scheduled project review date/time: 
* Instructor name: Maryann Mwikali
* Blog post URL: 


## BUSINESS UNDERSTANDING

### BUSINESS OVERVIEW

A real estate agency aims to provide detailed pricing models to their clients showing how different features impact home sale prices.By understanding these impacts,the agency can advise sellers on how to enhance their property to maximize sale price and assist buyers in evaluating potential homes

#### CHALLENGES:

1.The King County real estate market is subject to fluctuations influenced by economic conditions, interest rates, and other external factors, posing challenges in predicting property values accurately.

2.The presence of numerous real estate agencies and agents vying for clients' attention intensifies competition, requiring innovative strategies to stand out and attract business.

3.Ensuring the accuracy and completeness of data sources, as well as accessing relevant datasets for analysis, presents challenges in developing robust pricing models and predictive analytics.

4.Ongoing regulatory changes, such as zoning regulations, tax policies, and housing laws, can impact market dynamics and require adaptation to ensure compliance and mitigate risks.

#### PROPOSED SOLUTION:

1.Implement sophisticated data analytics techniques, including machine learning algorithms, to analyze historical sales data, identify trends, and predict future property values with greater accuracy.

2.Develop customized pricing models that consider a wide range of factors, including property features, neighborhood characteristics, market demand, and buyer preferences, to provide tailored pricing recommendations for each property.

3.Establish a framework for continuous monitoring of market trends and model performance, allowing for timely adjustments to pricing strategies and recommendations based on changing market conditions.

4.Foster collaborations with data providers, industry experts, and technology partners to access additional data sources, enhance analytical capabilities, and stay abreast of best practices in real estate valuation and predictive modeling.

In conclusion, by leveraging advanced analytics and innovative strategies, the real estate agency can overcome challenges in the dynamic King County market and provide clients with valuable insights and recommendations to optimize their real estate transactions. Through continuous adaptation, collaboration, and a client-centric approach, the agency can achieve sustainable growth, enhance competitiveness, and deliver exceptional value to clients in the ever-evolving real estate landscape.

### PROBLEM STATEMENT



The real estate agency seeks to develop sophisticated pricing models that analyze the King County housing market data to determine how different features influence home sale prices. By understanding these impacts, the agency aims to:
1. Assist sellers in optimizing their property to maximize sale price: Sellers will benefit from tailored recommendations on which features to enhance or highlight to increase the value of their homes. By leveraging insights from the pricing models, sellers can make informed decisions about renovations, upgrades, or staging strategies to attract potential buyers and achieve optimal sale prices.
2. Empower buyers to make informed purchasing decisions: Buyers will gain valuable insights into how various property features correlate with sale prices. Armed with this knowledge, buyers can prioritize their preferences and make informed decisions when evaluating potential homes. Additionally, the agency can provide guidance on negotiating strategies based on the perceived value of different features.
3. Enhance the agency's competitive advantage: By offering advanced pricing models that provide granular insights into the factors influencing home sale prices, the agency can differentiate itself in the market. This will attract both sellers seeking to maximize their returns and buyers seeking expert guidance in their property search process.

Overall, the development of detailed pricing models will enable the real estate agency to provide superior value to its clients, facilitate more informed decision-making processes, and maintain a competitive edge in the dynamic King County housing market.

## DATA UNDERSTANDING

The King County House sale dataset contains information regarding houses sold during the one year period ranging from May 2014 to May 2015.

In order to understand what each column in our data frame represents, a data dictionary is displayed below:

##### TARGET/DEPENDENT VARIABLE:

price — price of each home sold

##### PREDICTORS/INDEPENDENT VARIABLES:

id — unique identifier for a house

date — date of the home sale

bedrooms — number of bedrooms

bathrooms — number of bathrooms

sqft_living — square footage of the house’s interior living space

sqft_lot — square footage of the land space

floors — number of floors

waterfront — does the house have a view to the waterfront?

view — an index from 0 to 4 of how good the view of the property was

condition — an index from 1 to 5 on the condition of the house
grade — an index from 1 to 13, where 1–3 falls short of building construction and design, 7 has an average level of construction and design, and 11–13 have a high quality level of construction and design

sqft_above — square feet above ground

sqft_basement — square feet below ground

yr_built— the year the house was initially built

yr_renovated — the year of the house’s last renovation (0 if never renovated)

zipcode — zip

lat — latitude coordinate

long — longitude coordinate

sqft_living15 — average size of interior housing living space for the closest 15 houses, in square feet

sqft_lot15 — average size of land lot for the closest 15 houses, in square feet



### DATA PREPROCESSING

In [60]:
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [61]:


dataset = pd.read_csv("data/kc_house_data.csv")



In [62]:
dataset.head()

Unnamed: 0,id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,...,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
0,7129300520,10/13/2014,221900,3,1,1180,5650,1,,0,...,7,1180,0.0,1955,0.0,98178,48,-122,1340,5650
1,6414100192,12/9/2014,538000,3,2,2570,7242,2,0.0,0,...,7,2170,400.0,1951,1991.0,98125,48,-122,1690,7639
2,5631500400,2/25/2015,180000,2,1,770,10000,1,0.0,0,...,6,770,0.0,1933,,98028,48,-122,2720,8062
3,2487200875,12/9/2014,604000,4,3,1960,5000,1,0.0,0,...,7,1050,910.0,1965,0.0,98136,48,-122,1360,5000
4,1954400510,2/18/2015,510000,3,2,1680,8080,1,0.0,0,...,8,1680,0.0,1987,0.0,98074,48,-122,1800,7503


In [64]:
dataset.describe()

Unnamed: 0,id,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,grade,sqft_above,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
count,21597,21597,21597,21597,21597,21597,21597,19221,21534,21597,21597,21597,21597,17755,21597,21597,21597,21597,21597
mean,4580474288,540297,3,2,2080,15099,1,0,0,3,8,1789,1971,84,98078,48,-122,1987,12758
std,2876735716,367368,1,1,918,41413,1,0,1,1,1,828,29,400,54,0,0,685,27274
min,1000102,78000,1,0,370,520,1,0,0,1,3,370,1900,0,98001,47,-123,399,651
25%,2123049175,322000,3,2,1430,5040,1,0,0,3,7,1190,1951,0,98033,47,-122,1490,5100
50%,3904930410,450000,3,2,1910,7618,2,0,0,3,7,1560,1975,0,98065,48,-122,1840,7620
75%,7308900490,645000,4,2,2550,10685,2,0,0,4,8,2210,1997,0,98118,48,-122,2360,10083
max,9900000190,7700000,33,8,13540,1651359,4,1,4,5,13,9410,2015,2015,98199,48,-121,6210,871200


In [65]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21597 entries, 0 to 21596
Data columns (total 21 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   id             21597 non-null  int64  
 1   date           21597 non-null  object 
 2   price          21597 non-null  float64
 3   bedrooms       21597 non-null  int64  
 4   bathrooms      21597 non-null  float64
 5   sqft_living    21597 non-null  int64  
 6   sqft_lot       21597 non-null  int64  
 7   floors         21597 non-null  float64
 8   waterfront     19221 non-null  float64
 9   view           21534 non-null  float64
 10  condition      21597 non-null  int64  
 11  grade          21597 non-null  int64  
 12  sqft_above     21597 non-null  int64  
 13  sqft_basement  21597 non-null  object 
 14  yr_built       21597 non-null  int64  
 15  yr_renovated   17755 non-null  float64
 16  zipcode        21597 non-null  int64  
 17  lat            21597 non-null  float64
 18  long  

In [66]:
dataset.shape

(21597, 21)

In [67]:
dataset.columns

Index(['id', 'date', 'price', 'bedrooms', 'bathrooms', 'sqft_living',
       'sqft_lot', 'floors', 'waterfront', 'view', 'condition', 'grade',
       'sqft_above', 'sqft_basement', 'yr_built', 'yr_renovated', 'zipcode',
       'lat', 'long', 'sqft_living15', 'sqft_lot15'],
      dtype='object')

In [68]:
#CHECKING THE DATATYPES OF THE FEATURES ABOVE
column_data_types_counts = dataset.dtypes.value_counts()

print(column_data_types_counts)


int64      11
float64     8
object      2
dtype: int64


#### DATA CLEANING

##### CHECKING FOR MISSING VALUES

In [69]:
# checking if there are any missing values in the features
dataset.isnull().any()

id               False
date             False
price            False
bedrooms         False
bathrooms        False
sqft_living      False
sqft_lot         False
floors           False
waterfront        True
view              True
condition        False
grade            False
sqft_above       False
sqft_basement    False
yr_built         False
yr_renovated      True
zipcode          False
lat              False
long             False
sqft_living15    False
sqft_lot15       False
dtype: bool

In [70]:
#checking for the percentage of missing values per column
dataset.isnull().sum() / len(dataset) * 100


id               0
date             0
price            0
bedrooms         0
bathrooms        0
sqft_living      0
sqft_lot         0
floors           0
waterfront      11
view             0
condition        0
grade            0
sqft_above       0
sqft_basement    0
yr_built         0
yr_renovated    18
zipcode          0
lat              0
long             0
sqft_living15    0
sqft_lot15       0
dtype: float64

#### FIXING MISSING VALUES

In [71]:
# iterate over each column and print most common values
for col in ["waterfront", "view", "yr_renovated"]:
    print(col)
    print(dataset[col].value_counts(normalize = True).sort_values(ascending = False).head())
    print("------------------")

waterfront
0   1
1   0
Name: waterfront, dtype: float64
------------------
view
0   1
2   0
3   0
1   0
4   0
Name: view, dtype: float64
------------------
yr_renovated
0       1
2,014   0
2,003   0
2,013   0
2,007   0
Name: yr_renovated, dtype: float64
------------------


In [72]:

def replace_missing(val, probs):
    if pd.isnull(val):
        return np.random.choice(probs.index, p=probs.values)
    else:
        return val

for col in ["waterfront", "view", "yr_renovated"]:
    # get weights of unique values
    unique_p = dataset[col].value_counts(normalize=True)
    if unique_p.isnull().any():  # Handle the case where the column has all NaN values
        print("Skipping column {} because it contains only missing values".format(col))
        continue
    # apply function above
    dataset[col] = dataset[col].apply(replace_missing, args=(unique_p,))
    print("The number of missing values in {} is:".format(col), dataset[col].isna().sum())

print("--------------------------------------")
print("Missing values per column:")
# last check to see if there are missing values
dataset.isna().sum()


The number of missing values in waterfront is: 0
The number of missing values in view is: 0
The number of missing values in yr_renovated is: 0
--------------------------------------
Missing values per column:


id               0
date             0
price            0
bedrooms         0
bathrooms        0
sqft_living      0
sqft_lot         0
floors           0
waterfront       0
view             0
condition        0
grade            0
sqft_above       0
sqft_basement    0
yr_built         0
yr_renovated     0
zipcode          0
lat              0
long             0
sqft_living15    0
sqft_lot15       0
dtype: int64

#### CHECKING FOR DUPLICATED VALUES

In [73]:
dataset.duplicated().any()

False

There are no duplicated values