In [2]:
import pandas as pd

In [24]:
df = pd.read_csv('kc_house_data.csv')
df.head()
df.columns

Index(['id', 'date', 'price', 'bedrooms', 'bathrooms', 'sqft_living',
       'sqft_lot', 'floors', 'waterfront', 'view', 'condition', 'grade',
       'sqft_above', 'sqft_basement', 'yr_built', 'yr_renovated', 'zipcode',
       'lat', 'long', 'sqft_living15', 'sqft_lot15'],
      dtype='object')

In [66]:
df.shape

(21613, 21)

In [67]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21613 entries, 0 to 21612
Data columns (total 21 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   id             21613 non-null  int64  
 1   date           21613 non-null  object 
 2   price          21613 non-null  float64
 3   bedrooms       21613 non-null  int64  
 4   bathrooms      21613 non-null  float64
 5   sqft_living    21613 non-null  int64  
 6   sqft_lot       21613 non-null  int64  
 7   floors         21613 non-null  float64
 8   waterfront     21613 non-null  int64  
 9   view           21613 non-null  int64  
 10  condition      21613 non-null  int64  
 11  grade          21613 non-null  int64  
 12  sqft_above     21613 non-null  int64  
 13  sqft_basement  21613 non-null  int64  
 14  yr_built       21613 non-null  int64  
 15  yr_renovated   21613 non-null  int64  
 16  zipcode        21613 non-null  int64  
 17  lat            21613 non-null  float64
 18  long  

Our data is clean

CORRELATION:
Correlation is a statistical measure that indicates how 2 variables change together.
correlation values range from -1 to 1 indicating the strength and direction of the relationship between two sets of data points.

PRICE / BEDROOMS

In [8]:
corrBedrooms = df['bedrooms'].corr(df['price'])
corrBedrooms

0.3083383686880968

The result of this correlation between bedrooms and price indicate a positive correlation, however, the strength of the correlation is relatively weak.
=> while there's a relationship between higher prices and higher number of bedrooms, other factors are playing a larger role in affecting the price 

In [12]:
resultBR = df.groupby('bedrooms').agg({'price': 'max', 'bedrooms': 'count'}).rename(columns={'bedrooms': 'count'})
resultBR


Unnamed: 0_level_0,price,count
bedrooms,Unnamed: 1_level_1,Unnamed: 2_level_1
0,1300000.0,13
1,1250000.0,199
2,3280000.0,2760
3,3800000.0,9824
4,4490000.0,6882
5,7060000.0,1601
6,7700000.0,272
7,3200000.0,38
8,3300000.0,13
9,1400000.0,6


When we observe the above data frame we find that houses having 4 -> 6 bedrooms are occupying the highest price while houses having other number of bedrooms tend to have lower prices

PRICE / BATHROOMS

In [13]:
corrBathrooms = df['bathrooms'].corr(df['price'])
corrBathrooms

0.5251340727456004

having a result of ~ 0.5 incidates a moderate positive relationship between bathrooms and price.
=> there's a noticeable but not very strong relationship between the number of bathrooms and house prices

In [23]:
resultBT = df.groupby('bathrooms').agg({'price':'max','bathrooms':'count'}).rename(columns={'bathrooms': 'count'})
resultBT = resultBT.sort_values(by='price', ascending=True)
resultBT

Unnamed: 0_level_0,price,count
bathrooms,Unnamed: 1_level_1,Unnamed: 2_level_1
0.5,312500.0,4
7.5,450000.0,1
0.75,785000.0,72
0.0,1300000.0,10
1.0,1300000.0,3852
1.25,1390000.0,9
1.5,1500000.0,1446
2.0,2200000.0,1930
6.5,2240000.0,2
2.25,2400000.0,2047


by observing the above table we can see that the highest price has the most number of bathrooms. However, on the least expensive houses, in some cases we can see high number of bathrooms (7.50, 6.50, ...) and on the more expensive side we can see houses having 3 bathrooms but are priced on the higher end ( 4,490,000 )

In [25]:
corrSQFT = df['sqft_living'].corr(df['price'])
corrSQFT

0.7020437212325276

In [26]:
corrFLRS = df['floors'].corr(df['price'])
corrFLRS

0.25678570497551195

by comparing the correlation relation between sqft living / price and floors / price we conclude that it makes more sense analyzing the price according to the sqft living because the correlation result gives a strong positive indication.

SQFT LIVING / PRICE

In [55]:
resultSQF = df.groupby('sqft_living').agg({'price':'max','sqft_living':'count'}).rename(columns={'sqft_living':'count'}).sort_values(by='price', ascending=False).head(20)
resultSQF

Unnamed: 0_level_0,price,count
sqft_living,Unnamed: 1_level_1,Unnamed: 2_level_1
12050,7700000.0,1
10040,7060000.0,1
9890,6890000.0,1
9200,5570000.0,1
8000,5350000.0,1
7390,5300000.0,1
8010,5110000.0,1
9640,4670000.0,1
6430,4490000.0,1
7440,4210000.0,1


by taking the poles of the data set (price asc and desc) we can see a strong relationship between the area of the house and it's price. 
=> house pricing and sqft_living are directly proportional to each other

GRADE / PRICE

In [51]:
corrGrade = df['grade'].corr(df['price'])
corrGrade

0.6674627402178585

In [56]:
resultGD = df.groupby('grade').agg({'price':'max','grade':'count'}).round(1).rename(columns={'grade':'count'})
resultGD

Unnamed: 0_level_0,price,count
grade,Unnamed: 1_level_1,Unnamed: 2_level_1
1,142000.0,1
3,280000.0,3
4,435000.0,29
5,795000.0,242
6,1200000.0,2038
7,2050000.0,8981
8,3070000.0,6068
9,2700000.0,2615
10,3600000.0,1134
11,7060000.0,399


The result of the correlation relation indicates that there is a relatively strong positive relation between grade and price, a change in one of the variables can provide some prediction for changes in the other variable.
=> as the grade increases, the prices(Mean) increases too. However, when extracting the max prices for each grade we can find that a house with grade 11 is more expensive that a house with grade 12

YEAR BUILT VS YEAR RENOVATED

In [59]:
corrYBuilt = df['yr_built'].corr(df['price'])

corrYReno = df['yr_renovated'].corr(df['price'])
print(corrYBuilt," ",corrYReno)

0.05398182517966194   0.12644222895207075


by analyzing the corr of the yr_built vs yr_renovated we can see that the year a house was built or renovated has a very limited imact on the house's price, however, the yr_renovated appears to have a bigger impact on the house's price that the yr_built var

WATERFRONT AND VIEW / PRICE

In [71]:
corrWT = df['waterfront'].corr(df['price'])
corrWT

0.2663305105222563

the positive correlation of ~0.266 indicates that having a waterfront is related to having a higher price, however this relation is moderate

In [72]:
corrVW = df['view'].corr(df['price'])
corrVW

0.3973464743789386

the stronger correlation of 0.397 indicates that having a  higher view score tend to be associated with a significantly higher prices. 
=> having a view player a bigger role as a factor in affecting the price than having a waterfront

In [73]:
resultWV = df.groupby('waterfront').agg({'price':'mean','waterfront':'count'}).round(1).rename(columns={'grade':'count'})
resultWV

Unnamed: 0_level_0,price,waterfront
waterfront,Unnamed: 1_level_1,Unnamed: 2_level_1
0,531653.4,21450
1,1662524.2,163


houses with waterfront are ~ 32% more expensive than houses without a waterfront

In [74]:
resultV = df.groupby('view').agg({'price':'mean','waterfront':'count'}).round(1).rename(columns={'view':'count'})
resultV

Unnamed: 0_level_0,price,waterfront
view,Unnamed: 1_level_1,Unnamed: 2_level_1
0,496623.5,19489
1,812518.6,332
2,792746.2,963
3,972468.4,510
4,1464362.6,319


houses with higher view score are significantly more expensive

FINAL CONCLUSION

In summary, factors like sqft_living and grade play a major role in affecting the house's price.
While, bathrooms, bedrooms, floors, view, and waterfront play an intermediate role affecting the price.
On the contrary, year renovated plays a minor role in the pricing and year built plays almost no role in affecting the housing prices.