<h3>Importing thge neccessary libraries for completing the project<h3>

In [16]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
from scipy.stats import zscore


<h1>Data Loading and Cleaning<h1>

<h3>Data loading and displaying the first few rows. This will help to give an idea of the columns and values that are being dealt with.<h3>

In [18]:
df = pd.read_csv('/workspaces/Practise-Code/Kaggle Project/Data/kc_house_data.csv').sort_values('price', ascending=False)
df.head()

Unnamed: 0,id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,...,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
7252,6762700020,20141013T000000,7700000.0,6,8.0,12050,27600,2.5,0,3,...,13,8570,3480,1910,1987,98102,47.6298,-122.323,3940,8800
3914,9808700762,20140611T000000,7062500.0,5,4.5,10040,37325,2.0,1,2,...,11,7680,2360,1940,2001,98004,47.65,-122.214,3930,25449
9254,9208900037,20140919T000000,6885000.0,6,7.75,9890,31374,2.0,0,4,...,13,8860,1030,2001,0,98039,47.6305,-122.24,4540,42730
4411,2470100110,20140804T000000,5570000.0,5,5.75,9200,35069,2.0,0,0,...,13,6200,3000,2001,0,98039,47.6289,-122.233,3560,24345
1448,8907500070,20150413T000000,5350000.0,5,5.0,8000,23985,2.0,0,4,...,12,6720,1280,2009,0,98004,47.6232,-122.22,4600,21750


<h3>Using '.info()' to give us an idea of missing values and the data types used.<h3>

In this instance, the data is not missing values and most columns are using numeric data types allowing for easier analysis (catagorical values such as 'view' or 'seafront' have been converted to numerical already). Date however could benefit from being converted to 'datetime' which will allow for more control when generating EDA (Exploritory Data Analysis).

In [19]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 21613 entries, 7252 to 1149
Data columns (total 21 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   id             21613 non-null  int64  
 1   date           21613 non-null  object 
 2   price          21613 non-null  float64
 3   bedrooms       21613 non-null  int64  
 4   bathrooms      21613 non-null  float64
 5   sqft_living    21613 non-null  int64  
 6   sqft_lot       21613 non-null  int64  
 7   floors         21613 non-null  float64
 8   waterfront     21613 non-null  int64  
 9   view           21613 non-null  int64  
 10  condition      21613 non-null  int64  
 11  grade          21613 non-null  int64  
 12  sqft_above     21613 non-null  int64  
 13  sqft_basement  21613 non-null  int64  
 14  yr_built       21613 non-null  int64  
 15  yr_renovated   21613 non-null  int64  
 16  zipcode        21613 non-null  int64  
 17  lat            21613 non-null  float64
 18  long     

<h3>Below demonstrates the conversion of the 'date' column to 'datetime' and how the type of data for 'date' changes when we call 'info()' again.<h3>

In [20]:
df['date'] = pd.to_datetime(df['date'])
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 21613 entries, 7252 to 1149
Data columns (total 21 columns):
 #   Column         Non-Null Count  Dtype         
---  ------         --------------  -----         
 0   id             21613 non-null  int64         
 1   date           21613 non-null  datetime64[ns]
 2   price          21613 non-null  float64       
 3   bedrooms       21613 non-null  int64         
 4   bathrooms      21613 non-null  float64       
 5   sqft_living    21613 non-null  int64         
 6   sqft_lot       21613 non-null  int64         
 7   floors         21613 non-null  float64       
 8   waterfront     21613 non-null  int64         
 9   view           21613 non-null  int64         
 10  condition      21613 non-null  int64         
 11  grade          21613 non-null  int64         
 12  sqft_above     21613 non-null  int64         
 13  sqft_basement  21613 non-null  int64         
 14  yr_built       21613 non-null  int64         
 15  yr_renovated   21613 n

<h3>Outliers are handled below by creating Z scores for all data and making sure they fall wihtin a threshold of 3 standard deviations.<h3>

<h5> - 99.7% of data should fall within 3 standard deviations. By removing any outliers, a much more consistent EDA will be obtained.<h5>
<h5> - 'waterfront' and 'yr_renovated' are not included in outlier analysis/removal. As only 6 properties are waterfront properties and a small number of properties have been renovated, these properties would be removed if included and could provide valuable insights.<h5>

In [26]:
# Select only numerical columns
numeric_cols = df.select_dtypes(include=['float64', 'int64']).columns

# Calculate z-scores for numerical columns only
df_zscores = df[numeric_cols].apply(zscore)

# Identify outliers
threshold = 3

# Perform outlier removal in a loop
for _ in range(6):  # Repeat 3 times
    # Calculate z-scores for numerical columns
    df_zscores = df[numeric_cols].apply(zscore)
    
    # Identify outliers based on the z-score threshold
    outliers = (df_zscores > threshold) | (df_zscores < -threshold)
    
    # Identify rows with any outliers
    outliers_any = outliers.any(axis=1)
    
    # Filter out the rows with outliers
    df = df[~outliers_any]

# Display the cleaned DataFrame
print(df[numeric_cols].apply(zscore))


             id     price  bedrooms  bathrooms  sqft_living  sqft_lot  \
9826   1.698120  2.997893  1.035750   2.458845     2.283836 -1.204479   
9314  -0.104445  2.985600 -1.512547  -0.645900    -1.046027  0.331437   
8060  -0.279864  2.979453  1.035750   0.906472     2.374321  0.557664   
1184   1.111812  2.979453  1.035750   0.518379     1.469467  0.779412   
1786  -0.769852  2.973306  1.035750   2.070752    -0.285950 -0.858898   
...         ...       ...       ...        ...          ...       ...   
13756 -1.016548 -2.033402 -0.238398  -1.422087    -1.588940  0.869007   
10253 -0.794459 -2.042623 -1.512547  -1.422087    -1.607037  0.715416   
16714 -1.180304 -2.042623 -1.512547  -1.422087    -1.462260  0.956362   
18468  1.161969 -2.054917 -1.512547  -1.422087    -1.480357  0.581023   
8274  -0.281723 -2.061064 -0.238398  -1.422087    -1.552745  1.171710   

         floors  waterfront  view  condition     grade  sqft_above  \
9826   1.004339         NaN   NaN  -0.651104  2.11805

In [31]:
df['month'] = df['date'].dt.month
df['decade_built'] = (df['yr_built']//10) * 10

In [30]:
df.to_csv('/workspaces/Practise-Code/Kaggle Project/Data/df_no_outliers.csv')

df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 13896 entries, 9826 to 8274
Data columns (total 23 columns):
 #   Column         Non-Null Count  Dtype         
---  ------         --------------  -----         
 0   id             13896 non-null  int64         
 1   date           13896 non-null  datetime64[ns]
 2   price          13896 non-null  float64       
 3   bedrooms       13896 non-null  int64         
 4   bathrooms      13896 non-null  float64       
 5   sqft_living    13896 non-null  int64         
 6   sqft_lot       13896 non-null  int64         
 7   floors         13896 non-null  float64       
 8   waterfront     13896 non-null  int64         
 9   view           13896 non-null  int64         
 10  condition      13896 non-null  int64         
 11  grade          13896 non-null  int64         
 12  sqft_above     13896 non-null  int64         
 13  sqft_basement  13896 non-null  int64         
 14  yr_built       13896 non-null  int64         
 15  yr_renovated   13896 n

In [32]:
final_waterfront_distribution = df_no_outliers['waterfront'].value_counts()
print("Final 'waterfront' distribution:\n", final_waterfront_distribution)

final_year_renovated_distribution = df_no_outliers['yr_renovated'].value_counts()
print("Final 'yr_renovated' distribution\n", final_year_renovated_distribution)

Final 'waterfront' distribution:
 waterfront
0    18702
Name: count, dtype: int64
Final 'yr_renovated' distribution
 yr_renovated
0    18702
Name: count, dtype: int64


In [12]:
df_no_outliers.head()

Unnamed: 0,id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,...,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15,month,decade_built
15924,7533800170,2014-07-07,1636000.0,3,2.5,3110,6765,2.0,0,1,...,560,1946,0,98115,47.6886,-122.276,2630,7626,7,1940
17324,9809000010,2015-01-06,1629000.0,5,2.5,3090,16583,2.0,0,0,...,0,1964,0,98004,47.6458,-122.218,3740,17853,1,1960
14233,5318101565,2014-07-03,1625000.0,4,3.25,2980,3600,2.0,0,0,...,830,1999,0,98112,47.6352,-122.284,2980,4800,7,1990
11843,2450500060,2014-08-26,1620000.0,4,3.25,3820,8114,2.0,0,0,...,0,2005,0,98004,47.5837,-122.194,2440,9195,8,2000
16268,3025300250,2015-05-13,1620000.0,4,2.25,2350,17709,2.0,0,0,...,0,1977,0,98039,47.6232,-122.236,3360,19855,5,1970
