# Exploratory Data Analysis and Visualization Masterclass with Python

The Dataset comes from [Kaggle](https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data). It was made for a competition of advanced regression techniques.

## Data fields

Here's a brief version of what features you'll find in the data:

- **SalePrice** - The property's sale price in dollars. This is the target variable that you're trying to predict.
- **MSSubClass** - The building class  
- **MSZoning** - The general zoning classification  
- **LotFrontage** - Linear feet of street connected to property  
- **LotArea** - Lot size in square feet  
- **Street** - Type of road access  
- **Alley** - Type of alley access  
- **LotShape** - General shape of property  
- **LandContour** - Flatness of the property  
- **Utilities** - Type of utilities available  
- **LotConfig** - Lot configuration  
- **LandSlope** - Slope of property  
- **Neighborhood** - Physical locations within Ames city limits  
- **Condition1** - Proximity to main road or railroad  
- **Condition2** - Proximity to main road or railroad (if a second is present)  
- **BldgType** - Type of dwelling  
- **HouseStyle** - Style of dwelling  
- **OverallQual** - Overall material and finish quality  
- **OverallCond** - Overall condition rating  
- **YearBuilt** - Original construction date  
- **YearRemodAdd** - Remodel date  
- **RoofStyle** - Type of roof  
- **RoofMatl** - Roof material  
- **Exterior1st** - Exterior covering on house  
- **Exterior2nd** - Exterior covering on house (if more than one material)  
- **MasVnrType** - Masonry veneer type  
- **MasVnrArea** - Masonry veneer area in square feet  
- **ExterQual** - Exterior material quality  
- **ExterCond** - Present condition of the material on the exterior  
- **Foundation** - Type of foundation  
- **BsmtQual** - Height of the basement  
- **BsmtCond** - General condition of the basement  
- **BsmtExposure** - Walkout or garden level basement walls  
- **BsmtFinType1** - Quality of basement finished area  
- **BsmtFinSF1** - Type 1 finished square feet  
- **BsmtFinType2** - Quality of second finished area (if present)  
- **BsmtFinSF2** - Type 2 finished square feet  
- **BsmtUnfSF** - Unfinished square feet of basement area  
- **TotalBsmtSF** - Total square feet of basement area  
- **Heating** - Type of heating  
- **HeatingQC** - Heating quality and condition  
- **CentralAir** - Central air conditioning  
- **Electrical** - Electrical system  
- **1stFlrSF** - First Floor square feet  
- **2ndFlrSF** - Second floor square feet  
- **LowQualFinSF** - Low quality finished square feet (all floors)  
- **GrLivArea** - Above grade (ground) living area square feet  
- **BsmtFullBath** - Basement full bathrooms  
- **BsmtHalfBath** - Basement half bathrooms  
- **FullBath** - Full bathrooms above grade  
- **HalfBath** - Half baths above grade  
- **Bedroom** - Number of bedrooms above basement level  
- **Kitchen** - Number of kitchens  
- **KitchenQual** - Kitchen quality  
- **TotRmsAbvGrd** - Total rooms above grade (does not include bathrooms)  
- **Functional** - Home functionality rating  
- **Fireplaces** - Number of fireplaces  
- **FireplaceQu** - Fireplace quality  
- **GarageType** - Garage location  
- **GarageYrBlt** - Year garage was built  
- **GarageFinish** - Interior finish of the garage  
- **GarageCars** - Size of garage in car capacity  
- **GarageArea** - Size of garage in square feet  
- **GarageQual** - Garage quality  
- **GarageCond** - Garage condition  
- **PavedDrive** - Paved driveway  
- **WoodDeckSF** - Wood deck area in square feet  
- **OpenPorchSF** - Open porch area in square feet  
- **EnclosedPorch** - Enclosed porch area in square feet  
- **3SsnPorch** - Three season porch area in square feet  
- **ScreenPorch** - Screen porch area in square feet  
- **PoolArea** - Pool area in square feet  
- **PoolQC** - Pool quality  
- **Fence** - Fence quality  
- **MiscFeature** - Miscellaneous feature not covered in other categories  
- **MiscVal** - Value of miscellaneous feature ($)  
- **MoSold** - Month Sold  
- **YrSold** - Year Sold  
- **SaleType** - Type of sale  
- **SaleCondition** - Condition of sale  


In [25]:
# Load Libraries
import pandas as pd

In [None]:
# Load Dataset over relative path
df = pd.read_csv("housing_data.csv")
# Show results
print(df)

        Id  MSSubClass MSZoning  LotFrontage  LotArea Street Alley LotShape  \
0        1          60       RL         65.0     8450   Pave   NaN      Reg   
1        2          20       RL         80.0     9600   Pave   NaN      Reg   
2        3          60       RL         68.0    11250   Pave   NaN      IR1   
3        4          70       RL         60.0     9550   Pave   NaN      IR1   
4        5          60       RL         84.0    14260   Pave   NaN      IR1   
...    ...         ...      ...          ...      ...    ...   ...      ...   
1455  1456          60       RL         62.0     7917   Pave   NaN      Reg   
1456  1457          20       RL         85.0    13175   Pave   NaN      Reg   
1457  1458          70       RL         66.0     9042   Pave   NaN      Reg   
1458  1459          20       RL         68.0     9717   Pave   NaN      Reg   
1459  1460          20       RL         75.0     9937   Pave   NaN      Reg   

     LandContour Utilities  ... PoolArea PoolQC  Fe

## Explore the Data

In [27]:
# Explore shape
df.shape

(1460, 81)

In [28]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             1460 non-null   int64  
 1   MSSubClass     1460 non-null   int64  
 2   MSZoning       1460 non-null   object 
 3   LotFrontage    1201 non-null   float64
 4   LotArea        1460 non-null   int64  
 5   Street         1460 non-null   object 
 6   Alley          91 non-null     object 
 7   LotShape       1460 non-null   object 
 8   LandContour    1460 non-null   object 
 9   Utilities      1460 non-null   object 
 10  LotConfig      1460 non-null   object 
 11  LandSlope      1460 non-null   object 
 12  Neighborhood   1460 non-null   object 
 13  Condition1     1460 non-null   object 
 14  Condition2     1460 non-null   object 
 15  BldgType       1460 non-null   object 
 16  HouseStyle     1460 non-null   object 
 17  OverallQual    1460 non-null   int64  
 18  OverallC

This dataset contains information about 1460 different houses and offers up to 81 different features per house. The data seems mostly complete with a few features that lack a lot of entries, so lets see.

In [29]:
# Get the total number of features
total_features = df.shape[1]  

# Identify columns that are NOT 100% non-null
missing_features = df.columns[df.isnull().any()].tolist()

# Calculate the percentage of features that are NOT 100% non-null
percentage_not_full = (len(missing_features) / total_features) * 100

# Display results
print("Features that are not 100% non-null:", missing_features)
print(f"Percentage of features that are not 100% non-null: {percentage_not_full:.2f}%")

Features that are not 100% non-null: ['LotFrontage', 'Alley', 'MasVnrType', 'MasVnrArea', 'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2', 'Electrical', 'FireplaceQu', 'GarageType', 'GarageYrBlt', 'GarageFinish', 'GarageQual', 'GarageCond', 'PoolQC', 'Fence', 'MiscFeature']
Percentage of features that are not 100% non-null: 23.46%


The missing features seem to be expendable for now, so lets kick them out and work with what we have, since it is a pretty good base.

In [30]:
# Drops columns with at least one NaN
df = df.dropna(axis=1)

# Lets see if it has worked
# Get the total number of features
total_features = df.shape[1]  

# Identify columns that are NOT 100% non-null
missing_features = df.columns[df.isnull().any()].tolist()

# Calculate the percentage of features that are NOT 100% non-null
percentage_not_full = (len(missing_features) / total_features) * 100

# Display results
print("Features that are not 100% non-null:", missing_features)
print(f"Percentage of features that are not 100% non-null: {percentage_not_full:.2f}%")

Features that are not 100% non-null: []
Percentage of features that are not 100% non-null: 0.00%


Now lets see how we can analyze the features of our dataframe, starting with the main feature: SalePrice. This is typically the target feature and we will find out which other features influence the saleprice most.

First we split the features we have in numerical and categorical.

In [31]:
# Separate numerical and categorical features into two DataFrames
df_numerical = df.select_dtypes(include=['number'])  # DataFrame with numerical features
df_categorical = df.select_dtypes(include=['object', 'category'])  # DataFrame with categorical features

In [32]:
print("Numerical Features DataFrame:")
df_numerical.info()

Numerical Features DataFrame:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 35 columns):
 #   Column         Non-Null Count  Dtype
---  ------         --------------  -----
 0   Id             1460 non-null   int64
 1   MSSubClass     1460 non-null   int64
 2   LotArea        1460 non-null   int64
 3   OverallQual    1460 non-null   int64
 4   OverallCond    1460 non-null   int64
 5   YearBuilt      1460 non-null   int64
 6   YearRemodAdd   1460 non-null   int64
 7   BsmtFinSF1     1460 non-null   int64
 8   BsmtFinSF2     1460 non-null   int64
 9   BsmtUnfSF      1460 non-null   int64
 10  TotalBsmtSF    1460 non-null   int64
 11  1stFlrSF       1460 non-null   int64
 12  2ndFlrSF       1460 non-null   int64
 13  LowQualFinSF   1460 non-null   int64
 14  GrLivArea      1460 non-null   int64
 15  BsmtFullBath   1460 non-null   int64
 16  BsmtHalfBath   1460 non-null   int64
 17  FullBath       1460 non-null   int64
 18  HalfBath       146

In [33]:
print("Categorical Features DataFrame:")
df_categorical.info()

Categorical Features DataFrame:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 27 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   MSZoning       1460 non-null   object
 1   Street         1460 non-null   object
 2   LotShape       1460 non-null   object
 3   LandContour    1460 non-null   object
 4   Utilities      1460 non-null   object
 5   LotConfig      1460 non-null   object
 6   LandSlope      1460 non-null   object
 7   Neighborhood   1460 non-null   object
 8   Condition1     1460 non-null   object
 9   Condition2     1460 non-null   object
 10  BldgType       1460 non-null   object
 11  HouseStyle     1460 non-null   object
 12  RoofStyle      1460 non-null   object
 13  RoofMatl       1460 non-null   object
 14  Exterior1st    1460 non-null   object
 15  Exterior2nd    1460 non-null   object
 16  ExterQual      1460 non-null   object
 17  ExterCond      1460 non-null   object
 

We have a lot of 100% filled in features inside both numerical and categorical features, so for now we kick all the incomplete features out.