# Exploratory Data Analysis (EDA) with Housing Price Dataset

Exploratory Data Analysis (EDA) is a crucial step in the data analysis pipeline. It helps us understand the data, discover patterns, spot anomalies, and frame hypotheses. In this lesson, we'll use a housing price dataset to explore various EDA techniques.

## Initial Steps for Data Analysis

The initial steps for data analysis in Python include:

1. **Data Acquisition:** This involves gathering data from various sources such as local files, databases, APIs, websites, etc.
 
2. **Loading the Data:** Common formats to consider are CSV (Comma Separated Values), JSON, XLS, HTML, XML, and more.

3. **Exploratory Data Analysis (EDA):** EDA is a systematic approach to initial data inspection. It leverages **descriptive analysis** techniques to understand the data better, identify outliers, highlight significant variables, and generally uncover underlying data patterns. Additionally, EDA helps in organizing the data, spotting errors, and assessing missing values.

4. **Data Cleaning:** It's crucial to check the available data and perform tasks such as removing empty columns, standardizing terms, imputing missing data where appropriate, and more.

5. After cleaning, you should conduct a more in-depth exploratory data analysis to further understand the data.

## Methods in EDA

EDA methodologies can be broadly categorized into:

- **Numerical Measures:** These can include coefficients, frequency counts, and other statistical metrics.
  
- **Visual Representations:** Examples are histograms, scatter plots, pie charts, and more.

Additionally, based on the number of variables in focus, methods can be:

- **Univariate:** Describing the characteristics of a single variable at a time.
  
- **Bivariate:** Analyzing the relationship between two variables, either in tandem or understanding one variable based on the other (examining the influence of one independent variable in relation to the dependent variable).
  
- **Multivariate:** An extension of bivariate analysis but for multiple variables. It explores the relationships among them or the impact of two or more independent variables (sometimes along with associated variables or covariates) on one or more dependent variables.

**Note:**
It's crucial to ensure that all our analytical methods are tailored to the type of variable under consideration.


## Loading the Dataset

Before we dive into EDA, let's gather our data. In this case, we will load our dataset and take a quick look at its structure.


The dataset can be found [here](https://raw.githubusercontent.com/data-bootcamp-v4/data/main/housing_price_eda.csv) and the information about the dataset [here](https://github.com/data-bootcamp-v4/data/blob/main/housing_price_dataset_info.md).

In [34]:
import pandas as pd
import plotly.express as px
import numpy as np

%matplotlib inline

pd.options.display.max_columns = 500

In [35]:
# Loading the housing price dataset (assuming the file name is "housing_price.csv")
df = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/housing_price_eda.csv")

## Initial Exploration

Before diving into the specifics of univariate analysis, it's essential to get acquainted with our dataset. For reference, below is a copy of the columns meaning:

<details>
    <summary>Review columns</summary><br>
    <b>Id</b>: Unique identifier for each property.<br>
    <b>MSSubClass</b>: Classification of the dwelling based on its type.<br>    
    <b>MSZoning</b>: Zoning classification which determines the type of use permitted on the land (e.g., residential, commercial).<br>    
    <b>LotFrontage</b>: Linear feet of street that the property is connected to.<br>    
    <b>LotArea</b>: Size of the lot in square feet.<br>    
    <b>Street</b>: Type of road access to the property (e.g., paved, gravel).<br>    
    <b>Alley</b>: Type of alley access to the property.<br>    
    <b>LotShape</b>: General shape of the property (e.g., regular, slightly irregular).<br>    
    <b>LandContour</b>: Contour or flatness of the property.<br>    
    <b>Utilities</b>: Type of utilities available (e.g., all public utilities, electricity and gas only).<br>    
    <b>LotConfig</b>: Lot configuration or layout.<br>    
    <b>LandSlope</b>: Slope of the property.<br>    
    <b>Neighborhood</b>: Physical location within a city or town.<br>    
    <b>Condition1</b>: Proximity to main road or railroad.<br>    
    <b>Condition2</b>: Proximity to a second main road or railroad (if any).<br>    
    <b>BldgType</b>: Type of dwelling (e.g., single-family, duplex).<br>    
    <b>HouseStyle</b>: Style of the dwelling (e.g., one story, two-story).<br>    
    <b>OverallQual</b>: Overall quality of the house.<br>    
    <b>OverallCond</b>: Overall condition of the house.<br>    
    <b>YearBuilt</b>: Original construction date.<br>    
    <b>YearRemodAdd</b>: Remodeling date.<br>    
    <b>RoofStyle</b>: Type of roof.<br>    
    <b>RoofMatl</b>: Roof material.<br>    
    <b>Exterior1st</b>: Exterior covering on the house.<br>    
    <b>Exterior2nd</b>: Second exterior covering on the house (if any).<br>    
    <b>MasVnrType</b>: Type of masonry veneer.<br>    
    <b>MasVnrArea</b>: Masonry veneer area in square feet.<br>    
    <b>ExterQual</b>: Quality of the exterior material.<br>    
    <b>ExterCond</b>: Condition of the exterior material.<br>    
    <b>Foundation</b>: Type of foundation.<br>    
    <b>BsmtQual</b>: Quality of the basement.<br>    
    <b>BsmtCond</b>: Condition of the basement.<br>    
    <b>BsmtExposure</b>: Walkout or garden-level walls in the basement.<br>    
    <b>BsmtFinType1</b>: Quality of basement finished area.<br>    
    <b>BsmtFinSF1</b>: Type 1 finished square feet.<br>    
    <b>BsmtFinType2</b>: Secondary finished area (if multiple types).<br>    
    <b>BsmtFinSF2</b>: Type 2 finished square feet.<br>    
    <b>BsmtUnfSF</b>: Unfinished square feet of the basement area.<br>    
    <b>TotalBsmtSF</b>: Total square feet of the basement area.<br>    
    <b>Heating</b>: Type of heating system.<br>    
    <b>HeatingQC</b>: Heating quality and condition.<br>    
    <b>CentralAir</b>: Central air conditioning availability.<br>    
    <b>Electrical</b>: Electrical system type.<br>    
    <b>1stFlrSF</b>: First-floor square footage.<br>    
    <b>2ndFlrSF</b>: Second-floor square footage.<br>    
    <b>LowQualFinSF</b>: Low-quality finished square footage.<br>    
    <b>GrLivArea</b>: Above-ground living area in square feet.<br>    
    <b>BsmtFullBath</b>: Number of full bathrooms in the basement.<br>    
    <b>BsmtHalfBath</b>: Number of half bathrooms in the basement.<br>    
    <b>FullBath</b>: Full bathrooms above ground.<br>    
    <b>HalfBath</b>: Half bathrooms above ground.<br>    
    <b>BedroomAbvGr</b>: Number of bedrooms above ground.<br>    
    <b>KitchenAbvGr</b>: Number of kitchens.<br>    
    <b>KitchenQual</b>: Kitchen quality.<br>    
    <b>TotRmsAbvGrd</b>: Total rooms above ground (excluding bathrooms).<br>    
    <b>Functional</b>: Home functionality rating.<br>    
    <b>Fireplaces</b>: Number of fireplaces.<br>    
    <b>FireplaceQu</b>: Fireplace quality.<br>    
    <b>GarageType</b>: Garage location or type.<br>    
    <b>GarageYrBlt</b>: Year the garage was built.<br>    
    <b>GarageFinish</b>: Interior finish of the garage.<br>    
    <b>GarageCars</b>: Size of the garage in terms of car capacity.<br>    
    <b>GarageArea</b>: Size of the garage in square feet.<br>    
    <b>GarageQual</b>: Garage quality.<br>    
    <b>GarageCond</b>: Garage condition.<br>    
    <b>PavedDrive</b>: Paved driveway availability.<br>    
    <b>WoodDeckSF</b>: Wood deck area in square feet.<br>    
    <b>OpenPorchSF</b>: Open porch area in square feet.<br>    
    <b>EnclosedPorch</b>: Enclosed porch area in square feet.<br>    
    <b>3SsnPorch</b>: Three-season porch area in square feet.<br>    
    <b>ScreenPorch</b>: Screened porch area in square feet.<br>    
    <b>PoolArea</b>: Pool area in square feet.<br>    
    <b>PoolQC</b>: Pool quality.<br>    
    <b>Fence</b>: Fence quality.<br>    
    <b>MiscFeature</b>: Miscellaneous feature not covered in other categories.<br>    
    <b>MiscVal</b>: Value of the miscellaneous feature.<br>    
    <b>MoSold</b>: Month the house was sold.<br>    
    <b>YrSold</b>: Year the house was sold.<br>    
    <b>SaleType</b>: Type of sale.<br>    
    <b>SaleCondition</b>: Condition of sale.<br>    
    <b>SalePrice</b>: Price at which the house was sold.    
</details>


In [36]:
# Display the first few rows of the dataset
df.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,LandSlope,Neighborhood,Condition1,Condition2,BldgType,HouseStyle,OverallQual,OverallCond,YearBuilt,YearRemodAdd,RoofStyle,RoofMatl,Exterior1st,Exterior2nd,MasVnrType,MasVnrArea,ExterQual,ExterCond,Foundation,BsmtQual,BsmtCond,BsmtExposure,BsmtFinType1,BsmtFinSF1,BsmtFinType2,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,Heating,HeatingQC,CentralAir,Electrical,1stFlrSF,2ndFlrSF,LowQualFinSF,GrLivArea,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,KitchenQual,TotRmsAbvGrd,Functional,Fireplaces,FireplaceQu,GarageType,GarageYrBlt,GarageFinish,GarageCars,GarageArea,GarageQual,GarageCond,PavedDrive,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,7,5,2003,2003,Gable,CompShg,VinylSd,VinylSd,BrkFace,196.0,Gd,TA,PConc,Gd,TA,No,GLQ,706,Unf,0,150,856,GasA,Ex,Y,SBrkr,856,854,0,1710,1,0,2,1,3,1,Gd,8,Typ,0,,Attchd,2003.0,RFn,2,548,TA,TA,Y,0,61,0,0,0,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,FR2,Gtl,Veenker,Feedr,Norm,1Fam,1Story,6,8,1976,1976,Gable,CompShg,MetalSd,MetalSd,,0.0,TA,TA,CBlock,Gd,TA,Gd,ALQ,978,Unf,0,284,1262,GasA,Ex,Y,SBrkr,1262,0,0,1262,0,1,2,0,3,1,TA,6,Typ,1,TA,Attchd,1976.0,RFn,2,460,TA,TA,Y,298,0,0,0,0,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,7,5,2001,2002,Gable,CompShg,VinylSd,VinylSd,BrkFace,162.0,Gd,TA,PConc,Gd,TA,Mn,GLQ,486,Unf,0,434,920,GasA,Ex,Y,SBrkr,920,866,0,1786,1,0,2,1,3,1,Gd,6,Typ,1,TA,Attchd,2001.0,RFn,2,608,TA,TA,Y,0,42,0,0,0,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,Corner,Gtl,Crawfor,Norm,Norm,1Fam,2Story,7,5,1915,1970,Gable,CompShg,Wd Sdng,Wd Shng,,0.0,TA,TA,BrkTil,TA,Gd,No,ALQ,216,Unf,0,540,756,GasA,Gd,Y,SBrkr,961,756,0,1717,1,0,1,0,3,1,Gd,7,Typ,1,Gd,Detchd,1998.0,Unf,3,642,TA,TA,Y,0,35,272,0,0,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,FR2,Gtl,NoRidge,Norm,Norm,1Fam,2Story,8,5,2000,2000,Gable,CompShg,VinylSd,VinylSd,BrkFace,350.0,Gd,TA,PConc,Gd,TA,Av,GLQ,655,Unf,0,490,1145,GasA,Ex,Y,SBrkr,1145,1053,0,2198,1,0,2,1,4,1,Gd,9,Typ,1,TA,Attchd,2000.0,RFn,3,836,TA,TA,Y,192,84,0,0,0,0,,,,0,12,2008,WD,Normal,250000


In [37]:
df.sample(30)

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,LandSlope,Neighborhood,Condition1,Condition2,BldgType,HouseStyle,OverallQual,OverallCond,YearBuilt,YearRemodAdd,RoofStyle,RoofMatl,Exterior1st,Exterior2nd,MasVnrType,MasVnrArea,ExterQual,ExterCond,Foundation,BsmtQual,BsmtCond,BsmtExposure,BsmtFinType1,BsmtFinSF1,BsmtFinType2,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,Heating,HeatingQC,CentralAir,Electrical,1stFlrSF,2ndFlrSF,LowQualFinSF,GrLivArea,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,KitchenQual,TotRmsAbvGrd,Functional,Fireplaces,FireplaceQu,GarageType,GarageYrBlt,GarageFinish,GarageCars,GarageArea,GarageQual,GarageCond,PavedDrive,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
587,588,85,RL,74.0,8740,Pave,,IR1,Lvl,AllPub,Inside,Gtl,Sawyer,Norm,Norm,1Fam,SFoyer,5,6,1982,1982,Hip,CompShg,HdBoard,HdBoard,,0.0,TA,TA,CBlock,TA,TA,Av,ALQ,672,Unf,0,168,840,GasA,TA,Y,SBrkr,860,0,0,860,1,0,1,0,2,1,TA,4,Typ,0,,Detchd,1996.0,Unf,2,528,TA,TA,Y,0,0,0,0,0,0,,,,0,7,2009,WD,Normal,137000
304,305,75,RM,87.0,18386,Pave,,Reg,Lvl,AllPub,Inside,Gtl,OldTown,Norm,Norm,1Fam,2.5Fin,7,9,1880,2002,Gable,CompShg,CemntBd,CmentBd,,0.0,TA,TA,BrkTil,TA,TA,No,Unf,0,Unf,0,1470,1470,GasA,Ex,Y,SBrkr,1675,1818,0,3493,0,0,3,0,3,1,Gd,10,Typ,1,Ex,Attchd,2003.0,Unf,3,870,TA,TA,Y,302,0,0,0,0,0,,,,0,5,2008,WD,Normal,295000
662,663,20,RL,120.0,13560,Pave,,Reg,Lvl,AllPub,Corner,Gtl,NAmes,Norm,Norm,1Fam,1Story,6,3,1968,1968,Hip,CompShg,Wd Sdng,Wd Sdng,BrkFace,216.0,TA,TA,CBlock,Fa,Fa,No,Unf,0,Unf,0,1392,1392,GasA,Gd,Y,SBrkr,1392,0,0,1392,1,0,1,0,2,1,TA,5,Maj2,2,TA,Attchd,1968.0,RFn,2,576,TA,TA,Y,0,0,240,0,0,0,,,,0,7,2009,WD,Normal,110000
1195,1196,60,RL,51.0,8029,Pave,,IR1,Lvl,AllPub,Inside,Gtl,Gilbert,Norm,Norm,1Fam,2Story,6,5,2005,2005,Gable,CompShg,VinylSd,VinylSd,,0.0,TA,TA,PConc,Gd,TA,No,Unf,0,Unf,0,728,728,GasA,Ex,Y,SBrkr,728,728,0,1456,0,0,2,1,3,1,Gd,8,Typ,0,,Attchd,2005.0,Fin,2,400,TA,TA,Y,100,24,0,0,0,0,,,,0,7,2008,WD,Normal,176000
1351,1352,60,RL,70.0,9247,Pave,,IR1,Lvl,AllPub,Inside,Gtl,NAmes,Norm,Norm,1Fam,2Story,6,6,1962,1962,Gable,CompShg,HdBoard,HdBoard,BrkFace,318.0,TA,TA,CBlock,TA,TA,No,Rec,319,Unf,0,539,858,GasA,Ex,Y,SBrkr,858,858,0,1716,0,0,1,1,4,1,TA,8,Typ,1,Gd,Attchd,1962.0,Fin,2,490,TA,TA,Y,0,84,0,0,120,0,,,,0,3,2008,WD,Normal,171000
325,326,45,RM,50.0,5000,Pave,,Reg,Lvl,AllPub,Inside,Gtl,IDOTRR,RRAe,Norm,1Fam,1.5Unf,5,6,1941,1950,Gable,CompShg,MetalSd,MetalSd,,0.0,TA,TA,CBlock,TA,TA,Av,BLQ,116,Unf,0,604,720,GasA,Po,N,FuseF,803,0,0,803,0,0,1,0,2,1,TA,5,Typ,0,,Detchd,1941.0,Unf,2,360,TA,TA,Y,0,0,244,0,0,0,,,,0,12,2007,WD,Normal,87000
1360,1361,70,RL,51.0,9842,Pave,,Reg,Lvl,AllPub,Inside,Gtl,SWISU,Feedr,Norm,1Fam,2Story,5,6,1921,1998,Gable,CompShg,MetalSd,Wd Sdng,,0.0,TA,TA,BrkTil,TA,Fa,No,Unf,0,Unf,0,612,612,GasA,Ex,Y,SBrkr,990,1611,0,2601,0,0,3,1,4,1,TA,8,Typ,0,,BuiltIn,1998.0,RFn,2,621,TA,TA,Y,183,0,301,0,0,0,,,,0,5,2008,WD,Normal,189000
32,33,20,RL,85.0,11049,Pave,,Reg,Lvl,AllPub,Corner,Gtl,CollgCr,Norm,Norm,1Fam,1Story,8,5,2007,2007,Gable,CompShg,VinylSd,VinylSd,,0.0,Gd,TA,PConc,Ex,TA,Av,Unf,0,Unf,0,1234,1234,GasA,Ex,Y,SBrkr,1234,0,0,1234,0,0,2,0,3,1,Gd,7,Typ,0,,Attchd,2007.0,RFn,2,484,TA,TA,Y,0,30,0,0,0,0,,,,0,1,2008,WD,Normal,179900
822,823,60,RL,,12394,Pave,,IR1,Lvl,AllPub,Corner,Gtl,Gilbert,Norm,Norm,1Fam,2Story,7,5,2003,2003,Gable,CompShg,VinylSd,VinylSd,,0.0,Gd,TA,PConc,Gd,TA,Gd,Unf,0,Unf,0,847,847,GasA,Ex,Y,SBrkr,847,886,0,1733,0,0,2,1,3,1,Gd,7,Typ,1,Gd,BuiltIn,2003.0,Fin,2,433,TA,TA,Y,100,48,0,0,0,0,,,,0,10,2007,WD,Family,225000
77,78,50,RM,50.0,8635,Pave,,Reg,Lvl,AllPub,Inside,Gtl,BrkSide,Norm,Norm,1Fam,1.5Fin,5,5,1948,2001,Gable,CompShg,Wd Sdng,Wd Sdng,,0.0,TA,TA,CBlock,TA,TA,No,BLQ,336,GLQ,41,295,672,GasA,TA,Y,SBrkr,1072,213,0,1285,1,0,1,0,2,1,TA,6,Min1,0,,Detchd,1948.0,Unf,1,240,TA,TA,Y,0,0,0,0,0,0,,MnPrv,,0,1,2008,WD,Normal,127000


In [38]:
# Retrieving the number of rows and columns in the dataframe
df.shape

(1460, 81)

In [39]:
# Displaying the data types of each column in the dataframe
df.dtypes

Id                 int64
MSSubClass         int64
MSZoning          object
LotFrontage      float64
LotArea            int64
                  ...   
MoSold             int64
YrSold             int64
SaleType          object
SaleCondition     object
SalePrice          int64
Length: 81, dtype: object

### Exploring numerical and categorical variables

We'll explore numerical and categorical variables, and create two dataframes, one for each type of variable.

**Note**: 
- **Numerical variables**: These can encompass both quantitative and qualitative information. Often, discrete numerical variables with limited distinct values hint at qualitative (categorical) variable encoded as numbers.

- **Object variables**: Typically, these consist of qualitative data, numeric data in a string format, or data that might not be directly relevant to the analysis. Examples include identifiers like 'ID' numbers or 'Names'. Variables with a broad range of unique values, especially in string format, often fall into this category. 


In [40]:
# Retrieving the unique data types present in the dataframe columns
list(set(df.dtypes.tolist()))

[dtype('int64'), dtype('float64'), dtype('O')]

In [41]:
# Extracting column names with numerical data types from the dataframe
df.select_dtypes("number").columns

Index(['Id', 'MSSubClass', 'LotFrontage', 'LotArea', 'OverallQual',
       'OverallCond', 'YearBuilt', 'YearRemodAdd', 'MasVnrArea', 'BsmtFinSF1',
       'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', '1stFlrSF', '2ndFlrSF',
       'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath',
       'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'TotRmsAbvGrd',
       'Fireplaces', 'GarageYrBlt', 'GarageCars', 'GarageArea', 'WoodDeckSF',
       'OpenPorchSF', 'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea',
       'MiscVal', 'MoSold', 'YrSold', 'SalePrice'],
      dtype='object')

In [42]:
# Counting and sorting the unique values for each numerical column in descending order
df.select_dtypes("number").nunique().sort_values(ascending=False)

Id               1460
LotArea          1073
GrLivArea         861
BsmtUnfSF         780
1stFlrSF          753
TotalBsmtSF       721
SalePrice         663
BsmtFinSF1        637
GarageArea        441
2ndFlrSF          417
MasVnrArea        327
WoodDeckSF        274
OpenPorchSF       202
BsmtFinSF2        144
EnclosedPorch     120
YearBuilt         112
LotFrontage       110
GarageYrBlt        97
ScreenPorch        76
YearRemodAdd       61
LowQualFinSF       24
MiscVal            21
3SsnPorch          20
MSSubClass         15
TotRmsAbvGrd       12
MoSold             12
OverallQual        10
OverallCond         9
PoolArea            8
BedroomAbvGr        8
GarageCars          5
YrSold              5
KitchenAbvGr        4
Fireplaces          4
BsmtFullBath        4
FullBath            4
HalfBath            3
BsmtHalfBath        3
dtype: int64

In [43]:
# Separating between discrete and continuous variables, as discrete ones could potentially be treated as categorical.
# Remember to adjust the threshold (in this case, < 20) based on your dataset's specific characteristics and domain knowledge.
potential_categorical_from_numerical = df.select_dtypes("number").loc[:, df.select_dtypes("number").nunique() < 20]
potential_categorical_from_numerical

Unnamed: 0,MSSubClass,OverallQual,OverallCond,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,TotRmsAbvGrd,Fireplaces,GarageCars,PoolArea,MoSold,YrSold
0,60,7,5,1,0,2,1,3,1,8,0,2,0,2,2008
1,20,6,8,0,1,2,0,3,1,6,1,2,0,5,2007
2,60,7,5,1,0,2,1,3,1,6,1,2,0,9,2008
3,70,7,5,1,0,1,0,3,1,7,1,3,0,2,2006
4,60,8,5,1,0,2,1,4,1,9,1,3,0,12,2008
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1455,60,6,5,0,0,2,1,3,1,7,1,2,0,8,2007
1456,20,6,6,1,0,2,0,3,1,7,2,2,0,2,2010
1457,70,7,9,0,0,2,0,4,1,9,2,1,0,5,2010
1458,20,5,6,1,0,1,0,2,1,5,0,1,0,4,2010


In [44]:
# Retrieving column names with object (typically string) data types from the dataframe
df.select_dtypes("object").columns

Index(['MSZoning', 'Street', 'Alley', 'LotShape', 'LandContour', 'Utilities',
       'LotConfig', 'LandSlope', 'Neighborhood', 'Condition1', 'Condition2',
       'BldgType', 'HouseStyle', 'RoofStyle', 'RoofMatl', 'Exterior1st',
       'Exterior2nd', 'MasVnrType', 'ExterQual', 'ExterCond', 'Foundation',
       'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2',
       'Heating', 'HeatingQC', 'CentralAir', 'Electrical', 'KitchenQual',
       'Functional', 'FireplaceQu', 'GarageType', 'GarageFinish', 'GarageQual',
       'GarageCond', 'PavedDrive', 'PoolQC', 'Fence', 'MiscFeature',
       'SaleType', 'SaleCondition'],
      dtype='object')

In [45]:
# Counting and sorting the unique values for each object (string) column in descending order
df.select_dtypes("object").nunique().sort_values(ascending=False)

# All columns seem categorical, as there isn't a wide variability of values.

Neighborhood     25
Exterior2nd      16
Exterior1st      15
SaleType          9
Condition1        9
Condition2        8
HouseStyle        8
RoofMatl          8
Functional        7
BsmtFinType2      6
Heating           6
RoofStyle         6
SaleCondition     6
BsmtFinType1      6
GarageType        6
Foundation        6
Electrical        5
FireplaceQu       5
HeatingQC         5
GarageQual        5
GarageCond        5
MSZoning          5
LotConfig         5
ExterCond         5
BldgType          5
BsmtExposure      4
MiscFeature       4
Fence             4
LotShape          4
LandContour       4
BsmtCond          4
KitchenQual       4
ExterQual         4
BsmtQual          4
LandSlope         3
GarageFinish      3
MasVnrType        3
PavedDrive        3
PoolQC            3
Utilities         2
CentralAir        2
Street            2
Alley             2
dtype: int64

Decide based on domain knowledge and the above explorations which numerical columns are better as categorical
 and vice versa. 
 
 For demonstration purposes, let's assume the *potential_categorical_from_numerical* are categorical, even though this might not be the case in a real scenario.

In [46]:
# Extracting columns with object (typically string) data types to create a categorical dataframe
# For demonstration purposes, let's consider the columns in potential_categorical_from_numerical as categorical variables.
df_categorical = pd.concat([df.select_dtypes("object"), potential_categorical_from_numerical], axis=1)

# Adjusting the numerical dataframe by removing the moved columns
df_numerical = df.select_dtypes("number").drop(columns=potential_categorical_from_numerical.columns)

In [48]:
for col in df_categorical.columns:
    fig = px.histogram(df_categorical, x=col)
    fig.show()

In [50]:
for col in df_numerical.columns:
    fig = px.histogram(df_numerical, x=col)
    fig.show()

In [None]:
# Verifying that the total number of columns in the dataframe is the sum of object (string) and numerical columns
len(df.columns) == len(df.select_dtypes("object").columns) + len(df.select_dtypes("number").columns)

In the data cleaning phase, it's important to focus on several essential aspects. First, verify that data is in the correct type and format (**Data Typing/Formatting**). Then, identify and address any duplicates to eliminate redundancy (**Duplicates**). Next, tackle missing values by finding and managing null or absent data (**Missing Values**). For categorical variables, like gender, review the categories (e.g., M, F, Masculine) and their distributions to determine if cleaning is needed (**Categorical Variables**). Similarly, assess numerical data for consistency. Finally, evaluate outliers to decide how to handle these extreme values (**Outliers**). Effective exploration of these areas is crucial for comprehensive data cleaning.

We won't be delving into this now, as we already reviewed it in data cleaning lessons (expect for outliers, which we'll be reviewing later in this lesson). For now, we will just clean null values, so we can focus in univariate analysis.

## Data Cleaning


### Checking for Missing Data

Missing data can influence our analysis. It's essential to identify and handle them appropriately.


In [51]:
# Checking for missing data
df.isnull().sum().sort_values(ascending=False)

PoolQC         1453
MiscFeature    1406
Alley          1369
Fence          1179
MasVnrType      872
               ... 
ExterQual         0
Exterior2nd       0
Exterior1st       0
RoofMatl          0
SalePrice         0
Length: 81, dtype: int64

In [53]:
df.isnull().mean()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,LandSlope,Neighborhood,Condition1,Condition2,BldgType,HouseStyle,OverallQual,OverallCond,YearBuilt,YearRemodAdd,RoofStyle,RoofMatl,Exterior1st,Exterior2nd,MasVnrType,MasVnrArea,ExterQual,ExterCond,Foundation,BsmtQual,BsmtCond,BsmtExposure,BsmtFinType1,BsmtFinSF1,BsmtFinType2,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,Heating,HeatingQC,CentralAir,Electrical,1stFlrSF,2ndFlrSF,LowQualFinSF,GrLivArea,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,KitchenQual,TotRmsAbvGrd,Functional,Fireplaces,FireplaceQu,GarageType,GarageYrBlt,GarageFinish,GarageCars,GarageArea,GarageQual,GarageCond,PavedDrive,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,True,True,False,False,False,False,False,False
1,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,True,True,False,False,False,False,False,False
2,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,True,True,False,False,False,False,False,False
3,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,True,True,False,False,False,False,False,False
4,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,True,True,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1455,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,True,True,False,False,False,False,False,False
1456,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,True,False,False,False,False,False,False
1457,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False
1458,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,True,True,False,False,False,False,False,False


In [54]:
# Identifying columns in the dataframe where over 80% of the values are missing
df.columns[df.isnull().mean() > 0.8]

Index(['Alley', 'PoolQC', 'Fence', 'MiscFeature'], dtype='object')

In [55]:
# Filtering out columns in the dataframe where more than 80% of the values are missing
df = df[df.columns[df.isnull().mean() < 0.8]]

In [56]:
# Removing the "Id" column from the dataframe
df.drop("Id", inplace=True, axis=1)

## Univariate Analysis

Univariate analysis, as its name suggests, concentrates on one variable at a time, giving us a deep understanding of its characteristics. This fundamental step in Exploratory Data Analysis (EDA) lays the groundwork for subsequent analyses involving multiple variables. Let's explore various techniques for both categorical and numerical variables.

**Categorical variables**:
- Frequency tables. Counts and proportions.
- Visualizations: Bar charts, pie charts

**Numerical variables**: 
- Measures of centrality: Mean, median, mode
- Measures of dispersion: Variance,  standard deviation, minimum, maximum, range, quantiles
- Shape of the distribution: Symmetry and kurtosis - in **Extra Stats notebook**
- Visualizations: Histograms, box plots

### Categorical Variables

Categorical variables represent categories or labels, like types or groups. Analyzing categorical data involves understanding the frequency or proportion of each category.

#### Frequency Tables

Frequency tables are tabular representations that display the number of occurrences of each category. They help in understanding the distribution of categories in a dataset.

In python, we can use:
- `value_counts()`
- `pd.crosstab()`

Let's consider *MSZoning* as our categorical variable of interest, which represents the general zoning classification of the sale.

We'll look at `value_counts()`.

In [57]:
# Frequency table for 'MSZoning'
frequency_table = df['MSZoning'].value_counts()

# Calculating the proportion of each unique value in the 'MSZoning'
proportion_table = df['MSZoning'].value_counts(normalize=True)

frequency_table, proportion_table

(MSZoning
 RL         1151
 RM          218
 FV           65
 RH           16
 C (all)      10
 Name: count, dtype: int64,
 MSZoning
 RL         0.788356
 RM         0.149315
 FV         0.044521
 RH         0.010959
 C (all)    0.006849
 Name: proportion, dtype: float64)

The frequency table gives the count of each zoning type, while the proportion table provides the percentage representation of each category in the dataset. This helps to quickly identify dominant and minority categories.

Let's look at `pd.crosstab()`. The crosstab function can be useful to compute a cross-tabulation of two (or more) factors. Here, it's used to count the occurrences of each 'MSZoning' type.

In [58]:
# Creating a crosstab table for the 'MSZoning' column, counting occurrences for each unique value
my_table = pd.crosstab(index = df_categorical["MSZoning"],  # Make a crosstab
                              columns="count")      # Name the count column
my_table

col_0,count
MSZoning,Unnamed: 1_level_1
C (all),10
FV,65
RH,16
RL,1151
RM,218


We can also get the proportion_table the following way:

In [59]:
# Calculating the proportions for each value in 'my_table' and rounding the results to two decimal places
my_table = pd.crosstab(index = df_categorical["MSZoning"],  # Make a crosstab
                              columns="count", normalize=True)      # Name the count column
my_table

col_0,count
MSZoning,Unnamed: 1_level_1
C (all),0.006849
FV,0.044521
RH,0.010959
RL,0.788356
RM,0.149315


The crosstab table displays the number of occurrences of each 'MSZoning' type, just like the frequency table. Computing the proportion table showcases the relative percentage of each category.

**Insights** for 'MSZoning':

- The most common zoning classification is 'RL', which stands for Residential Low Density, comprising approximately 78.8% of the properties in the dataset.
- The second most frequent zoning classification is 'RM' (Residential Medium Density), making up roughly 14.9%.

#### Visualizations

Visualizations offer a more intuitive understanding of categorical data distribution. Bar charts and pie charts are common methods to visually represent categorical data.

##### Bar charts

Bar charts can display the frequency or proportion of each category using bars of varying lengths. Here, the same data is visualized using different methods: `px.histogram()` and `px.bar()`.

Let's see how to use the `px.histogram()` function with the result from `value_counts()` and `pd.crosstab()`. We should expect the same plot for both following lines of code.

In [60]:
# Plotting a bar chart using the values from the frequency table, with colors sourced from the "Set3" palette
px.histogram(x=frequency_table.index, y=frequency_table.values)

In [61]:
# Plotting a bar chart using the 'count' values from 'my_table', with colors sourced from the "Set3" palette
px.histogram(x=my_table.index, y=my_table["count"])

**Insights** from the Bar Charts:

1. Both bar charts confirm the dominance of the `RL` zoning classification within the dataset. 
2. The bar representing `RL` is significantly taller than the others, emphasizing its higher frequency.
3. The two charts are identical, showcasing that both `value_counts()` and `pd.crosstab()` provide similar counts for the categories.

##### Pie charts



Pie charts provide a circular representation of the data, showing the proportion of each category as slices of a pie. However, they can be challenging to interpret when there are many categories or when categories have similar proportions.

Seaborn, as of 2023, does not have a dedicated function for pie charts. Pie charts are more commonly created using `matplotlib`, which Seaborn is built upon.

In [62]:
# Plotting a pie chart of the 'MSZoning' column value counts, with percentage labels, 
px.pie(df, names='MSZoning')

**Insights**:

- The pie chart provides a clear visual representation of the dominance of the 'RL' (Residential Low Density) zoning classification, occupying a significant portion of the chart.
- Other zoning types like 'RM', 'FV', 'RH', and 'C (all)' occupy much smaller slices, emphasizing the skewed distribution.
- While pie charts can illustrate proportions effectively, the dominance of 'RL' makes it somewhat challenging to discern differences between the smaller categories. This underscores why alternative visualizations, like bar charts, can sometimes be more informative for such distributions.

### Numerical Variables

Numerical variables are quantitative, and their values can be measured. Analyzing numerical data involves understanding its distribution, central tendency, and variability.


#### Summary Statistics

**Centrality and Dispersion Measures**

Let's start by getting some basic statistics on our dataset to understand its scale, centrality, and spread.


- The `.describe()` method provides key statistics for numerical columns (by default) in a dataframe, excluding NaN values; although it primarily targets numeric data, the `include` parameter allows for the selection of other data types.

In [64]:
# Summary statistics for the dataset
df.describe().round(2)

Unnamed: 0,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,1stFlrSF,2ndFlrSF,LowQualFinSF,GrLivArea,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,TotRmsAbvGrd,Fireplaces,GarageYrBlt,GarageCars,GarageArea,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold,SalePrice
count,1460.0,1201.0,1460.0,1460.0,1460.0,1460.0,1460.0,1452.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1379.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0
mean,56.9,70.05,10516.83,6.1,5.58,1971.27,1984.87,103.69,443.64,46.55,567.24,1057.43,1162.63,346.99,5.84,1515.46,0.43,0.06,1.57,0.38,2.87,1.05,6.52,0.61,1978.51,1.77,472.98,94.24,46.66,21.95,3.41,15.06,2.76,43.49,6.32,2007.82,180921.2
std,42.3,24.28,9981.26,1.38,1.11,30.2,20.65,181.07,456.1,161.32,441.87,438.71,386.59,436.53,48.62,525.48,0.52,0.24,0.55,0.5,0.82,0.22,1.63,0.64,24.69,0.75,213.8,125.34,66.26,61.12,29.32,55.76,40.18,496.12,2.7,1.33,79442.5
min,20.0,21.0,1300.0,1.0,1.0,1872.0,1950.0,0.0,0.0,0.0,0.0,0.0,334.0,0.0,0.0,334.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,1900.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,2006.0,34900.0
25%,20.0,59.0,7553.5,5.0,5.0,1954.0,1967.0,0.0,0.0,0.0,223.0,795.75,882.0,0.0,0.0,1129.5,0.0,0.0,1.0,0.0,2.0,1.0,5.0,0.0,1961.0,1.0,334.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0,2007.0,129975.0
50%,50.0,69.0,9478.5,6.0,5.0,1973.0,1994.0,0.0,383.5,0.0,477.5,991.5,1087.0,0.0,0.0,1464.0,0.0,0.0,2.0,0.0,3.0,1.0,6.0,1.0,1980.0,2.0,480.0,0.0,25.0,0.0,0.0,0.0,0.0,0.0,6.0,2008.0,163000.0
75%,70.0,80.0,11601.5,7.0,6.0,2000.0,2004.0,166.0,712.25,0.0,808.0,1298.25,1391.25,728.0,0.0,1776.75,1.0,0.0,2.0,1.0,3.0,1.0,7.0,1.0,2002.0,2.0,576.0,168.0,68.0,0.0,0.0,0.0,0.0,0.0,8.0,2009.0,214000.0
max,190.0,313.0,215245.0,10.0,9.0,2010.0,2010.0,1600.0,5644.0,1474.0,2336.0,6110.0,4692.0,2065.0,572.0,5642.0,3.0,2.0,3.0,2.0,8.0,3.0,14.0,3.0,2010.0,4.0,1418.0,857.0,547.0,552.0,508.0,480.0,738.0,15500.0,12.0,2010.0,755000.0


From `describe()` we get:
- Measures of centrality: mean, median (indicated as 50%)
- Measures of dispersion: standard deviation (std), minimum, maximum, quartiles (Q1, Q2, Q3, indicated as 25%, 50%, and 75% respectively)

**Insights** from Summary Statistics for 'SalePrice':

- The average (mean) sale price of the houses in the dataset is approximately `$180,921`.

- The median sale price (middle value when sorted) stands at `$163,000`. Notably, the median is lower than the mean, suggesting a skew in the distribution of sale prices towards higher values.

- The standard deviation, a measure of the amount of variation or dispersion in the sale prices, is approximately `$79,442`. This indicates that sale prices can vary significantly from the average.

- The minimum and maximum sale prices are `$34,900` and `$755,000`, respectively, highlighting a wide range of property values in the dataset.

- The interquartile range (IQR), given by the values at 25% (Q1) and 75% (Q3), is between `$129,975` and `$214,000`. This means that 50% of the houses in the dataset were sold within this price range.

#### More Centrality and Dispersion Measures

Now, suppose we want to calculate individual statistical measures without using the `.describe()` method. Here are some ways to do it:

- `df[column].mean()`: Computes the mean of the selected column.
- `df[column].median()`: Calculates the median of the selected column.
- `df[column].mode()`: Identifies the mode of the selected column.
- `df[column].std()`: Determines the standard deviation of the selected column.
- `df[column].var()`: Computes the variance of the selected column.
- `df[column].min()`: Finds the minimum value in the selected column.
- `df[column].max()`: Finds the maximum value in the selected column.
- `df[column].count()`: Counts the number of non-NaN entries in the selected column.

In these examples, replace `column` with the name of the column you want to analyze.

For this section, we'll focus on 'SalePrice' as our numerical variable of interest, which represents the price at which the house was sold.

**Measures of Centrality**

In [65]:
mean_price = df['SalePrice'].mean()
median_price = df['SalePrice'].median()
mode_price = df['SalePrice'].mode()[0]

mean_price, median_price, mode_price

(180921.19589041095, 163000.0, 140000)

**Measures of Dispersion**

In [66]:
variance_price = df['SalePrice'].var()
std_dev_price = df['SalePrice'].std()
min_price = df['SalePrice'].min()
max_price = df['SalePrice'].max()
range_price = max_price - min_price
quantiles_price = df['SalePrice'].quantile([0.25, 0.5, 0.75])

variance_price, std_dev_price, min_price, max_price, range_price, quantiles_price

(6311111264.297448,
 79442.50288288662,
 34900,
 755000,
 720100,
 0.25    129975.0
 0.50    163000.0
 0.75    214000.0
 Name: SalePrice, dtype: float64)

In [None]:
df['SalePrice'].quantile(0.1) # We can get any quantile value, not just quartiles

**Insights** from Measures of Centrality and Dispersion for 'SalePrice', for those metrics not calcualted in `describe()`:

- **Centrality**:
  - The most frequent (mode) sale price is $140,000. This value appears more frequently than any other price in the dataset.
  
- **Dispersion**:
  - The variance, a measure of how far each sale price in the set is from the mean, is approximately \(6,311,111,264\). A high variance implies that sale prices can be quite different from one another.
  - The range of sale prices is $720,100, calculated as the difference between the maximum and minimum prices. This wide range underscores the diversity in property prices within the dataset.

#### Shape of the Distribution

Skewness and kurtosis provide insights into the shape of the data distribution. Skewness indicates the asymmetry, and kurtosis tells us about the "tailedness" or how peaked the distribution is.

In [68]:
skewness_price = df['SalePrice'].skew()
kurtosis_price = df['SalePrice'].kurtosis()

skewness_price, kurtosis_price

(1.8828757597682129, 6.536281860064529)

- Skewness of 'SalePrice': \(1.88\)
- Kurtosis of 'SalePrice': \(6.54\)

**Insights**:

1. **Skewness**: The positive value of skewness (1.88) for the 'SalePrice' indicates that the distribution is right-skewed. This means that the tail on the right side (higher prices) is longer than the left side (lower prices). In practical terms, this suggests that there are a significant number of houses that are sold at higher prices, which are acting as outliers and pulling the mean upwards.
  
2. **Kurtosis**: The kurtosis value of 6.54 is greater than 3, which indicates that the 'SalePrice' distribution has heavier tails and a sharper peak compared to a normal distribution. This means that there are more outliers (extreme values) in the 'SalePrice' than one would expect in a normally distributed set.

The skewness and kurtosis values suggest that there are some houses that are sold at significantly higher prices than the majority, and these are affecting the overall distribution of house prices in the dataset.

#### Visualizations

Visual tools like histograms and box plots offer insights into the distribution, variability, and potential outliers in numerical data.

##### Histograms

Histograms display the frequency distribution of a dataset. The height of each bar represents the number of data points in each bin.

In [None]:
# Plotting a histogram for the 'SalePrice' column of the 'data' dataframe
px.histogram(df, x="SalePrice", nbins=30)

**Insights:**
- The histogram reveals that the majority of the houses are sold in the price range of approximately `$100,000` to `$250,000`. However, there's a long tail on the right side, confirming our earlier inference from the skewness value that there are houses sold at much higher prices. The Kernel Density Estimate (the smooth line) also shows the right-skewed nature of the distribution.

If we wanted to plot at the same time all the numerical variables with histograms, without a for loop, we could do so using matplotlib:

In [None]:
# Creating histograms for each numerical column in 'df_numerical'
df_numerical.hist(figsize=(15, 20), bins=60, xlabelsize=10, ylabelsize=10);

Just by looking at it, which ones do you think could be correlated to SalePrice?

##### Box plots

Box plots, or whisker plots, showcase the central 50% of the data (interquartile range), potential outliers, and other statistical properties.

In [None]:
# Plotting a boxplot for the 'SalePrice' column with a light blue color
px.box(df, x="SalePrice")


**Insights:**
- The box plot gives us a visual representation of the central 50% of the data (the interquartile range) with the median price shown as a line inside the box. The whiskers extend to 1.5 times the interquartile range, and points outside of this range are considered outliers. As observed, there are several outlier points on the higher end of the sale prices, which aligns with our earlier insights about houses sold at significantly higher prices.

Both visualizations underscore the presence of outliers in the higher price range. These outliers might be luxury homes or properties in prime locations, and special attention might be needed when building predictive models, as these outliers can influence model performance.

## Converting continuous to discrete variables: Discretization

Discretization is the process of converting continuous variables into discrete ones by creating a set of contiguous intervals (or bins) and then categorizing the variables into these intervals. This can be particularly useful when you want to categorize a continuous variable into different groups based on ranges. Note that we usually lose information in this process.

For our dataset, let's take the 'SalePrice' column, which is continuous, and discretize it into categories like 'Low', 'Medium', 'High', and 'Very High'.

In [74]:
# Discretizing 'SalePrice' into 4 categories
bins = [0, 100_000, 200_000, 300_000, df['SalePrice'].max()]
labels = ['Low', 'Medium', 'High', 'Very High']
df['SalePrice_category'] = pd.cut(df['SalePrice'], bins=bins, labels=labels, include_lowest=True)

In [73]:
df.SalePrice_category.value_counts()

SalePrice_category
Medium       1028
High          312
Very High     115
Low             5
Name: count, dtype: int64

Another useful option is **discretizing by quantiles**. This means dividing the data into intervals based on specific quantile values. This ensures that each bin has (approximately) the same number of data points. The `pandas` library provides a convenient method, `qcut()`, for this purpose.

Discretizing by quantiles can be particularly useful when you want to create categories that represent relative rankings (like low, medium, high, etc.) based on the distribution of the data, rather than fixed numeric ranges.

**Step 1**: Choose the number of quantiles (or bins). For example, if you want quartiles, you would choose 4 bins. 

**Step 2**: Use the `qcut()` function from `pandas`.

In [71]:
# Discretizing 'SalePrice' into quartiles
df['SalePrice_quantile'] = pd.qcut(df['SalePrice'], q=4, labels=['Q1', 'Q2', 'Q3', 'Q4'])

df.SalePrice_category.value_counts()

SalePrice_category
Medium       910
High         312
Low          123
Very High    115
Name: count, dtype: int64

In the above code:
- `q=4` indicates that we want to divide the data into 4 quantiles (quartiles).
- `labels=['Q1', 'Q2', 'Q3', 'Q4']` provides custom labels for each quantile bin.

The resulting 'SalePrice_quantile' column will categorize each house's sale price into one of the four quartiles.

By discretizing 'SalePrice', we have transformed a continuous variable into categorical bins. This can simplify the analysis by grouping houses into broad price categories. For example, you can now easily analyze the number of houses in each price range or determine if certain features are more common in high-priced houses compared to low-priced ones.

### 💡 Check for understanding

Discretize the '1stFlrSF' column (first-floor square feet) into three categories: 'Small', 'Medium', 'Large'. Set the bins such that 'Small' includes sizes up to the 33rd percentile, 'Medium' includes sizes from the 33rd to the 66th percentile, and 'Large' includes sizes from the 66th percentile onward. How many houses fall into each category?

In [36]:
# Your code goes here

## Summary

In this lesson, we've conducted a comprehensive univariate analysis:

- For **categorical variables**, we visualized the distribution of our zoning classifications with bar and pie charts, backed by frequency tables.
- For **numerical variables**, we explored the central tendencies, dispersions and shape of distribution of our sale prices, visualized through histograms and box plots.

This analysis allows us to deeply understand each variable, laying a strong foundation for subsequent multivariate analyses.

## 💡 Check for understanding

**Scenario**:
Given the 'TotRmsAbvGrd' column (total rooms above ground), let's dive deep into its univariate characteristics.

**Tasks**:

1. **Data Aggregation**:
    - Create a frequency table for 'TotRmsAbvGrd' to understand the distribution of the number of rooms in houses.
    - Calculate the mean, median, mode, variance, and standard deviation of 'TotRmsAbvGrd'.

2. **Visualization**:
    - Plot a histogram for 'TotRmsAbvGrd' to understand its distribution.
    - Plot a box plot for 'TotRmsAbvGrd' to visualize its central tendency, spread, and potential outliers.

3. **Interpretation**:
    - Is the distribution of the number of rooms skewed? If so, in which direction?
    - Based on the histogram and box plot, what can you infer about the common number of rooms above ground in houses? 
    - Are there any noticeable outliers in the number of rooms? If so, are there more houses with unusually many rooms or unusually few?


In [37]:
# Your code goes here

In [67]:
df_expensive = df[df["SalePrice"] > 500_000]

px.histogram(df_expensive, x="KitchenQual")