# Exploratory Data Analysis

Let's import the necessary libraries to do some data exploration

In [None]:
import pandas as pd
import seaborn as sns
sns.set_theme(font_scale=1.2, rc={"figure.figsize": (8, 6)})

## Iris dataset

In [None]:
from IPython.display import IFrame
url = "https://en.wikipedia.org/wiki/Iris_flower_data_set"
IFrame(url, width="100%", height=400)

Let's load the Iris dataset using the scikit-learn library

In [None]:
from sklearn import datasets

iris = datasets.load_iris()

The `load_iris()` method returns a dict-like object containing the following keys

In [None]:
iris.keys()

dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename', 'data_module'])

In [None]:
print(f"Number of samples: {iris.data.shape[0]}")
print(f"Number of features: {iris.data.shape[1]}")
print(f"Feature names: {iris.feature_names}")
print(f"Target classes: {iris.target_names}")

Number of samples: 150
Number of features: 4
Feature names: ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
Target classes: ['setosa' 'versicolor' 'virginica']


You can get a summary of the dataset with the *DESCR* key

In [None]:
print(iris.DESCR)

.. _iris_dataset:

Iris plants dataset
--------------------

**Data Set Characteristics:**

    :Number of Instances: 150 (50 in each of three classes)
    :Number of Attributes: 4 numeric, predictive attributes and the class
    :Attribute Information:
        - sepal length in cm
        - sepal width in cm
        - petal length in cm
        - petal width in cm
        - class:
                - Iris-Setosa
                - Iris-Versicolour
                - Iris-Virginica
                
    :Summary Statistics:

                    Min  Max   Mean    SD   Class Correlation
    sepal length:   4.3  7.9   5.84   0.83    0.7826
    sepal width:    2.0  4.4   3.05   0.43   -0.4194
    petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)
    petal width:    0.1  2.5   1.20   0.76    0.9565  (high!)

    :Missing Attribute Values: None
    :Class Distribution: 33.3% for each of 3 classes.
    :Creator: R.A. Fisher
    :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
    :

Let's create a DataFrame so that we can manipulate the loaded data

In [None]:
df_iris = pd.DataFrame(iris.data, columns=iris.feature_names)
df_iris["species"] = iris.target_names[iris.target]

In [None]:
df_iris.head(5)

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


### Dataset statistics

Let's compute the summary statistics for the continuous attributes

In [None]:
df_iris.describe()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
count,150.0,150.0,150.0,150.0
mean,5.843333,3.057333,3.758,1.199333
std,0.828066,0.435866,1.765298,0.762238
min,4.3,2.0,1.0,0.1
25%,5.1,2.8,1.6,0.3
50%,5.8,3.0,4.35,1.3
75%,6.4,3.3,5.1,1.8
max,7.9,4.4,6.9,2.5


And the summary statistics of the *sepal length* for each *species*

In [None]:
df_iris.groupby("species")["sepal length (cm)"].describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
species,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
setosa,50.0,5.006,0.35249,4.3,4.8,5.0,5.2,5.8
versicolor,50.0,5.936,0.516171,4.9,5.6,5.9,6.3,7.0
virginica,50.0,6.588,0.63588,4.9,6.225,6.5,6.9,7.9


Other interesting statistic might be the interquartile range

In [None]:
df_iris.groupby("species").quantile(q=0.75) - df_iris.groupby("species").quantile(q=0.25)

Unnamed: 0_level_0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
species,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
setosa,0.4,0.475,0.175,0.1
versicolor,0.7,0.475,0.6,0.3
virginica,0.675,0.375,0.775,0.5


Or the input features' range

In [None]:
df_iris.groupby("species").max() - df_iris.groupby("species").min()

Unnamed: 0_level_0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
species,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
setosa,1.5,2.1,0.9,0.5
versicolor,2.1,1.4,2.1,0.8
virginica,3.0,1.6,2.4,1.1


## Ames Housing dataset

We can now look at a more complex dataset, consisting of 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa.

First you need to load the dataset:

In [None]:
!wget https://www.dropbox.com/s/84fnsub46ddt6hs/Wisconsin_breast_cancer_dataset.csv?dl=0 -O Ames_housing_dataset.csv 

--2023-03-28 06:50:52--  https://www.dropbox.com/s/84fnsub46ddt6hs/Wisconsin_breast_cancer_dataset.csv?dl=0
Resolving www.dropbox.com (www.dropbox.com)... 162.125.1.18, 2620:100:6016:18::a27d:112
Connecting to www.dropbox.com (www.dropbox.com)|162.125.1.18|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: /s/raw/84fnsub46ddt6hs/Wisconsin_breast_cancer_dataset.csv [following]
--2023-03-28 06:50:52--  https://www.dropbox.com/s/raw/84fnsub46ddt6hs/Wisconsin_breast_cancer_dataset.csv
Reusing existing connection to www.dropbox.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://ucc95136c01f5046af6d905aec30.dl.dropboxusercontent.com/cd/0/inline/B5Gu_WTzeseU20EcTarjw8q7JUZUvpWGnoGq6BegUhPFtsjpZrZG8hEdBuKSzA5bMrLvCzPQ6QT_KWNofwyTiWiAj4_B4d0JZYlLOhC2baaGTQDdY98MJv9xM4GQA92GVgvsD_HspN-EXZdIDkcDcmFTwaplq5pOmuUVk9TQ2Fnfwg/file# [following]
--2023-03-28 06:50:53--  https://ucc95136c01f5046af6d905aec30.dl.dropboxusercontent.com/cd/0/inline/B5Gu_W

Next we can load the dataset and see some information

In [None]:
df_house = pd.read_csv("Ames_housing_dataset.csv")
df_house.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 32 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   id                       569 non-null    int64  
 1   diagnosis                569 non-null    object 
 2   radius_mean              569 non-null    float64
 3   texture_mean             569 non-null    float64
 4   perimeter_mean           569 non-null    float64
 5   area_mean                569 non-null    float64
 6   smoothness_mean          569 non-null    float64
 7   compactness_mean         569 non-null    float64
 8   concavity_mean           569 non-null    float64
 9   concave points_mean      569 non-null    float64
 10  symmetry_mean            569 non-null    float64
 11  fractal_dimension_mean   569 non-null    float64
 12  radius_se                569 non-null    float64
 13  texture_se               569 non-null    float64
 14  perimeter_se             5

Let's remove the `Id` column which we do not need

In [None]:
del df_house["Id"]

KeyError: ignored

In [None]:
df_house.head(5)

### Missing values

There are many features with missing values which should be taken into account. 

We can compute the ratio of missing values for each feature

In [None]:
percent_missing = df_house.isnull().sum() / len(df_house) * 100
percent_missing = percent_missing[percent_missing != 0].sort_values(ascending=False)
percent_missing.name = "Missing ratio"
percent_missing

Series([], Name: Missing ratio, dtype: float64)

These features are both numerical and categorical

In [None]:
df_house[percent_missing.index].dtypes

Series([], dtype: object)

Let's plot these values for a clear visualization

In [None]:
_ = sns.barplot(x=percent_missing, y=percent_missing.index)

ValueError: ignored

Some of the attributes have the majority of their values unknown (>80%).

We need to carefully assess the **meaning** of these **missing values before** applying **imputation**, otherwise we could end up filling our features with uninformational/wrong data.

In a real world scenario we would ask the domain experts whether some missing values have a special meaning. In this case, we have a description for the majority of the dataset's features given in *data_description*.

We can see that the meaning of the unknown values is:

*   *PoolQC*: Pool quality, NA means "No Pool"
*   *MiscFeature*: Miscellaneous feature not covered in other categories, NA means "None"
*   *Alley*: Type of alley access to property, NA means "No alley access"
*   *Fence*: Fence quality, NA means "No Fence"
*   *FireplaceQu*: Fireplace quality, NA means "No Fireplace"
*   *GarageType*: Garage location, NA means "No Garage"
*   *GarageFinish*: Interior finish of the garage, NA means "No Garage"
*   *GarageQual*: Garage quality, NA means "No Garage"
*   *GarageCond*: Garage condition, NA means "No Garage"
*   *BsmtExposure*: Refers to walkout or garden level walls, NA means "No Basement"
*   *BsmtFinType1*: Rating of basement finished area, NA means "No Basement"
*   *BsmtFinType2*: Rating of basement finished area (if multiple types), NA means "No Basement"
*   *BsmtCond*: Evaluates the general condition of the basement, NA means "No Basement"
*   *BsmtQual*: Evaluates the height of the basement, NA means "No Basement"


However, we do not know the meaning of unknown values for:

*   *LotFrontage*: Linear feet of street connected to property
*   *GarageYrBlt*: Year garage was built
*   *MasVnrArea*: Masonry veneer area in square feet
*   *MasVnrType*: Masonry veneer type
*   *Electrical*: Electrical system

Accordingly, we need to keep this information into account when dealing with the missing values of these variables.



### Impute missing values

Let's impute the missing values for the categorical features for which we know the meaning

In [None]:
df_house.PoolQC.fillna("NoPool", inplace=True)  # equivalent: df_house.PoolQC = df_house.PoolQC.fillna("None")
df_house.MiscFeature.fillna("None", inplace=True)
df_house.Alley.fillna("NoAccess", inplace=True)
df_house.Fence.fillna("NoFence", inplace=True)
df_house.FireplaceQu.fillna("NoFireplace", inplace=True)

for col in ("GarageType", "GarageFinish", "GarageQual", "GarageCond"):
    df_house[col].fillna("NoGarage", inplace=True)

for col in ("BsmtQual", "BsmtCond", "BsmtExposure", "BsmtFinType1", "BsmtFinType2"):
    df_house[col].fillna("NoBasement", inplace=True)

For the features for which we do not know the meaning we can apply some heuristics.

For *LotFrontage*, since the area of each street connected to the house property will most likely have a similar area to other houses in its neighborhood, we can fill in missing values using the median *LotFrontage* of the *Neighborhood*

In [None]:
df_house.LotFrontage = df_house.groupby("Neighborhood")["LotFrontage"]\
    .transform(lambda x: x.fillna(x.median()))

We can see that missing values for *GarageYrBlt* correspond to garage features whose value is *NoGarage*

In [None]:
df_house.loc[df_house.GarageYrBlt.isna(), ["GarageType", "GarageFinish", "GarageQual", "GarageCond"]].value_counts()

We can therefore set its value to 0

In [None]:
df_house.GarageYrBlt.fillna(0, inplace=True)

*MasVnrArea* and *MasVnrType* have missing values for the same houses (rows), which most likely means there is no masonry veneer

In [None]:
all(df_house.loc[df_house.MasVnrArea.isna()].index == df_house.loc[df_house.MasVnrType.isna()].index)

We can therefore fill 0 for *MasVnrArea* and NoMasVnr for *MasVnrType*

In [None]:
df_house.MasVnrArea.fillna(0, inplace=True)
df_house.MasVnrType.fillna("NoMasVnr", inplace=True)

Finally, *Electrical* has only one missing value

In [None]:
df_house.Electrical.value_counts(dropna=False)

We can set the missing value to the most frequent one, i.e., *SBrkr*

In [None]:
df_house.Electrical.mode()

In [None]:
df_house.Electrical.fillna(df_house.Electrical.mode()[0], inplace=True)

We can check our DataFrame has no more missing values

In [None]:
df_house.isnull().values.any()

What would have happened if we did not use or did not have the data description?

For example, a missing value of *PoolQC* might have indicated that no quality control was performed.

However, with a closer inspection, you can notice that *PoolArea* is zero when *PoolQC* is missing, and therefore safely assume that a missing value of *PoolQC* means *NoPool*.

# References

1.   pandas `melt`: https://pandas.pydata.org/docs/reference/api/pandas.melt.html
2.   Ames Housing dataset: https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques/overview
3.   pandas groupby transformation: https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html#transformation
