# Week 1 - Data Exploration


In [None]:
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

sns.set_theme()

In [None]:
sample_submission = pd.read_csv("../data/submission_format.csv")

X_test = pd.read_csv("../data/test_values.csv")
X_train = pd.read_csv("../data/train_values.csv")
y_train = pd.read_csv("../data/train_labels.csv")

In [None]:
X_train.head()

In [None]:
y_train.head()

In [None]:
X_train.info()

In [None]:
df = X_train.merge(y_train, on="building_id")  # noqa: PD901
df.info()

## Missing Values

Analyze missing values and find uninformative columns


In [None]:
df.isna().sum().any()

## Univariate Analysis

Analyze the distribution of individual features. Are there any imbalances or outliers?


In [None]:
df.damage_grade.value_counts(normalize=True).plot.pie(autopct="%1.1f%%")
plt.title("Damage Grade Distribution")
plt.show()

The target class is imbalanced, with more than half of the buildings in damage class 2.

In [None]:
cat_cols = X_train.select_dtypes(include="object").columns
numeric_cols = X_train.select_dtypes(include="int64").columns
binary_cols = [col for col in X_train.columns if col.startswith("has")]
numeric_cols = [col for col in numeric_cols if col not in binary_cols]

df[cat_cols].nunique()

In [None]:
n = len(binary_cols)
fig = plt.figure(figsize=(20, 40))
for i, col in enumerate(binary_cols):
    ax = plt.subplot(n // 3 + 1, 3, i + 1)
    df[col].value_counts(normalize=True).plot.pie(autopct="%1.1f%%")
    ax.set_title(col)

The binary columns give information about whether different kinds of superstructures exists (first 11 features) and whether the buildings have secondary use, and what type of sencondary use that is(last 10 features). The binary columns are all imbalanced. Has superstructure_mud_mortar_stone and has_superstructure_timber are the least imbalanced where the less frequent value has still over 20%. The general feature has_secondary_use shows that 11.2% of buildings have a secondary use. Of the specific types of secondary use, agriculture is occuring most often (6.4%, which is more than 50%). Most of the others occur very rarely, 8 of them occur in less than 1% of cases, 4 occur in 0% of cases. Those that occur not at all, have no information value and can be removed. It might also make sense to remove others that occur rarely.

In [None]:
# correlation with target
corr_target = df[numeric_cols + binary_cols].corrwith(y_train.damage_grade)
corr_target.sort_values().plot.barh()

The strongest correlations between feature and target are for the features has_superstructure_mud_mortar_stone (about 0.3) and has_superstructure_cement_mortar_brick (about -0.25). Most features have a weak correlation (<|0.1|). Building id and geo_level_id have a correlations very close to 0, so they could be removed.  

In [None]:
print(corr_target.loc[lambda x: (abs(x) < 0.05)].sort_values())

These are the features with a correlation lower than 0.05. It might make sense to also remove some of those.

In [None]:
# visualize categorical columns

n = len(cat_cols)
fig = plt.figure(figsize=(20, 15))
for i, col in enumerate(cat_cols):
    ax = plt.subplot(n // 3 + 1, 3, i + 1)
    ax = sns.countplot(data=df, x=col)

The categorical features are all imbalanced. For the features plan_configuration and legal_ownership_status the other values besides the most frequent are very close to zero, so only few buildings differ in their value for this columns.

## Multivariate Analysis

Analyze relationship between features. Are there any redundancies?

Analyze relationship between features and target variable. Are there any features that are highly correlated with the target variable?


In [None]:
corr = df[binary_cols].corr()
sns.heatmap(corr, cmap="coolwarm")

Most of the binary columns are not correlated. Correlation is visible between has_secondary_use and has_secondary_use_agriculture (around 0.8), has_secondary_use and has_secondary_use_hotel (around 0.4), has_superstructure_bamboo and has_superstructure_timber (around 0.4), has_superstructure_cement_mortar_brick and has_superstructure_mud_mortar_stone (-0.4), has_superstructure_mud_mortar_brick and has_superstructure_adobe_mud (around 0.3).

In [None]:
df[numeric_cols].hist(figsize=(20, 20))

Buidling_id is a unique and random identifier for each building. For geo_level_1_id (largest subarea), differences can be seen, so some regions (6-8) occur more often than others. The other two geo_level ids are more equally distributed. The other column values are al  imbalanced. The count_floors value is between 1 and 3 for most buildings, with 2 occuring by far most often. Almost all buildings have an age between 1-100 years, with just few being up to 200 years old and up to 1000 years. Area percentage is under 10 for most buidlings, heiht percentage between 5 and 8 for most buildings and the count_families of 1 is most frequent.

In [None]:
df[numeric_cols].plot(kind="box", subplots=True, layout=(4, 4), figsize=(20, 20))

For count_floors_pre_eq all values except 2, and for count_families all values except 1, are drawn as outliers in the boxplot. This means that the median and the quartiles (between which 50% of the data is located) have the same value, therefore also the minimum and maximum. When many buildings have the same value for these features, they might be less suitable to distinguish between buildings. Age, area percentage and height percentage also have boxplots where minimum and maximum are close together. For 50% of the buildings their value is within the two quartiles, so within a relatively small range for those features.