# 1. Import libraries

In [None]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sn
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objects as go

# 2. Explore data

## Read data

In [None]:
raw_data = pd.read_csv('../data/data.csv')
raw_data

## How many rows and how many columns does the raw data have?

In [None]:
# Todo

## What are the meanings of each row?

## Does the raw data have duplicate rows? (if it has, handle it)

In [None]:
# Todo

## What does each column mean?

In [None]:
raw_data.columns

Describe meaning of each column

In [None]:
col_meaning_df = pd.read_csv('../data/schema.csv')
pd.set_option("display.max_colwidth", None)

col_meaning_df

## What data type does each column currently have? Are there any columns whose data types are not suitable for further processing?

To answer this question, first we will use **info() function** to see the general information of each column.

In [None]:
raw_data.info()

Upon examining the **Column** and **Dtype**, we can see that all columns have appropriate data types, so we don't need to convert them.

## For each column with numeric data type, how are the values distributed?

With each numerical column, how are values distributed?
- What is the percentage of missing values?
- If there are missing values, handle them.
- Min? max? Are they abnormal?
- Missing value treatment.

### Select numeric columns

In [None]:
num_col_df = raw_data.select_dtypes(include='float64')
num_col_df

### Explore the distribution using descriptive statistics

In [None]:
def missing_ratio(col):
    return (col.isna().sum() * 100 / len(col)).round(1)

def lower_quartile(col):
    return col.quantile(0.25).round(1)

def upper_quartile(col):
    return col.quantile(0.75).round(1)

num_col_info_df = num_col_df.agg([missing_ratio, 'min', lower_quartile, 'median', upper_quartile, 'max'])
num_col_info_df

**Observation:**
- The percentage of missing values of each numeric column is low so we won't drop any of these columns. Instead, we try to handle these missing values.
- The minimum and maximum values of each numeric column are within normal ranges:
    - There are no negative numbers.
    - PhysicalHealthDays and MentalHealthDays both have values equal or less than 30.
    - SleepHours have values equal or less than 24.
    - Three remaining columns also have reasonable values.
- Based on upper-quartile values and max values, we can see PhysicalHealthDays, MentalHealthDays, SleepHours, WeightInKilograms and BMI have right-skewed distributions as upper-quartile values are far from max values.
- Because of that, we will fill missing values in these columns with the median (an indicator that is insensitive with outliers).

### Visualize missing ratio

In [None]:
data = num_col_info_df.loc['missing_ratio']
fig = px.bar(x=data.index, y=data.values, width=1000, height=500, labels={'x': 'Numeric column', 'y': 'Percentage(%)'}, 
             title='Missing ratio of numeric columns')
fig.show()

### Handle missing values

In [None]:
raw_data[num_col_df.columns] = raw_data[num_col_df.columns].fillna(num_col_df.median())

After handling missing values, we will check missing-ratio again to ensure that we have handle missing values successfully.

In [None]:
non_nan_num_cols = raw_data[num_col_df.columns] 
non_nan_num_col_info_df = non_nan_num_cols.agg([missing_ratio, 'min', lower_quartile, 'median', upper_quartile, 'max'])
non_nan_num_col_info_df

Now there are no missing values. 

### Visualize the distribution 

We will use **histogram** to visualize the distributions of numeric columns and provide insights we can gain from them.

In [None]:
fig, axes = plt.subplots(2, 3, figsize=(14, 6))
axes = axes.flatten()
plt.subplots_adjust(hspace=0.4)

bin_nums = [10, 10, 23, 20, 20, 20]
for i in range(len(axes)):
    axes[i].hist(raw_data[non_nan_num_cols.columns[i]], bins=bin_nums[i])
    axes[i].set_title(non_nan_num_cols.columns[i]);

**Observation**:
- The physical health of people in California is generally good, as most of them experienced physical health problems for less than 6 days.
- The number of people experiencing mental health problems for more than 6 days is quite higher than that of physical health problems. It can be observed that mental health problems often persist for a longer duration compared to physical health problems.
- The distribution of the SleepHours column is narrow, indicating that most people in California have average sleep hours around 6 to 9 hours per day, which is good for health.
- The height of people in California is various, but focused mainly in the range of 1.5 to 1.85 meters.
- Similarly, the weight and BMI are primarily centered in the range of 50 to 112 kilograms and 20 to 35.

### Handle outliers

First, we will see if there are any outliers in numerical columns.

In [None]:
fig, axes = plt.subplots(2, 3, figsize=(14, 6))
axes = axes.flatten()
plt.subplots_adjust(hspace=0.3)

non_nan_num_cols = raw_data[num_col_df.columns]
for i in range(len(axes)):
    axes[i].boxplot(non_nan_num_cols.iloc[:, i]);
    axes[i].set_title(non_nan_num_cols.columns[i])

## Outliers detection