# Week 2 - Pre-processing, part 2

# <font color='orangered'>ATTENTION
### * This notebook is best viewed with a dark background
### * if you `Run All`, each dataset will take ~1 min to download and 30 sec to generate and render the Profile Report

### * Most answers are in <font color='plum'> this color.
#### *powered by Arun&trade;* 

In [23]:
# %pip install kagglehub
# %pip install --upgrade pip

In [24]:
import pandas as pd
import numpy as np
import kagglehub 
import matplotlib.pyplot as plt 
%matplotlib inline

from kagglehub              import KaggleDatasetAdapter
from datetime               import datetime, timedelta
from ydata_profiling        import ProfileReport
from ydata_profiling.config import Settings

# custom settings object
custom_settings = Settings()


# 1. Lesson: None

# 2. Weekly graph question

The Storytelling With Data book mentions planning on a "Who, What, and How" for your data story.  Write down a possible Who, What, and How for your data, using the ideas in the book.

<font color='plum'>

Dataset:  Hypertension Risk Prediction, https://www.kaggle.com/datasets/ankushpanday1/hypertension-risk-prediction-dataset

The level of detail mostly depends on whether the story is being presented live or in a document. 

### WHO
Public health officials and policy-makers within a specific region or country, the kinds of people responsible for allocating resources and developing public health campaigns to address chronic diseases like hypertension.

### WHAT
These officials need to be convinced that targeted screening programs for hypertension are necessary, and that specific demographics are at higher risk than others.  The story will need to establish the prevalence of hypertension and its correlation with key risk factors, as identified in the dataset.  The tone is one of 'call to action': approve funding for a pilot program focused on early detection and preTvention in high-risk populations.

### HOW
Some kind of 'slideument' that presents the problem (prevalence and scale of hypertension, as shown in the data), the risk factors (correlations b/n hypertension and factors like age, BMI, family history, behaviors), a solution (visualizations of model results showing impact of screening programs).  

# 3. Homework - work with your own data

This week, you will do the same types of exercises as last week, but you should use your chosen datasets that someone in your class found last semester. (They likely will not be the particular datasets that you found yourself.)

## 3. Guidelines

### Here are some types of analysis you can do.  Use Google, documentation, and ChatGPT to help you:

- Summarize the datasets using `info()` and `describe()`

- Are there any duplicate rows?

- Are there any duplicate values in a given column (when this would be inappropriate?)

- What are the mean, median, and mode of each column?

- Are there any missing or null values?

    - Do you want to fill in the missing value with a mean value?  A value of your choice?  Remove that row?

- Identify any other inconsistent data (e.g. someone seems to be taking an action before they are born.)

- Encode any categorical variables (e.g. with one-hot encoding.)



### Conclusions:

- Are the data usable?  If not, find some new data!

- Do you need to modify or correct the data in some way?

- Is there any class imbalance?  (Categories that have many more items than other categories).

## <font color='plum'> 3. Description

This project focuses on understanding and predicting the development of three major chronic illnesses—hypertension, chronic kidney disease (CKD), and diabetes—through lifestyle, demographic, and clinical risk factors. By comparing and analyzing multiple real- world datasets, the goal is to uncover overlapping risk indicators and investigate how preventive strategies can reduce disease onset.

##  <font color='plum'> 3.1. Hypertension Risk Prediction Dataset

Includes lifestyle, demographic, and clinical data (e.g., BMI, cholesterol, stress,
salt intake, smoking, family history) from individuals across multiple countries. It is
labeled for classification tasks (low vs. high risk of hypertension).
https://www.kaggle.com/datasets/ankushpanday1/hypertension-risk-prediction-dataset

In [None]:

file_path = "ankushpanday1/hypertension-risk-prediction-dataset"
file_name = "hypertension_dataset.csv"

df_1 = kagglehub.load_dataset(
  KaggleDatasetAdapter.PANDAS,
  file_path,
  file_name,
)

df_1.head()

### <font color='plum'> 3.1.a. Exploratory Data Analysis

In [None]:
df_1.info()

In [None]:
df_1.describe()

In [None]:
profile = ProfileReport(
    df_1,
    html={'style': {'primary_color': 'magenta'}},
    minimal=False,
    plot={"dpi": 100, "image_format": "svg", "tight_layout": True},
)

profile

### <font color='plum'> 3.1.b. Commentary

## <font color='plum'> 3.2. Chronic Kidney Disease Dataset

Contains comprehensive data for 1,659 patients, including 54 variables spanning
medical history, lab results, medication usage, quality of life, and environmental
exposure. Ideal for regression, classification, and clustering analyses.
https://www.kaggle.com/datasets/rabieelkharoua/chronic-kidney-disease-dataset-analysis

In [None]:
file_path = "rabieelkharoua/chronic-kidney-disease-dataset-analysis"
file_name = "Chronic_Kidney_Dsease_data.csv"

# Load the latest version
df_2 = kagglehub.load_dataset(
  KaggleDatasetAdapter.PANDAS,
  file_path,
  file_name,
)

df_2.head()

### <font color='plum'> 3.2.a. Exploratory Data Analysis

In [None]:
df_2.info()

In [None]:
df_2.describe()

In [None]:
profile = ProfileReport(
    df_2,
    html    = {'style': {'primary_color': 'firebrick'}},
    minimal = False,
    plot    = {"dpi": 100, "image_format": "svg", "tight_layout": True},
)
profile

### <font color='plum'> 3.2.b. Commentary

## <font color='plum'> 3.3. Diabetes Health Indicators Dataset
Over 250,000 responses from a U.S. national health survey with demographic and
lifestyle variables (e.g., BMI, activity level, smoking, sleep, general health).
Designed for predicting diabetes status (none, pre-diabetic, diabetic).
https://www.kaggle.com/datasets/alexteboul/diabetes-health-indicators-dataset

In [None]:
file_path = "alexteboul/diabetes-health-indicators-dataset"
file_name = "diabetes_012_health_indicators_BRFSS2015.csv"

# Load the latest version
df_3 = kagglehub.load_dataset(
  KaggleDatasetAdapter.PANDAS,
  file_path,
  file_name,
)

df_3.head()

### <font color='plum'> 3.3.a. Exploratory Data Analysis

In [None]:
df_3.info()

In [None]:
df_3.describe()

In [None]:
profile = ProfileReport(
    df_3,
    html    = {'style': {'primary_color': 'firebrick'}},
    minimal = False,
    plot    = {"dpi": 100, "image_format": "svg", "tight_layout": True},
)
profile

### <font color='plum'> 3.3.b. Commentary

# 4. <font color='plum'>Storytelling With Data graph

Just like last week: choose any graph in the Introduction of <u>Storytelling With Data</u>. 

Use `matplotlib` to reproduce it in a rough way. I don't expect you to spend an enormous amount of time on this; I understand that you likely will not have time to re-create every feature of the graph. However, if you're excited about learning to use matplotlib, this is a good way to do that. You don't have to duplicate the exact values on the graph; just the same rough shape will be enough.  If you don't feel comfortable using matplotlib yet, do the best you can and write down what you tried or what Google searches you did to find the answers.

<font color='plum'> This is from page 4 of <u>Storytelling With Data</u> 

In [None]:

months      = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
received    = [160, 180, 235, 150, 180, 170, 140, 202, 160, 139, 149, 177]
processed   = [160, 180, 235, 150, 180, 170, 125, 156, 126, 104, 124, 140]

plt.figure(figsize = (8, 5))

plt.plot(months, received, marker='o', color='gray', label='Received', linewidth = 2, markersize = 4)
plt.plot(months, processed, marker='o', color='navy', label='Processed', linewidth = 2, markersize = 4 )

plt.title('Ticket volume over time', fontsize = 14, pad = 20, loc = 'left')
plt.ylabel('Number of tickets', fontsize = 10, loc = 'top')
# plt.grid(False, linestyle='--', alpha=0.7)

# vertical line at May
plt.axvline(x           = months.index('May'), 
            color       = 'gray', 
            linestyle   = '-', 
            alpha       = 0.5)

# annotations
plt.text(-0.5, -40, '2014', fontsize = 10)
plt.text(months.index('Apr')-0.25, 250, 
         '2 employees quit in May. We nearly kept up with incoming volume\nin the following two months, but fell behind with the increase in Aug\nand haven\'t been able to catch up since.', 
         fontsize = 8, 
         color = 'gray')

# data labels
for i, (r, p) in enumerate(zip(received, processed)):
    if i >= 7:
        plt.text(i, r+5, str(r), ha='center', va='bottom', color='gray')
        plt.text(i, p-10, str(p), ha='center', va='top', color='navy')

# Customize axis
plt.ylim(0, 300)

#legend
plt.legend(frameon = False)

plt.tight_layout()
plt.show()