# August Perez Capstone Two Project:
## Asthma prediction model

The goal is to build a model with at least 90% sensitivity (focusing on reducing false negatives)

Data source: Asthma Disease Dataset (https://www.kaggle.com/datasets/rabieelkharoua/asthma-disease-dataset?resource=download)

#### About the dataset: (from the kaggle description)
- health information for 2,392 patients
  - includes demographic details, lifestyle factors, environmental and allergy factors, medical history, clinical measurements, symptoms, and a diagnosis indicator


Column Info: (from Kaggle description)
Patient ID

    PatientID: A unique identifier assigned to each patient (5034 to 7425).

Demographic Details

    Age: The age of the patients ranges from 5 to 80 years.
    Gender: Gender of the patients, where 0 represents Male and 1 represents Female.
    Ethnicity: The ethnicity of the patients, coded as follows:
    0: Caucasian
    1: African American
    2: Asian
    3: Other
    EducationLevel: The education level of the patients, coded as follows:
    0: None
    1: High School
    2: Bachelor's
    3: Higher

Lifestyle Factors

    BMI: Body Mass Index of the patients, ranging from 15 to 40.
    Smoking: Smoking status, where 0 indicates No and 1 indicates Yes.
    PhysicalActivity: Weekly physical activity in hours, ranging from 0 to 10.
    DietQuality: Diet quality score, ranging from 0 to 10.
    SleepQuality: Sleep quality score, ranging from 4 to 10.

Environmental and Allergy Factors

    PollutionExposure: Exposure to pollution, score from 0 to 10.
    PollenExposure: Exposure to pollen, score from 0 to 10.
    DustExposure: Exposure to dust, score from 0 to 10.
    PetAllergy: Pet allergy status, where 0 indicates No and 1 indicates Yes.

Medical History

    FamilyHistoryAsthma: Family history of asthma, where 0 indicates No and 1 indicates Yes.
    HistoryOfAllergies: History of allergies, where 0 indicates No and 1 indicates Yes.
    Eczema: Presence of eczema, where 0 indicates No and 1 indicates Yes.
    HayFever: Presence of hay fever, where 0 indicates No and 1 indicates Yes.
    GastroesophagealReflux: Presence of gastroesophageal reflux, where 0 indicates No and 1 indicates Yes.

Clinical Measurements

    LungFunctionFEV1: Forced Expiratory Volume in 1 second (FEV1), ranging from 1.0 to 4.0 liters.
    LungFunctionFVC: Forced Vital Capacity (FVC), ranging from 1.5 to 6.0 liters.

Symptoms

    Wheezing: Presence of wheezing, where 0 indicates No and 1 indicates Yes.
    ShortnessOfBreath: Presence of shortness of breath, where 0 indicates No and 1 indicates Yes.
    ChestTightness: Presence of chest tightness, where 0 indicates No and 1 indicates Yes.
    Coughing: Presence of coughing, where 0 indicates No and 1 indicates Yes.
    NighttimeSymptoms: Presence of nighttime symptoms, where 0 indicates No and 1 indicates Yes.
    ExerciseInduced: Presence of symptoms induced by exercise, where 0 indicates No and 1 indicates Yes.

Diagnosis Information

    Diagnosis: Diagnosis status for Asthma, where 0 indicates No and 1 indicates Yes.

Confidential Information

    DoctorInCharge: This column contains confidential information about the doctor in charge, with "Dr_Confid" as the value for all patients.


#### Imports:

In [None]:
%matplotlib inline

# data manipulation and math

import numpy as np
import scipy as sp
import pandas as pd
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)

# plotting and visualization

import matplotlib.pyplot as plt
import seaborn as sns

# modeling & pre-processing
import sklearn.model_selection
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
import sklearn.preprocessing
import sklearn.metrics


#### Set random seed for reproducability
Note that this should not be done for models used in real-world applications

In [None]:
np.random.seed(9)

#### Load the data into a pandas df

In [None]:
df_adhd = pd.read_csv('asthma_disease_data.csv')
df_adhd.head()

In [None]:
print('shape:', df_adhd.shape,'\n')
print('columns:\n','\n'.join(list(df_adhd.columns)))
print('\nnumber of cols:', len(df_adhd.columns))

#### Data Wrangling

In [None]:
df_adhd.info()

AP: No null values found, making my job here easier than expected.

In [None]:
# AP: Looking at col names to ensure they make sense
df_adhd.head()

##### AP: Col names to edit:
- Change all to lower case (mainly makes my typing easier/faster)
- 

In [None]:
# AP: change col names to lower case
df_adhd.columns = df_adhd.columns.str.lower()
df_adhd.head()

#### Check the data types, ensure they make sense for the column

In [None]:
df_adhd.dtypes

##### AP: Data types check out. doctorincharge col has every value as "Dr_Confid" so object type makes sense.

#### Visualize the data
Detection of possible duplicates and outliers

In [None]:
df_adhd.hist(figsize=(8,8))
plt.tight_layout()
plt.show()

AP: Most of the data is categorical between 0 or 1 (yes or no)
Continuous values don't seem to have any specific distribution


In [None]:
# AP: Make list of cols that are non-categorical in nature

non_cat_cols = ['age', 'bmi', 'physicalactivity', 'dietquality', 'sleepquality', 'pollutionexposure', 'pollenexposure', 'dustexposure', 'lungfunctionfev1', 'lungfunctionfvc']

In [None]:
#AP: box & whisker plot to see potential outliers
df_adhd[non_cat_cols].boxplot(figsize=(9,9), rot=34)
plt.tight_layout()
plt.show()

In [None]:
# Boxplot without age & bmi so other columns can be inspected more closely

non_cat_cols_noagebmi = [element for element in non_cat_cols if element != 'age' and element != 'bmi']
df_adhd[non_cat_cols_noagebmi].boxplot(figsize=(9,9), rot=35)
plt.tight_layout()
plt.show()

AP: No outliers seen, max & min's seem reasonable for non-categorical cols

#### A look at consistency in values within columns
ex. categorical values are within specified ranges, numerical values are within reasonable ranges for the metric measured

In [None]:
df_adhd.nunique(axis=0)

#### AP: The count of unique values within each column is consistent with expectations

## Data Wrangling Conclusions:

- Data is tidy (each observation is a row, each variable is a col)
- No null values found, making my job here easier than expected.
- Data types for each col check out. (all but doctorincharge are numerical int or float) (doctorincharge col has every value as "Dr_Confid" so object type makes sense)
- Continuous values don't seem to have any specific distribution (flat). Most of the data is categorical (17 out of 29 cols) between 0 or 1 (yes or no).
- No outliers detected, max & min's seem reasonable for non-categorical cols
- The count of unique values within each column is consistent with expectations (as many unique values for cat cols as there are categories for that col; and for all but 'age' col, unique values equals sample count)