# D209 - Data Mining I
by Desiree McElroy

In [1]:
# imports
import pandas as pd
import numpy as np

import wrangle as w

import pprint

# Part I: Research Question

A.  Describe the purpose of this data mining report by doing the following:

## A1. Research Question

My research question is what could be contributing factors in someone being diagnosed with diabetes? Could these features be solid predictors in identifying who is more at risk for diabetes?

## A2.  Goal of the Data Analysis.

 According to the Center for Disease Control (CDC), diabetes affects more than 38 million Americans. This disease remains a serious illness among population. By exploring this data set using Naive Bayes methods, I may be able to uncover what underlying patterns and correlations that may be contributing to this disease. Demographics, location and lifestyle features are known to be associated with the risk of diabetes.
By incorporating these variables into a Naive Bayes model, I can predict the likelihood of diabetes and identify which factors are most strongly associated with a diagnosis. Naive Bayes is effective for this because it assumes feature independence, allowing it to efficiently model the contributions of each factor. In summary, my main goal is to identify what importance each feature may have in relation to diabetes.

# Part II: Method Justification

## B1.  Naive Bayes Justification

Naive Bayes is an effective algorithm for classifying features in predicting target variables like diabetes due to its simplicity, speed, and ability to handle high dimensional data. The algorithm's different functions, such as Gaussian Naive Bayes (handling continuous variables like age) and Bernoulli Naive Bayes (suitable for binary and categorical features) make it adaptable to datasets with a variety of data types. This flexibility makes it ideal for this dataset.

Naive Bayes can still perform well even when its assumption of feature independence is violated. Additionally, while Naive Bayes does not necessarily address imbalanced data, it can handle imbalanced datasets well due to the probabilistic nature of its predictions. This is vital for features like gender that contains severe class imbalances.

While dimensionality isn’t a major concern in the dataset (lacking high cardinality), Naive Bayes performs well in high-dimensional spaces, particularly in text classification tasks. Lastly, Naive Bayes is computationally efficient, making it suitable for faster predictions, which can be beneficial for larger healthcare systems and real life applications.

Given the nature of the dataset, Naive Bayes is expected to perform well in identifying contributing factors for diabetes, such as age, area, and medical conditions (e.g. overweight). The model will provide probabilistic outputs, allowing the estimation of the likelihood of a patient being diagnosed with diabetes based on the features.

## B2.  Summarize one assumption of the chosen classification method.
One of the main reasons Naive Bayes received its name ***naive*** is because of an assumption that all feature variables are completely independent of one another. It naively assumes the "presence or absence of a particular feature does not affect the presence or absence of any other feature" (Spot Intelligence 2024). Despite this assumption often being violated due to its unrealistic nature, this algorithm still performs fairly well in practice.

## B3.  Python Packages & Libraries

Python has many versatile libraries including NumPy (numerical python operations) and Pandas (used for manipulation of dataframes). During the wrangling phase, the pandas library proves to be incredibly useful for the manipulation of dataframes, including cleaning, transforming and creating new features. It has many useful functions needed to help prepare dataframes for modeling such as functions to fill nulls, change data types, and one hot encode variables. Numpy is also a favorite with it's powerful performance with numerical operations on arrays. This is especially useful for preparing data for machine learning models like Naive Bayes.

Scikit-learn (sklearn) is a great library for machine learning because it provides a wide range of efficient tools for model building, data preprocessing, and evaluation. It offers user friendly functions for classification tasks. Additionally, its integration with NumPy and Pandas makes data manipulation easily achievable, and it supports techniques such as cross validation and model fine tuning.

In summary, these versatile packages, their ease of use and large community make them a strong tool for this project.

# Part III: Data Preparation
## C1.  Data Preprocessing
While data in general requires many preprocessing steps to prepare for machine learning, one of the most important ones is one hot encoding. Models such as Naive Bayes require numerical data. When working with tabular data, any non numerical variables must be converted to numerical representation. One hot encoding is an especially useful function in converting categorical variables into readable numerical columns. Each value in the specific column will be transformed into a *flag* column (binary 1/0). For example, gender has the three values: female, male and nonbinary. In order for this to be translated numerically, each value will be transformed into its own flag column, often dropping the first column to avoid repetitiveness. In turn the results will be two binary columns representing `male` and `nonbinary`. Both columns having 0 as a value mathematically represents the female gender.

## C2.  Initial data set variables that you will use to perform the analysis for the classification question from part A1 and classify each variable as numeric or categorical.

The initial dataset variables below show a variety of descriptors including patient medical history, geographical information and hospital stay details. Medical history information includes history of backpain, obesity and age. Geographical information describes the area in which they live in including the local population count and area type. Lastly, the details of their medical stay include what type of admission it was and services receieved.

I chosse to remove identifier variables such as case order, customer id, interaction and unique id. These columns are variables identifiable to the patient and have no numerical value. Next I removed excessive geographical features such as city, zip and latitude/longitude. These data would require more robust exploration and transformation and are not suitable for initial minimal viable product modeling. I also exclude personal patient details such as income and job and removed survey answers item 1 through item 8 as these are known to have multicollinearity which would violate the basic assumption of Naive Bayes.

My **numerical** variables consist of:
- ***ordinal***\
`age`\
`vitd_levels`\
`daily_charges`\
`additional_charges`\
`population`\
`children`\
`doc_visits`\
`full_meals_eaten`\
`vitd_levels`\
`vitd_supplement`


- ***boolean***\
`readmission`\
`high_blood`\
`stroke`\
`overweight`\
`arthritis`\
`diabetes`\
`hyperlipidemia`\
`backpain`\
`anxiety`\
`allergic_rhinitis`\
`reflux_esophagitis`\
`asthma`\
`soft_drink`



My **categorical** variables are:\
`gender`\
`marital`\
`area`\
`state`\
`initial_admin`\
`complication_risk`\
`services_received`

In [2]:
df = w.wrangle_df(filepath='medical_raw_df.csv')
df.head()

Unnamed: 0,state,population,area,children,age,marital,gender,readmission,vitd_levels,doc_visits,...,hyperlipidemia,backpain,anxiety,allergic_rhinitis,reflux_esophagitis,asthma,services_received,hospital_stay_days,daily_charges,additional_charges
0,AL,2951,Suburban,1,53,Divorced,Male,False,19.14,6,...,False,True,True,True,False,True,Blood Work,10,3726.7,17939.4
1,FL,11303,Urban,3,51,Married,Female,False,18.94,4,...,False,False,False,False,True,False,Intravenous,15,4193.19,17613.0
2,SD,17125,Suburban,3,53,Widowed,Female,False,18.06,4,...,False,False,False,False,False,False,Blood Work,4,2434.23,17505.19
3,MN,2162,Suburban,0,78,Married,Male,False,16.58,4,...,False,False,False,False,True,True,Blood Work,1,2127.83,12993.44
4,VA,5287,Rural,1,22,Widowed,Female,False,17.44,5,...,True,False,False,True,False,False,CT Scan,1,2113.07,3716.53


In [3]:
print(f'Columns used for initial analysis: {list(sorted(df.columns))}')

Columns used for initial analysis: ['additional_charges', 'age', 'allergic_rhinitis', 'anxiety', 'area', 'arthritis', 'asthma', 'backpain', 'children', 'complication_risk', 'daily_charges', 'diabetes', 'doc_visits', 'full_meals_eaten', 'gender', 'high_blood', 'hospital_stay_days', 'hyperlipidemia', 'initial_admin', 'marital', 'overweight', 'population', 'readmission', 'reflux_esophagitis', 'services_received', 'soft_drink', 'state', 'stroke', 'vitd_levels', 'vitd_supplement']


In [4]:
# separate exploratory variables into type for ease of exploring

# numerical/ordinal variables
num_vars = ['age',
             'vitd_levels',
             'daily_charges',
             'additional_charges',
             'population',
             'children',
             'doc_visits',
             'full_meals_eaten',
             'vitd_levels',
             'vitd_supplement']

# categorical variables
cat_vars = ['gender', 
            'marital',
            'area',
            'state',
            'initial_admin',
            'complication_risk',
            'services_received']


# List of boolean health-related variables
bool_vars = ['readmission',
             'high_blood', 
             'stroke',  
             'overweight', 
             'arthritis', 
             'diabetes', 
             'hyperlipidemia', 
             'backpain', 
             'anxiety', 
             'allergic_rhinitis', 
             'reflux_esophagitis', 
             'asthma',
             'soft_drink']

In [5]:
len(df.columns)

30

## C3.  Explain each of the steps used to prepare the data for the analysis. Identify the code segment for each step.

## C4.  Provide a copy of the cleaned data set.

# Part IV: Analysis

D.  Perform the data analysis and report on the results by doing the following:

1.  Split the data into training and test data sets and provide the file(s).

2.  Describe the analysis technique you used to appropriately analyze the data. Include screenshots of the intermediate calculations you performed.

3.  Provide the code used to perform the classification analysis from part D2.

# Part V: Data Summary and Implications

E.  Summarize your data analysis by doing the following:

1.  Explain the accuracy and the area under the curve (AUC) of your classification model.

2.  Discuss the results and implications of your classification analysis.

3.  Discuss one limitation of your data analysis.

4.  Recommend a course of action for the real-world organizational situation from part A1 based on your results and implications discussed in part E2.

# Web Sources
- https://www.cdc.gov/diabetes/php/data-research/index.html#:~:text=Among%20the%20U.S.%20population%20overall,Table%201a%3B%20Table%201b).

- https://www.ibm.com/topics/naive-bayes#:~:text=IBM-,What%20are%20Na%C3%AFve%20Bayes%20classifiers%3F,probability%20to%20perform%20classification%20tasks.

- https://scikit-learn.org/stable/modules/naive_bayes.html

- https://www.youtube.com/watch?v=O2L2Uv9pdDA

- https://iq.opengenus.org/types-of-naive-bayes/

- https://spotintelligence.com/2024/05/31/naive-bayes-classification/