# **Decision Tree Analysis**
by Desiree McElroy
- Using the medical data set.

# Part I: Research Question

## A1.  Propose one question relevant to a real-world organizational situation that you will answer using one of the following prediction methods:

Similarly to the project from Task 1 using Naive Bayes, I would like to explore once again whether  there could be any contributing factors in someone being diagnosed with diabetes? Could these features be solid predictors in identifying who is more at risk for diabetes? 


## A2.  Define one goal of the data analysis. Ensure that your goal is reasonable within the scope of the scenario and is represented in the available data.

As highlighted in my previous analysis using Naive Bayes, diabetes is a prevalent disease affecting millions of Americans and continues to pose a significant public health challenge. In my last project, it was revealed that Naive Bayes was not a good predictor of diabetes. It is known that characteristics such as a person's demographics, location, and lifestyle choices are often linked to diabetes risk. I want to challenge this data set again, employing decision tree methods to help identify the most important predictors of diabetes.

By applying the Decision Tree algorithm to this data set, I aim to uncover the underlying patterns and relationships that contribute to the development of diabetes.  Decision Trees are adept at capturing complex, non-linear feature interactions, I hope to gain valuable insights into the factors influencing diabetes and can uncover patterns in the data not originally noticed.

# Part II: Method Justification

## B1.  Explain how the prediction method you chose analyzes the selected data set. Include expected outcomes.

Decision trees work by

## B2.  Summarization: One assumption of Decision Trees.

Unlike Naive Bayes or other linear models, Decision Trees generally make no assumptions about the data's distribution or multi-collinearity. This is ideal as the previous exploration project from D207 Data Exploration revealed no existing linear relationships between features or target variables. Only a handful of variables (doc_visits and vitd_levels) actually projected a normal distribution.

## B3.  Python Packages & Libraries

Python has many versatile libraries, including NumPy (for numerical operations) and Pandas (for manipulating dataframes), which are particularly useful in preparing data for decision tree algorithms. During the data wrangling phase, Pandas proves invaluable for cleaning, transforming, and creating new features necessary for decision tree models. Functions such as filling null values, changing data types, and one-hot encoding categorical variables are critical when preparing data for the branching structure of decision trees. NumPy enhances performance with useful numerical operations on arrays, which is especially handy when computing splits based on various criteria in decision trees.

The Scikit-learn (sklearn) library is a powerful tool for building decision tree models, providing tools for cross-validation, feature importance analysis, and model optimization, all of which are critical for fine-tuning decision trees to improve their performance.

Visual libraries such as Matplotlib and Seaborn provide visually appealing and insightful representations for tasks like feature reduction, feature performance analysis, and distribution plots. These quick, intuitive visuals are helpful during data preprocessing, making it easier to identify trends, outliers, and relationships in the data, all of which are crucial for refining models. 

In summary, these libraries—Pandas, NumPy, and Scikit-learn—are essential for handling data and building efficient decision tree models, thanks to their ease of use, powerful functionalities, and strong community support.

# Part III: Data Preparation

## C1.  Data Pre-processing

Data pre-processing is an essential step in preparing a data set for use with decision trees. It ensures the model can accurately learn from and make informed decisions from the data. Despite decision trees being less sensitive to certain types of data disparities compared to other algorithms (e.g. missing values, categorical data), proper preparation can significantly improve their performance.

When preparing data for decision tree algorithms, one important pre-processing step is one-hot encoding. Although decision trees can handle categorical data directly, using one hot encoding can still help improve interpretability and aid in consistency. One-hot encoding converts categorical variables into numerical format by creating new binary columns for each category. For example, with the "gender" column that has values like female, male, and nonbinary, one-hot encoding will create two new columns: one for male and one for nonbinary (dropping female to avoid redundancy). As a result, each category is represented numerically as either 1 (present) or 0 (absent), helping the decision tree algorithm use this data effectively during training.

## C2. Variables Used to Perform Analysis

The initial data set variables below shows a variety of descriptors including patient medical history, geographical information and hospital stay details. Medical history information includes history of back pain, obesity and age. Geographical information describes the area in which they live in including the local population count and area type. Lastly, the details of their medical stay include what type of admission it was and services received.

My **numerical** variables consist of:

| Feature       | Description   |
| :------------- | :----------: |
|`age` | numerical |
|`vitd_levels` | numerical |
|`daily_charges` | numerical |
|`additional_charges` | numerical |
|`population` | numerical |
|`children` | numerical |
|`doc_visits` | numerical |
|`full_meals_eaten` | numerical |
|`vitd_supplement` | numerical |
| `gender` | categorical |
|`marital` | categorical |
|`area` | categorical |
|`initial_admin` | categorical |
|`complication_risk` | categorical |
|`services_received` | categorical |
|`readmission` | categorical |
|`high_blood` | categorical |
|`stroke` | categorical |
|`overweight` | categorical |
|`arthritis` | categorical |
|`diabetes` | categorical |
|`hyperlipidemia` | categorical |
|`backpain` | categorical |
|`anxiety` | categorical |
|`allergic_rhinitis` | categorical |
|`reflux_esophagitis` | categorical |
|`asthma` | categorical |
|`soft_drink` | categorical |

The numerical variables are all ordinal, which means they have an inherent order or ranking. Excluding additional and daily charges, these variables are primarily countable and thus marks them as discrete. Age and vitd_level can arguably be considered continuous but for this project, they will be identified as discrete. Additional and daily charges are continuous numerical variables. Since this project is exploring the use of Decision Tree, it is not necessary to scale numerical values.

The first set of categorical variables listed above are considered categorical because each value is representative of a category. Likewise for the boolean variables, despite those being a numeric data type, there is no ordinal value to the values 1 or 0. In this case, these value of 1 indicates positive or True while 0 indicates negative or False. These categorical features have varying values from (1,0), (Yes,No) and (True,False). In either case, there is a value for True and a value for False. These variables do need to be converted to numerical values in order to be readable for modeling.

## C3.  Explain the steps used to prepare the data for the analysis. Identify the code segment for each step.

### imports

In [5]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

from preprocess_model_DecisionTree import acquire_df, clean_df

In [8]:
raw_df = acquire_df(filepath = 'medical_raw_df.csv')
df = clean_df(raw_df)
df.sample(3)

Unnamed: 0_level_0,population,area,children,age,marital,gender,readmission,vitd_levels,doc_visits,full_meals_eaten,...,hyperlipidemia,backpain,anxiety,allergic_rhinitis,reflux_esophagitis,asthma,services_received,hospital_stay_days,daily_charges,additional_charges
CaseOrder,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,2951,Suburban,1,53,Divorced,Male,False,19.14,6,0,...,False,True,True,True,False,True,Blood Work,10,3726.7,17939.4
2,11303,Urban,3,51,Married,Female,False,18.94,4,2,...,False,False,False,False,True,False,Intravenous,15,4193.19,17613.0
3,17125,Suburban,3,53,Widowed,Female,False,18.06,4,1,...,False,False,False,False,False,False,Blood Work,4,2434.23,17505.19


## C4.  Provide a copy of the cleaned data set.

# Part IV: Analysis

D.  Perform the data analysis and report on the results by doing the following:

1.  Split the data into training and test data sets and provide the file(s).

2.  Describe the analysis technique you used to appropriately analyze the data. Include screenshots of the intermediate calculations you performed.

3.  Provide the code used to perform the prediction analysis from part D2.


# Part V: Data Summary and Implications

E.  Summarize your data analysis by doing the following:

1.  Explain the accuracy and the mean squared error (MSE) of your prediction model.

2.  Discuss the results and implications of your prediction analysis.

3.  Discuss one limitation of your data analysis.

4.  Recommend a course of action for the real-world organizational situation from part A1 based on your results and implications discussed in part E2.
