**Question 1**:
Which health indicator(s) are most predictive of future diabetes diagnosis?

**Variables/Attributes**:
Diabetes_012, HighBP, HighChol, CholCheck , BMI , Smoker , Stroke, HeartDiseaseorAttack, PhysActivity, Fruits , Veggies , HvyAlcoholConsump , AnyHealthcare , NoDocbcCost , GenHlth, MentHlth, PhysHlth , DiffWalk , Sex, Age, Education, Income.

**Managerial Decision Making**:
Analyzing these indicators can help healthcare providers and policymakers identify high-risk individuals and target interventions more effectively. Decisions can include lifestyle intervention programs, increased screenings for individuals with high risk, and policies to lower the prevalence of risk factors like obesity and high cholesterol.

**Analytical Method**:
Given the nature of the research question, a Logistic Regression model would be an appropriate choice initially because it can directly estimate the probability of diabetes occurrence related to each predictor variable.

**Reason for Chosen Methods**:
Logistic Regression is chosen for its simplicity and interpretability, especially useful for understanding the direct impact of each health indicator.



In [1]:
import pandas as pd
import statsmodels.api as sm
import numpy as np

In [2]:
# Load your dataset
# Ensure you replace 'path_to_your_dataset.csv' with the actual path to your dataset
file_path = '/content/sample_data/diabetes_binary_health_indicators.csv'
data = pd.read_csv(file_path)

In [3]:
data.shape




(253680, 22)

In [4]:
# Check for missing values
missing_values = data.isnull().sum()
print("Missing Values:\n", missing_values)

Missing Values:
 Diabetes_binary         0
HighBP                  0
HighChol                0
CholCheck               0
BMI                     0
Smoker                  0
Stroke                  0
HeartDiseaseorAttack    0
PhysActivity            0
Fruits                  0
Veggies                 0
HvyAlcoholConsump       0
AnyHealthcare           0
NoDocbcCost             0
GenHlth                 0
MentHlth                0
PhysHlth                0
DiffWalk                0
Sex                     0
Age                     0
Education               0
Income                  0
dtype: int64


In [5]:
# Check for infinite values
has_infinite = np.isinf(data).any().any()
print("Has Infinite Values:", has_infinite)

Has Infinite Values: False


In [6]:
# Checking for duplicates
duplicates = data[data.duplicated()]
print("Duplicate Rows: ", len(duplicates))
duplicates.head()

Duplicate Rows:  24206


Unnamed: 0,Diabetes_binary,HighBP,HighChol,CholCheck,BMI,Smoker,Stroke,HeartDiseaseorAttack,PhysActivity,Fruits,...,AnyHealthcare,NoDocbcCost,GenHlth,MentHlth,PhysHlth,DiffWalk,Sex,Age,Education,Income
1242,1.0,1.0,1.0,1.0,27.0,1.0,0.0,0.0,0.0,0.0,...,1.0,0.0,5.0,0.0,30.0,1.0,0.0,10.0,4.0,5.0
1563,0.0,0.0,0.0,1.0,21.0,1.0,0.0,0.0,1.0,1.0,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,4.0,6.0,8.0
2700,0.0,0.0,0.0,1.0,32.0,0.0,0.0,0.0,1.0,1.0,...,1.0,0.0,2.0,0.0,0.0,0.0,0.0,5.0,6.0,8.0
3160,0.0,0.0,0.0,1.0,21.0,0.0,0.0,0.0,1.0,1.0,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,4.0,6.0,8.0
3332,0.0,0.0,0.0,1.0,24.0,0.0,0.0,0.0,1.0,1.0,...,1.0,0.0,1.0,0.0,0.0,0.0,1.0,9.0,6.0,8.0


In [7]:
# Drop duplicates
data.drop_duplicates(inplace=True)

In [8]:
data.shape

(229474, 22)

In [9]:
# Define your dependent (target) and independent variables (features)
# Make sure to replace 'DependentVariable' with the name of your dependent variable column
# and the list inside drop() function with your independent variables.
y = data['Diabetes_binary']
X = data.drop(['Diabetes_binary'], axis=1)

In [10]:
# Adding a constant to the independent variable set (required for statsmodels)
X = sm.add_constant(X)

In [11]:
# Fit the logistic regression model
model = sm.Logit(y, X)
result = model.fit()


Optimization terminated successfully.
         Current function value: 0.346123
         Iterations 8


In [12]:
# Display the model summary to review coefficients, p-values, etc.
print(result.summary())


                           Logit Regression Results                           
Dep. Variable:        Diabetes_binary   No. Observations:               229474
Model:                          Logit   Df Residuals:                   229452
Method:                           MLE   Df Model:                           21
Date:                Wed, 08 May 2024   Pseudo R-squ.:                  0.1909
Time:                        04:55:19   Log-Likelihood:                -79426.
converged:                       True   LL-Null:                       -98165.
Covariance Type:            nonrobust   LLR p-value:                     0.000
                           coef    std err          z      P>|z|      [0.025      0.975]
----------------------------------------------------------------------------------------
const                   -7.7514      0.093    -83.258      0.000      -7.934      -7.569
HighBP                   0.7344      0.015     49.695      0.000       0.705       0.763
HighChol    

In [13]:
Identifying and printing the most significant variables based on p-values
print("\nSignificant Variables with p-value less than 0.05:")
significant_variables = result.pvalues[result.pvalues < 0.05]
print(significant_variables)



Significant Variables with p-value less than 0.05:
const                   0.000000e+00
HighBP                  0.000000e+00
HighChol                0.000000e+00
CholCheck               3.465687e-78
BMI                     0.000000e+00
Smoker                  3.999960e-02
Stroke                  2.222010e-07
HeartDiseaseorAttack    3.808593e-32
PhysActivity            2.948876e-02
Fruits                  3.099103e-02
HvyAlcoholConsump       5.718075e-98
AnyHealthcare           3.332720e-03
GenHlth                 0.000000e+00
MentHlth                5.885703e-06
PhysHlth                1.705591e-19
DiffWalk                1.350690e-14
Sex                     1.233464e-84
Age                     0.000000e+00
Education               3.410599e-03
Income                  2.092298e-36
dtype: float64


In [15]:
#Identifying and printing the non significant variables based on p-values
print("\nVariables with p-value greater than 0.05:")
insignificant_variables = result.pvalues[result.pvalues > 0.05]
print(insignificant_variables)


Variables with p-value greater than 0.05:
Veggies        0.335876
NoDocbcCost    0.633205
dtype: float64
