# **Preventive Care and Health Screening System**

### **Problem Statement**

In modern healthcare, preventive care and health screenings are essential for early detection and effective management of chronic diseases. However, many patients miss their routine screenings due to a lack of timely reminders or awareness. This results in late diagnoses, higher healthcare costs, and poorer health outcomes.

Healthcare providers face challenges in identifying and reaching patients who are due or overdue for preventive screenings. They need a systematic, automated solution to manage patient data, track screening schedules, and provide timely notifications.

---

### **Core Problem Statement**

**"How can we build an automated and cost-effective system that identifies patients overdue for preventive care and health screenings, predicts their risk levels, and sends timely reminders to improve adherence and health outcomes?"**

---

### **Challenges Addressed**
1. **Identification of Overdue Patients**  
   Patients who miss their screenings are often not identified due to fragmented or outdated patient records.

2. **Risk Prediction**  
   Healthcare providers need to prioritize high-risk patients for follow-ups, but manual analysis is time-consuming and error-prone.

3. **Timely Notifications**  
   Many patients do not receive reminders about their preventive care, leading to missed opportunities for early intervention.

4. **Resource Constraints**  
   Many healthcare organizations have limited budgets and resources to develop sophisticated systems.

---

### **Goals of the Solution**
1. **Automate Screening Identification**  
   Develop a system to automatically flag patients overdue for screenings.

2. **Predict Risk Levels**  
   Use machine learning to predict the likelihood of complications based on patient history.

3. **Send Automated Notifications**  
   Integrate notification mechanisms (SMS, email) to remind patients of due or overdue screenings.

4. **Improve Patient Outcomes**  
   Increase adherence to preventive care protocols, leading to early detection and better disease management.

5. **Cost-Effective Implementation**  
   Leverage free and open-source tools to ensure accessibility and affordability for healthcare providers.

**Lets do some data processing**

In [14]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, roc_auc_score, roc_curve
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
df = pd.read_csv("data.csv")
df.head()

Unnamed: 0,Patient ID,Age,Gender,Chronic Disease,Preventive Measures,Quality of Life Score,Risk Factors,Lifestyle Interventions,BMI,Screening Status,...,Follow-Up Schedule,Risk Score,Priority Level,Health Advice,Diagnosis Date,Current Medications,Blood Pressure,Blood Sugar Levels,Doctor Name,Healthcare Provider
0,S0001,80,Female,Cancer Risk,Diet Plan,12.5,Smoking,Exercise,20.5,Completed,...,2/25/2026,70,Medium,Consider increasing physical activity.,2018-02-24,Atorvastatin,121/77,,Catherine Thompson,ABC Health
1,S0002,36,Male,Heart Disease,Routine Screening,64.8,Sedentary Lifestyle,Yoga,27.1,Overdue,...,11/23/2026,4,Low,Consider quitting smoking.,2017-03-22,Atorvastatin,140/87,160.5,Katherine Strong,ABC Health
2,S0003,74,Male,Heart Disease,Smoking Cessation,57.1,Obesity,Exercise,31.2,Due,...,10/16/2024,97,High,Consider quitting smoking.,2023-11-25,Omeprazole,100/63,,Joseph Collins,CarePlus
3,S0004,33,Male,Cancer Risk,Diet Plan,69.5,Smoking,Yoga,25.7,Overdue,...,8/13/2024,55,Medium,Consider quitting smoking.,2016-12-06,Metformin,98/66,78.4,Joshua Schmidt,CarePlus
4,S0005,67,Male,Heart Disease,Smoking Cessation,68.8,Family History,Yoga,25.8,Overdue,...,3/6/2026,46,Medium,Consider quitting smoking.,2020-06-09,Metformin,117/73,76.4,Ashlee Zimmerman,CarePlus


In [3]:
print("Column Names:")
print(df.columns)

Column Names:
Index(['Patient ID', 'Age', 'Gender', 'Chronic Disease', 'Preventive Measures',
       'Quality of Life Score', 'Risk Factors', 'Lifestyle Interventions',
       'BMI', 'Screening Status', 'Screening Completion Date',
       'Follow-Up Schedule', 'Risk Score', 'Priority Level', 'Health Advice',
       'Diagnosis Date', 'Current Medications', 'Blood Pressure',
       'Blood Sugar Levels', 'Doctor Name', 'Healthcare Provider'],
      dtype='object')


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6000 entries, 0 to 5999
Data columns (total 21 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   Patient ID                 6000 non-null   object 
 1   Age                        6000 non-null   int64  
 2   Gender                     6000 non-null   object 
 3   Chronic Disease            6000 non-null   object 
 4   Preventive Measures        6000 non-null   object 
 5   Quality of Life Score      6000 non-null   float64
 6   Risk Factors               6000 non-null   object 
 7   Lifestyle Interventions    6000 non-null   object 
 8   BMI                        6000 non-null   float64
 9   Screening Status           6000 non-null   object 
 10  Screening Completion Date  3038 non-null   object 
 11  Follow-Up Schedule         6000 non-null   object 
 12  Risk Score                 6000 non-null   int64  
 13  Priority Level             6000 non-null   objec

In [5]:
df.shape

(6000, 21)

In [6]:
df.describe()

Unnamed: 0,Age,Quality of Life Score,BMI,Risk Score,Blood Sugar Levels
count,6000.0,6000.0,6000.0,6000.0,4794.0
mean,49.025167,70.193533,25.011167,50.461,134.520317
std,18.112322,14.877405,5.010807,28.903444,37.141933
min,18.0,12.5,5.4,1.0,70.0
25%,33.0,60.1,21.7,25.0,102.6
50%,49.0,70.3,25.1,50.0,134.4
75%,65.0,80.6,28.4,75.0,166.3
max,80.0,130.3,42.1,100.0,200.0


**Handling missing values**

In [9]:
df['Blood Sugar Levels'] = df['Blood Sugar Levels'].fillna(df['Blood Sugar Levels'].mean())
df['Screening Completion Date'] = df['Screening Completion Date'].fillna('Not Available')

**Feature encoding for categorical variables**

In [15]:
categorical_cols = ['Gender', 'Chronic Disease', 'Preventive Measures', 'Risk Factors', 'Lifestyle Interventions', 'Screening Status', 'Priority Level', 'Health Advice', 'Current Medications', 'Doctor Name', 'Healthcare Provider']
encoder = OneHotEncoder(drop='first')

**Feature Scaling**

In [16]:
scaler = StandardScaler()
numerical_cols = ['Age', 'Quality of Life Score', 'BMI', 'Risk Score', 'Blood Sugar Levels']

**Preprocessing pipeline**

In [17]:
preprocessor = ColumnTransformer(
    transformers=[
        ('num', scaler, numerical_cols),
        ('cat', encoder, categorical_cols)
    ])

**Split the dataset into features and target**

In [18]:
X = df.drop(['Patient ID', 'Diagnosis Date', 'Screening Completion Date'], axis=1)  # Exclude non-relevant columns
y = df['Risk Score']  # Assuming Risk Score is the target (can be adjusted)

**Train-test split**

In [20]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

**XGBoost model pipeline**

In [21]:
xgb_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', XGBClassifier(use_label_encoder=False, eval_metric='logloss'))
])

**Hyperparameter tuning using GridSearchCV**

In [22]:
param_grid = {
    'classifier__n_estimators': [100, 200],
    'classifier__learning_rate': [0.01, 0.1, 0.3],
    'classifier__max_depth': [3, 6, 10]
}

In [23]:
grid_search = GridSearchCV(xgb_pipeline, param_grid, cv=3, verbose=1)
grid_search.fit(X_train, y_train)

Fitting 3 folds for each of 18 candidates, totalling 54 fits


ValueError: 
All the 54 fits failed.
It is very likely that your model is misconfigured.
You can try to debug the error by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
54 fits failed with the following error:
Traceback (most recent call last):
  File "D:\Anaconda\envs\data_analysis\Lib\site-packages\sklearn\model_selection\_validation.py", line 888, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "D:\Anaconda\envs\data_analysis\Lib\site-packages\sklearn\base.py", line 1473, in wrapper
    return fit_method(estimator, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\Anaconda\envs\data_analysis\Lib\site-packages\sklearn\pipeline.py", line 473, in fit
    self._final_estimator.fit(Xt, y, **last_step_params["fit"])
  File "D:\Anaconda\envs\data_analysis\Lib\site-packages\xgboost\core.py", line 726, in inner_f
    return func(**kwargs)
           ^^^^^^^^^^^^^^
  File "D:\Anaconda\envs\data_analysis\Lib\site-packages\xgboost\sklearn.py", line 1491, in fit
    raise ValueError(
ValueError: Invalid classes inferred from unique values of `y`.  Expected: [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71
 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95
 96 97 98 99], got [-1.71139201 -1.67679117 -1.64219033 -1.6075895  -1.57298866 -1.53838782
 -1.50378699 -1.46918615 -1.43458531 -1.39998447 -1.36538364 -1.3307828
 -1.29618196 -1.26158113 -1.22698029 -1.19237945 -1.15777861 -1.12317778
 -1.08857694 -1.0539761  -1.01937526 -0.98477443 -0.95017359 -0.91557275
 -0.88097192 -0.84637108 -0.81177024 -0.7771694  -0.74256857 -0.70796773
 -0.67336689 -0.63876606 -0.60416522 -0.56956438 -0.53496354 -0.50036271
 -0.46576187 -0.43116103 -0.3965602  -0.36195936 -0.32735852 -0.29275768
 -0.25815685 -0.22355601 -0.18895517 -0.15435433 -0.1197535  -0.08515266
 -0.05055182 -0.01595099  0.01864985  0.05325069  0.08785153  0.12245236
  0.1570532   0.19165404  0.22625487  0.26085571  0.29545655  0.33005739
  0.36465822  0.39925906  0.4338599   0.46846073  0.50306157  0.53766241
  0.57226325  0.60686408  0.64146492  0.67606576  0.7106666   0.74526743
  0.77986827  0.81446911  0.84906994  0.88367078  0.91827162  0.95287246
  0.98747329  1.02207413  1.05667497  1.0912758   1.12587664  1.16047748
  1.19507832  1.22967915  1.26427999  1.29888083  1.33348166  1.3680825
  1.40268334  1.43728418  1.47188501  1.50648585  1.54108669  1.57568753
  1.61028836  1.6448892   1.67949004  1.71409087]
