# Medical Appointment No‑Show Prediction & Demand Forecasting
**Interview Level Notebook – EDA & Understanding**


## 1. Objective of this Notebook
Purpose of this notebook:

1. Understand the dataset structure  
2. Analyze target imbalance  
3. Identify missing values and their business impact  
4. Explore relationships between:
   - patient demographics  
   - appointment characteristics  
   - health conditions  
   - weather factors  
5. Prepare foundation for preprocessing and modeling


## 2. Import Required Libraries

In [1]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

pd.set_option('display.max_columns', None)


## 3. Load Dataset

In [2]:

data_path = 'Medical_appointment_data.csv'
df = pd.read_csv(data_path)

df.head()


Unnamed: 0,specialty,appointment_time,gender,no_show,disability,place,appointment_shift,age,under_12_years_old,over_60_years_old,patient_needs_companion,average_temp_day,average_rain_day,max_temp_day,max_rain_day,rainy_day_before,storm_day_before,rain_intensity,heat_intensity,appointment_date_continuous,Hipertension,Diabetes,Alcoholism,Handcap,Scholarship,SMS_received
0,psychotherapy,17,F,yes,intellectual,Lake Marvinville,afternoon,9.0,1,0,1,23.18,0.0,27.5,0.0,1,1,no_rain,warm,2020-01-01,0,0,0,0,0,0
1,,7,M,no,intellectual,ITAPEMA,morning,11.0,1,0,1,14.31,0.02,16.5,0.6,1,1,no_rain,cold,2020-01-01,0,0,0,0,0,0
2,speech therapy,16,M,no,intellectual,ITAJAÍ,afternoon,8.0,1,0,1,21.61,0.01,29.9,0.2,1,1,no_rain,warm,2020-01-01,0,0,0,0,0,0
3,speech therapy,14,M,yes,intellectual,Sarahside,afternoon,9.0,1,0,1,21.39,0.11,24.1,1.4,1,1,moderate,mild,2020-01-01,0,0,0,0,0,1
4,physiotherapy,8,M,no,motor,ITAJAÍ,morning,,0,0,0,20.15,0.02,23.1,0.2,1,1,no_rain,mild,2020-01-01,0,0,0,0,0,0


## 4. Basic Dataset Overview

In [3]:

print("Shape:", df.shape)
print("\nColumns:", df.columns.tolist())
df.info()


Shape: (109593, 26)

Columns: ['specialty', 'appointment_time', 'gender', 'no_show', 'disability', 'place', 'appointment_shift', 'age', 'under_12_years_old', 'over_60_years_old', 'patient_needs_companion', 'average_temp_day', 'average_rain_day', 'max_temp_day', 'max_rain_day', 'rainy_day_before', 'storm_day_before', 'rain_intensity', 'heat_intensity', 'appointment_date_continuous', 'Hipertension', 'Diabetes', 'Alcoholism', 'Handcap', 'Scholarship', 'SMS_received']
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 109593 entries, 0 to 109592
Data columns (total 26 columns):
 #   Column                       Non-Null Count   Dtype  
---  ------                       --------------   -----  
 0   specialty                    89466 non-null   object 
 1   appointment_time             109593 non-null  int64  
 2   gender                       109593 non-null  object 
 3   no_show                      109593 non-null  object 
 4   disability                   92992 non-null   object 
 5   pl


## 5. Target Variable Analysis – No Show

Why this step?
- To check class imbalance  
- To decide evaluation metric  
- To plan SMOTE / class weight strategy


In [4]:

df['no_show'].value_counts(normalize=True) * 100


no_show
no     68.216948
yes    31.783052
Name: proportion, dtype: float64


## 6. Missing Value Analysis

Business reasoning:
- Age missing may indicate incomplete registration  
- Specialty missing may represent walk‑in patients  
- Weather missing must be imputed using nearby dates


In [5]:

missing = df.isnull().sum()
missing = missing[missing > 0]
missing.sort_values(ascending=False)


age                 22960
specialty           20127
disability          16601
place               11539
max_rain_day         2263
average_rain_day     2245
max_temp_day         2227
average_temp_day     2211
dtype: int64

## 7. Descriptive Statistics

In [6]:

df.describe(include='all')

Unnamed: 0,specialty,appointment_time,gender,no_show,disability,place,appointment_shift,age,under_12_years_old,over_60_years_old,patient_needs_companion,average_temp_day,average_rain_day,max_temp_day,max_rain_day,rainy_day_before,storm_day_before,rain_intensity,heat_intensity,appointment_date_continuous,Hipertension,Diabetes,Alcoholism,Handcap,Scholarship,SMS_received
count,89466,109593.0,109593,109593,92992,98054,109593,86633.0,109593.0,109593.0,109593.0,107382.0,107348.0,107366.0,107330.0,109593.0,109593.0,109593,109593,109593,109593.0,109593.0,109593.0,109593.0,109593.0,109593.0
unique,8,,3,2,3,26289,2,,,,,,,,,,,4,5,498,,,,,,
top,psychotherapy,,M,no,intellectual,ITAJAÍ,afternoon,,,,,,,,,,,no_rain,mild,2021-04-02,,,,,,
freq,28645,,82269,74761,62852,20515,59334,,,,,,,,,,,76415,46903,1512,,,,,,
mean,,12.120966,,,,,,18.632138,0.446424,0.071328,0.519823,20.346642,0.183537,24.03291,2.048093,0.937396,0.937533,,,,0.058088,0.023952,0.018541,0.009116,0.055113,0.311808
std,,3.281623,,,,,,17.666999,0.497124,0.257372,0.499609,3.446079,0.416267,3.959696,4.352247,0.242251,0.242004,,,,0.23391,0.152901,0.134899,0.09504,0.228202,0.463234
min,,7.0,,,,,,2.0,0.0,0.0,0.0,8.94,0.0,13.3,0.0,0.0,0.0,,,,0.0,0.0,0.0,0.0,0.0,0.0
25%,,9.0,,,,,,8.0,0.0,0.0,0.0,18.06,0.0,21.4,0.0,1.0,1.0,,,,0.0,0.0,0.0,0.0,0.0,0.0
50%,,13.0,,,,,,12.0,0.0,0.0,1.0,20.6,0.01,23.9,0.2,1.0,1.0,,,,0.0,0.0,0.0,0.0,0.0,0.0
75%,,15.0,,,,,,18.0,1.0,0.0,1.0,22.72,0.15,26.8,1.9,1.0,1.0,,,,0.0,0.0,0.0,0.0,0.0,1.0



## 8. Feature Grouping (Conceptual)

We categorize variables into:

1. Patient Features  
   - gender, age, disability, needs_companion

2. Appointment Features  
   - specialty, shift, date

3. Location  
   - place (city)

4. Health Conditions  
   - Hypertension, Diabetes, Alcoholism

5. Weather  
   - temperature, rain, intensity
