# Data Mining/Machine Learning Project: Medical Appointments - No Show

## Goals
1. Given a set of attributes/factors, predict if a person will miss their appointment or not.
2. Determine what factors contribute the most to a person missing their appointment.
3. Compare the performance of the 2 data mining/analysis methods implemented for this project.

##I. Business Understanding

Missed appointments are costly on the medical institutions. Therefore, understanding the factors that cause no-shows are vital in the search for potential solutions to these problems. Having the information about the data set have the following benefits:

1. Hospital can intelligently send more reminders to patients at a higher risk of missing appointments.
2. Understand if the reminder methods (in this case: SMS) are effective or not, and make changes as necessary to the strategies.
3. Inform appointment management/scheduling strategy. (More on the day or more routine appointments?)

## II. Data Understanding
### Dataset:
The dataset contains information about medical appointments and has 14 variables (PatientId, AppointmentID, Gender, DateScheduled, AppointmentDate, Age, Neighborhood, Scholarship, Hypertension, Diabetes, Alcoholism, Handicap, SMSReceived, NoShow).

### Tasks:

Explore the dataset to understand its structure, size, and features.
Check for missing values, outliers, and data types.
Understand the distribution of the target variable (NoShow).
Explore and analyze the relationships between features and the target variable.

In [None]:
# Load the required libraries
import numpy as np
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.lines import Line2D
plt.style.use('fivethirtyeight')
pd.set_option('display.width', 1000)
import warnings
warnings.filterwarnings('ignore')

In [None]:
# Load the data into a pandas dataframe
df = pd.read_csv('dataset.csv')


## Data Size, Dimensionality, Size, Data types
The dataset provided by [source] has a 110527 x 14 (mxn) dimensionality. We can identify the following columns and their data types (as nominal, ordinal, or continuous):
1. PatientId: nominal
2. AppointmentID: nominal
3. Gender: nominal
4. ScheduledDay: date type
5. AppointmentDay: date type
6. Age: continuous
7. Neighbourhood: nominal
8. Scholarship: nominal
9. Hypertension: nominal
10. Diabetes: nominal
11. Alcoholism: nominal
12. Handcap: nominal
13. SMS_received: nominal
14. No-show: nominal

In [None]:
# Dataset shape
df.shape

In [None]:
# First 5 rows of the dataset
df.head()

The dataset has 14 columns or characteristics.

In [None]:
df.tail()

In [None]:
# List columns in the dataset
df.columns

In [None]:
# Duplication check
df.duplicated().sum()

## Dataframe overall information
The dataset has no missing values across all rows and columns.

In [None]:
df.info()
df.isna().sum()

In [None]:

df.drop(columns=['AppointmentID'], inplace=True)

In [None]:
## rename columns to fix typos and use pythonic naming conventions
column_rename_dict = {}
for column in df.columns:
    column_rename_dict[column] = column.lower().replace(' ', '_')
    if '-' in column_rename_dict[column]:
        column_rename_dict[column] = column_rename_dict[column].replace('-', '_')

column_rename_dict['Hipertension'] = 'hypertension'
column_rename_dict['Handcap'] = 'handicap'
column_rename_dict['AppointmentDay'] = 'appointment_day'
column_rename_dict['ScheduledDay'] = 'scheduled_day'
column_rename_dict['PatientId'] = 'patient_id'

df.rename(columns=column_rename_dict, inplace=True)
# Check
df.columns


## Descriptive Statistics
1. Minimum age is -1 which is not possible.
2. Scholarship, Hypertension, Diabetes are binary for all rows. But Handicap has a max value of 4. This could mean this attribute should be binary and these >1 values are errors or it means the number of handicaps the patient had. The description provided from the source via Kaggle states it should be represented as True or False, but the Discussions revealed the attribute is the number of handicaps the patient has.

In [None]:
df[df.select_dtypes(exclude='object').columns.drop(["patient_id"])].describe().T

In [None]:
num_cols_no_age = df.select_dtypes(exclude='object').columns.drop(['patient_id', 'age'])

# Concatenate the percentage distribution data for all columns
perc_dist = pd.concat([pd.DataFrame({f"{column} value": df[column].value_counts(normalize=True).index,
                                                f"{column} percentage %": (df[column].value_counts(normalize=True) * 100).round(4).values})
                                  for column in num_cols_no_age], axis=1)
perc_dist = perc_dist.fillna(0)

perc_dist

## Data Cleaning
The goal is to remove anomalies from the data to develop data quality. Since in the descriptive statistics check an anomaly was observed in both the age and handicap columns, data cleaning operation can be performed in these columns. We can also ensure date type columns are converted correctly to datetime.
### Steps:
1. Remove the row will the age = -1. Manual removal is done here as it is simply 1 record with this issue and will not significantly impact the age column in correlation to the target variable for data modeling.
2. Convert scheduled_day and appointment_day columns to datetime.


In [None]:
# Remove rows with negative age
df.query("age == -1")

Check if there are enough rows in the dataset for ROMÃO. If that's the case, the loss is negligible.

In [None]:
len(df[df['neighbourhood'] ==  "ROMÃO"])

In [None]:
# Drop row with negative age
from matplotlib import axis


negative_age_idx = df[df["age"] == -1].index
df.drop(negative_age_idx, inplace = True)

In [None]:
# Convert the scheduled_day and appointment_day columns to datetime
df['scheduled_day'] = pd.to_datetime(df['scheduled_day'])
df['appointment_day'] = pd.to_datetime(df['appointment_day'])

In [None]:
nominal_columns = df.select_dtypes(include='object').columns
numerical_columns =df.select_dtypes(exclude='object').columns

nominal_cols_list = nominal_columns.tolist()
num_cols_list = numerical_columns.tolist()
numerical_columns

In [None]:
df[df.columns.drop(['patient_id', 'appointment_day', 'scheduled_day'])].hist(figsize=(16,8))

## Age Group Distribution
From plotting age distribution on a bar chart, baby (0 years) patients have the most frequency. The distribution is slightly left skewed meaning only a minority sample of the patient population in the dataframe were of the senior/elderly population.

In [None]:
# Discretize the Age column into bins
age_bins = [0, 18, 30, 45, 60, 75, 100]
age_labels = ['0-17', '18-29', '30-44', '45-59', '60-74', '75-100']
age_group_df = pd.cut(df['age'], bins=age_bins, labels=age_labels, right=False)

# Calculate age group counts and percentages
age_group_counts = age_group_df.value_counts().sort_index()
age_group_percentages = (age_group_counts / age_group_counts.sum()) * 100

# Prepare the data for plotting
plot_data = pd.DataFrame({'AgeGroup': age_group_counts.index, 'Count': age_group_counts.values, 'Percentage': age_group_percentages.values})

# Plot the bar chart
plt.figure(figsize=(14, 7))
sns.barplot(data=plot_data, x='AgeGroup', y='Count', palette='pastel', hue='AgeGroup', dodge=False)
plt.legend([],[], frameon=False)  # Hide the legend

# Add percentages on top of bars
for i, (count, percentage) in enumerate(zip(plot_data['Count'], plot_data['Percentage'])):
    plt.text(i, count, f'{percentage:.1f}%', ha='center', va='bottom')

plt.title('Age Group Distribution')
plt.xlabel('Age Group')
plt.ylabel('Number of Patients')
plt.show()

In [None]:
show = df['no_show'] == 'No'
no_show = df['no_show'] == 'Yes'

In [None]:
# Plot histograms for age based on attendance
plt.figure(figsize=(10, 6))

# Histogram for age of patients who showed up
plt.hist(df['age'][show], bins=18, color='green', alpha=0.5,  label='Showed up')
# Histogram for age of patients who didn't show up
plt.hist(df['age'][no_show], bins=18, alpha=0.5, color='red', label='No show')

plt.xlabel('Age')
plt.ylabel('Number of Patients')
plt.title('Attendance by Age')
plt.legend()

plt.show()

## Comparison of attendance between genders (Male and Female)

The analysis shows that 64.9% of females attended their appointments versus 35.1% of males, and 65.39% of females did not attend versus 34.61% of males. This indicates that while the dataset is skewed towards females, gender here is not a strong predictor of no-show behavior due to the similar percentage distributions across both attendance and no-shows. Consequently, to achieve the project's aim of improving attendance rates, it is crucial to explore other variables such as age, medical conditions, and the impact of SMS reminders, which may provide stronger correlations and insights into patient attendance patterns.

In [None]:



gender_showed = df[show]['gender'].value_counts(normalize=True)
gender_no_show = df[no_show]["gender"].value_counts(normalize=True)

colors = ['lightgreen', 'lightcoral']
fig, axes = plt.subplots(1, 2, figsize=(14, 6))
gender_showed.plot(kind='pie', autopct='%1.1f%%',ax=axes[0], colors=colors)        
axes[0].set_title('Percentage of gender that attended', fontdict={'fontsize':12})
axes[0].set_ylabel('')

gender_no_show.plot(kind='pie', autopct='%1.1f%%',ax=axes[1], colors=colors) 
axes[1].set_title('Percentage of gender that did not show up', fontdict={'fontsize': 12})
axes[1].set_ylabel('')

plt.show()       

## Comparison of attendance by chronic disease
(Note to self): Get reference for alcoholism definition as a chronic disease.
In exploring the correlation between chronic diseases and appointment attendance, our objective is to understand whether patients with chronic conditions may demonstrate distinct attendance patterns compared to those without such ailments. The analysis unveils a noticeable contrast in attendance rates, with 82.23% of patients with chronic diseases attending appointments versus 79.09% of those without. Conversely, 17.77% of patients with chronic diseases missed appointments, while 20.91% of those without chronic diseases did.

This disparity, a 3.14% difference in attendance rates, although may be thought of being relatively small, slightly suggests that ongoing health management may influence attendance behavior, providing insights for healthcare providers to tailor interventions and enhance appointment adherence across patient groups. However, the scale of this influence may not be determined yet as the data collection occured in a short time period. A longer time frame collection may yield better clarity in understanding this influence. But for the goal of the data exploration and modeling, chronic diseases such as hypertension, diabetes and alcoholism do not show a noteworthy correlation with appointment adherence.

In [None]:
# Create a new column to indicate if a patient has any chronic disease
dfc = df.copy(deep=True)
dfc['has_chronic_disease'] = dfc[['hypertension', 'diabetes', 'alcoholism']].sum(axis=1) > 0

# Calculate the counts of no-shows and shows for patients with and without chronic diseases
comparison = dfc.groupby(['has_chronic_disease', 'no_show']).size().unstack().fillna(0)

# Plot the bar chart
ax = comparison.plot(kind='bar', stacked=True, color=['lightgreen', 'lightcoral'], figsize=(9, 7))

# Set labels and title
ax.set_xlabel('Chronical Disease Status')
ax.set_ylabel('Count')
ax.set_title('Attendance Comparison: Patients with and without Chronic Diseases')
ax.set_xticklabels(['No Chronic Disease', 'Has Chronic Disease'], rotation=0)


# Show the plot
plt.show()

In [None]:
# Calculate the counts of no-shows and shows for patients with and without chronic diseases
attendance_comparison = dfc.groupby(['has_chronic_disease', 'no_show']).size().unstack().fillna(0)

# Calculate percentages
attendance_percentages = attendance_comparison.div(attendance_comparison.sum(axis=1), axis=0) * 100

# Prepare data for printing as a table
table_data = [
    ["Chronic Disease", "No-show", "Show"],
    [False, attendance_comparison.loc[False, 'Yes'], attendance_comparison.loc[False, 'No']],
    [True, attendance_comparison.loc[True, 'Yes'], attendance_comparison.loc[True, 'No']],
]

# Print table headers
print("Counts of Attendance for Patients with and without Chronic Diseases:")
# Print table rows
for row in table_data:
    print("{:<17} | {:<7} | {:<5}".format(*row))

print("\nPercentages of Attendance for Patients with and without Chronic Diseases:")
# Print percentages
print(attendance_percentages.round(4))


## Attendance comparison based on SMS received
In analyzing the correlation between SMS reception and appointment attendance, our aim is to discern whether patients who receive SMS reminders exhibit different attendance behavior compared to those who don't. The results reveal a notable difference in attendance rates: 83.30% of patients who did not receive an SMS reminder attended their appointments, while 16.70% did not. The discrepancy show that sending SMS reminders actually had an opposite outcome of the expectation that the reminders would improve attendance. However, we need to investigate how same-day appointments contributes to these findings.

In [None]:
plt.figure(figsize=(8, 6))
sns.countplot(x='sms_received', hue='no_show', data=df, palette='pastel')

# Set labels and title
plt.xlabel('SMS Received', fontsize=12)
plt.ylabel('Count', fontsize=12)
plt.title('Attendance comparison based on SMS Reception', fontsize=14)

# Show plot
plt.legend(title='Attended', labels=['Yes', 'No'])
plt.show()

In [None]:
# Calculate the percentages of show and noshow instances for each category of SMS reception
perc_sms_show = df.groupby('sms_received')['no_show'].value_counts(normalize=True)[:, 'No'] * 100
perc_sms_noshow = df.groupby('sms_received')['no_show'].value_counts(normalize=True)[:, 'Yes'] * 100

# Prepare data for printing as a table
table_data = []
for sms_received, show_percentage, noshow_percentage in zip(perc_sms_show.index, perc_sms_show.values, perc_sms_noshow.values):
    table_data.append([sms_received, show_percentage, noshow_percentage])

# Print table headers
print("SMS Received | Show Percentage | Noshow Percentage")
# Print table rows
for row in table_data:
    print("{:<12} | {:<15.2f}% | {:<15.2f}%".format(*row))


## Same day appointments statistics
Roughly 35% of all appointments recorded were same-day appointments. This distribution is significant enough to influence the results gathered earlier. Therefore, it is necessary to filter out same-day appointments as this will be the real test of the impact of the sms campaign.

In [None]:
is_same_day = (df['scheduled_day'].dt.date == df['appointment_day'].dt.date) & (df['scheduled_day'].dt.month == df['appointment_day'].dt.month)
same_day_appts = df[is_same_day == True]
same_day_appts_count = same_day_appts.value_counts().sum()
# Non-same day appointments
not_same_day_appts = df[is_same_day == False]

print(f"Number of appointments scheduled on the same day: {same_day_appts_count}")
print(f"Percentage of appointments scheduled on the same day: {(same_day_appts_count / df.shape[0]) * 100:.4f}%")

In [None]:
not_same_day_appts_count = not_same_day_appts.value_counts().sum()
print(f"Number of appointments scheduled on different days: {not_same_day_appts_count}")
not_same_day_appts.shape

In [None]:
plt.figure(figsize=(10, 9))
sns.countplot(x='sms_received', hue='no_show', data=not_same_day_appts, palette='pastel')

# Set labels and title
plt.xlabel('SMS Received', fontsize=12)
plt.ylabel('Count', fontsize=12)
plt.title('Attendance comparison based on SMS Reception', fontsize=14)

# Show plot
plt.legend(title='Attended', labels=['Yes', 'No'])
plt.show()

After filtering out same-day appointments, the new analysis revealed that patients who did not receive an SMS had a show percentage of 70.55% and a no-show percentage of 29.45%. Those who received an SMS showed a slight increase in attendance, with a show percentage of 72.43% and a no-show percentage of 27.57%. This suggests that, for non-same-day appointments, receiving an SMS has a modest positive impact on attendance, improving the show rate by approximately 2% compared to those who did not receive an SMS.

In [None]:
# Calculate the percentages of show and noshow instances for each category of SMS reception for non-same day appointments
perc_sms_show = not_same_day_appts.groupby('sms_received')['no_show'].value_counts(normalize=True)[:, 'No'] * 100
perc_sms_noshow = not_same_day_appts.groupby('sms_received')['no_show'].value_counts(normalize=True)[:, 'Yes'] * 100

# Prepare data for printing as a table
table_data = []
for sms_received, show_percentage, noshow_percentage in zip(perc_sms_show.index, perc_sms_show.values, perc_sms_noshow.values):
    table_data.append([sms_received, show_percentage, noshow_percentage])

# Print table headers
print("SMS Received | Show Percentage | Noshow Percentage")
# Print table rows
for row in table_data:
    print("{:<12} | {:<15.4f}% | {:<15.4f}%".format(*row))


## Attendance comparison by Handicap

Based on the analysis of attendance comparison based on the level of handicap, we observe varying trends. The majority of appointments involve patients with no reported handicap, comprising approximately 97.97% of the dataset. Among these appointments, the no-show rate is 20.24%, indicating a moderate but notable proportion of missed appointments. Interestingly, appointments involving patients with a reported handicap level of 1 or 2 exhibit slightly lower no-show rates compared to those with no reported handicap, suggesting a potential correlation between a mild level of handicap and increased appointment attendance. However, caution is warranted in interpreting these findings due to the relatively small sample sizes of patients with higher levels of handicap (levels 3 and 4), which may not be representative. Further investigation with larger datasets or stratified analyses by handicap severity may provide deeper insights into the relationship between handicap level and appointment attendance.

In [None]:
# Calculate the counts for show and no-show for each handicap level
handicap_attendance_counts = df.groupby(['handicap', 'no_show']).size().unstack().fillna(0)

# Plot the bar chart
handicap_attendance_counts.plot(kind='bar', stacked=True, figsize=(10, 6), color=['lightgreen', 'lightcoral'])

# Add labels and title
plt.xlabel('Handicap Level')
plt.ylabel('Count')
plt.title('Attendance Comparison Based on Handicap')
plt.legend(title='No-show', loc='upper right', labels=['Show', 'No-show'])
plt.xticks(rotation=0)

plt.show()

In [None]:
# Calculate the total count per handicap level
handicap_total_counts = df.groupby('handicap').size()


handicap_attendance_counts = df.groupby(['handicap', 'no_show']).size().unstack().fillna(0)

handicap_attendance_percentages = handicap_attendance_counts.div(handicap_total_counts, axis=0) * 100

handicap_summary = pd.DataFrame({
    'Handicap Level': handicap_total_counts.index,
    'Total Count': handicap_total_counts.values,
    'Show Count': handicap_attendance_counts['No'].values,
    'Noshow Count': handicap_attendance_counts['Yes'].values,
    'Show Percentage': handicap_attendance_percentages['No'].values,
    'Noshow Percentage': handicap_attendance_percentages['Yes'].values
})

# Print the summary table
print(handicap_summary.to_string(index=False))


## Attendance Comparison based on Scholarship status

We observe that the majority of patients without scholarship status attended their appointments, with an attendance rate of 80.19%. Conversely, patients with scholarship status had a slightly lower attendance rate of 76.26%.

In [None]:
# Calculate the counts for show and no-show for each scholarship status
scholarship_attendance_counts = df.groupby(['scholarship', 'no_show']).size().unstack()

# Plot the bar chart
scholarship_attendance_counts.plot(kind='bar', stacked=True, figsize=(10, 6), color=['lightgreen', 'lightcoral'])

# Add labels and title
plt.xlabel('Scholarship Status')
plt.ylabel('Count')
plt.title('Attendance Comparison Based on Scholarship Status')
plt.legend(title='No-show', loc='upper right', labels=['Show', 'No-show'])
plt.xticks(rotation=0)

plt.show()
# #does neighbourhood affect the attendance?
# plt.figure(figsize=(20,13))
# df['scholarship'][show].value_counts().plot(kind='bar', color = 'blue', label = 'show')
# df['scholarship'][no_show].value_counts().plot(kind='bar', color = 'red', label = 'noShow')
# plt.legend()
# plt.title('comparison according to Neighbourhood')
# plt.xlabel('Neighbourhood')
# plt.ylabel('Patient No.')

scholarship_attendance_counts.head()

In [None]:
print(df.groupby('scholarship').size().sum())
df.groupby('scholarship')

In [None]:
# Calculate the total count per scholarship status
scholarship_total_counts = df.groupby('scholarship').size()
scholarship_attendance_perc = scholarship_attendance_counts.div(scholarship_total_counts, axis=0) * 100

scholarship_summary = pd.DataFrame({
    'Scholarship Status': scholarship_total_counts.index,
    'Total Count': scholarship_total_counts.values,
    'Show Count': scholarship_attendance_counts['No'].values,
    'Noshow Count': scholarship_attendance_counts['Yes'].values,
    'Show Percentage': scholarship_attendance_perc['No'].values,
    'Noshow Percentage': scholarship_attendance_perc['Yes'].values
})

print(scholarship_summary.to_string(index=False))

## Attendance comparison based on Neighborhood

The variability in the percentages of attendance per neighbourhood shows neighbourhood has a strong effect on attendance, perhaps more than other features explored in this analysis. This may be the factor that contributes most to attendance and this might need to be investigated further, although that is beyond the scope of the data analysis.

In [None]:
#does neighbourhood affect the attendance?
plt.figure(figsize=(20,10))
df['neighbourhood'][show].value_counts().plot(kind='bar', color = 'lightgreen', label = 'show')
df['neighbourhood'][no_show].value_counts().plot(kind='bar', color = 'lightcoral', label = 'no show')
plt.legend()
plt.title('comparison according to Neighbourhood')
plt.xlabel('Neighbourhood')
plt.ylabel('Patient No.')

In [None]:
# Group by neighborhood and no_show, calculate counts
neighborhood_counts = df.groupby(['neighbourhood', 'no_show']).size().unstack().fillna(0)

# Calculate total count per neighborhood
total_counts = neighborhood_counts.sum(axis=1)

# Sort neighborhoods by total count in descending order
sorted_neighborhood_counts = neighborhood_counts.loc[total_counts.sort_values(ascending=False).index]

# Calculate percentages
neighborhood_percentages = (sorted_neighborhood_counts.div(sorted_neighborhood_counts.sum(axis=1), axis=0) * 100).round(2)

# Prepare the data for printing
data = {
    'Neighborhood': sorted_neighborhood_counts.index,
    'Total Count': sorted_neighborhood_counts.sum(axis=1),
    'Show Count': sorted_neighborhood_counts['No'],
    'Noshow Count': sorted_neighborhood_counts['Yes'],
    'Show Percentage': neighborhood_percentages['No'],
    'Noshow Percentage': neighborhood_percentages['Yes']
}

# Create DataFrame
result_df = pd.DataFrame(data)

# Print the table
print(result_df.head(15).to_string(index=False))


## Class Imbalance Investigation
There is a significant imbalance between the classes as over 88k patients attended their appointments versus over 22k missing their appointments. A similar imbalance still appears even after filtering out same-day appointments as it was already known that 35% of the appointments were same-day appointments which majorly were shows (No in no_show class). This occurence must be considered during data modeling. This also means the metric for model quality may not be accuracy and might be other metrics like F1 Score and ROC AUC. Another possible technique that can be implemented could be Random Undersampling.

In [None]:
df['no_show'].value_counts()

In [None]:
# Investigate the class imbalance in the dataset (no-shows vs shows) and plot the distribution on one bar chart

sns.countplot(x='no_show', data=df, palette='pastel')
plt.title("Classes distribution")
plt.show()



In [None]:
# Check class imbalance if same day appointments are removed  and plot the distribution on one bar chart
sns.countplot(x='no_show', data=not_same_day_appts, palette='pastel')
plt.title("Classes distribution")
plt.show()

In [None]:

# Ratio of shows to no-shows
no_show_ratio = df['no_show'].value_counts(normalize=True)['Yes'] / df['no_show'].value_counts(normalize=True)['No']
no_show_ratio = round(no_show_ratio, 2)
print(f'Ratio of shows to no-shows: {no_show_ratio.as_integer_ratio()[1]}:{no_show_ratio.as_integer_ratio()[0]}')

# Data Modelling
As we proceed with the data modeling stage, two data modeling techniques were chosen for predicting if a patient will miss their appointments or not. The model techniques used are:
1. Logistic Regression Classifier
2. Neural Network Classifier

This is the order of steps that will be followed:
1. Perform feature engineering.
    - Convert categorical features to numerical. This can be done via one hot encoding.
    - Create new features as needed.
    - Perform feature selection.
3. Split dataset into training, validation and testing sets.
4. Design and train the models on the training set and hypertune with validation set.
5. Evaluate the model's performance via accuracy, F1, confusion matrix and ROC AUC results from testing set.

## 1. Convert categorical features to numerical

In [None]:
from sklearn.preprocessing import MinMaxScaler

In [None]:
dfm = df.copy(deep=True)

In [None]:
# Drop rows where the values of the handicap column are greater than 1
dfm.drop(dfm[dfm['handicap'] > 1].index, inplace=True)

In [None]:
# Create Male and Female numerical columns from gender column
dfm['is_male'] = dfm['gender'].map({'M': 1, 'F': 0})

# Convert the target variable column
dfm['no_show'] = dfm['no_show'].map({'Yes': 1, 'No': 0})

# Drop the gender column
dfm.drop(columns=['gender'], inplace=True, axis=1)

### We have neighbourhood column with high cardinality. Using one-hot encoding will increase the number of columns significantly. Therefore, we can use frequency encoding to encode the neighbourhood column.

In [None]:
# First check if there are any labels in the neighbourhood column that have the same count.
track_count = 0
for count in dfm['neighbourhood'].value_counts().value_counts():
    if count > 1:
        track_count += 1
    else:
        continue

if track_count > 0:
    print(f'{'-'*10}There are labels with the same count. Valuable information may be lost.{'-'*10}')

In [None]:
# Compute the frequency of each category
freq = dfm['neighbourhood'].value_counts(normalize=True)

# Map the frequencies to the dataframe
dfm['neighbourhood_encoded'] = dfm['neighbourhood'].map(freq)

# Drop the neighbourhood column
dfm.drop(columns=['neighbourhood'], inplace=True)

In [None]:
# Normalize the Age column
_numerical_columns = dfm.select_dtypes(exclude='object').columns.drop(['patient_id', 'scheduled_day', 'appointment_day'])

scaler = MinMaxScaler()

normalized = scaler.fit_transform(dfm[_numerical_columns])
dfm['age_normalized'] = pd.DataFrame(normalized, columns=_numerical_columns)['age']

dfm.drop(columns=['age'], inplace=True)

In [None]:
dfm.head()

## Feature Selection

Before modeling, we need to select what features may contribute the most information gain to the model, i.e correlates with the target variables. From the dataset, most of the features are binary, including the target variables. For categorical input features with categorical output/target, a well-known method for determining the correlation is called Chi-Square Test. However, we have converted the neighbourhood and age features to non-binary numerical features. For these columns, we may need to apply a different correlation discovery technique called Pearson Correlation Coefficient.

## Chi Square 
I will select the best 3 features with the highest importance from the results of conducting the Chi Square. This works by choosing the three features with the highest chi values and lowest p values.

# Pearson Correlation Coefficient
As there are only 2 input features,adding to the previously selected 3 features from the Chi-Square test gives 5 features which is not too much for the model. So after visualization, we can add both features as inputs for the models and evaluate the performance.

In [None]:
dfm.columns

In [None]:
# Implement Chi-Square for checking correlation between categorical variables and the target variable
from sklearn.feature_selection import chi2

X = dfm.drop(columns=['no_show', 'patient_id', 'scheduled_day', 'appointment_day', 'neighbourhood_encoded', 'age_normalized'], axis=1)
y = dfm['no_show']

In [None]:
chi_scores = chi2(X, y)

In [None]:
chi_values = pd.Series(chi_scores[0], index=X.columns)
chi_values.sort_values(ascending=False, inplace=True)


In [None]:
# Higher the p-value, the more the feature is independent of the target variable

p_values = pd.Series(chi_scores[1], index=X.columns)
p_values.sort_values(ascending=False, inplace=True)


In [None]:
# Print Chi and P-values for each feature
print("Chi and P-values for each feature:")
# Print the table
print(pd.concat([chi_values, p_values], axis=1, keys=['Chi', 'P-value']))

In [None]:
# Higher the chi-squared value, the more the feature is related to the target variable
# Plot the chi-squared values using a bar chart. Give a chart title of 'Chi value importance for each feature'
chi_values.plot(kind='bar', figsize=(12, 6), color='skyblue')
plt.title('Chi value importance for each feature')

In [None]:
# Plot the p-values using a bar chart. Give a chart title of 'P-value importance for each feature'
p_values.plot(kind='bar', figsize=(12, 6), color='lightcoral')
plt.title('P-value importance for each feature')

In [None]:
sns.set_theme(style="white")
non_binary_cols = ['neighbourhood_encoded', 'age_normalized', 'no_show']
corr = dfm[non_binary_cols].corr()  # Create the pearson correlation metrix object

fig, ax = plt.subplots() # create the figure

sns.heatmap(corr, annot=True, cmap='Greens', annot_kws={'rotation':45}) # Draw the heatmap

plt.title("Correlation Matrix")
plt.show()

## Model Design Thought process