## Data loading

### Subtask:
Load the dataset "noshowappointments.csv" into a pandas DataFrame.


In [1]:
import pandas as pd

try:
    df = pd.read_csv('noshowappointments.csv')
    display(df.head())
    print(df.shape)
except FileNotFoundError:
    print("Error: 'noshowappointments.csv' not found.")
    df = None
except pd.errors.ParserError:
    print("Error: Could not parse the CSV file.")
    df = None
except Exception as e:
    print(f"An unexpected error occurred: {e}")
    df = None

Unnamed: 0,PatientId,AppointmentID,Gender,ScheduledDay,AppointmentDay,Age,Neighbourhood,Scholarship,Hipertension,Diabetes,Alcoholism,Handcap,SMS_received,No-show
0,29872500000000.0,5642903,F,2016-04-29T18:38:08Z,2016-04-29T00:00:00Z,62,JARDIM DA PENHA,0,1,0,0,0,0,No
1,558997800000000.0,5642503,M,2016-04-29T16:08:27Z,2016-04-29T00:00:00Z,56,JARDIM DA PENHA,0,0,0,0,0,0,No
2,4262962000000.0,5642549,F,2016-04-29T16:19:04Z,2016-04-29T00:00:00Z,62,MATA DA PRAIA,0,0,0,0,0,0,No
3,867951200000.0,5642828,F,2016-04-29T17:29:31Z,2016-04-29T00:00:00Z,8,PONTAL DE CAMBURI,0,0,0,0,0,0,No
4,8841186000000.0,5642494,F,2016-04-29T16:07:23Z,2016-04-29T00:00:00Z,56,JARDIM DA PENHA,0,1,1,0,0,0,No


(110527, 14)


## Data exploration

### Subtask:
Explore the loaded dataset to understand its characteristics.


In [2]:
# Data Types and Missing Values
print("Data Types:\n", df.dtypes)
print("\nMissing Values:\n", df.isnull().sum())
missing_percentage = (df.isnull().sum() / len(df)) * 100
print("\nPercentage of Missing Values:\n", missing_percentage)

Data Types:
 PatientId         float64
AppointmentID       int64
Gender             object
ScheduledDay       object
AppointmentDay     object
Age                 int64
Neighbourhood      object
Scholarship         int64
Hipertension        int64
Diabetes            int64
Alcoholism          int64
Handcap             int64
SMS_received        int64
No-show            object
dtype: object

Missing Values:
 PatientId         0
AppointmentID     0
Gender            0
ScheduledDay      0
AppointmentDay    0
Age               0
Neighbourhood     0
Scholarship       0
Hipertension      0
Diabetes          0
Alcoholism        0
Handcap           0
SMS_received      0
No-show           0
dtype: int64

Percentage of Missing Values:
 PatientId         0.0
AppointmentID     0.0
Gender            0.0
ScheduledDay      0.0
AppointmentDay    0.0
Age               0.0
Neighbourhood     0.0
Scholarship       0.0
Hipertension      0.0
Diabetes          0.0
Alcoholism        0.0
Handcap           0.0
SM

In [3]:
# Descriptive Statistics for Numerical Features
numerical_features = df.select_dtypes(include=['number'])
print("\nDescriptive Statistics for Numerical Features:\n", numerical_features.describe())


Descriptive Statistics for Numerical Features:
           PatientId  AppointmentID            Age    Scholarship  \
count  1.105270e+05   1.105270e+05  110527.000000  110527.000000   
mean   1.474963e+14   5.675305e+06      37.088874       0.098266   
std    2.560949e+14   7.129575e+04      23.110205       0.297675   
min    3.921784e+04   5.030230e+06      -1.000000       0.000000   
25%    4.172614e+12   5.640286e+06      18.000000       0.000000   
50%    3.173184e+13   5.680573e+06      37.000000       0.000000   
75%    9.439172e+13   5.725524e+06      55.000000       0.000000   
max    9.999816e+14   5.790484e+06     115.000000       1.000000   

        Hipertension       Diabetes     Alcoholism        Handcap  \
count  110527.000000  110527.000000  110527.000000  110527.000000   
mean        0.197246       0.071865       0.030400       0.022248   
std         0.397921       0.258265       0.171686       0.161543   
min         0.000000       0.000000       0.000000       0.000

In [4]:
# Unique Value Counts for Categorical Features
categorical_features = df.select_dtypes(exclude=['number'])
for col in categorical_features.columns:
    print(f"\nUnique values and counts for {col}:\n{categorical_features[col].value_counts()}")


Unique values and counts for Gender:
Gender
F    71840
M    38687
Name: count, dtype: int64

Unique values and counts for ScheduledDay:
ScheduledDay
2016-05-06T07:09:54Z    24
2016-05-06T07:09:53Z    23
2016-04-25T17:18:27Z    22
2016-04-25T17:17:46Z    22
2016-04-25T17:17:23Z    19
                        ..
2016-05-09T10:17:48Z     1
2016-05-02T09:50:06Z     1
2016-05-13T14:28:22Z     1
2016-05-09T08:12:56Z     1
2016-05-06T09:28:08Z     1
Name: count, Length: 103549, dtype: int64

Unique values and counts for AppointmentDay:
AppointmentDay
2016-06-06T00:00:00Z    4692
2016-05-16T00:00:00Z    4613
2016-05-09T00:00:00Z    4520
2016-05-30T00:00:00Z    4514
2016-06-08T00:00:00Z    4479
2016-05-11T00:00:00Z    4474
2016-06-01T00:00:00Z    4464
2016-06-07T00:00:00Z    4416
2016-05-12T00:00:00Z    4394
2016-05-02T00:00:00Z    4376
2016-05-18T00:00:00Z    4373
2016-05-17T00:00:00Z    4372
2016-06-02T00:00:00Z    4310
2016-05-10T00:00:00Z    4308
2016-05-31T00:00:00Z    4279
2016-05-05T00:0

In [5]:
# Shape of the Data
print(f"\nShape of the data: {df.shape}")



Shape of the data: (110527, 14)


In [6]:
# Summary
print("\nSummary of Findings:")
print("1. Data Types: See above.")
print("2. Missing Values: See above. No missing values were found.")
print("3. Descriptive Statistics: See above.")
print("4. Unique Value Counts: See above.")
print("5. Shape of data:", df.shape)


Summary of Findings:
1. Data Types: See above.
2. Missing Values: See above. No missing values were found.
3. Descriptive Statistics: See above.
4. Unique Value Counts: See above.
5. Shape of data: (110527, 14)


Data Cleaning


In [7]:
# 1. Rename Columns (lowercase, replace dashes/spaces)
df.columns = [col.strip().lower().replace('-', '_').replace(' ', '_') for col in df.columns]

In [8]:
# 2. Convert ScheduledDay and AppointmentDay to datetime
df['scheduledday'] = pd.to_datetime(df['scheduledday'])
df['appointmentday'] = pd.to_datetime(df['appointmentday'])

In [9]:
# 3. Remove invalid age values
df = df[df['age'] >= 0]

In [10]:
# 4. Create new column: waiting_days
df['waiting_days'] = (df['appointmentday'] - df['scheduledday']).dt.days


In [11]:
# 5. Remove rows with negative waiting time
df = df[df['waiting_days'] >= 0]


In [12]:
# Drop unnecessary columns
df = df.drop(['patientid', 'appointmentid'], axis=1)

In [13]:
# 6. Reset index after cleaning
df.reset_index(drop=True, inplace=True)

In [14]:
# Confirm changes
print("✅ Data cleaning complete. Current shape:", df.shape)
df.head()

✅ Data cleaning complete. Current shape: (71959, 13)


Unnamed: 0,gender,scheduledday,appointmentday,age,neighbourhood,scholarship,hipertension,diabetes,alcoholism,handcap,sms_received,no_show,waiting_days
0,F,2016-04-27 08:36:51+00:00,2016-04-29 00:00:00+00:00,76,REPÚBLICA,0,1,0,0,0,0,No,1
1,F,2016-04-27 15:05:12+00:00,2016-04-29 00:00:00+00:00,23,GOIABEIRAS,0,0,0,0,0,0,Yes,1
2,F,2016-04-27 15:39:58+00:00,2016-04-29 00:00:00+00:00,39,GOIABEIRAS,0,0,0,0,0,0,Yes,1
3,F,2016-04-27 12:48:25+00:00,2016-04-29 00:00:00+00:00,19,CONQUISTA,0,0,0,0,0,0,No,1
4,F,2016-04-27 14:58:11+00:00,2016-04-29 00:00:00+00:00,30,NOVA PALESTINA,0,0,0,0,0,0,No,1


## Data analysis

### Subtask:
Analyze the data to identify potential relationships between features and potential predictive factors for appointment attendance.


In [15]:


# Convert 'No-show' to numerical (1 for 'Yes', 0 for 'No')
df['no_show'] = df['no_show'].apply(lambda x: 1 if x == 'Yes' else 0)

# 1. Analyze correlation between numerical features and 'no_show'
numerical_cols = ['age', 'scholarship', 'hipertension', 'diabetes', 'alcoholism', 'handcap', 'sms_received']
# Column names should be lowercase and use underscores
correlations = df[numerical_cols + ['no_show']].corr()['no_show'].drop('no_show')  # Use 'no_show' here
print("Correlations with no_show:\n", correlations)


Correlations with no_show:
 age            -0.101042
scholarship     0.045687
hipertension   -0.056859
diabetes       -0.022412
alcoholism      0.019864
handcap        -0.007184
sms_received   -0.020631
Name: no_show, dtype: float64


In [16]:
# 2. Investigate relationship between categorical features and 'no_show'
categorical_cols = ['gender', 'neighbourhood']  # Column names should be lowercase
no_show_rates = {}
for col in categorical_cols:
    no_show_rates[col] = df.groupby(col)['no_show'].mean()  # Use 'no_show' here
print("\nNo-show rates by category:\n", no_show_rates)



No-show rates by category:
 {'gender': gender
F    0.284460
M    0.286659
Name: no_show, dtype: float64, 'neighbourhood': neighbourhood
AEROPORTO              0.200000
ANDORINHAS             0.322178
ANTÔNIO HONÓRIO        0.238889
ARIOVALDO FAVALESSA    0.325714
BARRO VERMELHO         0.277193
                         ...   
SÃO JOSÉ               0.271076
SÃO PEDRO              0.284722
TABUAZEIRO             0.273389
UNIVERSITÁRIO          0.276786
VILA RUBIM             0.227425
Name: no_show, Length: 80, dtype: float64}


In [17]:
# 3. Examine 'ScheduledDay' and 'AppointmentDay'
# These columns might need to be adjusted to lowercase as well
df['scheduledday'] = pd.to_datetime(df['scheduledday'])
df['appointmentday'] = pd.to_datetime(df['appointmentday'])
df['timedifference'] = (df['appointmentday'] - df['scheduledday']).dt.days
print("\nTime Difference distribution:\n", df['timedifference'].describe())


Time Difference distribution:
 count    71959.000000
mean        14.642018
std         16.494334
min          0.000000
25%          3.000000
50%          8.000000
75%         21.000000
max        178.000000
Name: timedifference, dtype: float64


In [18]:
# 4. Investigate potential interactions between features (Example: Scholarship and Age)
# Column names need to be adjusted to lowercase
df['agegroup'] = pd.cut(df['age'], bins=[0, 18, 65, 120], labels=['Child', 'Adult', 'Senior'])
scholarship_age_interaction = df.groupby(['scholarship', 'agegroup'])['no_show'].mean().unstack()  # Use lowercase and underscores
print("\nScholarship and AgeGroup interaction:\n", scholarship_age_interaction)

display(df.head())


Scholarship and AgeGroup interaction:
 agegroup        Child     Adult    Senior
scholarship                              
0            0.328408  0.278642  0.208509
1            0.357958  0.350666  0.234899


  scholarship_age_interaction = df.groupby(['scholarship', 'agegroup'])['no_show'].mean().unstack()  # Use lowercase and underscores


Unnamed: 0,gender,scheduledday,appointmentday,age,neighbourhood,scholarship,hipertension,diabetes,alcoholism,handcap,sms_received,no_show,waiting_days,timedifference,agegroup
0,F,2016-04-27 08:36:51+00:00,2016-04-29 00:00:00+00:00,76,REPÚBLICA,0,1,0,0,0,0,0,1,1,Senior
1,F,2016-04-27 15:05:12+00:00,2016-04-29 00:00:00+00:00,23,GOIABEIRAS,0,0,0,0,0,0,1,1,1,Adult
2,F,2016-04-27 15:39:58+00:00,2016-04-29 00:00:00+00:00,39,GOIABEIRAS,0,0,0,0,0,0,1,1,1,Adult
3,F,2016-04-27 12:48:25+00:00,2016-04-29 00:00:00+00:00,19,CONQUISTA,0,0,0,0,0,0,0,1,1,Adult
4,F,2016-04-27 14:58:11+00:00,2016-04-29 00:00:00+00:00,30,NOVA PALESTINA,0,0,0,0,0,0,0,1,1,Adult
