Name: Neha Gurudatta Devarkonda

email_id: devarkondaneha02@gmail.com

Task1 - Completed Data Cleaning and Preprocessing.

Using Python-pandas

Dataset:-  Medical Appointment No Shows Dataset




In [11]:
import pandas as pd
import numpy as np

df = pd.read_csv("/content/noshowappointments.csv")

print("Initial Rows:")
print(df.head())

print("\nDataset Info:")
print(df.info())

# Move column renaming to an earlier stage
df.columns = df.columns.str.lower().str.replace(" ", "_")
print("\nNew Column Names (after initial standardization):")
print(df.columns)

print("\nMissing Values:")
print(df.isnull().sum())

# Example handling:
# Filling missing age with median
if "age" in df.columns:
    df["age"].fillna(df["age"].median(), inplace=True)

# Fill missing gender with mode
if "gender" in df.columns:
    df["gender"].fillna(df["gender"].mode()[0], inplace=True)

# Drop rows where any appointment date is missing
date_columns = ["scheduledday", "appointmentday"]
for col in date_columns:
    if col in df.columns:
        df.dropna(subset=[col], inplace=True)

initial_count = df.shape[0]
df.drop_duplicates(inplace=True)
final_count = df.shape[0]

print(f"\nRemoved {initial_count - final_count} duplicate rows.")

if "gender" in df.columns:
    df["gender"] = df["gender"].str.lower().str.strip()
    df["gender"] = df["gender"].replace({
        "m": "male",
        "f": "female"
    })

if "no-show" in df.columns:
    df["no-show"] = df["no-show"].str.strip().str.lower()
    df["no-show"] = df["no-show"].replace({
        "no": "no",
        "yes": "yes",
        "n": "no",
        "y": "yes"
    })

# Fix negative ages
if "age" in df.columns:
    df = df[df["age"] >= 0]

# Optional: remove extreme values
if "age" in df.columns:
    df = df[df["age"] <= 120]

for col in ["scheduledday", "appointmentday"]:
    if col in df.columns:
        df[col] = pd.to_datetime(df[col], errors="coerce")

df.dropna(subset=["scheduledday", "appointmentday"], inplace=True)

df["scheduledday"] = df["scheduledday"].dt.strftime("%d-%m-%Y")
df["appointmentday"] = df["appointmentday"].dt.strftime("%d-%m-%Y")

# This line was moved up
# df.columns = df.columns.str.lower().str.replace(" ", "_")

print("\nNew Column Names (final):")
print(df.columns)

if "age" in df.columns:
    df["age"] = df["age"].astype(int)

if "no-show" in df.columns:
    df["no-show"] = df["no-show"].astype("category")

print("\nFinal Dataset Info:")
print(df.info())

print("\nFinal Summary Statistics:")
print(df.describe(include="all"))

df.to_csv('/content/sample_data/noshowappointments.csv', index=False)
print("\nCleaned dataset exported successfully!")

Initial Rows:
      PatientId  AppointmentID Gender          ScheduledDay  \
0  2.987250e+13        5642903      F  2016-04-29T18:38:08Z   
1  5.589978e+14        5642503      M  2016-04-29T16:08:27Z   
2  4.262962e+12        5642549      F  2016-04-29T16:19:04Z   
3  8.679512e+11        5642828      F  2016-04-29T17:29:31Z   
4  8.841186e+12        5642494      F  2016-04-29T16:07:23Z   

         AppointmentDay  Age      Neighbourhood  Scholarship  Hipertension  \
0  2016-04-29T00:00:00Z   62    JARDIM DA PENHA            0             1   
1  2016-04-29T00:00:00Z   56    JARDIM DA PENHA            0             0   
2  2016-04-29T00:00:00Z   62      MATA DA PRAIA            0             0   
3  2016-04-29T00:00:00Z    8  PONTAL DE CAMBURI            0             0   
4  2016-04-29T00:00:00Z   56    JARDIM DA PENHA            0             1   

   Diabetes  Alcoholism  Handcap  SMS_received No-show  
0         0           0        0             0      No  
1         0           0 

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df["age"].fillna(df["age"].median(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df["gender"].fillna(df["gender"].mode()[0], inplace=True)



Removed 0 duplicate rows.

New Column Names (final):
Index(['patientid', 'appointmentid', 'gender', 'scheduledday',
       'appointmentday', 'age', 'neighbourhood', 'scholarship', 'hipertension',
       'diabetes', 'alcoholism', 'handcap', 'sms_received', 'no-show'],
      dtype='object')

Final Dataset Info:
<class 'pandas.core.frame.DataFrame'>
Index: 110526 entries, 0 to 110526
Data columns (total 14 columns):
 #   Column          Non-Null Count   Dtype   
---  ------          --------------   -----   
 0   patientid       110526 non-null  float64 
 1   appointmentid   110526 non-null  int64   
 2   gender          110526 non-null  object  
 3   scheduledday    110526 non-null  object  
 4   appointmentday  110526 non-null  object  
 5   age             110526 non-null  int64   
 6   neighbourhood   110526 non-null  object  
 7   scholarship     110526 non-null  int64   
 8   hipertension    110526 non-null  int64   
 9   diabetes        110526 non-null  int64   
 10  alcoholism   