<a href="https://colab.research.google.com/github/Rithikkaa-17/TASK-1-Data-Cleaning-and-Preprocessing-/blob/main/Task_1(Data_Cleaning_and_Preprocessing).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [14]:
# 📘 Task 1 - Data Cleaning and Preprocessing: Medical Appointment No Shows Dataset
"""
This notebook performs data cleaning and preprocessing on the 'Medical Appointment No Shows' dataset.
Key Steps Covered:
1. Loaded the dataset and examined its structure and data types.
2. Removed duplicate records to ensure data uniqueness.
3. Converted 'ScheduledDay' and 'AppointmentDay' columns to datetime format.
4. Handled invalid entries (e.g., removed negative ages).
5. Standardized categorical columns like 'Gender' and 'No-show' for consistency.
6. Renamed all column headers to lowercase with underscores for readability.
7. Verified data cleanliness:
   - Confirmed no missing or duplicate values remain.
   - Validated correct data types for each column.
   - Checked unique values in key fields for consistency.
Final Output:
A cleaned version of the dataset was saved as `cleaned_medical_appointments.csv`, ready for further analysis.
This task demonstrates core data preprocessing techniques using Python and Pandas.
"""

"\nThis notebook performs data cleaning and preprocessing on the 'Medical Appointment No Shows' dataset.\nKey Steps Covered:\n1. Loaded the dataset and examined its structure and data types.\n2. Removed duplicate records to ensure data uniqueness.\n3. Converted 'ScheduledDay' and 'AppointmentDay' columns to datetime format.\n4. Handled invalid entries (e.g., removed negative ages).\n5. Standardized categorical columns like 'Gender' and 'No-show' for consistency.\n6. Renamed all column headers to lowercase with underscores for readability.\n7. Verified data cleanliness:\n   - Confirmed no missing or duplicate values remain.\n   - Validated correct data types for each column.\n   - Checked unique values in key fields for consistency.\nFinal Output:\nA cleaned version of the dataset was saved as `cleaned_medical_appointments.csv`, ready for further analysis.\nThis task demonstrates core data preprocessing techniques using Python and Pandas.\n"

In [2]:
import pandas as pd
df = pd.read_csv("KaggleV2-May-2016.csv")
df.head()

Unnamed: 0,PatientId,AppointmentID,Gender,ScheduledDay,AppointmentDay,Age,Neighbourhood,Scholarship,Hipertension,Diabetes,Alcoholism,Handcap,SMS_received,No-show
0,29872500000000.0,5642903,F,2016-04-29T18:38:08Z,2016-04-29T00:00:00Z,62,JARDIM DA PENHA,0,1,0,0,0,0,No
1,558997800000000.0,5642503,M,2016-04-29T16:08:27Z,2016-04-29T00:00:00Z,56,JARDIM DA PENHA,0,0,0,0,0,0,No
2,4262962000000.0,5642549,F,2016-04-29T16:19:04Z,2016-04-29T00:00:00Z,62,MATA DA PRAIA,0,0,0,0,0,0,No
3,867951200000.0,5642828,F,2016-04-29T17:29:31Z,2016-04-29T00:00:00Z,8,PONTAL DE CAMBURI,0,0,0,0,0,0,No
4,8841186000000.0,5642494,F,2016-04-29T16:07:23Z,2016-04-29T00:00:00Z,56,JARDIM DA PENHA,0,1,1,0,0,0,No


In [3]:
df.info()
df.describe()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 110527 entries, 0 to 110526
Data columns (total 14 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   PatientId       110527 non-null  float64
 1   AppointmentID   110527 non-null  int64  
 2   Gender          110527 non-null  object 
 3   ScheduledDay    110527 non-null  object 
 4   AppointmentDay  110527 non-null  object 
 5   Age             110527 non-null  int64  
 6   Neighbourhood   110527 non-null  object 
 7   Scholarship     110527 non-null  int64  
 8   Hipertension    110527 non-null  int64  
 9   Diabetes        110527 non-null  int64  
 10  Alcoholism      110527 non-null  int64  
 11  Handcap         110527 non-null  int64  
 12  SMS_received    110527 non-null  int64  
 13  No-show         110527 non-null  object 
dtypes: float64(1), int64(8), object(5)
memory usage: 11.8+ MB


Unnamed: 0,PatientId,AppointmentID,Age,Scholarship,Hipertension,Diabetes,Alcoholism,Handcap,SMS_received
count,110527.0,110527.0,110527.0,110527.0,110527.0,110527.0,110527.0,110527.0,110527.0
mean,147496300000000.0,5675305.0,37.088874,0.098266,0.197246,0.071865,0.0304,0.022248,0.321026
std,256094900000000.0,71295.75,23.110205,0.297675,0.397921,0.258265,0.171686,0.161543,0.466873
min,39217.84,5030230.0,-1.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,4172614000000.0,5640286.0,18.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,31731840000000.0,5680573.0,37.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,94391720000000.0,5725524.0,55.0,0.0,0.0,0.0,0.0,0.0,1.0
max,999981600000000.0,5790484.0,115.0,1.0,1.0,1.0,1.0,4.0,1.0


In [4]:
df.isnull().sum()
df.duplicated().sum()

np.int64(0)

In [5]:
df.drop_duplicates(inplace=True)

In [6]:
df.columns = df.columns.str.lower().str.replace('-', '_')

In [7]:
df['scheduledday'] = pd.to_datetime(df['scheduledday'])
df['appointmentday'] = pd.to_datetime(df['appointmentday'])

In [8]:
df['gender'] = df['gender'].str.upper().str.strip()
df['no_show'] = df['no_show'].str.upper().str.strip()

In [9]:
df = df[df['age'] >= 0]

In [10]:
df.to_csv("cleaned_medical_appointments.csv", index=False)

In [12]:
df['no_show'].value_counts()
df['gender'].value_counts()

Unnamed: 0_level_0,count
gender,Unnamed: 1_level_1
F,71839
M,38687


In [11]:
df.isnull().sum()
df.duplicated().sum()
df.info()
df.head()

<class 'pandas.core.frame.DataFrame'>
Index: 110526 entries, 0 to 110526
Data columns (total 14 columns):
 #   Column          Non-Null Count   Dtype              
---  ------          --------------   -----              
 0   patientid       110526 non-null  float64            
 1   appointmentid   110526 non-null  int64              
 2   gender          110526 non-null  object             
 3   scheduledday    110526 non-null  datetime64[ns, UTC]
 4   appointmentday  110526 non-null  datetime64[ns, UTC]
 5   age             110526 non-null  int64              
 6   neighbourhood   110526 non-null  object             
 7   scholarship     110526 non-null  int64              
 8   hipertension    110526 non-null  int64              
 9   diabetes        110526 non-null  int64              
 10  alcoholism      110526 non-null  int64              
 11  handcap         110526 non-null  int64              
 12  sms_received    110526 non-null  int64              
 13  no_show         110

Unnamed: 0,patientid,appointmentid,gender,scheduledday,appointmentday,age,neighbourhood,scholarship,hipertension,diabetes,alcoholism,handcap,sms_received,no_show
0,29872500000000.0,5642903,F,2016-04-29 18:38:08+00:00,2016-04-29 00:00:00+00:00,62,JARDIM DA PENHA,0,1,0,0,0,0,NO
1,558997800000000.0,5642503,M,2016-04-29 16:08:27+00:00,2016-04-29 00:00:00+00:00,56,JARDIM DA PENHA,0,0,0,0,0,0,NO
2,4262962000000.0,5642549,F,2016-04-29 16:19:04+00:00,2016-04-29 00:00:00+00:00,62,MATA DA PRAIA,0,0,0,0,0,0,NO
3,867951200000.0,5642828,F,2016-04-29 17:29:31+00:00,2016-04-29 00:00:00+00:00,8,PONTAL DE CAMBURI,0,0,0,0,0,0,NO
4,8841186000000.0,5642494,F,2016-04-29 16:07:23+00:00,2016-04-29 00:00:00+00:00,56,JARDIM DA PENHA,0,1,1,0,0,0,NO
