8.	Perform the following operations in python on given dataset
[patients.csv: 	Patient demographic info and visits.csv: Doctor visits and diagnosis codes]
a.	Fill or drop missing diagnosis codes and ages.
b.	Standardize gender values (e.g., “M”, “Male”, “F” → “Male”, “Female”).
c.	Merge patient info with visits
d.	Group data to get total visits and unique diagnoses per patient.
e.	Correct out-of-range values (e.g., age > 120).


In [20]:
import pandas as pd

In [22]:
patients_df = pd.read_csv("patients.csv")
visits_df = pd.read_csv("visits.csv")

In [24]:
visits_df

Unnamed: 0,VisitID,PatientID,VisitDate,Diagnosis
0,201,1001,2023-02-01,Flu
1,202,1002,2023-02-10,
2,203,1003,2023-02-12,Diabetes
3,204,1006,2023-03-01,Cold
4,205,1005,2023-03-05,Allergy


In [26]:
patients_df

Unnamed: 0,PatientID,Name,Age,Gender
0,1001,Tom,30.0,M
1,1002,Jerry,122.0,F
2,1003,Anna,28.0,Male
3,1004,Sam,,female
4,1005,Kate,25.0,F


# a. Fill or drop missing diagnosis codes and ages

In [29]:
patients_df = patients_df.dropna(subset=['Age'])
visits_df = visits_df.dropna(subset=['Diagnosis'])

# b. Standardize gender values

In [32]:
def standardize_gender(g):
    g = str(g).strip().lower()
    if g in ['m', 'male']:
        return 'Male'
    elif g in ['f', 'female']:
        return 'Female'
    else:
        return 'Other'

In [37]:
patients_df['Gender'] = patients_df['Gender'].apply(standardize_gender)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  patients_df['Gender'] = patients_df['Gender'].apply(standardize_gender)


# c. Merge patient info with visits

In [45]:
merged_df = pd.merge(visits_df, patients_df, on='PatientID', how='inner')

# d. Group data: total visits and unique diagnoses per patient

In [57]:
visit_counts = merged_df.groupby('PatientID').agg(
    total_visits=pd.NamedAgg(column='VisitID', aggfunc='count'),
    unique_diagnoses=pd.NamedAgg(column='Diagnosis', aggfunc=lambda x: x.nunique())
).reset_index()

In [59]:
print(visit_counts.head())

   PatientID  total_visits  unique_diagnoses
0       1001             1                 1
1       1003             1                 1
2       1005             1                 1


# e. Correct out-of-range values (e.g., age > 120)

In [73]:
patients_df.loc[patients_df['Age'] > 120, 'Age'] = 120  # cap age

In [75]:
patients_df

Unnamed: 0,PatientID,Name,Age,Gender
0,1001,Tom,30.0,Male
1,1002,Jerry,120.0,Female
2,1003,Anna,28.0,Male
4,1005,Kate,25.0,Female
