What is a Random Forest? 
- Combines multiple decision trees to form a predictive ML model
- Can help predict classification or continuous data (regression)

Hyperparameters to know: 
- n_estimators: Number of trees in the forest (higher = more stability but slower training).
- max_depth: Maximum depth of each decision tree (prevents overfitting).
- min_samples_split: Minimum samples required to split a node (higher = simpler trees).
- max_features: Number of features considered for each split (controls randomness).
- random_state: Ensures reproducibility.


In [23]:
import pandas as pd
df_patients = pd.read_csv("patient_data.csv")
print(df_patients.head())
df_patients.dropna(inplace=True) # to handle missing values, can also put df_patients.fillna to fill missing values 

                 Name  Age  Gender Blood Type Medical Condition  \
0     Heather Bennett   66  Female         B-              COPD   
1        Mike Brennan   62  Female        AB+    Kidney Disease   
2      Rhonda Gilbert   49  Female         B+            Anemia   
3      Dylan Campbell   30    Male        AB+     Liver Disease   
4  Stephanie Gonzalez   40    Male         A+              COPD   

  Date of Admission  
0        2024-10-15  
1        2024-08-01  
2        2023-10-14  
3        2023-08-23  
4        2024-07-12  


In [24]:
df_patients

Unnamed: 0,Name,Age,Gender,Blood Type,Medical Condition,Date of Admission
0,Heather Bennett,66,Female,B-,COPD,2024-10-15
1,Mike Brennan,62,Female,AB+,Kidney Disease,2024-08-01
2,Rhonda Gilbert,49,Female,B+,Anemia,2023-10-14
3,Dylan Campbell,30,Male,AB+,Liver Disease,2023-08-23
4,Stephanie Gonzalez,40,Male,A+,COPD,2024-07-12
...,...,...,...,...,...,...
295,Michael Scott,55,Female,AB+,COPD,2023-12-20
296,Matthew Velasquez,83,Male,O-,Asthma,2024-06-11
297,Alexa Ramirez,84,Female,O-,Diabetes,2024-06-22
298,Michael Whitehead,55,Male,A+,Kidney Disease,2023-07-16


Now do feature selection (choose most important data to make predictions) and scaling (adjusting features so they are on the same scale)

In [30]:
from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()

df_patients['Gender'] = label_encoder.fit_transform(df_patients['Gender'])
df_patients['Blood Type'] = label_encoder.fit_transform(df_patients['Blood Type'])
df_patients.drop('Name', axis=1, inplace=True)

# Convert Date of Admission to numerical
df_patients['Date of Admission'] = pd.to_datetime(df_patients['Date of Admission'])
df_patients['Days_Since_Admission'] = (df_patients['Date of Admission'] - df_patients['Date of Admission'].min()).dt.days
df_patients.drop('Date of Admission', axis=1, inplace=True)

X = df_patients.drop("Medical Condition", axis=1)  # Features (everything except the target)
y = df_patients["Medical Condition"]  # Target variable

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

print("\nModel accuracy:", model.score(X_test, y_test))

# Print feature importance
feature_importance = pd.DataFrame({
    'feature': X.columns,
    'importance': model.feature_importances_
}).sort_values('importance', ascending=False)

print("\nFeature Importance:")
print(feature_importance)

IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices