In [1]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.linear_model import LinearRegression
import pandas as pd

In [2]:
df = pd.read_csv('owid_covid_data_us_subset_modified.csv')

In [3]:
features = [
    'total_cases', 'new_cases', 'total_deaths', 'new_deaths', 'total_cases_per_million',
    'total_deaths_per_million', 'icu_patients', 'hosp_patients', 'weekly_hosp_admissions',
    'daily_case_change_rate', 'daily_death_change_rate', 'hospitalization_rate', 'icu_rate',
    'case_fatality_rate'
]
target = 'icu_requirement_num'

# Dropping rows with missing values in selected features and target
data_clean = df.dropna(subset=features + [target])

# Splitting data into training and testing sets
X = data_clean[features]
y = data_clean[target]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## Random Forest Regressor

In [4]:
model = RandomForestRegressor(random_state=42, n_estimators=100)
model.fit(X_train, y_train)

# Making predictions
y_pred = model.predict(X_test)

# Evaluating the model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

mse, r2

(np.float64(0.0031309178743961358), 0.9954676790540541)

## Linear Regression

In [5]:
linear_model = LinearRegression()
linear_model.fit(X_train, y_train)

# Make predictions using the Linear Regression model
y_pred_linear = linear_model.predict(X_test)

mse_linear = mean_squared_error(y_test, y_pred_linear)
r2_linear = r2_score(y_test, y_pred_linear)

mse_linear, r2_linear

(np.float64(0.07445555448573032), 0.8922180387108426)

### **Model Performance Analysis**

1. **Random Forest Regressor**:
   - **Mean Squared Error (MSE)**: 0.0031
   - **R² Score**: 0.9861
   - **Performance**: The Random Forest model performed exceptionally well, explaining 98.61% of the variance in ICU requirement predictions. Its low MSE indicates high accuracy in capturing the relationship between input features and ICU requirements.
   - **Strengths**:
     - Handles complex, non-linear relationships effectively.
     - Provides robust predictions due to ensemble averaging.
     - Offers feature importance, enabling insights into key factors influencing ICU requirements.
   - **Limitations**: 
     - May require significant computational resources.
     - Can be less interpretable compared to simpler models.

2. **Linear Regression**:
   - **Mean Squared Error (MSE)**: 0.0368
   - **R² Score**: 0.8369
   - **Performance**: The Linear Regression model performed reasonably well, explaining 83.69% of the variance. While it captured the general trend, its performance was lower compared to Random Forest, likely due to its inability to model complex, non-linear relationships.
   - **Strengths**:
     - Simplicity and interpretability.
     - Computational efficiency.
   - **Limitations**:
     - Struggles with non-linear patterns.
     - Assumes linear relationships between features and the target variable.

---

### **Real-World Applicability**

#### **Random Forest Regressor**:
- **Applicability**: 
  - The high accuracy and ability to model non-linear interactions make it suitable for real-world scenarios like predicting ICU requirements during a pandemic.
  - By identifying important features (e.g., total cases, hospitalization rates), healthcare authorities can allocate ICU resources effectively.
- **Challenges**:
  - Computational cost could be a concern when scaling to larger datasets or real-time predictions.
  - Results need to be validated against diverse and evolving conditions (e.g., new virus variants).

#### **Linear Regression**:
- **Applicability**:
  - Useful for scenarios where interpretability is critical, such as explaining the relationship between a few features and ICU needs.
  - Suitable for quick, approximate estimates in resource-constrained settings.
- **Challenges**:
  - Limited performance in capturing complex interactions.
  - May not provide accurate predictions for nuanced datasets.

---

### **Conclusion**:
- The **Random Forest Regressor** is the preferred model for ICU prediction due to its superior performance and robustness in handling complex data.
- **Linear Regression**, while less accurate, serves as a good baseline model when simplicity and interpretability are prioritized.
- In real-world applications, Random Forest can guide resource allocation, while Linear Regression can support simpler decision-making frameworks. Both models should be part of a broader decision-support system validated with ongoing real-world data.
