# Random Forest Model On Health Care Industry

## Import Modules

In [25]:
import pandas as pd
import seaborn as sns

from sklearn import metrics
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor, StackingRegressor, VotingRegressor
from sklearn.linear_model import Ridge
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVR

## Search and Load Dataset From Seaborn Library

In [3]:
sns.get_dataset_names()

['anagrams',
 'anscombe',
 'attention',
 'brain_networks',
 'car_crashes',
 'diamonds',
 'dots',
 'dowjones',
 'exercise',
 'flights',
 'fmri',
 'geyser',
 'glue',
 'healthexp',
 'iris',
 'mpg',
 'penguins',
 'planets',
 'seaice',
 'taxis',
 'tips',
 'titanic']

In [4]:
df = sns.load_dataset('healthexp')

## Explore the Dataset

In [5]:
df

Unnamed: 0,Year,Country,Spending_USD,Life_Expectancy
0,1970,Germany,252.311,70.6
1,1970,France,192.143,72.2
2,1970,Great Britain,123.993,71.9
3,1970,Japan,150.437,72.0
4,1970,USA,326.961,70.9
...,...,...,...,...
269,2020,Germany,6938.983,81.1
270,2020,France,5468.418,82.3
271,2020,Great Britain,5018.700,80.4
272,2020,Japan,4665.641,84.7


In [7]:
df.shape

(274, 4)

In [9]:
df.dtypes

Year                 int64
Country             object
Spending_USD       float64
Life_Expectancy    float64
dtype: object

### Check for NaN Values

In [10]:
df.isnull().sum()

Year               0
Country            0
Spending_USD       0
Life_Expectancy    0
dtype: int64

## Model Preparation 

### Create Binary Numbers for Catagorical Column

In [12]:
df = pd.get_dummies(df)

In [13]:
df

Unnamed: 0,Year,Spending_USD,Life_Expectancy,Country_Canada,Country_France,Country_Germany,Country_Great Britain,Country_Japan,Country_USA
0,1970,252.311,70.6,0,0,1,0,0,0
1,1970,192.143,72.2,0,1,0,0,0,0
2,1970,123.993,71.9,0,0,0,1,0,0
3,1970,150.437,72.0,0,0,0,0,1,0
4,1970,326.961,70.9,0,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...
269,2020,6938.983,81.1,0,0,1,0,0,0
270,2020,5468.418,82.3,0,1,0,0,0,0
271,2020,5018.700,80.4,0,0,0,1,0,0
272,2020,4665.641,84.7,0,0,0,0,1,0


### Assign the Features into A Set 'X' and Assign the Target Value to 'Y' 

In [14]:
X = df.drop(['Life_Expectancy'], axis =1)

In [15]:
y = df['Life_Expectancy']

### Split the Data Between Training & Set

In [16]:
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.2 , random_state = 42)

## Random Forest Regressor Model

In [17]:
rfr = RandomForestRegressor(random_state = 13)

In [20]:
rfr.fit(X_train, y_train)

RandomForestRegressor(random_state=13)

In [22]:
y_pred = rfr.predict(X_test)

In [26]:
mean_absolute_error(y_test, y_pred)

0.27990909090908

In [23]:
mean_squared_error(y_test, y_pred)

0.12957590909089994

In [24]:
r2_score(y_test, y_pred)

0.9893864896823702

1. Mean Squared Error (MSE) & Mean Absolute Error (MAE):

Lower MSE is Better: For the MSE, a lower value indicates a better model fit. It measures the average squared difference between the predicted values and the actual values. Smaller MSE means the model's predictions are closer to the actual data points.

2. R-squared (R^2) Score:

Higher R^2 is Better: R-squared is a measure of how well the independent variables explain the variation in the dependent variable. It ranges from 0 to 1, and a higher R^2 score indicates a better model fit. An R^2 score of 1 means that the model explains all the variability in the data, while an R^2 score of 0 means that the model does not explain any of the variability.

In summary:

For MSE & MAE, lower values are better.

For R^2, higher values (closer to 1) are better.

Our Random Forest Model has very good results, MSE (0.129), MAE (0.2799) and R2 (0.9893). however, We are going to see if we can imporve our model in order to recieve better results by hyperparameter tuning.

In [28]:
param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [ 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

In [32]:
rfr_cv = GridSearchCV(estimator =rfr, 
                      param_grid=param_grid, 
                      cv=3, 
                      scoring='neg_mean_squared_error',
                     n_jobs=-1  # Use all available CPU cores
                     )

In [33]:
%%time
rfr_cv.fit(X_train, y_train)

Wall time: 16.2 s


GridSearchCV(cv=3, estimator=RandomForestRegressor(random_state=13), n_jobs=-1,
             param_grid={'max_depth': [10, 20, 30],
                         'min_samples_leaf': [1, 2, 4],
                         'min_samples_split': [2, 5, 10],
                         'n_estimators': [100, 200, 300]},
             scoring='neg_mean_squared_error')

In [34]:
y_pred = rfr_cv.predict(X_test)

In [35]:
mean_absolute_error(y_test, y_pred)

0.27253333333335944

In [36]:
mean_squared_error(y_test, y_pred)

0.12117737373739787

In [37]:
r2_score(y_test, y_pred)

0.990074410317099

With hyper parameter tunning, our results tend to get slighly better, where MSE (0.1211), MAE (0.2725) and R2 (0.9900).

In order to get better results we can cosider the fdollowing: 

1. Cross-Validation:

K-Fold Cross-Validation: Implement K-fold cross-validation to evaluate the model's performance on different subsets of the data. This helps assess generalization and reduce overfitting.

2. Ensemble Methods:

Ensemble Learning: Combine multiple models using techniques like stacking or bagging. Ensemble methods often lead to improved performance.

3. Data Expansion:

Collect More Data: If possible, increase the size of your dataset. More data can lead to better model performance.