# Introduction

Decision trees leave you with a difficult decision. A deep tree with lots of leaves will overfit because each prediction is coming from historical data from only the few houses at its leaf. But a shallow tree with few leaves will perform poorly because it fails to capture as many distinctions in the raw data.

Even today's most sophisticated modeling techniques face this tension between underfitting and overfitting. But, many models have clever ideas that can lead to better performance. We'll look at the random forest as an example.

The random forest uses many trees, and it makes a prediction by averaging the predictions of each component tree. It generally has much better predictive accuracy than a single decision tree and it works well with default parameters. If you keep modeling, you can learn more models with even better performance, but many of those are sensitive to getting the right parameters.

You've already seen the code to load the data a few times. At the end of data-loading, we have the following variables:

train_X
val_X
train_y
val_y


In [25]:
import pandas as pd
data_set = pd.read_csv("Latest Covid-19 India Status.csv")
data_set.describe()

Unnamed: 0,Total Cases,Active,Discharged,Deaths,Active Ratio (%),Discharge Ratio (%),Death Ratio (%)
count,36.0,36.0,36.0,36.0,36.0,36.0,36.0
mean,911412.4,10505.027778,888712.3,12195.0,1.255278,97.478611,1.266111
std,1334291.0,37159.139184,1290595.0,23546.148094,2.661486,2.556711,0.563541
min,7566.0,4.0,7431.0,4.0,0.01,84.6,0.04
25%,73153.25,145.0,70212.25,809.75,0.0475,97.6325,0.955
50%,468646.5,839.0,459735.0,5396.0,0.535,98.225,1.3
75%,1005276.0,6034.5,991171.8,13630.5,0.945,98.6525,1.59
max,6464876.0,219441.0,6272800.0,137313.0,15.03,99.92,2.74


In [26]:
data_set.columns

Index(['State/UTs', 'Total Cases', 'Active', 'Discharged', 'Deaths',
       'Active Ratio (%)', 'Discharge Ratio (%)', 'Death Ratio (%)'],
      dtype='object')

In [27]:
data_set.dropna(axis=0)

Unnamed: 0,State/UTs,Total Cases,Active,Discharged,Deaths,Active Ratio (%),Discharge Ratio (%),Death Ratio (%)
0,Andaman and Nicobar,7566,6,7431,129,0.08,98.22,1.7
1,Andhra Pradesh,2014116,14693,1985566,13857,0.73,98.58,0.69
2,Arunachal Pradesh,53031,863,51908,260,1.63,97.88,0.49
3,Assam,589426,6901,576865,5660,1.17,97.87,0.96
4,Bihar,725708,100,715955,9653,0.01,98.66,1.33
5,Chandigarh,65105,40,64252,813,0.06,98.69,1.25
6,Chhattisgarh,1004451,412,990484,13555,0.04,98.61,1.35
7,Dadra and Nagar Haveli and Daman and Diu,10663,4,10655,4,0.04,99.92,0.04
8,Delhi,1437764,349,1412333,25082,0.02,98.23,1.74
9,Goa,173955,877,169877,3201,0.5,97.66,1.84


In [28]:
#data_set.drop('State/UTs',axis = 1)

In [42]:
Y = ['Total Cases']
y = data_set.get('Total Cases')
data_fea = ['Active', 'Discharged', 'Deaths','Active Ratio (%)', 'Discharge Ratio (%)', 'Death Ratio (%)']

    Total Cases
0          7566
1       2014116
2         53031
3        589426
4        725708
5         65105
6       1004451
7         10663
8       1437764
9        173955
10       825422
11       770486
12       213548
13       325419
14       347867
15      2949445
16      4057233
17        20560
18        10347
19       792175
20      6464876
21       113933
22        75836
23        59119
24        30083
25      1007750
26       123572
27       600614
28       954095
29        29878
30      2614872
31       658054
32        82961
33      1709335
34       342976
35      1548604


In [30]:
X = data_set[data_fea]
X.describe()

Unnamed: 0,Active,Discharged,Deaths,Active Ratio (%),Discharge Ratio (%),Death Ratio (%)
count,36.0,36.0,36.0,36.0,36.0,36.0
mean,10505.027778,888712.3,12195.0,1.255278,97.478611,1.266111
std,37159.139184,1290595.0,23546.148094,2.661486,2.556711,0.563541
min,4.0,7431.0,4.0,0.01,84.6,0.04
25%,145.0,70212.25,809.75,0.0475,97.6325,0.955
50%,839.0,459735.0,5396.0,0.535,98.225,1.3
75%,6034.5,991171.8,13630.5,0.945,98.6525,1.59
max,219441.0,6272800.0,137313.0,15.03,99.92,2.74


In [43]:
from sklearn.tree import DecisionTreeRegressor
data_model = DecisionTreeRegressor(random_state = 1)
data_model.fit(X, y)

DecisionTreeRegressor(random_state=1)

In [44]:
print("The given data is: ")
print(X.head())
print("The predictions are: ")
print(data_model.predict(X.head()))

The given data is: 
   Active  Discharged  Deaths  Active Ratio (%)  Discharge Ratio (%)  \
0       6        7431     129              0.08                98.22   
1   14693     1985566   13857              0.73                98.58   
2     863       51908     260              1.63                97.88   
3    6901      576865    5660              1.17                97.87   
4     100      715955    9653              0.01                98.66   

   Death Ratio (%)  
0             1.70  
1             0.69  
2             0.49  
3             0.96  
4             1.33  
The predictions are: 
[   7566. 2014116.   53031.  589426.  725708.]


In [47]:
#Train test split
from sklearn.model_selection import train_test_split
train_X, val_X, train_y, val_y = train_test_split(X,y,random_state = 0)

In [50]:
#Random Forest Modeling/Regressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error
data_model_random = RandomForestRegressor(random_state =1)
data_model_random.fit(train_X, train_y)
y_pred = data_model_random.predict(val_X)
print("The Mean Absolute Error is: ", mean_absolute_error(val_y, y_pred))

The Mean Absolute Error is:  92163.53444444448


  data_model_random.fit(train_X, train_y)
