## Random Forest

Random Forest is an ensemble of Decision Trees. With a few exceptions, a `RandomForestClassifier` has all the hyperparameters of a `DecisionTreeClassifier` (to control how trees are grown), plus all the hyperparameters of a `BaggingClassifier` to control the ensemble itself.

The Random Forest algorithm introduces extra randomness when growing trees; instead of searching for the very best feature when splitting a node, it searches for the best feature among a random subset of features. This results in a greater tree diversity, which (once again) trades a higher bias for a lower variance, generally yielding an overall better model. The following `BaggingClassifier` is roughly equivalent to the previous `RandomForestClassifier`. Run the cell below to visualize a single estimator from a random forest model, using the Iris dataset to classify the data into the appropriate species.

In [3]:
from sklearn.datasets import load_iris
iris = load_iris()

# Model (can also use single decision tree)
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_estimators=10)

# Train
model.fit(iris.data, iris.target)
# Extract single tree
estimator = model.estimators_[5]

from sklearn.tree import export_graphviz
# Export as dot file
export_graphviz(estimator, out_file='tree.dot', 
                feature_names = iris.feature_names,
                class_names = iris.target_names,
                rounded = True, proportion = False, 
                precision = 2, filled = True)

# Convert to png using system command (requires Graphviz)
from subprocess import call
call(['dot', '-Tpng', 'tree.dot', '-o', 'tree.png', '-Gdpi=600'])

# Display in jupyter notebook
from IPython.display import Image
Image(filename = 'tree.png')

FileNotFoundError: [WinError 2] The system cannot find the file specified

Notice how each split seperates the data into buckets of similar observations. This is a single tree and a relatively simple classification dataset, but the same method is used in a more complex dataset with greater depth to the trees.

## Coronavirus
Coronavirus disease (COVID-19) is an infectious disease caused by a new virus.
The disease causes respiratory illness (like the flu) with symptoms such as a cough, fever, and in more severe cases, difficulty breathing. You can protect yourself by washing your hands frequently, avoiding touching your face, and avoiding close contact (1 meter or 3 feet) with people who are unwell. An outbreak of COVID-19 started in December 2019 and at the time of the creation of this project was continuing to spread throughout the world. Many governments recommended only essential outings to public places and closed most business that do not serve food or sell essential items. An excellent [spatial dashboard](https://www.arcgis.com/apps/opsdashboard/index.html#/bda7594740fd40299423467b48e9ecf6) built by Johns Hopkins shows the daily confirmed cases by country. 

This case study was designed to drive home the important role that data science plays in real-world situations like this pandemic. This case study uses the Random Forest Classifier and a dataset from the South Korean cases of COVID-19 provided on [Kaggle](https://www.kaggle.com/kimjihoo/coronavirusdataset) to encourage research on this important topic. The goal of the case study is to build a Random Forest Classifier to predict the 'state' of the patient.

First, please load the needed packages and modules into Python. Next, load the data into a pandas dataframe for ease of use.

In [1]:
import os
import pandas as pd
from datetime import datetime,timedelta
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline
import plotly.graph_objects as go
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.ensemble import ExtraTreesRegressor

In [2]:
url ='SouthKoreacoronavirusdataset/PatientInfo.csv'
df = pd.read_csv(url)
df.head()

Unnamed: 0,patient_id,global_num,sex,birth_year,age,country,province,city,disease,infection_case,infection_order,infected_by,contact_number,symptom_onset_date,confirmed_date,released_date,deceased_date,state
0,1000000001,2.0,male,1964.0,50s,Korea,Seoul,Gangseo-gu,,overseas inflow,1.0,,75.0,2020-01-22,2020-01-23,2020-02-05,,released
1,1000000002,5.0,male,1987.0,30s,Korea,Seoul,Jungnang-gu,,overseas inflow,1.0,,31.0,,2020-01-30,2020-03-02,,released
2,1000000003,6.0,male,1964.0,50s,Korea,Seoul,Jongno-gu,,contact with patient,2.0,2002000000.0,17.0,,2020-01-30,2020-02-19,,released
3,1000000004,7.0,male,1991.0,20s,Korea,Seoul,Mapo-gu,,overseas inflow,1.0,,9.0,2020-01-26,2020-01-30,2020-02-15,,released
4,1000000005,9.0,female,1992.0,20s,Korea,Seoul,Seongbuk-gu,,contact with patient,2.0,1000000000.0,2.0,,2020-01-31,2020-02-24,,released


In [3]:
df.shape

(2218, 18)

In [4]:
#Counts of null values 
na_df=pd.DataFrame(df.isnull().sum().sort_values(ascending=False)).reset_index()
na_df.columns = ['VarName', 'NullCount']
na_df[(na_df['NullCount']>0)]

Unnamed: 0,VarName,NullCount
0,disease,2199
1,deceased_date,2186
2,infection_order,2176
3,symptom_onset_date,2025
4,released_date,1995
5,contact_number,1807
6,infected_by,1749
7,infection_case,1055
8,global_num,904
9,birth_year,454


In [5]:
#counts of response variable values
df.state.value_counts()

isolated    1791
released     307
deceased      32
Name: state, dtype: int64

 **<font color='teal'> Create a new column named 'n_age' which is the calculated age based on the birth year column.</font>**

In [9]:
df['n_age']=2021-df['birth_year']
df.head()

Unnamed: 0,patient_id,global_num,sex,birth_year,age,country,province,city,disease,infection_case,infection_order,infected_by,contact_number,symptom_onset_date,confirmed_date,released_date,deceased_date,state,n_age
0,1000000001,2.0,male,1964.0,50s,Korea,Seoul,Gangseo-gu,,overseas inflow,1.0,,75.0,2020-01-22,2020-01-23,2020-02-05,,released,57.0
1,1000000002,5.0,male,1987.0,30s,Korea,Seoul,Jungnang-gu,,overseas inflow,1.0,,31.0,,2020-01-30,2020-03-02,,released,34.0
2,1000000003,6.0,male,1964.0,50s,Korea,Seoul,Jongno-gu,,contact with patient,2.0,2002000000.0,17.0,,2020-01-30,2020-02-19,,released,57.0
3,1000000004,7.0,male,1991.0,20s,Korea,Seoul,Mapo-gu,,overseas inflow,1.0,,9.0,2020-01-26,2020-01-30,2020-02-15,,released,30.0
4,1000000005,9.0,female,1992.0,20s,Korea,Seoul,Seongbuk-gu,,contact with patient,2.0,1000000000.0,2.0,,2020-01-31,2020-02-24,,released,29.0


### Handle Missing Values

 **<font color='teal'> Print the number of missing values by column.</font>**

In [10]:
print(df.isnull().sum() )

patient_id               0
global_num             904
sex                    145
birth_year             454
age                    261
country                  0
province                 0
city                    65
disease               2199
infection_case        1055
infection_order       2176
infected_by           1749
contact_number        1807
symptom_onset_date    2025
confirmed_date         141
released_date         1995
deceased_date         2186
state                   88
n_age                  454
dtype: int64


In [11]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2218 entries, 0 to 2217
Data columns (total 19 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   patient_id          2218 non-null   int64  
 1   global_num          1314 non-null   float64
 2   sex                 2073 non-null   object 
 3   birth_year          1764 non-null   float64
 4   age                 1957 non-null   object 
 5   country             2218 non-null   object 
 6   province            2218 non-null   object 
 7   city                2153 non-null   object 
 8   disease             19 non-null     object 
 9   infection_case      1163 non-null   object 
 10  infection_order     42 non-null     float64
 11  infected_by         469 non-null    float64
 12  contact_number      411 non-null    float64
 13  symptom_onset_date  193 non-null    object 
 14  confirmed_date      2077 non-null   object 
 15  released_date       223 non-null    object 
 16  deceas

 **<font color='teal'> Fill the 'disease' missing values with 0 and remap the True values to 1.</font>**

In [12]:
print(df['disease'].value_counts(), '\n')

True    19
Name: disease, dtype: int64 



In [13]:
df['disease'] = df['disease'].map( {np.nan: 0, True: 1})
print(df['disease'].value_counts())

0    2199
1      19
Name: disease, dtype: int64


 **<font color='teal'> Fill null values in the following columns with their mean: 'global_number','birth_year','infection_order','infected_by'and 'contact_number'</font>**

In [15]:
cols=['global_num','birth_year','infection_order','infected_by','contact_number']
print(df[cols].info(),'\n')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2218 entries, 0 to 2217
Data columns (total 5 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   global_num       1314 non-null   float64
 1   birth_year       1764 non-null   float64
 2   infection_order  42 non-null     float64
 3   infected_by      469 non-null    float64
 4   contact_number   411 non-null    float64
dtypes: float64(5)
memory usage: 86.8 KB
None 



In [16]:
means=df[cols].mean()
print(means,'\n')

global_num         4.664817e+03
birth_year         1.974989e+03
infection_order    2.285714e+00
infected_by        2.600789e+09
contact_number     2.412895e+01
dtype: float64 



In [17]:
df[cols] = df[cols].fillna(value=means)
print(df[cols].info(),'\n')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2218 entries, 0 to 2217
Data columns (total 5 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   global_num       2218 non-null   float64
 1   birth_year       2218 non-null   float64
 2   infection_order  2218 non-null   float64
 3   infected_by      2218 non-null   float64
 4   contact_number   2218 non-null   float64
dtypes: float64(5)
memory usage: 86.8 KB
None 



In [18]:
[print(df[col].value_counts()) for col in cols] 

4664.816591    904
907.000000       2
1753.000000      2
7982.000000      2
2769.000000      2
              ... 
7829.000000      1
8134.000000      1
7151.000000      1
5397.000000      1
2.000000         1
Name: global_num, Length: 1304, dtype: int64
1974.988662    454
1969.000000     54
1995.000000     51
1998.000000     47
1996.000000     45
              ... 
2019.000000      2
1927.000000      2
1925.000000      1
1931.000000      1
1916.000000      1
Name: birth_year, Length: 97, dtype: int64
2.285714    2176
2.000000      19
1.000000      11
3.000000       6
5.000000       3
4.000000       2
6.000000       1
Name: infection_order, dtype: int64
2.600789e+09    1749
2.000000e+09      44
4.100000e+09      27
4.100000e+09      21
2.000000e+09      17
                ... 
2.000000e+09       1
6.023000e+09       1
4.100000e+09       1
1.100000e+09       1
2.000000e+09       1
Name: infected_by, Length: 207, dtype: int64
24.128954      1807
0.000000         47
2.000000         44
3.0

[None, None, None, None, None]

 **<font color='teal'> Fill the rest of the missing values with any method.</font>**

In [19]:
print(df.isnull().sum().sort_values(ascending=False))

deceased_date         2186
symptom_onset_date    2025
released_date         1995
infection_case        1055
n_age                  454
age                    261
sex                    145
confirmed_date         141
state                   88
city                    65
contact_number           0
infected_by              0
infection_order          0
disease                  0
province                 0
country                  0
birth_year               0
global_num               0
patient_id               0
dtype: int64


In [21]:
['deceased_date', 'symptom_onset_date', 'released_date', 'infection_case',   'n_age'

SyntaxError: unexpected EOF while parsing (<ipython-input-21-daa837e831e1>, line 1)

In [20]:
cols2=['deceased_date', 'symptom_onset_date', 'released_date', 'infection_case', 
       'n_age', 'age', 'sex', 'confirmed_date', 'state', 'city']
[print(df[col].value_counts(),'\n') for col in cols2] 

2020-02-23    4
2020-03-02    3
2020-03-09    3
2020-03-04    3
2020-03-05    3
2020-03-01    3
2020-02-25    2
2020-03-07    2
2020-03-19    2
2020-02-19    1
2020-02-21    1
2020-02-26    1
2020-03-06    1
2020-03-03    1
2020-02-24    1
2020-02-27    1
Name: deceased_date, dtype: int64 

2020-02-27    19
2020-02-24    12
2020-03-09    12
2020-02-22    12
2020-02-25    12
2020-03-10    10
2020-02-26     9
2020-03-11     8
2020-03-06     8
2020-03-07     8
2020-02-23     8
2020-02-21     6
2020-03-08     6
2020-03-12     6
2020-03-04     5
2020-03-16     4
2020-02-20     4
2020-03-05     4
2020-02-29     4
2020-03-15     4
2020-03-02     4
2020-02-18     3
2020-03-13     3
2020-03-01     2
2020-02-13     2
2020-02-15     2
2020-02-28     2
2020-03-17     2
2020-02-11     2
2020-02-19     2
2020-01-27     1
2020-03-03     1
2020-02-06     1
2020-01-26     1
2020-01-22     1
2020-03-14     1
2020-01-19     1
2020-01-31     1
Name: symptom_onset_date, dtype: int64 

2020-03-13    23
2020

[None, None, None, None, None, None, None, None, None, None]

##### For these columns, fill na with XXXXXXXXXXXXXXXXXXXXXXXXXXX
'deceased_date', 'symptom_onset_date', and 'released_date'

In [26]:
cols3 = ['deceased_date', 'symptom_onset_date', 'released_date']
print(df[cols3].info(),'\n')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2218 entries, 0 to 2217
Data columns (total 3 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   deceased_date       32 non-null     object
 1   symptom_onset_date  193 non-null    object
 2   released_date       223 non-null    object
dtypes: object(3)
memory usage: 52.1+ KB
None 



In [29]:
[print( df.loc[df[col].isnull()].sample(15)  ) for col in cols3]

      patient_id   global_num     sex   birth_year  age country  \
902   3009000005  4664.816591  female  1970.000000  50s   Korea   
1709  6006000009  4664.816591    male  1994.000000  20s   Korea   
645   2000000057  4944.000000    male  1989.000000  30s   Korea   
221   1000000222  7965.000000    male  1964.000000  50s   Korea   
1693  6004000058  4664.816591    male  1974.988662  60s   Korea   
1076  6001000002    73.000000  female  2001.000000  10s   Korea   
1085  6001000011   238.000000    male  1966.000000  50s   Korea   
1350  6001000276  4870.000000    male  1997.000000  20s   Korea   
1722  6007000001  4664.816591    male  1960.000000  60s   Korea   
596   2000000008    25.000000  female  1946.000000  70s   Korea   
1090  6001000016   532.000000    male  1957.000000  60s   Korea   
411   1200000070    70.000000  female  1972.000000  40s   Korea   
334   1100000053  4664.816591  female  1992.000000  20s   Korea   
737   2000000149  7262.000000    male  1997.000000  20s   Kore

[None, None, None]

##### For this columns, fill na with XXXXXXXXXXXXXXXXXXXXXXXXXXX
'infection_case'

In [25]:
cols4 = ['infection_case']
print(df[cols4].info(),'\n')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2218 entries, 0 to 2217
Data columns (total 1 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   infection_case  1163 non-null   object
dtypes: object(1)
memory usage: 17.5+ KB
None 



##### For this columns, fill na with  XXXXXXXXXXXXXXXXXXXXXXXXXXX
 'n_age'

In [26]:
cols5 = ['n_age']
print(df[cols5].info(),'\n')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2218 entries, 0 to 2217
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   n_age   1764 non-null   float64
dtypes: float64(1)
memory usage: 17.5 KB
None 



##### For these columns, fill na with the mode (most frequently occuring value)
'age', 'sex', 'confirmed_date', 'state', and 'city'

In [141]:
mode_cols=['age', 'sex', 'confirmed_date', 'state', 'city']
modes=df[mode_cols].mode().T
modes=pd.Series(modes[0])
print(modes,'\n')

print(df[mode_cols].info())

age                        20s
sex                     female
confirmed_date      2020-03-01
state                 isolated
city              Gyeongsan-si
Name: 0, dtype: object 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2218 entries, 0 to 2217
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             1957 non-null   object
 1   sex             2073 non-null   object
 2   confirmed_date  2077 non-null   object
 3   state           2130 non-null   object
 4   city            2153 non-null   object
dtypes: object(5)
memory usage: 86.8+ KB
None


In [142]:
df[mode_cols] = df[mode_cols].fillna(value=modes)
df[mode_cols].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2218 entries, 0 to 2217
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             2218 non-null   object
 1   sex             2218 non-null   object
 2   confirmed_date  2218 non-null   object
 3   state           2218 non-null   object
 4   city            2218 non-null   object
dtypes: object(5)
memory usage: 86.8+ KB


In [143]:
[print( df[col].value_counts(),'\n' ) for col in mode_cols]

20s     728
50s     385
40s     303
30s     251
60s     229
70s     117
80s      84
10s      74
0s       29
90s      17
100s      1
Name: age, dtype: int64 

female    1316
male       902
Name: sex, dtype: int64 

2020-03-01    274
2020-02-28    125
2020-02-26    123
2020-03-03    113
2020-02-27    106
2020-03-04    106
2020-03-05    102
2020-02-29     99
2020-03-06     97
2020-03-10     94
2020-02-25     93
2020-02-23     71
2020-02-22     70
2020-03-09     69
2020-03-02     66
2020-02-24     56
2020-03-07     54
2020-03-08     53
2020-03-16     53
2020-03-11     49
2020-02-20     41
2020-03-12     37
2020-03-15     36
2020-03-13     35
2020-02-21     33
2020-03-17     33
2020-03-18     32
2020-03-14     31
2020-02-19     28
2020-02-18      9
2020-02-05      5
2020-01-31      4
2020-02-09      3
2020-01-30      3
2020-02-02      3
2020-02-16      2
2020-02-06      2
2020-02-04      1
2020-01-20      1
2020-01-27      1
2020-01-26      1
2020-02-10      1
2020-01-23      1
2020-02-07  

[None, None, None, None, None]

 **<font color='teal'> Check for any remaining null values.</font>**

In [None]:
df.head()

Remove date columns from the data.


In [None]:
df = df.drop(['symptom_onset_date','confirmed_date','released_date','deceased_date'],axis =1)

Review the count of unique values by column.

In [None]:
print(df.nunique())

Review the percent of unique values by column.

In [None]:
print(df.nunique()/df.shape[0])

Review the range of values per column.

In [None]:
df.describe().T

### Check for duplicated rows

In [None]:
duplicateRowsDF = df[df.duplicated()]
duplicateRowsDF

Print the categorical columns and their associated levels.

In [None]:
dfo = df.select_dtypes(include=['object'], exclude=['datetime'])
dfo.shape
#get levels for all variables
vn = pd.DataFrame(dfo.nunique()).reset_index()
vn.columns = ['VarName', 'LevelsCount']
vn.sort_values(by='LevelsCount', ascending =False)
vn

**<font color='teal'> Plot the correlation heat map for the features.</font>**

**<font color='teal'> Plot the boxplots to check for outliers. </font>**

**<font color='teal'> Create dummy features for object type features. </font>**

### Split the data into test and train subsamples

In [None]:
from sklearn.model_selection import train_test_split

# dont forget to define your X and y

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state=1)

### Scale data to prep for model creation

In [None]:
#scale data
from sklearn import preprocessing
import numpy as np
# build scaler based on training data and apply it to test data to then also scale the test data
scaler = preprocessing.StandardScaler().fit(X_train)
X_train_scaled=scaler.transform(X_train)
X_test_scaled=scaler.transform(X_test)

In [None]:
from sklearn.metrics import precision_recall_curve
from sklearn.metrics import f1_score
from sklearn.metrics import auc
from sklearn.linear_model import LogisticRegression
from matplotlib import pyplot
from sklearn.metrics import precision_recall_curve
from sklearn.metrics import f1_score
from sklearn.metrics import auc
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report,confusion_matrix,roc_curve,roc_auc_score
from sklearn.metrics import accuracy_score,log_loss
from matplotlib import pyplot

### Fit Random Forest Classifier
The fit model shows an overall accuracy of 80% which is great and indicates our model was effectively able to identify the status of a patients in the South Korea dataset.

In [None]:
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(n_estimators=300, random_state = 1,n_jobs=-1)
model_res = clf.fit(X_train_scaled, y_train)
y_pred = model_res.predict(X_test_scaled)
y_pred_prob = model_res.predict_proba(X_test_scaled)
lr_probs = y_pred_prob[:,1]
ac = accuracy_score(y_test, y_pred)

f1 = f1_score(y_test, y_pred, average='weighted')
cm = confusion_matrix(y_test, y_pred)

print('Random Forest: Accuracy=%.3f' % (ac))

print('Random Forest: f1-score=%.3f' % (f1))

### Create Confusion Matrix Plots
Confusion matrices are great ways to review your model performance for a multi-class classification problem. Being able to identify which class the misclassified observations end up in is a great way to determine if you need to build additional features to improve your overall model. In the example below we plot a regular counts confusion matrix as well as a weighted percent confusion matrix. The percent confusion matrix is particulary helpful when you have unbalanced class sizes.

In [None]:
class_names=['isolated','released','missing','deceased'] # name  of classes

In [None]:
import itertools
import numpy as np
import matplotlib.pyplot as plt

from sklearn import svm, datasets
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix

def plot_confusion_matrix(cm, classes,
                          normalize=False,
                          title='Confusion matrix',
                          cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
        print("Normalized confusion matrix")
    else:
        print('Confusion matrix, without normalization')

    print(cm)

    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)

    fmt = '.2f' if normalize else 'd'
    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, format(cm[i, j], fmt),
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    plt.ylabel('True label')
    plt.xlabel('Predicted label')
    plt.tight_layout()


# Compute confusion matrix
cnf_matrix = confusion_matrix(y_test, y_pred)
np.set_printoptions(precision=2)

# Plot non-normalized confusion matrix
plt.figure()
plot_confusion_matrix(cnf_matrix, classes=class_names,
                      title='Confusion matrix, without normalization')
#plt.savefig('figures/RF_cm_multi_class.png')

# Plot normalized confusion matrix
plt.figure()
plot_confusion_matrix(cnf_matrix, classes=class_names, normalize=True,
                      title='Normalized confusion matrix')
#plt.savefig('figures/RF_cm_proportion_multi_class.png', bbox_inches="tight")
plt.show()

### Plot feature importances
The random forest algorithm can be used as a regression or classification model. In either case it tends to be a bit of a black box, where understanding what's happening under the hood can be difficult. Plotting the feature importances is one way that you can gain a perspective on which features are driving the model predictions.

In [None]:
feature_importance = clf.feature_importances_
# make importances relative to max importance
feature_importance = 100.0 * (feature_importance / feature_importance.max())[:30]
sorted_idx = np.argsort(feature_importance)[:30]

pos = np.arange(sorted_idx.shape[0]) + .5
print(pos.size)
sorted_idx.size
plt.figure(figsize=(10,10))
plt.barh(pos, feature_importance[sorted_idx], align='center')
plt.yticks(pos, X.columns[sorted_idx])
plt.xlabel('Relative Importance')
plt.title('Variable Importance')
plt.show()

The popularity of random forest is primarily due to how well it performs in a multitude of data situations. It tends to handle highly correlated features well, where as a linear regression model would not. In this case study we demonstrate the performance ability even with only a few features and almost all of them being highly correlated with each other.
Random Forest is also used as an efficient way to investigate the importance of a set of features with a large data set. Consider random forest to be one of your first choices when building a decision tree, especially for multiclass classifications.