## Data Mining of Solar Flares

The Data provided is from the Reuven Ramaty High Energy Solar Spectroscopic Imager (RHESSI, originally High Energy Solar Spectroscopic Imager or HESSI). It is a NASA solar flare observatory. <br>
Description and analysis of the data is in the report, here is manipulation of data in various forms filtering,cleaning, and transforming the data as appropriate such that it can be used to produce optimal classification for the flare bassed on its energy oputput, in the form of MLP and RF classifiers.


In [None]:
# Imports
%matplotlib inline
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import sklearn as sklearn
import scikitplot as skplt

from datetime import datetime

from sklearn.model_selection import train_test_split
# Import Model Classifiers.
from sklearn.neural_network import MLPClassifier
from sklearn.ensemble import RandomForestClassifier
# Import evaluation libaries.
from sklearn.metrics import classification_report, plot_confusion_matrix
from sklearn.metrics import accuracy_score, mean_absolute_error, mean_squared_error

# Data preperation
### Data ingestion.

How to convert txt to csv: https://stackoverflow.com/questions/39642082/convert-txt-to-csv-python-script <br>

How to remove excess white space: https://stackoverflow.com/questions/2077897/substitute-multiple-whitespace-with-single-whitespace-in-python

In [56]:
# Converting txt to csv.
with open('../Data Sets/V1 Solar Flares from RHESSI Mission/hessi_2018.txt', 'r') as in_file:
    stripped = (' '.join(line.split()) for line in in_file)    
    lines = (line.split(" ") for line in stripped if line)
    with open('../Data Sets/V1 Solar Flares from RHESSI Mission/hessi_2018.csv', 'w') as out_file:
        writer = csv.writer(out_file)
        writer.writerow(('Flare', 'Start_Date', 'Start_Time', 'Peak_Time', 'End_Time', 'Dur_S', 'Peak_c/s', 'Total_Counts', 'Energy_keV', 'X_Pos_Arcsec', 'Y_Pos_Arcsec', 'Radial_Pos_Arcsec', 'Active_Region', 'F01', 'F02', 'F03', 'F04', 'F05', 'F06', 'F07', 'F08', 'F09', 'F10', 'F11'))
        writer.writerows(lines)

In [43]:
# Function for converting string to a datetime, specifying the format.
# Datetime Formatting Codes: http://bit.ly/python-dt-fmt
# Working with Dates and Time Series Data: https://youtu.be/UFuo7EHI8zc?t=641
# Pandas can recognise the datetime format of this rhessi data, so it doesn't need this 'date_parser' method.
# But doing this in case I need it for other data.
d_parser = lambda x: datetime.strptime(x, '%d-%b-%Y')

In [57]:
df_rhessi_orig = pd.read_csv('../Data Sets/V1 Solar Flares from RHESSI Mission/hessi_2018.csv'
, parse_dates=['Start_Date'], date_parser=d_parser)
# Read the original data in. 
df_rhessi = df_rhessi_orig.copy() # Create a copy to work with.
print(type(df_rhessi)) # Check it's in the expected DataFrame format.

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,
<class 'pandas.core.frame.DataFrame'>


### Initial inspection of the data.

In [45]:
df_rhessi.shape # Rows and Columns.

(121206, 24)

In [58]:
df_rhessi.head(10) # Making sure the data has been injested correctly.

Unnamed: 0,Flare,Start_Date,Start_Time,Peak_Time,End_Time,Dur_S,Peak_c/s,Total_Counts,Energy_keV,X_Pos_Arcsec,...,F02,F03,F04,F05,F06,F07,F08,F09,F10,F11
0,2021213,2002-02-12,21:29:56,21:33:38,21:41:48,712,136,167304,12-25,592,...,P1,,,,,,,,,
1,2021228,2002-02-12,21:44:08,21:45:06,21:48:56,288,7,9504,6-12,604,...,P1,PE,Q1,,,,,,,
2,2021332,2002-02-13,00:53:24,00:54:54,00:57:00,216,15,11448,6-12,-310,...,P1,,,,,,,,,
3,2021308,2002-02-13,04:22:52,04:23:50,04:26:56,244,20,17400,12-25,-277,...,P1,,,,,,,,,
4,2021310,2002-02-13,07:03:52,07:05:14,07:07:48,236,336,313392,25-50,-272,...,GS,P1,PE,Q2,,,,,,
5,2021353,2002-02-13,07:07:48,07:09:14,07:20:56,788,272,524304,12-25,-271,...,P1,,,,,,,,,
6,2021354,2002-02-13,07:20:56,07:22:42,07:30:04,548,28,52488,6-12,-267,...,P1,,,,,,,,,
7,2021312,2002-02-13,08:53:20,08:55:18,09:05:08,708,92,125352,25-50,-362,...,P1,,,,,,,,,
8,2021339,2002-02-13,10:02:56,10:04:42,10:04:44,108,26,10368,6-12,-235,...,P1,PE,Q2,SE,,,,,,
9,2021313,2002-02-13,12:29:32,12:30:58,12:33:24,232,26,16920,12-25,-905,...,P1,,,,,,,,,


In [51]:
df_rhessi['Start_Date'].dt.day_name() 

0           Tuesday
1           Tuesday
2         Wednesday
3         Wednesday
4         Wednesday
            ...    
121201       Friday
121202       Friday
121203     Saturday
121204       Monday
121205     Saturday
Name: Start_Date, Length: 121206, dtype: object

In [53]:
df_rhessi['Start_Date'].min()

Timestamp('2002-02-12 00:00:00')

In [52]:
df_rhessi['Start_Date'].max()

Timestamp('2018-03-03 00:00:00')

In [50]:
# How many days worth of data do we have?
df_rhessi['Start_Date'].max() - df_rhessi['Start_Date'].min()

Timedelta('5863 days 00:00:00')

In [None]:
filt = (df_rhessi['start.date'] >= '')


### Look at the data types of each column/attribute.

In [59]:
# They do not match the briefe, this will need to be rectified later. 
# IPSI and Contra should be of type int.
df_rhessi.dtypes

Flare                         int64
Start_Date           datetime64[ns]
Start_Time                   object
Peak_Time                    object
End_Time                     object
Dur_S                         int64
Peak_c/s                      int64
Total_Counts                  int64
Energy_keV                   object
X_Pos_Arcsec                  int64
Y_Pos_Arcsec                  int64
Radial_Pos_Arcsec             int64
Active_Region                 int64
F01                          object
F02                          object
F03                          object
F04                          object
F05                          object
F06                          object
F07                          object
F08                          object
F09                          object
F10                          object
F11                          object
dtype: object

### Inspect the Unique values of the categorical (dtype object), attributes.

In [79]:
# Create a new df_uniques and read in the columns unique values.
df_uniques = pd.DataFrame(columns= ['Buffer', 'Energy_keV', 'F01', 'F02', 'F03', 'F04', 'F05', 'F06', 'F07', 'F08', 'F09', 'F10', 'F11'])

df_uniques.Buffer = pd.Series(range(1,25))
# When creating a df, the number of rows is set by the first column.
# I needed to add this buffer to show all values.
df_uniques.Energy_keV = pd.Series(df_rhessi.Energy_keV.unique())
df_uniques.F01 = pd.Series(df_rhessi.F01.unique())
df_uniques.F02 = pd.Series(df_rhessi.F02.unique())
df_uniques.F03 = pd.Series(df_rhessi.F03.unique())
df_uniques.F04 = pd.Series(df_rhessi.F04.unique())
df_uniques.F05 = pd.Series(df_rhessi.F05.unique())
df_uniques.F06 = pd.Series(df_rhessi.F06.unique())
df_uniques.F07 = pd.Series(df_rhessi.F07.unique())
df_uniques.F08 = pd.Series(df_rhessi.F08.unique())
df_uniques.F09 = pd.Series(df_rhessi.F09.unique())
df_uniques.F10 = pd.Series(df_rhessi.F10.unique())
df_uniques.F11 = pd.Series(df_rhessi.F11.unique())

df_uniques

Unnamed: 0,Buffer,Energy_keV,F01,F02,F03,F04,F05,F06,F07,F08,F09,F10,F11
0,1,12-25,A1,P1,,,,,,,,,
1,2,6-12,A0,GS,PE,Q1,Q2,PE,Q1,Q4,Q5,SS,SE
2,3,25-50,a0,GE,P1,PE,SE,Q3,SE,Q3,SE,Q6,SD
3,4,3-6,a1,PS,PS,Q2,Q1,Q2,Q3,SE,SD,Q5,
4,5,50-100,A3,A1,GE,PS,P1,SE,PE,PE,PE,SD,
5,6,100-300,,a1,a2,a3,PS,Q4,Q4,SD,Q4,SE,
6,7,300-800,,GD,GS,P1,PE,P1,Q2,Q7,Q6,,
7,8,800-7000,,ES,Q1,SE,Q3,PS,SD,Q5,PS,,
8,9,7000-20000,,EE,ES,GE,SS,SS,PS,P1,SS,,
9,10,,,PE,EE,GS,GE,Q1,SS,PS,,,


## Cleaning

First we will look for nulls and drop them for now, later revisions we will do some imputing. <br>

Fortunately the data from Kaggl is clean!! 

### Drop Null entries.

In [80]:
print(df_rhessi.isnull().sum()) # Show nulls.

Flare                     0
Start_Date                0
Start_Time                0
Peak_Time                 0
End_Time                  0
Dur_S                     0
Peak_c/s                  0
Total_Counts              0
Energy_keV                0
X_Pos_Arcsec              0
Y_Pos_Arcsec              0
Radial_Pos_Arcsec         0
Active_Region             0
F01                       0
F02                       0
F03                   19925
F04                   20220
F05                   62832
F06                  101142
F07                  117178
F08                  120306
F09                  121078
F10                  121184
F11                  121204
dtype: int64


### Fixing the data types.

I was able to parse the start.date column to type datetime as I read the .csv file in. <br>

But for the time values, I would like them as just datetime.time.

In [81]:
# From our initial inspection, we know that some of our columns are of the wrong data type.
# We want the date time to be of type date time.
df_rhessi.dtypes

Flare                         int64
Start_Date           datetime64[ns]
Start_Time                   object
Peak_Time                    object
End_Time                     object
Dur_S                         int64
Peak_c/s                      int64
Total_Counts                  int64
Energy_keV                   object
X_Pos_Arcsec                  int64
Y_Pos_Arcsec                  int64
Radial_Pos_Arcsec             int64
Active_Region                 int64
F01                          object
F02                          object
F03                          object
F04                          object
F05                          object
F06                          object
F07                          object
F08                          object
F09                          object
F10                          object
F11                          object
dtype: object

In [None]:
# https://medium.com/@vincentteyssier/optimizing-the-size-of-a-pandas-dataframe-for-low-memory-environment-5f07db3d72e
# I found that using a smaller int subtype means pandas requires less memory.
# Consuming less memory is always better!
df_rhessi['start.time'] = pd.to_datetime.time(df_rhessi['start.time'])

In [None]:
# Because I know the values for IPSI and Contra range from 0 to 101, it makes sense to use int8, only consuming 1 byte of memory.
# To be safe I will cast to int and then downcast safely.
df_cvd_cleaned.IPSI = df_cvd_cleaned.IPSI.astype(int)
df_cvd_cleaned.IPSI = pd.to_numeric(df_cvd_cleaned.IPSI, downcast=('unsigned'))

In [None]:
# Error when casting Contra of type Object to int, there is an empty entry.
# Find it here.  
s_empty_contra = df_cvd_cleaned[df_cvd_cleaned.Contra == " "]
s_empty_contra

In [None]:
# Drop empty Contra here.
df_cvd_cleaned = df_cvd_cleaned.drop(index=642)

In [None]:
# Cast Contra from object, to string, strip leading and trailing white spaces, cast to int, then downcast.
df_cvd_cleaned.Contra = df_cvd_cleaned.Contra.astype(str).str.strip().astype(int)
df_cvd_cleaned.Contra = pd.to_numeric(df_cvd_cleaned.Contra, downcast=('unsigned'))

In [None]:
# Convert objects to nominal category, we don't care about order. Again this saves memory. 
df_cvd_cleaned.Indication = df_cvd_cleaned.Indication.astype('category')
df_cvd_cleaned.Diabetes = df_cvd_cleaned.Diabetes.astype('category')
df_cvd_cleaned.IHD = df_cvd_cleaned.IHD.astype('category')
df_cvd_cleaned.Hypertension = df_cvd_cleaned.Hypertension.astype('category')
df_cvd_cleaned.Arrhythmia = df_cvd_cleaned.Arrhythmia.astype('category')
df_cvd_cleaned.History = df_cvd_cleaned.History.astype('category')
df_cvd_cleaned.label = df_cvd_cleaned.label.astype('category')

In [None]:
# Now are data types are more representitve of the data.
df_cvd_cleaned.dtypes

## Data visualisation.

In [None]:
# I have created my own pallete to clearly convay risk in red and no risk in blue.
NoRisk_Risk = ["#2F70F2", "#D83A37"]
sns.set_palette(NoRisk_Risk)
sns.palplot(sns.color_palette())

### Count plots, stacked with label.

In [None]:
plt.title("Count of Risk/NoRisk")
sns.countplot(data=df_cvd_cleaned, x='label')

In [None]:
# plt.title("Count of Diabtese")
df_plot = df_cvd_cleaned.groupby(['label', 'Diabetes']).size().reset_index().pivot(columns='label', index='Diabetes', values=0)
df_plot.plot(kind='bar', stacked=True)
#sns.countplot(data=df_cvd_cleaned, x='Diabetes')

In [None]:
# plt.title("Count of IHD")
df_plot = df_cvd_cleaned.groupby(['label', 'IHD']).size().reset_index().pivot(columns='label', index='IHD', values=0)
df_plot.plot(kind='bar', stacked=True)

In [None]:
#plt.title("Count of Hypertension")
df_plot = df_cvd_cleaned.groupby(['label', 'Hypertension']).size().reset_index().pivot(columns='label', index='Hypertension', values=0)
df_plot.plot(kind='bar', stacked=True)

In [None]:
# plt.title("Count of Arrhythmia")
df_plot = df_cvd_cleaned.groupby(['label', 'Arrhythmia']).size().reset_index().pivot(columns='label', index='Arrhythmia', values=0)
df_plot.plot(kind='bar', stacked=True)

In [None]:
# plt.title("Count of History")
df_plot = df_cvd_cleaned.groupby(['label', 'History']).size().reset_index().pivot(columns='label', index='History', values=0)
df_plot.plot(kind='bar', stacked=True)

In [None]:
# plt.title("Count of Indication")
df_plot = df_cvd_cleaned.groupby(['label', 'Indication']).size().reset_index().pivot(columns='label', index='Indication', values=0)
df_plot.plot(kind='bar', stacked=True)

### Looking at IPSI and Contra

In [None]:
plt.title('IPSI vs Indication vs label')
sns.boxplot(data=df_cvd_cleaned, x='Indication', y='IPSI', hue='label')

In [None]:
plt.title('Contra vs Indication vs label')
sns.boxplot(data=df_cvd_cleaned, x='Indication', y='Contra', hue='label')

In [None]:
sns.set_context('paper', font_scale=1.5)

# Set up variables in a matrix formatt.
df_ipsi_contra = df_cvd_cleaned.drop(columns=['Random', 'Id', 'Indication', 'Diabetes', 'IHD', 'Hypertension', 'Arrhythmia', 'History', 'label'])
cvd_mx = df_ipsi_contra.corr() 

sns.heatmap(cvd_mx, annot=True, cmap='RdBu')
# Slightly possitevely correlated. 

In [None]:
# Plotting IPSI against Contra, with the label being hue.
sns.set_context('paper', font_scale=1.5)
sns.jointplot(data=df_cvd_cleaned, x='IPSI', y='Contra', kind='scatter', hue='label')
# Strong possitive correlation between Contra and IPSI leading to Risk.

# Encoding data

In [None]:
# One-Hot Encoding Indication.
onehot_indication = pd.get_dummies(df_cvd_cleaned.Indication, prefix='Indication')
onehot_indication

In [None]:
# The following can be dummy variables.
dummy_diabetes = pd.get_dummies(df_cvd_cleaned.Diabetes, drop_first=True)
dummy_diabetes.rename(columns={'yes' : 'Diabetes'}, inplace=True)

In [None]:
dummy_ihd = pd.get_dummies(df_cvd_cleaned.IHD, drop_first=True)
dummy_ihd.rename(columns={'yes' : 'IHD'}, inplace=True)

In [None]:
dummy_hypertension = pd.get_dummies(df_cvd_cleaned.Hypertension, drop_first=True)
dummy_hypertension.rename(columns={'yes' : 'Hypertension'}, inplace=True)

In [None]:
dummy_arrhythmia = pd.get_dummies(df_cvd_cleaned.Arrhythmia, drop_first=True)
dummy_arrhythmia.rename(columns={'yes' : 'Arrhythmia'}, inplace=True)

In [None]:
dummy_history = pd.get_dummies(df_cvd_cleaned.History, drop_first=True)
dummy_history.rename(columns={'yes' : 'History'}, inplace=True)

In [None]:
dummy_label = pd.get_dummies(df_cvd_cleaned.label, drop_first=True)

## Feature Discretision - Chunking
Feature Discretisation helps reduce the search space. <br>
pd.cut allows us to create a new df and do some binning to the IPSI and Contra features. <br>

In [None]:
# Define bins and labels here as both features need to be chuncked the same to keep consistency.
bins_to_chunk = [0,4,9,14,19,24,29,34,39,44,49,54,59,64,69,74,79,84,89,94,99,100]
bin_labels = ['0', '5','10','15','20','25','30','35','40','45','50','55','60','65','70','75','80','85','90','95','100']

In [None]:
bin_ipsi = df_cvd_cleaned.IPSI
bin_ipsi = pd.cut(bin_ipsi, bins=bins_to_chunk, labels=bin_labels) # Bin Ipsilateral.
bin_ipsi.rename('IPSI_%', inplace=True) # Name the axis.
bin_ipsi = pd.to_numeric(bin_ipsi, downcast=('unsigned')) # Downcast for efficieny.
bin_ipsi.value_counts().sort_index() # Look at bins.

In [None]:
print(bin_ipsi.isnull().sum()) # Check we havn't lost any data.

In [None]:
# Contra already follows this patern, but bining will reduce search space.
bin_contra = df_cvd_cleaned.Contra
bin_contra = pd.cut(bin_contra, bins=bins_to_chunk, labels=bin_labels) # Bin contralaterol.
bin_contra.rename('Contra_%', inplace=True) # Name the axis.
bin_contra = pd.to_numeric(bin_contra, downcast=('unsigned')) # Downcast for efficieny.
bin_contra.value_counts().sort_index() # Look at bins.

In [None]:
print(bin_contra.isnull().sum()) # Check we havn't lost any data.

## Amalagamate encoded features into a new df.
While getting our data ready for training, I'm also going to drop the random, and id columns as they don't hold relevant information for the model. Multiple sessions could hold a patern, they could show how early symptoms do develop into a high risk of CVD over time.

In [None]:
# drop old columns.
df_cvd_encoded_set_0 = df_cvd_cleaned.drop(columns=['Random', 'Id', 'Indication', 'Diabetes', 'IHD', 'Hypertension', 'Arrhythmia', 'History', 'IPSI', 'Contra', 'label'])
# concat new encoded columns.
df_cvd_encoded_set_0 = pd.concat([df_cvd_encoded_set_0, onehot_indication, dummy_diabetes, dummy_ihd, dummy_hypertension, dummy_arrhythmia, dummy_history, bin_ipsi, bin_contra, dummy_label], axis=1)
df_cvd_encoded_set_0

### Now that the data is encoded, we can visualise some more patterns.

In [None]:
plt.figure(figsize=(10,8))
plt.title('Heatmap of all features')
sns.set_context('paper', font_scale=1.1)
sns.heatmap(data=df_cvd_encoded_set_0.corr(), annot=True)

In [None]:
sns.catplot(data=df_cvd_encoded_set_0, x='Arrhythmia', y='Contra_%', hue='Risk')
# Identified possible outlier, there is a no risk data point that has high contra and arrhythmia.

In [None]:
# Unable to get selection to work, Contra_% is acting strange.
# df_cvd_encoded_set_0[(df_cvd_encoded_set_0.Arrhythmia == 1) & (df_cvd_encoded_set_0.Contra_% > 70)]

In [None]:
sns.catplot(data=df_cvd_encoded_set_0, x='Arrhythmia', y='IPSI_%', hue='Risk')

## Feature Selection
I am creating 3 new data sets based on learning what features can be dropped.

In [None]:
df_cvd_encoded_set_1 = df_cvd_encoded_set_0.drop(columns=['History'])
df_cvd_encoded_set_1.head(1)

In [None]:
# Based off Set 1 that already has history dropped.
df_cvd_encoded_set_2 = df_cvd_encoded_set_1.drop(columns=['Indication_A-F', 'Indication_CVA'])
df_cvd_encoded_set_2.head(1)

In [None]:
# Based off Set 1 that already has history dropped.
# Set 3 added after Random Forrest feature importance.
df_cvd_encoded_set_3 = df_cvd_encoded_set_1.drop(columns=['Indication_ASX'])
df_cvd_encoded_set_3.head(1)

## Sanity checks

First I would like to manually look at a few entries and compare the raw data to the encoded data to ensure that the encoding has been done correctly. 

In [None]:
# Picked because of IPSI and Contra, wanted to ensure binning was correct. It initially wasn't.
df_cvd_encoded_set_0.loc[[922]] # A-F,no,no,yes,no,no,75.0,50 ,NoRisk

In [None]:
df_cvd_encoded_set_0.dtypes # Sanity check, ensure data is all numerical and ready for model training.

In [None]:
print(df_cvd_encoded_set_0.isnull().sum()) # Sanity check, ensure there are no nulls.

# Modelling
## Split the data

Now our data is ready, we want to create our training and testing sets. <br>

Our truth, target value y, will be Risk, as that's what we want our model to predict. <br>

Our training data, inputs x, will be everything other than Risk. <br>

We split our data 70 / 30. We train on 70% of the data and then test on 30%.



In [None]:
# Change data sets here.
df_model_data = df_cvd_encoded_set_3

In [None]:
# Create target set Y - Risk.
y = df_model_data.Risk
y # is series, 1d.

In [None]:
# Create training set x - Everything BUT Risk.
x = df_model_data.drop(columns=['Risk'])
x # is data frame, 2d.

In [None]:
# Split the data.
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, shuffle=True)

## Multi Layer Perceptron (MLP) classifier.

Now that we have our data sets split, we can pass it to our model for training. <br>

In [None]:
# Create the MLP model.
# https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html

# model = MLPClassifier(hidden_layer_sizes=(11,14,2), max_iter=1000, activation='relu', solver='sgd', learning_rate='adaptive') # MLP 1
# model = MLPClassifier(hidden_layer_sizes=(11,14,2), max_iter=1000, activation='identity', solver='lbfgs') # MLP 2,3,4,5
model = MLPClassifier(hidden_layer_sizes=(11,14,2), max_iter=1000, activation='tanh', solver='adam', batch_size=500, beta_1=0.8, beta_2=0.75) # MLP 6 

model.fit(x_train,y_train) # Training - Fit our data to the model.

pred_y = model.predict(x_test) # Predict. 

accuracy_score(y_test, pred_y, normalize=True)

## Evaluation metrics - Confusion matrix and F2 score

In [None]:
mean_squared_error(y_train, model.predict(x_train))

In [None]:
mean_absolute_error(y_train, model.predict(x_train))

In [None]:
print(classification_report(y_test, pred_y))

In [None]:
# We used our model to predict (x_test) above, now we are comparing that with the truth (y_test).
sns_plot = skplt.metrics.plot_confusion_matrix(y_test, pred_y, normalize=True) 
sns_plot.figure.savefig("Second model.png") # Save it as we will go through and change some things.

# Random Forest Classifier
https://medium.com/analytics-vidhya/evaluating-a-random-forest-model-9d165595ad56


In [None]:
# Create a list to label the feature importance.
# Need to make df with labels for each of the 4 different data sets otherwise they don't match up.
df_cvd_feature_names_set_0 = ['Indication A-F', 'Indication ASX', 'Indication CVA', 'Indication TIA', 'Diabetes', 'IHD', 'Hypertension', 'Arrhythmia', 'History', 'IPSI', 'Contra']

df_cvd_feature_names_set_3 = ['Indication A-F', 'Indication CVA', 'Indication TIA', 'Diabetes', 'IHD', 'Hypertension', 'Arrhythmia', 'IPSI', 'Contra']

In [None]:
rf = RandomForestClassifier(n_estimators=200, criterion='gini')
rf.fit( x_train, y_train ) # Use the same split data as above.
y_pred_test = rf.predict(x_test) # Predict using random forrest.
rf.score( x_test, y_test ) # View accuracy score 

In [None]:
print(classification_report(y_test, y_pred_test)) # View the classification report for test data and predictions

In [None]:
# Print out the feature importance, labeled.
# Only working for data set 0 atm.
for name, score in zip( df_cvd_feature_names_set_3, rf.feature_importances_ ):
    print(name, score)
# From this I can see that ASX and History have low importance.

In [None]:
sns_plot = skplt.metrics.plot_confusion_matrix(y_test, y_pred_test, normalize=True) 