# Stroke Thromobolysis Dataset: Logistic Regression Exercise (Solution)

The data loaded in this exercise is for seven acute stroke units, and whether a patient receives clost-busting treatment for stroke.  There are lots of features, and a description of the features can be found in the file stroke_data_feature_descriptions.csv.

Train a Logistic Regression model to try to predict whether or not a stroke patient receives clot-busting treatment.  Use the prompts below to write each section of code.

What do you conclude are the most important features for predicting whether a patient receives clot busting treatment?  Can you improve accuracy by changing the size of your train / test split?  If you have time, perhaps consider dropping some features from your data based on your outputs (in the same way you dropped passengerID in the Titanic example).  Don't forget you'll need to rerun all subsequent cells if you make changes like that.

In [1]:
import pandas as pd
import numpy as np
# Import machine learning methods
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Download data
# (not required if running locally and have previously downloaded data)

download_required = True

if download_required:

    # Download processed data:
    address = 'https://raw.githubusercontent.com/MichaelAllen1966/' + \
                '2004_titanic/master/jupyter_notebooks/data/hsma_stroke.csv'
    data = pd.read_csv(address)

    # Create a data subfolder if one does not already exist
    import os
    data_directory ='./data/'
    if not os.path.exists(data_directory):
        os.makedirs(data_directory)

    # Save data to data subfolder
    data.to_csv(data_directory + 'hsma_stroke.csv', index=False)

# Load data
data = pd.read_csv('data/hsma_stroke.csv')
# Make all data 'float' type
data = data.astype(float)
# Show data
data.head()

Unnamed: 0,Clotbuster given,Hosp_1,Hosp_2,Hosp_3,Hosp_4,Hosp_5,Hosp_6,Hosp_7,Male,Age,...,S2NihssArrivalFacialPalsy,S2NihssArrivalMotorArmLeft,S2NihssArrivalMotorArmRight,S2NihssArrivalMotorLegLeft,S2NihssArrivalMotorLegRight,S2NihssArrivalLimbAtaxia,S2NihssArrivalSensory,S2NihssArrivalBestLanguage,S2NihssArrivalDysarthria,S2NihssArrivalExtinctionInattention
0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,63.0,...,3.0,4.0,0.0,4.0,0.0,0.0,0.0,0.0,1.0,1.0
1,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,85.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,1.0,0.0
2,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,91.0,...,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
3,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,90.0,...,1.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0
4,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,69.0,...,2.0,0.0,4.0,1.0,4.0,0.0,1.0,2.0,2.0,1.0


In [2]:
# Look at overview of data
data.describe()

Unnamed: 0,Clotbuster given,Hosp_1,Hosp_2,Hosp_3,Hosp_4,Hosp_5,Hosp_6,Hosp_7,Male,Age,...,S2NihssArrivalFacialPalsy,S2NihssArrivalMotorArmLeft,S2NihssArrivalMotorArmRight,S2NihssArrivalMotorLegLeft,S2NihssArrivalMotorLegRight,S2NihssArrivalLimbAtaxia,S2NihssArrivalSensory,S2NihssArrivalBestLanguage,S2NihssArrivalDysarthria,S2NihssArrivalExtinctionInattention
count,1862.0,1862.0,1862.0,1862.0,1862.0,1862.0,1862.0,1862.0,1862.0,1862.0,...,1862.0,1862.0,1862.0,1862.0,1862.0,1862.0,1862.0,1862.0,1862.0,1862.0
mean,0.40333,0.159506,0.14232,0.154672,0.165414,0.055854,0.113319,0.208915,0.515575,74.553706,...,1.11493,1.002148,0.96348,0.96348,0.910849,0.216971,0.610097,0.944146,0.739527,0.566595
std,0.490698,0.366246,0.349472,0.361689,0.371653,0.229701,0.317068,0.406643,0.499892,12.280576,...,0.930527,1.479211,1.441594,1.406501,1.380606,0.522643,0.771932,1.121379,0.731083,0.794
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,40.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,67.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,76.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
75%,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,83.0,...,2.0,2.0,2.0,2.0,2.0,0.0,1.0,2.0,1.0,1.0
max,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,100.0,...,3.0,4.0,4.0,4.0,4.0,2.0,2.0,3.0,2.0,2.0


In [3]:
# Look at mean feature values for those who were given a clotbuster vs those
# that weren't
mask = data['Clotbuster given'] == 1
given = data[mask]

mask = data['Clotbuster given'] == 0
not_given = data[mask]

summary = pd.DataFrame()
summary['given'] = given.mean()
summary['not given'] = not_given.mean()

summary

Unnamed: 0,given,not given
Clotbuster given,1.0,0.0
Hosp_1,0.203728,0.129613
Hosp_2,0.122503,0.155716
Hosp_3,0.182423,0.135914
Hosp_4,0.13715,0.184518
Hosp_5,0.067909,0.047705
Hosp_6,0.123835,0.106211
Hosp_7,0.16245,0.240324
Male,0.515313,0.515752
Age,73.303595,75.39874


In [4]:
# Divide into features and labels
X = data.drop('Clotbuster given', axis=1)
y = data['Clotbuster given']

In [5]:
# Divide into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)

In [6]:
# Standardise data
def standardise_data(X_train, X_test):

    # Initialise a new scaling object for normalising input data
    sc = StandardScaler()

    # Apply the scaler to the training and test sets
    train_std=sc.fit_transform(X_train)
    test_std=sc.fit_transform(X_test)

    return train_std, test_std

X_train_std, X_test_std = standardise_data(X_train, X_test)

In [7]:
# Fit (train) Logistic Regression model
model = LogisticRegression()
model.fit(X_train_std, y_train)

In [8]:
# Predict training and test labels, and calculate accuracy
y_pred_train = model.predict(X_train_std)
y_pred_test = model.predict(X_test_std)

accuracy_train = np.mean(y_pred_train == y_train)
accuracy_test = np.mean(y_pred_test == y_test)

print (f'Accuracy of predicting training data = {accuracy_train}')
print (f'Accuracy of predicting test data = {accuracy_test}')

Accuracy of predicting training data = 0.8202005730659025
Accuracy of predicting test data = 0.7939914163090128


In [9]:
# Examine feature weights and sort by most influential
co_eff = model.coef_[0]

co_eff_df = pd.DataFrame()
co_eff_df['feature'] = list(X)
co_eff_df['co_eff'] = co_eff
co_eff_df['abs_co_eff'] = np.abs(co_eff)
co_eff_df.sort_values(by='abs_co_eff', ascending=False, inplace=True)

co_eff_df

Unnamed: 0,feature,co_eff,abs_co_eff
32,Stroke Type_I,1.121156,1.121156
33,Stroke Type_PIH,-1.121156,1.121156
28,Stroke severity group_2. Minor,-0.709358,0.709358
29,Stroke severity group_3. Moderate,0.594452,0.594452
35,S2NihssArrival,-0.468261,0.468261
34,S2RankinBeforeStroke,-0.458139,0.458139
47,S2NihssArrivalBestLanguage,0.407136,0.407136
8,Age,-0.351431,0.351431
10,Onset Time Known Type_BE,-0.311359,0.311359
17,Atrial Fib,-0.304925,0.304925
