# Data preprocessing

There are the factors for having a stroke<br>
A: gender (0:male,1:female)<br>
B: age<br>
C: hypertension (0:doesn't have,1:has)<br>
D: heart_disease (0:doesn't have,1:has)<br>
E: ever_married (0:no,1:yes)<br>
   work_type is ignored<br>
F: Residence_type (0:Rural,1:Urban)<br>
G: avg_glucose_level<br>
H: bmi<br>
I: smoking_status (0:never smoked, 1:formerly smoked, 2:smokes)  "unknown" is ignored<br>
J: stroke(0:no stroke,1:has a stroke)<br>
<br>
"ignored" means data is taken out of dataset<br>

In [391]:
import pandas as pd
df = pd.read_csv("healthcare-dataset-stroke-data.csv")

gender = pd.Series(df["gender"])
gender = gender.replace({"Male":0, "Female":1,"Other":pd.NA})
age = pd.Series(df["age"])
hypertension = pd.Series(df["hypertension"])
heart_disease = pd.Series(df["heart_disease"])
ever_married = pd.Series(df["ever_married"])
ever_married = ever_married.replace({"No":0, "Yes":1})
residence = pd.Series(df["Residence_type"])
residence = residence.replace({"Rural":0,"Urban":1})
avg_glucose_level = pd.Series(df["avg_glucose_level"])
bmi = pd.Series(df["bmi"])
bmi = bmi.replace("N/A",pd.NA)
smoking_status = pd.Series(df["smoking_status"])
smoking_status = smoking_status.replace({"never smoked":0,"formerly smoked":1,"smokes":2,"Unknown":pd.NA})
stroke = pd.Series(df["stroke"])

dataset = pd.DataFrame({"Gender": gender, "Age": age, "Hypertension": hypertension, "Heart_Disease": heart_disease, "Ever_Married": ever_married, "Residence": residence, "Avg_Glucose_Level": avg_glucose_level, "BMI": bmi, "Smoking_Status": smoking_status, "Stroke": stroke})
dataset = dataset.dropna(how='any')
dataset = dataset.sample(frac=1)
dataset.reset_index(drop=True, inplace=True)
dataset.to_csv("FinalDataset.csv", index=False)
dataset



Unnamed: 0,Gender,Age,Hypertension,Heart_Disease,Ever_Married,Residence,Avg_Glucose_Level,BMI,Smoking_Status,Stroke
0,0,56.0,1,0,1,0,249.31,35.8,0,1
1,0,27.0,0,0,0,0,63.53,26.9,0,0
2,1,28.0,0,0,1,0,95.52,28.9,0,0
3,0,63.0,0,0,1,1,66.13,46.2,0,0
4,1,43.0,0,0,1,1,75.77,20.4,1,0
...,...,...,...,...,...,...,...,...,...,...
3420,1,64.0,0,1,1,0,114.71,30.6,0,0
3421,1,29.0,0,0,1,1,112.08,27.4,0,0
3422,1,34.0,0,0,1,1,113.26,27.6,0,0
3423,1,23.0,0,0,0,1,105.28,27.1,1,0


There are 3425 patients' data. After shuffling the rows, we are the first 3200 rows are training data, and 224 rows are testing data, and the last row is the one we are trying to predict.

In [392]:
x = dataset.iloc[:,:-1]
y= dataset.iloc[:,-1]
x_train = x.iloc[:3200]
y_train = y.iloc[:3200]
x_valid = x.iloc[3200:3420]
y_valid = y.iloc[3200:3420]
x_test = x.iloc[3420:]
y_test = y.iloc[3420:]


# Train

Train our Logistic Regression without regularization

In [393]:
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(penalty="none")
model.fit(x_train,y_train)


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Train our Logistic Regression with l2 regularization

In [394]:
model_l2 = LogisticRegression(penalty='l2', C=0.001,dual=True,solver="liblinear")
model_l2.fit(x_train,y_train)



Train our Logistic Regression with l1 regularization

In [395]:
model_l1 = LogisticRegression(penalty='l1',C=0.001, solver = "liblinear")
model_l1.fit(x_train,y_train)

# Evaluation

In [396]:
print(f"The score of how model fits the training data is {model.score(x_train,y_train)*100:.2f}")
print(f"The score of how model_l2 fits the training data is {model_l2.score(x_train,y_train)*100:.2f}")
print(f"The score of how model_l1 fits the training data is {model_l1.score(x_train,y_train)*100:.2f}")

The score of how model fits the training data is 94.81
The score of how model_l2 fits the training data is 94.81
The score of how model_l1 fits the training data is 94.81


# Validation

In [397]:

y_noReg_pred = pd.Series(model.predict(x_valid))
y_l2Reg_pred = pd.Series(model_l2.predict(x_valid))
y_l1Reg_pred = pd.Series(model_l1.predict(x_valid))
count_non = 0
count_l2 = 0
count_l1 = 0
leng = len(y_noReg_pred)
for i in range(leng):
    if y_noReg_pred.iloc[i] == y_valid.iloc[i]:
        count_non+=1
    if y_l2Reg_pred.iloc[i] == y_valid.iloc[i]:
        count_l2+=1
    if y_l1Reg_pred.iloc[i] == y_valid.iloc[i]:
        count_l1+=1
        
        
print(f"The correctness of model with no regularizaion on testing data is: {count_non}/{leng}. {(count_non/leng)*100:.2f}%")
print(f"The correctness of model with l2 regularizaion on testing data is: {count_l2}/{leng}. {(count_l2/leng)*100:.2f}%")
print(f"The correctness of model with l1 regularizaion on testing data is: {count_l1}/{leng}. {(count_l1/leng)*100:.2f}%")

The correctness of model with no regularizaion on testing data is: 207/220. 94.09%
The correctness of model with l2 regularizaion on testing data is: 207/220. 94.09%
The correctness of model with l1 regularizaion on testing data is: 207/220. 94.09%


# Final testing

In [398]:
y_final_pred = pd.Series(model.predict(x_test))
y_final_prob = pd.DataFrame(model.predict_proba(x_test))
leng = len(y_final_pred)

for i in range(leng):
    print(f"For patient #{i+1},",end=" ")
    print(f"We predict that the probability of this patient to have stroke is {y_final_prob.iloc[i,1]*100:.2f}%,",end=" ")
    if y_final_pred.iloc[i] == 0:
        print("and does not have a stroke.",end=" ")
    elif y_final_pred.iloc[i] == 1:
        print("and has a stroke.",end=" ")
    if y_final_pred.iloc[i]==y_test.iloc[i]:
        print("We predicted correctly.")
    else:
        print("We didn't predict correctly.")

For patient #1, We predict that the probability of this patient to have stroke is 10.21%, and does not have a stroke. We predicted correctly.
For patient #2, We predict that the probability of this patient to have stroke is 0.51%, and does not have a stroke. We predicted correctly.
For patient #3, We predict that the probability of this patient to have stroke is 0.71%, and does not have a stroke. We predicted correctly.
For patient #4, We predict that the probability of this patient to have stroke is 0.66%, and does not have a stroke. We predicted correctly.
For patient #5, We predict that the probability of this patient to have stroke is 24.98%, and does not have a stroke. We didn't predict correctly.
