# Ensemble Learning

## Imports and Setting up the Kaggle API
### Create .env File and Set KAGGLE_KEY and KAGGLE_USERNAME as Kaggle Username and Key in .env File
### Example:
KAGGLE_KEY=API_KEY
KAGGLE_USERNAME=USERNAME

load_dotenv will take .env and set key pairs as environmental variables in Python

In [71]:
import os
from dotenv import load_dotenv
load_dotenv()
import kaggle
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler, StandardScaler
import matplotlib.pyplot as plt



Setting the API Instance and downloading dataset

In [72]:
apiInstance=kaggle.KaggleApi()
apiInstance.dataset_download_files('fedesoriano/stroke-prediction-dataset', unzip=True)

Dataset URL: https://www.kaggle.com/datasets/fedesoriano/stroke-prediction-dataset


## Preprocessing


In [73]:
strokeData=pd.read_csv('healthcare-dataset-stroke-data.csv')
#strokeData.info()
strokeDataFeatures=strokeData.iloc[:,1:-1].values
#iloc[rows,columns] we used : on rows as :specifies a range so a range with no upper or lower bound means taking everyting
#1:-1 means a range from 1(dropping our first column) to -1(which really means our last column)
#dropping the first column our ID column since it has no predictive power and can potentially cause any learners we use to develop patterns on it
#dropping the last column since we only want our features and not the labels
strokeDataLabels=strokeData.iloc[:,-1].values
#getting only the last column as we only want the labels
print(strokeDataFeatures)

[['Male' 67.0 0 ... 228.69 36.6 'formerly smoked']
 ['Female' 61.0 0 ... 202.21 nan 'never smoked']
 ['Male' 80.0 0 ... 105.92 32.5 'never smoked']
 ...
 ['Female' 35.0 0 ... 82.99 30.6 'never smoked']
 ['Male' 51.0 0 ... 166.29 25.6 'formerly smoked']
 ['Female' 44.0 0 ... 85.28 26.2 'Unknown']]


In [79]:
print(strokeData.isnull().any())
#BMI is the only column with NaNs
from sklearn.impute import SimpleImputer
imputer=SimpleImputer( strategy='mean')
# our BMI column is our 8th column so we want to put that column in the imputer
imputer.fit(strokeDataFeatures[:,8:9])
strokeDataFeatures[:,8:9]=imputer.transform(strokeDataFeatures[:,8:9])


id                   False
gender               False
age                  False
hypertension         False
heart_disease        False
ever_married         False
work_type            False
Residence_type       False
avg_glucose_level    False
bmi                   True
smoking_status       False
stroke               False
dtype: bool
[[36.6]
 [28.893236911794666]
 [32.5]
 ...
 [30.6]
 [25.6]
 [26.2]]
