<a href="https://colab.research.google.com/github/Lucy-code-tech/100-Days-of-Code-Data-Science/blob/main/%5BSample_Notebook%5D_AfterWork_Feature_Engineering_for_Healthcare_with_Python.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# [Sample Notebook] AfterWork: Feature Engineering for Healthcare with Python

# Pre-requisites

In [3]:
# Import pandas for data manipulation
import pandas as pd

In [2]:
# Import numpy for scientific computations
import numpy as np

# 1. Encoding Features

## 1.1 Label Encoding

Label encoding is a method used to convert categorical data into numerical format. This is important because many machine learning algorithms require numerical input data. We can use label encoding when we have categorical features with no inherent order or ranking. For example, if we have a feature 'Color' with categories 'Red', 'Blue', and 'Green', we can assign numerical labels such as 0, 1, and 2 respectively.

In [None]:
# Data Importation
data = pd.read_csv("https://afterwork.ai/ds/e/healthcare_942b5.csv")

# Data Exploration
data.head()

In [None]:
# Feature Engineering Technique
from sklearn.preprocessing import LabelEncoder

# Create separate LabelEncoder instances for each categorical column
gender_encoder = LabelEncoder()
blood_type_encoder = LabelEncoder()
insurance_type_encoder = LabelEncoder()

# Perform label encoding for each column
data['Gender'] = gender_encoder.fit_transform(data['Gender'])
data['Blood_Type'] = blood_type_encoder.fit_transform(data['Blood_Type'])
data['Insurance_Type'] = insurance_type_encoder.fit_transform(data['Insurance_Type'])

# Display the first few rows of the dataframe
data.head()

### <font color="green">Challenge</font>

Apply label encoding to the 'Insurance_Type' column in the healthcare dataset from the URL: https://afterwork.ai/ds/ch/healthcare_kzo4n.csv. Remember to use the 'LabelEncoder' class from the 'sklearn.preprocessing' module in Python.


In [None]:
# Data Importation
data = pd.read_csv("https://afterwork.ai/ds/ch/healthcare_kzo4n.csv")

# Data Exploration
data.head()

In [None]:
# Feature Engineering Technique
from sklearn.preprocessing import LabelEncoder

# Write your code here


data.head()

## 1.2 Ordinal Encoding

Ordinal encoding is a method of encoding categorical variables where each unique category is assigned a unique integer value based on the order or rank of the category. This encoding is suitable for variables where the categories have a meaningful order or hierarchy.



In [None]:
# Data Importation
data = pd.read_csv("https://afterwork.ai/ds/e/healthcare_7eptm.csv")

# Data Exploration
data.head()

In [None]:
# Feature Engineering Technique
ordinal_mapping = {'Former': 0, 'Current': 1, 'Negative': 0, 'Positive': 1}
data['Smoking_Status'] = data['Smoking_Status'].map(ordinal_mapping)
data['Diabetes_Status'] = data['Diabetes_Status'].map(ordinal_mapping)
data['Heart_Disease_Status'] = data['Heart_Disease_Status'].map(ordinal_mapping)

data.head()

## 1.3 One-Hot Encoding

One hot encoding is a technique used in machine learning to convert categorical data into a numerical format. We do this by creating binary columns for each category, where only one column has a value of 1 while the rest have a value of 0.



In [None]:
# Load the dataset from the provided URL
data = pd.read_csv("https://afterwork.ai/ds/e/healthcare_fqkrh.csv")

# Data Exploration
data.head()

In [None]:
# Feature Engineering Technique
data_encoded = pd.get_dummies(data, columns=['Gender', 'Blood_Type', 'Diabetes', 'Hypertension', 'Heart_Disease', 'Smoker'])
data_encoded.head()

### <font color="green">Challenge</font>

Apply one hot encoding to the 'Blood Type' column in the healthcare dataset from the URL: https://afterwork.ai/ds/ch/healthcare_gzwmj.csv. Remember to use pandas get_dummies() function to create binary columns for each category.

In [None]:
# Load the dataset from the provided URL
data = pd.read_csv("https://afterwork.ai/ds/ch/healthcare_gzwmj.csv")

# Data Exploration
data.head()

In [None]:
# Apply One Hot Encoding to the 'Blood Type' column
# Write your code here


# 2. Feature Scaling

## 2.1 Standardization

Standardization is a technique used to rescale the range of independent variables or features so that they have a mean of 0 and a standard deviation of 1. This process helps in bringing all the features to a similar scale, which is important for many machine learning algorithms that are sensitive to the scale of the input data.

In [None]:
# Data Importation
data = pd.read_csv("https://afterwork.ai/ds/e/healthcare_46o8b.csv")

# Data Exploration
data.head()

In [None]:
# Feature Engineering Technique
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
data_scaled = scaler.fit_transform(data[['Age', 'Height', 'Weight', 'Glucose', 'Heart_Rate', 'Exercise_Hours']])

data[['Age', 'Height', 'Weight', 'Glucose', 'Heart_Rate', 'Exercise_Hours']] = data_scaled

data.head()

## 2.2 Min-Max Scaling

Min-Max scaling is a feature scaling technique that rescales the data to a fixed range, usually between 0 and 1. This is done by subtracting the minimum value of the feature and then dividing by the range of the feature (maximum - minimum). This concept is important because it helps to normalize the data and bring all features to a similar scale, which can improve the performance of machine learning algorithms that are sensitive to the scale of the input data.

In [None]:
# Data Importation
data = pd.read_csv("https://afterwork.ai/ds/e/healthcare_s6lr2.csv")

# Data Exploration
data.head()

In [None]:
# Feature Engineering Technique
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
data_scaled = scaler.fit_transform(data[['Age', 'Weight', 'Height', 'Heart_Rate', 'Temperature', 'Cholesterol', 'Glucose', 'Exercise_Hours']])
data[['Age', 'Weight', 'Height', 'Heart_Rate', 'Temperature', 'Cholesterol', 'Glucose', 'Exercise_Hours']] = data_scaled

data.head()

Perform Min-Max scaling on the 'Height' and 'Weight' columns of the healthcare dataset from the URL: https://afterwork.ai/ds/ch/healthcare_xdymn.csv.

### <font color="green">Challenge</font>

In [None]:
# Data Importation
data = pd.read_csv("https://afterwork.ai/ds/ch/healthcare_xdymn.csv")

# Data Exploration
data.head()

In [None]:
# Feature Engineering Technique
from sklearn.preprocessing import MinMaxScaler

# Write your code here


data.head()

## 2.3 Robust Scaling

Robust Scaling in Feature Scaling involves scaling our features in a way that is not affected by outliers or extreme values. This is important because it helps improve the performance of machine learning models by making them more robust to variations in data.



In [None]:
# Data Importation
data = pd.read_csv("https://afterwork.ai/ds/e/healthcare_grjf5.csv")

# Data Exploration
data.head()

In [None]:
# Feature Engineering Technique
from sklearn.preprocessing import RobustScaler

# Write your code here

data.head()

### <font color="green">Challenge</font>

Apply Robust Scaling to the 'Height' feature in the healthcare dataset from the URL: https://afterwork.ai/ds/ch/healthcare_bd82k.csv


In [None]:
data = pd.read_csv("https://afterwork.ai/ds/ch/healthcare_bd82k.csv")

data.head()

In [None]:
robust_scaler = RobustScaler()
data['Height'] = robust_scaler.fit_transform(data[['Height']])

data.head()

# 3. Date/time feature extraction

## 3.1 Basic Date/Time Features

Extracting basic date/time features involves extracting fundamental information from date/time data such as day of the week, month, year, hour, minute, and second. This allows us to gain insights and patterns from the temporal aspect of the data.



In [None]:
# Data Importation
data = pd.read_csv("https://afterwork.ai/ds/e/healthcare_0ikm9.csv")

# Data Exploration
data.head()

In [None]:
# Feature Engineering Techniques
data['DOB'] = pd.to_datetime(data['DOB'])
data['Admission_Date'] = pd.to_datetime(data['Admission_Date'])
data['Discharge_Date'] = pd.to_datetime(data['Discharge_Date'])

data['DOB_day'] = data['DOB'].dt.day
data['DOB_month'] = data['DOB'].dt.month
data['DOB_year'] = data['DOB'].dt.year

data['Admission_dayofweek'] = data['Admission_Date'].dt.dayofweek
data['Admission_month'] = data['Admission_Date'].dt.month
data['Admission_year'] = data['Admission_Date'].dt.year

data['Discharge_hour'] = data['Discharge_Date'].dt.hour
data['Discharge_minute'] = data['Discharge_Date'].dt.minute
data['Discharge_second'] = data['Discharge_Date'].dt.second

data.head()

## 3.2 Calculating Elapsed Time

Calculating elapsed time in healthcare data analysis involves determining the time that has passed between two specific dates or times. This concept is important for tracking patient wait times, medication administration intervals, or the duration of a medical procedure.



In [None]:
# Data Importation
df = pd.read_csv("https://afterwork.ai/ds/e/healthcare_5r3du.csv")
df.head()

In [None]:
# Data Exploration
df['Admission_Date'] = pd.to_datetime(df['Admission_Date'])
df['Discharge_Date'] = pd.to_datetime(df['Discharge_Date'])
df['Medication_Start'] = pd.to_datetime(df['Medication_Start'])
df['Medication_End'] = pd.to_datetime(df['Medication_End'])

# Feature Engineering Technique
df['Medication_Elapsed_Time'] = df['Medication_End'] - df['Medication_Start']

df.head()

### <font color="green">Challenge</font>

Calculate the average length of stay (in days) for patients in the healthcare dataset from the URL: https://afterwork.ai/ds/ch/healthcare_rlev5.csv. Remember to calculate the elapsed time by subtracting the Admission Date from the Discharge Date for each patient.

In [None]:
# Data Importation
df = pd.read_csv("https://afterwork.ai/ds/ch/healthcare_rlev5.csv")
df.head()

In [None]:
# Datetime conversions
# Write your code here


# Calculate Length of Stay
# Write your code here

# Calculate Average Length of Stay
average_length_of_stay = df['Length of Stay'].mean()

print("Average Length of Stay (in days) for patients: ", average_length_of_stay)

# 4. Feature Construction

## 4.1 Feature Construction

Feature construction involves creating new features from existing data to improve the performance of machine learning models. We construct features by combining, transforming, or extracting information from the original data. This process is important because it can help us uncover hidden patterns, reduce noise, and enhance the predictive power of our models.



In [None]:
# Read the dataset from the URL
data = pd.read_csv("https://afterwork.ai/ds/e/healthcare_phc4t.csv")

# Data Exploration
data.head()

In [None]:
# Feature Engineering Technique
data['BMI'] = data['Weight'] / ((data['Height']/100) ** 2)
data['Risk_Factor'] = data['Cholesterol_Level'] * data['Heart_Rate']

data.head()

### <font color="green">Challenge</font>

Create new features by calculating the Body Mass Index (BMI) for each patient in the healthcare dataset from the URL: https://afterwork.ai/ds/ch/healthcare_rx4fo.csv. Remember that BMI is calculated as weight (kg) divided by height squared (m^2).


In [None]:
# Read the dataset from the URL
data = pd.read_csv("https://afterwork.ai/ds/ch/healthcare_rx4fo.csv")

# Data Exploration
data.head()

In [None]:
# Feature Engineering Technique
# Write your code here


data.head()

# 5. Reducing Skewness

## 5.1 Log Transformation


Log transformation is a technique used to reduce skewness in data by applying the natural logarithm to each data point. This helps to make the data more normally distributed, which can improve the performance of machine learning models that assume normality in the data distribution.



In [None]:
# Data Importation
data = pd.read_csv("https://afterwork.ai/ds/e/healthcare_n0rg6.csv")

# Data Exploration
data.head()

In [None]:
# Feature Engineering Technique
data['Log_Age'] = data['Age'].apply(lambda x: np.log(x))
data['Log_Weight'] = data['Weight'].apply(lambda x: np.log(x))
data['Log_Height'] = data['Height'].apply(lambda x: np.log(x))
data['Log_Blood_Pressure'] = data['Blood_Pressure'].apply(lambda x: np.log(int(x.split('/')[0])) + np.log(int(x.split('/')[1])) / 2)
data['Log_Cholesterol'] = data['Cholesterol'].apply(lambda x: np.log(x))
data['Log_Glucose'] = data['Glucose'].apply(lambda x: np.log(x))
data['Log_Heart_Rate'] = data['Heart_Rate'].apply(lambda x: np.log(x))
data['Log_Body_Temperature'] = data['Body_Temperature'].apply(lambda x: np.log(x))
data['Log_Respiratory_Rate'] = data['Respiratory_Rate'].apply(lambda x: np.log(x))

In [None]:
# Display Distributions
data.hist(figsize=(15, 10));

## 5.2 Box-Cox Transformation

Box-Cox Transformation is a statistical technique that allows us to normalize non-normal data by applying a power transformation. This is important because many statistical methods assume that the data is normally distributed, and the Box-Cox Transformation helps us meet this assumption.

To apply the Box-Cox Transformation, we first identify the lambda parameter that maximizes the log-likelihood function. Then, we use this lambda value to transform our data by raising it to the power of lambda. This step helps us achieve a more normal distribution in our data, making it suitable for further analysis.

In [None]:
# Data Importation
df = pd.read_csv("https://afterwork.ai/ds/e/healthcare_rels7.txt")

# Data Exploration
df.head()

In [None]:
# Feature Engineering Technique
from scipy import stats

# Apply Box-Cox Transformation to 'Weight' column
df['Weight_BoxCox'], _ = stats.boxcox(df['Weight'])

df[['Weight', 'Weight_BoxCox']]

In [None]:
df.hist(column='Weight');

In [None]:
df.hist(column='Weight_BoxCox');

### <font color="green">Challenge</font>

Apply the Box-Cox Transformation to the 'Glucose_Level' data from the dataset located at https://afterwork.ai/ds/ch/healthcare_hdu0o.csv. Remember to identify the lambda parameter that maximizes the log-likelihood function and then transform the data by raising it to the power of lambda.


In [None]:
# Data Importation
df = pd.read_csv("https://afterwork.ai/ds/ch/healthcare_hdu0o.csv")

# Data Exploration
df.head()

In [None]:
# Apply Box-Cox Transformation to 'Glucose_Level' column
# Write your code here


In [None]:
# Write your code here


In [None]:
# Write your code here


# 6. Feature Selection

## 6.1 Recursive Feature Elimination

Recursive Feature Elimination is a feature selection technique that recursively removes attributes and builds a model on those attributes that remain. It is based on the idea that the best features are ranked higher and the worst features are ranked lower. This process helps to identify the most important features in a dataset.

In [None]:
# Data Importation
data = pd.read_csv("https://afterwork.ai/ds/e/healthcare_minvr.csv")

# Data Exploration
data.head()

In [None]:
# Feature Engineering Technique
from sklearn.feature_selection import RFE
from sklearn.ensemble import RandomForestClassifier

X = data.drop(['Patient_ID', 'Gender', 'Blood_Pressure', 'Diabetes_Status', 'Height', 'Cholesterol_Level'], axis=1)
y = data['Diabetes_Status']

model = RandomForestClassifier()
rfe = RFE(model, n_features_to_select=3)
fit = rfe.fit(X, y)

print("Selected Features:")
print(X.columns[fit.support_])

# 7. Dimensionality Reduction

## 7.1 Dimensionality Reduction

Principal Component Analysis (PCA) is a technique used for dimensionality reduction in which we transform our data into a new coordinate system to reduce the number of features while retaining the most important information. This is important because it helps us deal with the curse of dimensionality, reduce computational complexity, and improve model performance by removing redundant or irrelevant features.



In [None]:
# Read the dataset from the URL
data = pd.read_csv("https://afterwork.ai/ds/e/healthcare_y2kpc.csv")

# Data Exploration
data.head()

In [None]:
# Feature Engineering Technique
from sklearn.decomposition import PCA

# Separate features and target variable
X = data.drop(['ID', 'Blood Pressure'], axis=1)
y = data['ID']

# Apply PCA
pca = PCA(n_components=3)
X_pca = pca.fit_transform(X)

# Display the transformed data
print(X_pca)

### <font color="green">Challenge</font>

Perform Principal Component Analysis (PCA) on the healthcare dataset from the URL: https://afterwork.ai/ds/ch/healthcare_l31hx.csv to reduce the dimensionality of the data. Remember that PCA is used to transform the data into a new coordinate system to retain the most important information while reducing the number of features.

In [None]:
# Read the dataset from the URL
data = pd.read_csv("https://afterwork.ai/ds/ch/healthcare_l31hx.csv")

# Data Exploration
data.head()

In [None]:
# Data Preparation
from sklearn.preprocessing import LabelEncoder

# Create separate LabelEncoder instances for each categorical column
gender_encoder = LabelEncoder()
cholesterol_encoder = LabelEncoder()
smoking_encoder = LabelEncoder()

# Perform label encoding for each column
data['Gender'] = gender_encoder.fit_transform(data['Gender'])
data['Cholesterol'] = cholesterol_encoder.fit_transform(data['Cholesterol'])
data['Smoking Status'] = smoking_encoder.fit_transform(data['Smoking Status'])

# Display the first few rows of the dataframe
data.head()

In [None]:
# Separate features and target variable
X = data.drop(['ID', 'Blood Pressure', 'Cholesterol'], axis=1)
y = data['ID']

# Apply PCA
# Write your code here


# Display the transformed data
# Write your code here
