<a href="https://colab.research.google.com/github/Odero254/intro_to_github/blob/main/%5BPractice_Notebook%5D_AfterWork_Handling_Missing_Data_for_Healthcare_with_Python_Course.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# [Practice Notebook] AfterWork: Handling Missing Data for Healthcare with Python Course

# 1. Finding Missing Data

In [None]:
# Import the Pandas library
import pandas as pd

## 1.1 Finding Missing Values

Handling missing values in a dataset is crucial for accurate analysis and modeling. Missing values can skew results and lead to incorrect conclusions. We need to identify missing values, decide how to handle them (e.g. impute or remove), and then proceed with our analysis.



In [None]:
# Load the dataset
df = pd.read_csv("https://afterwork.ai/ds/e/healthcare_and_pharmaceuticals_uotn.csv")

# Check for missing values
missing_values = df.isnull().sum()
missing_values

**Explanation**: We load a dataset from a URL using pd.read_csv() and store it in a DataFrame called df. We check for missing values in the dataset by using df.isnull().sum() and store the count of missing values in a variable called missing_values. We then print out the missing values count.

### <font color="green">Challenge</font>

Identify missing values in the healthcare and pharmaceuticals dataset from https://afterwork.ai/ds/ch/healthcare_and_pharmaceuticals_gnsc.csv.

In [None]:
# Write your code here


# 2. Dropping Missing Data

## 2.1 Drop Rows With Any Missing Values

Dropping rows with any missing values in Pandas allows us to remove observations with missing data from our dataset. This is important because missing data can lead to biased or inaccurate analysis results. To drop rows with any missing values, we can use the dropna() method with the how='any' parameter.


In [None]:
# Load the dataset
df = pd.read_csv("https://afterwork.ai/ds/e/healthcare_and_pharmaceuticals_2zbd1.csv")

# Check for missing values
missing_values = df.isnull().sum()
missing_values

In [None]:
# Drop rows with any missing values
df.dropna(how='any', inplace=True)

# Check for missing values
missing_values = df.isnull().sum()
missing_values

**Explanation**: We load a dataset from a URL into a DataFrame using pd.read_csv(). Next, we drop rows with any missing values from the DataFrame using the dropna() method with the how='any' parameter. Finally, we print out the updated dataset.

### <font color="green">Challenge</font>

Drop rows with any missing values in the healthcare and pharmaceuticals dataset from https://afterwork.ai/ds/ch/healthcare_and_pharmaceuticals_12z8.csv using Pandas.

In [None]:
# Write your code here


## 2.2 Dropping Columns With Any Missing Values

Dropping columns with missing values in a dataset allows us to remove features that have a significant amount of missing data. This can help improve the quality of our analysis and prevent biased results.



In [None]:
# Load the dataset
df = pd.read_csv("https://afterwork.ai/ds/e/healthcare_and_pharmaceuticals_0fyxm.csv")

# Check for missing values
missing_values = df.isnull().sum()
missing_values

In [None]:
# Drop columns with missing values
df.dropna(axis=1, inplace=True)

# Check for missing values
missing_values = df.isnull().sum()
missing_values

### <font color="green">Challenge</font>

Remove the 'Duration' column from the dataset located at https://afterwork.ai/ds/ch/healthcare_and_pharmaceuticals_zgjs.csv. Dropping columns with missing values in a dataset allows us to remove features that have a significant amount of missing data.

In [None]:
# Write your code here


In [None]:
# Write your code here


# 3. Replacing Missing Values

## 3.1 Arbitrary Value Imputation

Arbitrary value imputation is a technique used to replace missing values in a dataset with a specific value that is chosen arbitrarily. This can be a constant value, the mean, median, or mode of the column, or any other value that makes sense in the context of the data.



In [None]:
# Load the dataset
df = pd.read_csv("https://afterwork.ai/ds/e/healthcare_and_pharmaceuticals_c5mxu.csv")

# Check for missing values
missing_values = df.isnull().sum()
missing_values

In [None]:
# Replace empty values with an arbitrary value of 0
df.fillna(0, inplace=True)

# Check for missing values
missing_values = df.isnull().sum()
missing_values

### <font color="green">Challenge</font>

In the dataset from https://afterwork.ai/ds/ch/healthcare_and_pharmaceuticals_5fbg.csv, there are missing values in the 'Income' and 'Education' columns. Implement arbitrary value imputation by replacing the missing values in the 'Income' column with the mean of the column and in the 'Education' column with the mode of the column.

In [None]:
# Write your code here


In [None]:
# Write your code here


## 3.2 Forward Fill Replacement

Forward fill replacement is a method used to replace missing values in a dataset with the last known value in the column. This is particularly useful when dealing with time series data where missing values can be filled in with the most recent observation.



In [None]:
# Load the dataset
df = pd.read_csv("https://afterwork.ai/ds/ch/healthcare_and_pharmaceuticals_5fbg.csv")

# Check for missing values
missing_values = df.isnull().sum()
missing_values

In [None]:
# Perform forward fill replacement
df.fillna(method='ffill', inplace=True)

# Check for missing values
missing_values = df.isnull().sum()
missing_values

**Explanation**: We use the fillna() method with the 'ffill' parameter to perform forward fill replacement on the DataFrame df. This replaces missing values with the last known value in the column.

### <font color="green">Challenge</font>

Replace missing values in the 'End Date' column of the healthcare and pharmaceuticals dataset from the URL: https://afterwork.ai/ds/ch/healthcare_and_pharmaceuticals_m5lg.csv using forward fill replacement. Remember to fill in missing values with the most recent observation in the column.

In [None]:
# Write your code here


In [None]:
# Write your code here


## 3.3 Backward Fill Replacement

Backward fill replacement is a technique used to replace missing values in a dataset with the next available value in the column. This method is useful when we want to fill missing values with the most recent known value in the dataset.



In [None]:
# Load the dataset
df = pd.read_csv("https://afterwork.ai/ds/ch/healthcare_and_pharmaceuticals_m5lg.csv")

# Check for missing values
missing_values = df.isnull().sum()
missing_values

In [None]:
# Perform backward fill replacement
df.fillna(method='bfill', inplace=True)

# Check for missing values
missing_values = df.isnull().sum()
missing_values

### <font color="green">Challenge</font>

Replace missing values in the 'Age' and 'Income' columns of the dataset from the URL: https://afterwork.ai/ds/ch/healthcare_and_pharmaceuticals_5fbg.csv using the backward fill replacement technique. Remember to fill missing values with the most recent known value in the dataset.

In [None]:
# Write your code here


In [None]:
# Write your code here


# 4. Mean/Median/Mode Imputation

In [None]:
from sklearn.impute import SimpleImputer

## 4.1 Mean Imputation

Mean imputation is a method of handling missing data by replacing the missing values with the mean of the non-missing values in the same column. This helps to maintain the overall distribution of the data and prevent bias in the analysis.



In [None]:
# Load the dataset
df = pd.read_csv("https://afterwork.ai/ds/e/healthcare_and_pharmaceuticals_owfhv.csv")

# Check for missing values
missing_values = df.isnull().sum()
missing_values

In [None]:
# Initialize SimpleImputer with strategy as 'mean'
imputer = SimpleImputer(strategy='mean')

# Apply the imputer to fill missing values in the 'Age' column
df['Age'] = imputer.fit_transform(df[['Age']])

# Check for missing values
missing_values = df.isnull().sum()
missing_values

### <font color="green">Challenge</font>

Perform mean imputation using SimpleImputer on the 'Income' column of the dataset from the URL: https://afterwork.ai/ds/ch/healthcare_and_pharmaceuticals_khzi.csv. Remember to replace missing values with the mean value of the column.

In [None]:
# Write your code here


In [None]:
# Write your code here


## 4.2 Median Imputation

Median imputation using SimpleImputer is a method of filling in missing values in a dataset by replacing them with the median value of the respective column. This approach is useful when the data is not normally distributed and the median is a more representative measure of central tendency than the mean or mode.



In [None]:
# Load the dataset
df = pd.read_csv("https://afterwork.ai/ds/e/healthcare_and_pharmaceuticals_owfhv.csv")

# Check for missing values
missing_values = df.isnull().sum()
missing_values

In [None]:
# Initialize SimpleImputer with strategy as 'median'
imputer = SimpleImputer(strategy='median')

# Apply the imputer to fill missing values in the 'Age' column
df['Age'] = imputer.fit_transform(df[['Age']])

# Check for missing values
missing_values = df.isnull().sum()
missing_values

### <font color="green">Challenge</font>

Perform median imputation using SimpleImputer on the 'Income' column of the dataset from the URL: https://afterwork.ai/ds/ch/healthcare_and_pharmaceuticals_khzi.csv. Remember to replace missing values with the median value of the column.

In [None]:
# Write your code here


In [None]:
# Write your code here


## 4.3 Mode Imputation

Mode imputation using SimpleImputer based on mean/median/mode imputation is a technique used to fill in missing values in a dataset by replacing them with the most frequent value in a column. This is useful when dealing with categorical data where the mode represents the most common value in a column.



In [None]:
# Load the dataset
df = pd.read_csv("https://afterwork.ai/ds/e/healthcare_and_pharmaceuticals_jtrx8.csv")

# Check for missing values
missing_values = df.isnull().sum()
missing_values

In [None]:
# Initialize SimpleImputer with strategy as 'most_frequent' for mode imputation
imputer = SimpleImputer(strategy='most_frequent')

# Fit and transform the data with the imputer
df['Income'] = imputer.fit_transform(df[['Income']])

# Check for missing values
missing_values = df.isnull().sum()
missing_values

### <font color="green">Challenge</font>

Perform mode imputation using SimpleImputer to fill in missing values in the 'Income' column of the dataset from the URL: https://afterwork.ai/ds/ch/healthcare_and_pharmaceuticals_4tog.csv. Remember that mode imputation replaces missing values with the most frequent value in a column.

In [None]:
# Write your code here


In [None]:
# Write your code here


# 5. Machine Learning for Missing Data

## 5.1 K-Nearest Neighbours Imputation

K-Nearest Neighbours Imputation is a machine learning technique used to fill in missing values in a dataset based on the values of its nearest neighbors. This approach calculates the distance between data points and uses the values of the k nearest neighbors to impute missing values.



In [None]:
# Load the dataset
df = pd.read_csv("https://afterwork.ai/ds/e/healthcare_and_pharmaceuticals_zulr.csv")

# Check for missing values
missing_values = df.isnull().sum()
missing_values

In [None]:
from sklearn.impute import KNNImputer

# Initialize KNNImputer
imputer = KNNImputer(n_neighbors=2)

# Impute missing values
df_imputed = imputer.fit_transform(df[['Age', 'Income']])

# Update the original dataset with imputed values
df[['Age', 'Income']] = df_imputed

# Check for missing values
missing_values = df.isnull().sum()
missing_values

### <font color="green">Challenge</font>

Use the dataset from the URL: https://afterwork.ai/ds/ch/healthcare_and_pharmaceuticals_8fdh.csv. Implement K-Nearest Neighbours Imputation to fill in missing values in the 'Income' column.  

In [None]:
# Write your code here


In [None]:
# Write your code here


## 5.3 Decision Tree Imputation

Decision tree imputation is a machine learning technique used to fill in missing values in a dataset by predicting the missing values based on other variables in the dataset. This approach is important because it allows us to retain valuable data that would otherwise be lost due to missing values, leading to more accurate and reliable analysis results.



In [None]:
import pandas as pd
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split

# Load the dataset
df = pd.read_csv("https://afterwork.ai/ds/e/healthcare_and_pharmaceuticals_vcy7u.csv")

# Check for missing values
missing_values = df.isnull().sum()
missing_values

In [None]:
# Identify the numerical variables with missing values
numeric_cols = df.select_dtypes(include='number').columns.tolist()

# Drop non-numeric columns
numeric_data = df[numeric_cols]

# Identify the numerical variables with missing values
numeric_missing_cols = numeric_data.columns[numeric_data.isnull().any()].tolist()

# Split the data into complete and incomplete datasets
complete_data = numeric_data.dropna()
incomplete_data = numeric_data[numeric_data.isnull().any(axis=1)]

# Choose the appropriate decision tree algorithm
dt = DecisionTreeRegressor()

# Train the decision tree model
X = complete_data.drop(numeric_missing_cols, axis=1)
y = complete_data[numeric_missing_cols]
dt.fit(X, y)

# Predict the missing values
X_pred = incomplete_data.drop(numeric_missing_cols, axis=1)
y_pred = dt.predict(X_pred)

# Fill in the missing values
df.loc[df.isnull().any(axis=1), numeric_missing_cols] = y_pred

# Check for missing values
missing_values = df.isnull().sum()
missing_values

**Explanation**: We import the specific modules from the scikit-learn library. We then load the dataset from a URL into a pandas DataFrame. We identify the numerical columns in the dataset and drop any non-numeric columns. We also identify the numerical columns with missing values and split the data into complete and incomplete datasets. We choose a Decision Tree Regressor algorithm and train the model using the complete dataset. We then predict the missing values in the incomplete dataset using the trained model. We fill in the missing values in the original dataset with the predicted values. Finally, we check for any remaining missing values in the dataset. The output of the code example would be the DataFrame with the missing values filled in using the Decision Tree Imputation technique. This allows us to have a complete dataset for further analysis and modeling.

### <font color="green">Challenge</font>

Apply the Decision Tree Imputer to fill in the missing values in the dataset located at https://bit.ly/49mNlOn.

In [None]:
# Write your code here


## 5.4 Gradient Boosting Imputation

Gradient Boosting Imputation is a machine learning technique that we use to impute missing values in a dataset. We train a gradient boosting model on the non-missing values and use it to predict the missing values. This approach takes into account the relationships between the features in the dataset, resulting in more accurate imputations.

In [None]:
from sklearn.experimental import enable_hist_gradient_boosting
from sklearn.ensemble import HistGradientBoostingRegressor
from sklearn.impute import IterativeImputer

# Load the dataset
df = pd.read_csv("https://afterwork.ai/ds/e/healthcare_and_pharmaceuticals_vcy7u.csv")

# Check for missing values
missing_values = df.isnull().sum()
missing_values

In [None]:
# Initialize the Gradient Boosting Imputer
imputer = IterativeImputer(estimator=HistGradientBoostingRegressor())

# Drop non-numeric columns for imputation
data_numeric = df.select_dtypes(include=['float64', 'int64'])

# Fit the imputer on the dataset
imputer.fit(data_numeric)

# Impute missing values
imputed_data = imputer.transform(data_numeric)

# Convert back to DataFrame
imputed_df = pd.DataFrame(imputed_data, columns=data_numeric.columns)

# Check for missing values
missing_values = imputed_df.isnull().sum()
missing_values

### <font color="green">Challenge</font>

Implement Gradient Boosting Imputation to fill in the missing values in the 'healthcare_and_pharmaceuticals_f4u3.csv' dataset from the URL: https://afterwork.ai/ds/ch/healthcare_and_pharmaceuticals_f4u3.csv.

In [None]:
# Write your code here


In [None]:
# Write your code here


# 6. Evaluating the Effect of Handling Missing Data

Evaluating the performance of imputation is crucial to ensure that the imputed data are reliable and accurate. This can be done by comparing the statistical properties of the original and imputed data. To evaluate the performance of imputation, we first calculate the mean, median, mode, variance, and other statistical properties of the original and imputed data using the appropriate functions in Pandas and NumPy. Then, we compare these properties to see if there are any significant differences.

If the mean values for the original and imputed data are the same, it suggests that the imputation strategy did not significantly alter the central tendency of the data. This can be seen as a desirable outcome in the context of missing data imputation because it means that the imputed values are consistent with the rest of the data.

However, it's important to note that assessing imputation solely based on the means may not capture potential differences in the overall distribution of the data. Consider conducting a more comprehensive analysis, such as examining the variance, skewness, and other statistical properties, and visualizing the data distribution to ensure that imputation did not introduce any unintended biases.

In [None]:
# Load the dataset
df = pd.read_csv('https://bit.ly/43MOXjp')

# Preview missing values
df.isnull().sum()

In [None]:
# Select only the numeric columns
df_numeric = df.select_dtypes(include=[np.number])

# Create a copy for original data comparison
df_original = df_numeric.copy()

# Impute missing values with mean
imputer = SimpleImputer(strategy='median')
df_imputed = imputer.fit_transform(df_numeric)
df_imputed = pd.DataFrame(df_imputed, columns=df_numeric.columns)

# Replace the numeric columns in the original DataFrame with the imputed values
df[df_numeric.columns] = df_imputed

# Preview missing values
df.isnull().sum()

In [None]:
# Calculate statistical properties for original and imputed data
original_statistics = df_original.describe()
imputed_statistics = df_imputed.describe()

# Print statistical properties for both original and imputed data
print("Original Data Statistics:")
print(original_statistics)
print("\nImputed Data Statistics:")
print(imputed_statistics)

### <font color="green">Challenge</font>

Using the dataset from https://bit.ly/3TIvdbP, impute the missing values in the 'Experience' column using the mean, median, or mode, and then calculate the statistical properties for the original and imputed data. Compare these properties to see if there are any significant differences.

In [None]:
# Load the dataset
df = pd.read_csv('https://bit.ly/3TIvdbP')

# Preview missing values
df.isnull().sum()

In [None]:
# Create a copy for original data comparison
df_original = df.copy()
# Write your code here

# Impute missing values in 'Experience' column with mean


# Preview missing values


In [None]:
# Impute missing values in 'Experience' column with mode


# Preview missing values


In [None]:
# Calculate and compare statistical properties

# Print statistical properties for original and imputed data
