# Checkpoint 2 - Missing Data Analysis

## Update and Checkpoint
1. The data was analysed in SPSS to evaluate the missing nature of the data with the Little MCAR's
2. The total of missing data were:
    2a) Cases = 175 (11.1%)
    2b) Questions = 71 (51%)
    2c) Responses =
3. Little MCAR's test indicated data are not MNAR: (provide statistics) = X2 [1762] = 2084.73, p < 0.001).
4. Therefore, as greater than 5% cases were missing data, multiple imputation performed in python.


# Checkpoint 3 - Multiple Imputation

1. Using the miceforest package, multiple imputation of missing data was performed. 
2. A total of 5 iterations are performed, with results pooled automatically. 
3. Imputed values are checked visually and with descriptive statistics to ensure appropriate range.

In [None]:
import os

#check and set working directory
cwd = os.getcwd()  # Get the current working directory (cwd)
files = os.listdir(cwd)  # Get all the files in that directory
print("Files in %r: %s" % (cwd, files))
print(cwd+ "\\" +files[0])

os.chdir('c:\\Users\\j_m289\\Pictures\\PHD\\15. Data Analysis\HMR\Raw')  # Provide the new path here

#try to use relative path for files (tied to directory and nesting within folders)
path = "/Users/j_m289\Pictures\PHD\15. Data Analysis\HMR\Raw"
  
start = "/Users/j_m289"

relative_path = os.path.relpath(path, start)

#Load packages
#pandas for data frame management and descriptives (in addition to numpy)
#scipy for HMR model
#seaborn for visualisation (MATPLOTLIB also)

import pandas as pd
import missingno as mn
import miceforest as mf

## Multiple Imputation with MICE

In [None]:
#Load raw_clean dataset and check missing data
missing_values =['n/a', 'na', '9999'] # missing values list is added so that pandas can read these

df_clean = pd.read_excel('raw_clean.xlsx', na_values = missing_values, engine='openpyxl')   # Load file and set missing values
df_clean.drop(columns= ['Unnamed: 0'], inplace=True)  # drop unnecessary column

for column in df_clean:     
    # Select column contents by column
    # name using [] operator
    columnSeriesObj = df_clean[column]
    print('Column Name : ', column)
    print('Column Contents : ', df_clean[column].isnull().sum()) #check missing values per column

#assign columns to values to append to completed dataset in last step
df_MH_diag = df_clean['MH_Diag_type']
df_ED_sub = df_clean['ED_subtype']
df_pastED_sub = df_clean['past_ED_subtype']

#columns that contain strings are dropped as they prevent multiple imputation function
df_clean = df_clean.drop(columns= ['MH_Diag_type','ED_subtype','past_ED_subtype'], inplace=False)

In [None]:
# convert columns to floats
for col in df_clean:
    df_clean[col] = df_clean[col].astype('float64')

df_clean.dtypes.value_counts()

#Run multiple imputation with mice (random state = 1)
kernel = mf.ImputationKernel(
  df_clean,
  datasets=4,
  save_all_iterations=True,
  random_state=1
)
# Run the MICE algorithm for 5 iterations on each of the datasets
kernel.mice(5)

# Printing the kernel will show you some high level information.
print(kernel)

In [None]:
# imputed datasets assigned to completed dataset and missing data checks
completed_dataset = kernel.complete_data(dataset=2)
#check missing values
print(completed_dataset.isnull().sum(0))
# visual inspection
display(completed_dataset) 
kernel.plot_imputed_distributions()

In [None]:
#Export to 'completedDataset.xlsx' file to the 'clean' folder
completed_dataset.drop(columns= ['Unnamed: 0', 'Unnamed: 0.1'], inplace=True)  # drop unnecessary column
os.chdir('c:\\Users\\j_m289\\Pictures\\PHD\\15. Data Analysis\HMR\Clean')  # Provide the new path here
completed_dataset.to_excel('completedDataset.xlsx', index=True) #re-save dataset