<a href="https://colab.research.google.com/github/MelGalera/HDS-Blog/blob/main/Final_Assignment_Name_Surname.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

![alt text](https://drive.google.com/uc?export=view&id=1DXUVHxd4t15mfuqMgMCLnsP4jWVI5EWz)

---

<br>
© 2022 Copyright The University of New South Wales - CRICOS 00098G

**Authors**: Oscar Perez-Concha and Zhisheng (Sandy) Sa.

**Communications**: If you have any questions, please email Oscar and Sandy at: o.perezconcha@unsw.edu.au and z.sa@unsw.edu.au

**Please use email exclusively for communication with us regarding this assignment.**

# Final Assignment 



---



#####################################################################################

Double-click to write down your name and surname.

**Name and Surname:** Melvin Galera

**Honour Pledge** <p>
    
    
Declaration: <p>
    
    
I declare that this assessment item is my own work, except where acknowledged, and has not been submitted for academic credit elsewhere or previously, or produced independently of this course (e.g. for a third party such as your place of employment) and acknowledge that the assessor of this item may, for the purpose of assessing this item: 

1. Reproduce this assessment item and provide a copy to another member of the University; and/or 

2. Communicate a copy of this assessment item to a plagiarism checking service (which may then retain a copy of the assessment item on its database for the purpose of future plagiarism checking). 

#####################################################################################

---
# 1.  Health Data Science Scenario
 
## 1a. Research Question

Our hospital has been very proactive in terms of analysing data of its electronic medical record (EMR). By analysing the data, they have made very interesing discoveries:

1. They identified that readmitted patients are not only those who were sicker in the first admission, but those who had less support at home after discharge or those patients that did not have a medical follow-up after discharge.
2. Readmitted patients experienced high levels of emotional stress.
3. Readmitted patients were at much higher risk of acquiring new infections while at hospital.
4. Readmissions are highly costly. Patients that were readmitted for the reasons cited above tended to be sicker than they were when they were first admitted and their length of stay was significantly longer than their first admission.

Our hospital has trialed a "specialised unit" that coordinates the patient's discharge. A team of nurses and OTs visit the patient's home after discharge. The frequency of the visits depends upon an assessment made before discharge, but the average number of visits is 5 across the board. Amongst other things, they make sure that the wounds are healing properly, medication is taken, that there is coordination with a GP in place and basic daily activities can be done, such as moving around their home, toilet, cooking, ...
In addition, the specialised unit regularly contacts the patients via telephone to check that everything is going well. The patients can also contact the specialised unit if they need any help. 

This pilot has drastically reduced the number of readmissions. 

In addition, the cost of operating the specialised unit was much cheaper than the cost of readmissions.

In terms of budget:
1. A day in hospital costs 5,000 dollars per day on average; readmitted patients tend to stay an average of 4 days.
2. A home visit of the specialised unit costs 400 dollars; the average number of visits is 5. 
3. No targeted patients: 0 dollars

Our hospital is now ready to roll out and implement the specialised unit service to all patients at risk for readmission. 
The problem is that a priori, they do not know which patients are at risk of readmission. Thus, they do not know the patients they should be targeting.


<b><font color=green>Goal/Research Question: </font>
1. The hospital needs a machine learning algorithm to predict which patients should participate in this intervention. The prediction will be done just before discharge. They are hiring you to build this algorithm.

2. You will need to explain to the hospital managers the performance of your algoritm, so they can make an informed decision whether to use your algorithm or not.



---



## 1b. Instructions

1. We are going to deliver one predictive model to predict readmission to hospital within 30 days of discharge. 

2. Check and study the data provided. Read the data dictionary carefully and check again the plots and graphs created in the exercises corresponding to weeks 1, 2 and 3. 

3. Since the data have been already provided by the hospital, we are going to skip the next steps of the health data science workflow (see image below). 
  - Step 3 and substeps 3a,3b,3c,3d,3e: Data Gathering.

4. Step 4, substep 4a: Data visualization. Visualize the data in your draft, but do not include the graphs in the final submission.

5. The hospital wants the model to capture as many "readmissions" as posible (true positives) even at the expense of <font color='green'><b> moderately</b> </font> increasing the number of false positives <font color='green'><b>  within reason</b></font> (that is, patients that are not at risk of readmission that are classified as readmission). 

6. Nevertheless, we must take into account that the hospital does not have unlimited resources, so they cannot considered a high number of patients at risk of readmission because the cost of the "specialised unit" would be very high.  

7. You will design several machine learning algorithms, choose one, and give a rationale explaining why you choose that algorithm. 

    * Build some predictive models to predict readmission to hospital within 30 days of discharge using:  
        *  'Logistic Regression Classifiers' 
        * 'Random Forest Classifier'. 
        
8. Use the `classification_report` and confusion matrix metrics to evaluate your models. If you use other metrics, such as the ROC curve, do not include them in the final submission.

9. <b> Very Important: Justify your decisions and why they were made in the space provided.</b>   Write <b>one or two sentences</b> for each section before each block of Python code with the description of what the blocks do; explanations must be clear; no “sanity checks” (although these checks are encouraged during the construction of the algorithms); 

10. Print clear labels in the printed results and explain in short and concise sentences the steps that you followed. 

11. Comment and document your code, as you will most likely later work in teams developing such algorithms.

12. Format: Jupyter Notebook.
13. Programming Language: Python.
14. Platform: Google Colab.
15. The assignment will be mark using the rubric provided in the outline

16. **Submission: Upload the Jupyter Notebook in OpenLearning, in the section provided for that and on your GitHub space. The application wil close at exactly the date and time of the deadline.**

17. Marks will be deducted if these instructions are not followed.

18. <b><font color=green>Only the final version of the document must be submitted. </font></b>. 

19. Add your name and surname to the name of this document. For example. `Final-Assignment-Oscar-PerezConcha.ipynb`

20. Each question is 10 marks, except question 8 that is 20 marks.

21. If you have any questions, please email Oscar and Sandy at:
o.perezconcha@unsw.edu.au and z.sa@unsw.edu.au

![alt text](https://drive.google.com/uc?export=view&id=105SGqeyo8RgLhSO8mN7ZE5OsG0YiLPKt)



---



In [1]:
# check required libraries are installed if not calling system to install
import sys
import subprocess
import pkg_resources

required = {'numpy', 'pandas', 'plotnine', 'matplotlib', 'seaborn', 
            'grid', 'shap', 'scikit-learn'}
installed = {pkg.key for pkg in pkg_resources.working_set}
missing = required - installed

if missing:
    print('Installing: ', missing)
    python = sys.executable
    subprocess.check_call([python, '-m', 'pip', 'install', *missing], stdout=subprocess.DEVNULL)
# delete unwanted variables
del required 
del installed 
del missing

Installing:  {'grid', 'shap'}


In [2]:
# Mount Google Drive
# We do not need to run this cell if you are not running this notebook in Google Colab

if 'google.colab' in str(get_ipython()):
    from google.colab import drive # import drive from Gogle colab
    root = '/content/drive'     # default location for the drive
    # print(root)                 # print content of ROOT (Optional)
    drive.mount(root)
else:
    print('Not running on CoLab')

Mounted at /content/drive


Change the project paths and the paths according to where you have placed your files: 

In [3]:
from pathlib import Path

if 'google.colab' in str(get_ipython()):
    # EDIT THE PROJECT PATH IF DIFFERENT WITH YOUR ONE
    project_path = Path(root) / 'MyDrive' / 'HDAT9500' / 'final-assignment'

    # OPTIONAL - set working directory according to your google drive project path
    # import os
    # Change directory to the location defined in project_path
    # os.chdir(project_path)
else:
    project_path = Path()



---
Function to plot the confusion matrix:


In [4]:
def plot_confusion_matrix(confusion_matrix):
  # visualise the confusion matrix
  labels = ['No', 'Yes']
  ax = plt.subplot()
  sns.heatmap(confusion_matrix, annot = True, fmt = '.0f', ax = ax, cmap = 'viridis')

  # labels, titles and ticks
  ax.set_xlabel('Predicted labels')
  ax.set_ylabel('True labels')
  ax.set_title('Confusion Matrix')
  ax.xaxis.set_ticklabels(labels)
  ax.yaxis.set_ticklabels(labels)

---
# Question 1: Docstring

Create a docstring with:
1. The final aim of our program (50 words limit).
2. The variables and constants that you will use in this exercise (data dictionary). It is expected that you choose informative variable names and document your program (both docstrings and comments). 
3. Divide the docstring in sections. One section per question.
---

<b> Final aim of our program:</b>

################################################################################

(double-click here)


################################################################################

<b> Constants and variables in alphabetical order:</b>

################################################################################

(double-click here)
For example:


---
<b> Aim of this program </b>

(...)

---

---
<b> Question 1: </b>

* `variable_1`: description
* `CONSTANT`: description
* ...

---

(...)

---
<b> Question 4: Training, and hyper-parameter tuning of the Logistic Regression model.</b>

* `variable_10`: description
* `CONSTANT`: description
* ...

---

<b>Question 5: Evaluation of the Logistic Regression: </b>

* `confusion_LR_test`: confusion matrix derived from `y_test` and `y_pred_LR_test` for the logistic regression model `grid_search_LR` evaluated on the test set. 
* `confusion_RF_test`: confusion matrix derived from `y_test` and `y_pred_RF_test` for the random forest model `grid_search_RF` evaluated on the test set. 

---

<b>Question 5: ... </b>

* `variable_200`: description
* ...
---

################################################################################



---



---
# Question 2: Read and check the `pickle` provided. Prepare the data so they can be used by the algorithms that you are going to create.
---

<b> Rationale: What are you doing to solve this question?-75 words limit:</b>


################################################################################

We are going to load the dataset stored in pickle.
We also need to import important libraries that we will use.
The pickle file was saved in google drive and we need to access it through project_path.
Drop patient_id and admission_id as we do not need these. These are just identifiers and we are not interested in them for prediction purposes.


################################################################################

In [6]:
# Python code here (15 lines limit)
import pickle
pickle_data_path = Path(project_path) /'data'/'hospital_data_final_assignment.pickle'
with open (pickle_data_path, 'rb') as data:
  hospital = pickle.load(data)

In [7]:
# Sanity check:
print(hospital.columns)
print(hospital.shape)

Index(['Patient_ID', 'Admission_ID', 'los', 'Age', 'number_diagnoses',
       'num_lab_procedures', 'num_procedures', 'num_medications',
       'number_emergency', 'number_inpatient', 'number_outpatient',
       'sex_Female', 'sex_Male', 'max_glu_serum_>200', 'max_glu_serum_>300',
       'max_glu_serum_None', 'max_glu_serum_Norm', 'A1Cresult_>7',
       'A1Cresult_>8', 'A1Cresult_None', 'A1Cresult_Norm',
       'group_name_1_Blood_&_immune', 'group_name_1_CNS',
       'group_name_1_Cancer', 'group_name_1_Cardiac_&_circulatory',
       'group_name_1_Digestive', 'group_name_1_Endocrine',
       'group_name_1_Infectious', 'group_name_1_Mental_&_Substance',
       'group_name_1_Other', 'group_name_1_Respiratory',
       'group_name_2_Blood_&_immune', 'group_name_2_CNS',
       'group_name_2_Cancer', 'group_name_2_Cardiac_&_circulatory',
       'group_name_2_Digestive', 'group_name_2_Endocrine',
       'group_name_2_Infectious', 'group_name_2_Mental_&_Substance',
       'group_name_2_Other'

In [9]:
hospital.head()

Unnamed: 0,Patient_ID,Admission_ID,los,Age,number_diagnoses,num_lab_procedures,num_procedures,num_medications,number_emergency,number_inpatient,...,admission_source_grouped_11,admission_source_grouped_12,admission_source_grouped_13,admission_source_grouped_14,admission_type_grouped_Elective,admission_type_grouped_Emergency,admission_type_grouped_Not Available/Null,admission_type_grouped_Trauma Centre,admission_type_grouped_Urgent,readmission
0,1,880,2,79,9,38,0,12,0,0,...,0,0,0,0,1,0,0,0,0,no
1,2,881,5,59,8,49,0,16,0,0,...,0,0,0,0,1,0,0,0,0,no
2,3,882,2,33,5,62,0,15,1,0,...,0,0,0,0,1,0,0,0,0,no
3,4,883,6,42,9,77,0,30,0,0,...,0,0,0,0,1,0,0,0,0,no
4,5,884,1,62,7,13,5,6,0,0,...,0,0,0,0,1,0,0,0,0,no


In [10]:
# Drop Patient_ID and Admission_ID
hospital = hospital.drop(['Patient_ID', 'Admission_ID'], axis = 1) 

In [11]:
# Sanity Check
hospital.head()

Unnamed: 0,los,Age,number_diagnoses,num_lab_procedures,num_procedures,num_medications,number_emergency,number_inpatient,number_outpatient,sex_Female,...,admission_source_grouped_11,admission_source_grouped_12,admission_source_grouped_13,admission_source_grouped_14,admission_type_grouped_Elective,admission_type_grouped_Emergency,admission_type_grouped_Not Available/Null,admission_type_grouped_Trauma Centre,admission_type_grouped_Urgent,readmission
0,2,79,9,38,0,12,0,0,0,1,...,0,0,0,0,1,0,0,0,0,no
1,5,59,8,49,0,16,0,0,0,0,...,0,0,0,0,1,0,0,0,0,no
2,2,33,5,62,0,15,1,0,1,1,...,0,0,0,0,1,0,0,0,0,no
3,6,42,9,77,0,30,0,0,0,1,...,0,0,0,0,1,0,0,0,0,no
4,1,62,7,13,5,6,0,0,0,0,...,0,0,0,0,1,0,0,0,0,no


In [13]:
# Sanity check. Check if there is any missing (NaN) values
hospital.describe(include = 'all')

Unnamed: 0,los,Age,number_diagnoses,num_lab_procedures,num_procedures,num_medications,number_emergency,number_inpatient,number_outpatient,sex_Female,...,admission_source_grouped_11,admission_source_grouped_12,admission_source_grouped_13,admission_source_grouped_14,admission_type_grouped_Elective,admission_type_grouped_Emergency,admission_type_grouped_Not Available/Null,admission_type_grouped_Trauma Centre,admission_type_grouped_Urgent,readmission
count,69267.0,69267.0,69267.0,69267.0,69267.0,69267.0,69267.0,69267.0,69267.0,69267.0,...,69267.0,69267.0,69267.0,69267.0,69267.0,69267.0,69267.0,69267.0,69267.0,69267
unique,,,,,,,,,,,...,,,,,,,,,,2
top,,,,,,,,,,,...,,,,,,,,,,no
freq,,,,,,,,,,,...,,,,,,,,,,57348
mean,6.155413,67.281028,7.620382,43.111698,1.618375,16.062974,0.282963,0.221014,0.774886,0.531869,...,0.062815,0.00179,1.4e-05,2.9e-05,0.350354,0.374377,0.105115,0.000245,0.169908,
std,5.240195,16.149325,2.326673,19.925355,1.866568,8.37729,0.883817,0.650029,2.022099,0.498987,...,0.242631,0.042273,0.0038,0.005373,0.477084,0.483965,0.306704,0.015664,0.375554,
min,1.0,20.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,
25%,2.0,56.0,6.0,31.0,0.0,10.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,
50%,4.0,69.0,8.0,44.0,1.0,15.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,
75%,9.0,79.0,9.0,57.0,3.0,20.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,


---

---
## Question 3: Divide the data into 80% training, 20% test, random seed of 42. Set the other hyper-parameters as you consider appropriate
---


<b> Rationale: What are you doing to solve this question?-75 words limit:</b>


################################################################################

(double-click here)


################################################################################

In [None]:
# Python code here (2 lines limit, and 1 cell limit)




---



---
## Question 4: Training, and hyper-parameter tuning of the Logistic Regression model.  

Hyper-parameters: 

- `C` values: 10, 100.
- `class_weight`: only two pairs. Choose one of these two combinations and explain why.

    * A weight of (80% for class 1, 20% for class 0) and (70% for class 1, 30% for class 0)
    * A weight of (80% for class 0, 20% for class 1) and (70% for class 0, 30% for class 1). 

- `penalty` values: l1, l2.
- 3-fold cross-validation for the grid search.
- `f1` as the score to choose the best model in the grid search.
- `n_jobs`=-1.
- do not include the heatmaps in the final submission.
- do not change these hyper-parameters, 
- keep the remaining set of hyper-parameters in the default state. 
---


<b> Rationale: What are you doing to solve this question?-75 words limit:</b>


################################################################################

(double-click here)


################################################################################

In [None]:
# Python code here (15 lines limit, and 2 cells limit)




---



---
## Question 5: Evaluation of the Logistic Regression model on the training and test sets. Use the function `plot_confusion_matrix(confusion_matrix)` provided above.
---

<b> Rationale: What are you doing to solve this question?-75 words limit:</b>


################################################################################

(double-click here)


################################################################################

In [None]:
# Python code here (10 lines limit, and 2 cells limit)




---



---
## Question 6: Training, and hyper-parameter tuning of the Random Forest model.  

Fixed hyper-parameters (do not change the values of these hyper-parameters): 

- `n_estimators`: 150, 200
- `class_weight`: Only two pairs. Choose one of these two combinations and explain why.
1. A weight of (80% for class 1, 20% for class 0) and (70% for class 1, 30% for class 0)
2.  A weight of (80% for class 0, 20% for class 1) and (70% for class 0, 30% for class 1). 
- 3-fold cross-validation for the grid search.
- `f1` as the score to choose the best model in the grid search.
- `n_jobs`=-1
- do not include the heatmaps in the final submission

Other hyper-parameters: 

- `max_features`: 20, 30
- `min_samples_split`: 20, 25
- you can change the previous hyper-parameters (`max_features` and `min_samples_split`) or add other hyper-parameters if you wish. An explanation must be given to why you made that decision. 

---


<b> Rationale: What are you doing to solve this question?-75 words lilmit:</b>


################################################################################

(double-click here)


################################################################################

In [None]:
# Python code here (15 lines limit, and 2 cells limit)


---

---
## Question 7: Evaluation of the Random Forest model on the training and test sets. Use the function `plot_confusion_matrix(confusion_matrix)` provided above.
---

<b> Rationale: What are you doing to solve this question?-50 words limit:</b>

################################################################################

(double-click here)


################################################################################

In [None]:
# Python code here (10 lines limit, and 2 cells limit)




---



---
## Question 8: Based on the research questions and the instructions; What model would you choose if any? Would you deploy this model?  - 300 words limit.
---

<b> 300 words limit </b>

################################################################################

(double-click here)


################################################################################

In [None]:
# Python code here (10 lines limit, and 2 cells limit)




---



---
## Question 9: Use SHAP for the final model and give some explanation of what you observe. If you haven't chosen any model in question 8, choose the best model according to the evaluation metrics. Comment the results.
---

<b> Explanation of what you observe - 100 words limit:</b>

################################################################################

(double-click here)


################################################################################

In [None]:
# Python code here (10 lines limit, and 2 cells limit)




---



© 2022 Copyright The University of New South Wales - CRICOS 00098G