Group 1: Factors Affecting Medical Expenses

QUICK LINK: To see our deployed dashboard on Heroku, please click here!

Introduction

This repository holds the STAT 547 Group Project, for Group 1: Diana Lin and Nima Jamshidi. The dataset we have chosen to work with is the "Medical Expenses" dataset used in the book Machine Learning with R, by Brett Lantz. This dataset was extracted from Kaggle by Github user @meperezcuello. The information about this dataset has been extracted from their GitHub Gist.

Usage

Prerequisites

Clone this repo

git clone https://github.com/STAT547-UBC-2019-20/group_01_dlin_njamshidi.git

Ensure the following packages are installed:
- RCurl
- base64enc
- bookdown
- broom
- corrplot
- crayon
- dash
- dashCoreComponents
- dashDaq
- dashHtmlComponents
- dashTable
- docopt
- devtools
- fiery
- glue
- grid
- gridExtra
- hablar
- here
- htmltools
- knitr
- mime
- plotly
- png
- psych
- rmarkdown
- reqres
- reshape2
- routr
- scales
- testthat
- tidyverse: ggplot2, dplyr, tidyr, readr, purrr, tibble, stringr, forcats
- tinytex
- viridis
To install all these packages:
```
make install
```

Running the whole pipeline

Clean the repository to undo any residual incomplete analysis
```
make clean
```
Install all required packages:
```
make install
```
Run the entire analysis pipeline
```
make all
```

Running each step using the Makefile

Download the data
```
make data/raw/data.csv
```
Process the data
```
make data/processed/processed_data.csv
```

Perform exploratory analysis

make images/age_histogram.png images/corrplot.png images/facet.png images/region_barchart.png data/explore/correlation.rds

Perform linear regression

make data/linear_model/model.rds data/linear_model/tidied.rds data/linear_model/glanced.rds data/linear_model/augmented.rds images/lmplot001.png images/lmplot002.png images/lmplot003.png images/lmplot004.png images/lmplot005.png

Knit the final report

make docs/milestone3.html docs/milestone3.pdf

Running each R script individually

Run the following scripts (in order) with the appropriate arguments specified

Install required packages
```
Rscript scripts/install.R
```

Download the data

Rscript scripts/load_data.R --data_to_url="https://gist.github.com/meperezcuello/82a9f1c1c473d6585e750ad2e3c05a41/raw/d42d226d0dd64e7f5395a0eec1b9190a10edbc03/Medical_Cost.csv"

Wrangle/clean/process your data

Rscript scripts/process_data.R --file_path="data/raw/data.csv" --filename="processed_data.csv"

Conduct exploratory data analysis

Rscript scripts/explore_data.R --processed_data="data/processed/processed_data.csv" --path_to_images="images" --path_to_data="data/explore"

Conduct linear regression

Rscript scripts/linear_model.R --processed_data="data/processed/processed_data.csv" --path_to_images="images" --path_to_lmdata="data/linear_model"

Knit the final report

Rscript scripts/knit.R --finalreport="docs/milestone3.Rmd"

Milestones

Milestone 1

For Milestone 1, you can find our initial explorary data analysis in the link below:

https://stat547-ubc-2019-20.github.io/group_01_dlin_njamshidi/milestone1.html

Our progress is outlined in issue #4.

Milestone 2

For Milestone 2, you can find the scripts to load, process, and conduct exploratory data analysis in the scripts/ directory. The first draft of our report can be found here.

Our progress is outlined in issue #8.

load_data.R

Rscript scripts/load_data.R --data_to_url=https://gist.github.com/meperezcuello/82a9f1c1c473d6585e750ad2e3c05a41/raw/d42d226d0dd64e7f5395a0eec1b9190a10edbc03/Medical_Cost.csv

process_data.R

Rscript scripts/process_data.R --file_path="data/raw/data.csv" --filename="processed_data.csv"

explore_data.R

Rscript scripts/explore_data.R --processed_data="data/processed/processed_data.csv" --path_to_images="images"

Milestone 3

For Milestone 3, the script to knit the final report is scripts/knit.R. The final report can be here in HTML and PDF.

Our progress is outlined in issue #24.

linear_model.R

Rscript scripts/linear_model.R --processed_data="data/processed/processed_data.csv" --path_to_images="images" --path_to_lmdata="data/linear_model"

knit.R

Rscript scripts/knit.R --finalreport="docs/milestone3.Rmd"

Makefile
```
make
```

Milestone 4

For Milestone 4, we have addressed the feedback from TAs (issues #9 and #25), and from our peers (issues #35 and #39). Of the feedback in these four issues, all were implemented except for one, which has been filed under future work in issue #41.

Our progress is outlined in issue #40.

Milestone 5

For milestone 5, we have finished our dashboard in app.R, and implemented TA feedback from issue #46

Our progress is outlined in issue #44.

To run the dashboard locally:

Rscript app.R

Milestone 6

For milestone 6, we have implemented the TA feedback from issue #54.

Our progress is outlined in issue #52.

To access our dashboard deployed on Heroku, click here!

Dashboard Proposal

Description

This app has two main pages. The user can choose between an exploration page or a page which shows the results of linear regression conducted on the dataset. On the first page, the user can find 4 graphs, each of which showing some statistics regarding the dataset. The upper left graph shows the correlations between dataset factors. The user can choose between color, shade, circle or pie as the style that is going to be used in the graph to display the correlations. Since the correlation matrix is symmetrical, the user can change the appearance of the graph to be a full, upper, or lower triangular matrix plus the option to hide diagonal values (equal to 1). Next to this graph, is a faceted plot that shows how BMI and charges are distributed for each region and sex. The user can choose a factor between smoker, age, and children to be represented in colors to make the most out of this graph. The left and right graphs at the bottom of the page show the distribution of the data among the age groups and regions respectively. They are color-coded based on the sex, smoker, or children factors chosen by the user. On the second page, at the top of the page the user can choose the factors they want to be used in the linear regression and see the results below it. The r-squared value and the diagnostics graphs would be shown there. At the bottom of this page, the user can enter their information required for each factor to see how much the linear regression model would estimate their medical charges.

Usage Scenario

Ron is taking The fundamentals of Public Health Care as an undergraduate course. As an assignment, he needs to estimate the medical expenses his group of classmates has. He should send a form to his classmates asking for information; however, he is not sure what information to request from them. He logs in the Medical Expenses app to learn more about the factors affecting medical expenses. He can look at the visualizations on the exploration page and grasp an idea of what the dataset looks like. He can learn about the correlation between the factors included in the dataset. He can look at the distribution of the dataset among various variables on this page. He might want to check if different sex would have visually distinctive clusters in the BMI vs. charges graph. He can look at the bar charts to see what type of distribution do the factors follow in this dataset. Next, he can go to the linear regression page and play with the factors to find which combination of factors can better explain the charges. In the end, he can put his own information, to check if the regression model based on the available variables can estimate his expenses well or not. He might decide to include some of the variables in this dataset and add other variables such as occupation, health status of parents and etc. in his form.

Sketch

If the images are not loading, please refresh the page.

Name		Name	Last commit message	Last commit date
Latest commit History 186 Commits
data		data
docs		docs
images		images
scripts		scripts
tests		tests
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
app.R		app.R
app.json		app.json
apt-packages		apt-packages
heroku.yml		heroku.yml
init.R		init.R

License

STAT547-UBC-2019-20/group_01_dlin_njamshidi

Folders and files

Latest commit

History

Repository files navigation