Adult Income and Personal Attributes
This repo is for UBC STAT 547 Group project for Group-8, The team members are Jimmy Liu and Hannah McSorley. The dataset we're working with is adult income data from the 1994 US Census Bureau available through University California Irvine Machine Learning Repository (https://archive.ics.uci.edu/ml/datasets/adult).
This repo will evolve with project milestones throughout the course (until April 8, 2020) and each milestone submission will be a tagged release.
View the GitHub Page for Milestone 1 here: https://stat547-ubc-2019-20.github.io/group_08_ADULT-INCOME/docs/milestone01.html
Milestone 2 focused on running scripts through command line (or RStudio terminal). Below are the instuctions for use:
-
Clone this repository
-
Check that you have the follow R packages installed:
- tidyverse
- here
- glue
- docopt
- ggpmisc
- cowplot
- ggpubr
- scales
-
Run the scripts under in the following order via command-line
a. Download raw data
Rscript scripts/load_data.R --data_url="https://raw.githubusercontent.com/STAT547-UBC-2019-20/data_sets/master/adult_data.csv" --output data/downloaded_datafile.csv
Note: a copy of the datafile will also be available by cloning the repo to an RStudio Project through git version control
b. Process raw data
load and process the data from the previously downloaded file
Rscript scripts/data_processing.R --input data/downloaded_datafile.csv --output data/processed_adult-data.csv
c. Visualize processed data
generate four different exploratory analysis plots from the processed data
Rscript scripts/EDA_plots.R --input data/processed_adult-data.csv
-
To generate a linear model on filtered data, create a plot, and knit a final report - run the following scripts:
a. Linear model
- Generate .RDS file containing linear model object
Rscript scripts/linear_regression.R --input data/processed_adult-data.csv --output data/lm_age-hrs.RDS
- Linear regression data visualization
Rscript scripts/plot_linear-regression.R --input data/processed_adult-data.csv --output images/plot_linear-regression.png
b. Produce a complete final report
Rscript scripts/knit.R --finalreport docs/FinalReport_milestone03.Rmd
the above command knits a report via R markdown file
-
Run the entire analysis pipeline using
make
after cloning the repositoryIn the terminal, type
make all
to produce final report. View the product here:Note, to remove files produced by the pipeline, run
make clean
in the terminal. Only run 'make clean' if you are certain it is necessary (we discourage running this command).
Description
Our app will have a landing page that presents both a summary of the (1994 adult income census) data demographics as well as bi-variate data comparisons. The 'summary' section will provide an overview of the dataset where the user can choose (from a dropdown menu) which key variable to summarize in a plot and a table of min/med/max net gain. In addition to the dropdown menu, there will be a radio-button/radiometer option to toggle between options for the selected key variable (e.g., male vs female, under or over $50K income). The 'analysis' portion of the app will show the distribution of each person's education, age, sex, ethnicity as they relate to each other and to annual net gain (monetary gains over employment income). These plots will likely be a combination of boxplots, violin plots and/or density ridge plots. Users will be able to compare variables by toggling between different explanatory variables to visualize their relationship to net income and/or to eachother. From a drop-down list, users will be able to select different demographic variables to visualize the metadata distribution of the census participants (i.e. only show females, or specific ethnicities). Via a slider bar, users will be able to narrow the range of data displayed based on filterable variables (e.g., age, education level, average hours worked per week). This app will allow users to compare two variables such as education level, age, sex, hours worked, ethnicity or annual net gain, and also to rank personal attributes relative to whether annual income is predicted to be above or below $50,000.
Sketch of dashboard plan
Usage Scenario
Pat is interested in understanding how personal attributes and socioeconomic position are associated with annual financial gains. They want to [explore] a census dataset to [compare] attributes (e.g., age, sex, marital status, education and work hours) with annual net gain amounts to [identify] patterns relating monetary benefits to personal and socioeconomic details. When Pat logs on to the “1994 Adult Income Census Data App”, they will see an overview of all the available variables in the dataset, according to the number of instances (persons) associated with each demographic category. They can filter variables to narrow the range of personal attributes to compare to net annual gains, and/or rank census participants according to their predicted annual income (over or under $50K). When Pat does so, they may notice that there are no apparent links between the available personal attributes and net annual gain. They may also notice that the data is dominated by a specific demographic, which may encourage them to pursue the collection of more representative and diverse demographic data in future census.
To run our Dash app, clone or download this project repository and run the command Rscript app.R
Our Dash app was updated based on feedback, bug fixes were finalized, and the app was deployed online.
View our finalized app here: https://group08-1994-adult-income.herokuapp.com/