[WIP] Add script to clean and combine data, and add data (#29)

* Add script to clean and combine data, and add data - Update survey data dictionary with left out questions - Update survey data dictionary with variable/column names for questions - Add script `clean-data.R` to clean and combine the two survey datasets into one for ease of analysis - Create the combined survey dataset after running `clean-data.R` - Create README.md file to explain cleaned data and the script to produce it - Update root README.md file to briefly explain data - Change `data/` directory to `raw-data/` * Move around functions and add more edits - Update date - Categorize functions into different categories - Utility functions - Sub-process functions - Main process functions - Main function - Update function descriptions - Add function to check survey data uses only one ID from each * Move cleaning of code events to own function * Create function to search and add col + formatting - Create function to search in a given column for search terms, then creates a new column labeling rows containing search terms - Reformat input data comments - Reformat NSE functions e.g. mutate_() * Create temp helper function to look at columns * Move reading data function to main processes * Create draft full dataset * Rename cleaning function and update joining key - The cleaning function `clean_part_1` was written for the first dataset. I've changed the function, along with the variables, to attend to the joined dataset. - Removing outliers for hours learning per week was simplified - Added usage case for `search_and_create()` function * Add feedback to user on script actions * Separate other job interests cleaning to function * Fix inconsistent indenting in helper function * Move cleaning other podcasts to separate function * Reorganize sub-cleaning functions to own category * Update helper function with flexible use Allow helper function to either default view the data, print data to console (printYes=1), or to print the number of instances * Create new columns for significant other podcasts - Update description of `clean_podcasts` function - Add more variations to “None” response - Add feedback to user on start and finish of function - Add new columns for podcasts that were mentioned >15 times * Separate a function for cleaning hours learned * Add feedback in cleaning code events & exp earning * Separate function for cleaning months programming * Separate function cleaning post bootcamp salary - Retain previous cleaning - Add in same normalizations from expected income * Separate function for cleaning money for learning * Add description to entire script * Floor values and remove outliers in money to learning * Create function for cleaning age * Initialize functions for columns needing cleaning * Create new boolean column for PodcastOther * Fix feedback message for cleaning hours learning * Update draft of complete data * Remove boolean Podcast Other column * Finish cleaning income and remove extras - Finished cleaning income function - Removed changing ExpectedEarning to integer - Remove unnecessary cleaning * Remove "Other" from new podcast cols * Finish cleaning commute times * Update code events cleaning to make new cols * Clean other resources * Update code events threshold to 1.5% frequency * Update detail on cutoff for other podcasts is 1.5% * Add Bootcamp Name into joining key * Add back in podcast and events from 2nd dataset * Make ages less than 10 to NA * Convert resources to boolean * Finish cleaning data with consistency check - Check for inconsistencies between job role interests - Remove unnecessary columns * Remove "Other" from new Podcast columns * Clean student debt owed * Add CodeEvent column to columns removed * Write final polish of data * Fix small spelling mistakes * Update final dataset * Remove first dataset * Update script date
freeCodeCamp · May 18, 2016 · 4c903b0 · 4c903b0
1 parent 97ba361
commit 4c903b0
Show file tree

Hide file tree

Showing 8 changed files with 17,801 additions and 37 deletions.
diff --git a/README.md b/README.md
@@ -5,9 +5,18 @@ We announced on [March 29th,
 
 Survey development was lead by [Quincy Larson](https://twitter.com/ossia) with Free Code Camp and [Saron Yitbarek](https://twitter.com/saronyitbarek) with Code Newbie. For more about why we made this survey: ["How we crafted a survey for thousands of people who are learning to code"](https://medium.freecodecamp.com/we-just-launched-the-biggest-ever-survey-of-people-learning-to-code-cac81dadf1ea#.8g9ts8gm5).
 
+## Table of Contents
+
+- [About the Data](#about-the-data)
+- [How to Contribute](#how-to-contribute)
+- [Analysis of other relevant recent data](#analysis-of-other-relevant-recent-data)
+- [License](#license)
+
 ## About the Data
 
-The survey results are located in the [`data/`](data/) directory, in .csv format.
+The raw survey results are located in the [`raw-data/`](raw-data/) directory, in `.csv` format.
+
+We have cleaned and combined the data for convenience of downstream analyses and visualizations. The cleaned data is located in the [`clean-data/`](clean-data/) directory.
 
 ## How to Contribute
 

diff --git a/clean-data/2016-FCC-New-Coders-Survey-Data.csv b/clean-data/2016-FCC-New-Coders-Survey-Data.csv
diff --git a/clean-data/README.md b/clean-data/README.md
@@ -0,0 +1,65 @@
+# Cleaning and Combine Free Code Camp Survey Data
+
+## Introduction
+
+The survey data was broken up into two parts and need to be combined into one
+for ease of future downstream analyses. Additionally, these two data sets need
+to be cleaned up a bit because of the nature of survey data.
+
+## Notable Data Transformations
+
+### Obvious Outliers
+
+In some of the numeric free text answers, numeric values were filtered out if it
+was beyond a reasonable threshold. For example, an answer saying you've coded
+for 100,000 months would be removed.
+
+### Numeric Ranges
+
+Some answers were given as ranges. For example, a range of "9-10" months of
+programming might have been answer to a question. The average of this range was
+taken when possible.
+
+### Years to Months
+
+Some answers to a question asking about months were given in years. These were
+converted to months if possible.
+
+### Normalization of Answers
+
+Some of the free text answers were very similar to each other, with the
+exception of a space or two. These will register as different answers if you
+aren't looking for them. Answers like "Cybersecurity" and "Cyber Security" are
+the same and were changed to a consistent manner. There may have been some
+missed.
+
+
+## Prerequisites to Rerun Data Manipulations
+
+- [R][RProj] (>= 3.2.3)
+- [dplyr][dplyrGH] (>= 0.4.3) [CRAN][dplyrCRAN]
+- [Rcpp][RcppGH] (>= 0.12.4) [CRAN][RcppCRAN]
+
+[RProj]: https://www.r-project.org/
+[dplyrGH]: https://github.com/hadley/dplyr
+[RcppGH]: https://github.com/RcppCore/Rcpp
+[dplyrCRAN]: https://cran.r-project.org/web/packages/dplyr/index.html
+[RcppCRAN]: https://cran.r-project.org/web/packages/Rcpp/index.html
+
+
+## Reproduce Cleaning and Combining of Data
+
+Running the following script will create a new file
+`2016-New-Coders-Survey.csv` file in this directory `clean-data/`.
+
+```shell
+git clone https://github.com/FreeCodeCamp/2016-new-coder-survey.git
+cd clean-data
+Rscript clean-data.R
+```
+
+
+## Cleaning Pipeline
+
+1. Rename column names
+2. Clean free text fields for appropriate question