Skip to content
This repository has been archived by the owner on Jun 23, 2020. It is now read-only.

Commit

Permalink
[WIP] Add script to clean and combine data, and add data (#29)
Browse files Browse the repository at this point in the history
* Add script to clean and combine data, and add data

- Update survey data dictionary with left out questions
- Update survey data dictionary with variable/column names for questions
- Add script `clean-data.R` to clean and combine the two survey datasets into
  one for ease of analysis
- Create the combined survey dataset after running `clean-data.R`
- Create README.md file to explain cleaned data and the script to produce it
- Update root README.md file to briefly explain data
- Change `data/` directory to `raw-data/`

* Move around functions and add more edits

- Update date
- Categorize functions into different categories
    - Utility functions
    - Sub-process functions
    - Main process functions
    - Main function
- Update function descriptions
- Add function to check survey data uses only one ID from each

* Move cleaning of code events to own function

* Create function to search and add col + formatting

- Create function to search in a given column for search terms, then
  creates a new column labeling rows containing search terms
- Reformat input data comments
- Reformat NSE functions e.g. mutate_()

* Create temp helper function to look at columns

* Move reading data function to main processes

* Create draft full dataset

* Rename cleaning function and update joining key

- The cleaning function `clean_part_1` was written for the first dataset. I've
  changed the function, along with the variables, to attend to the joined
  dataset.
- Removing outliers for hours learning per week was simplified
- Added usage case for `search_and_create()` function

* Add feedback to user on script actions

* Separate other job interests cleaning to function

* Fix inconsistent indenting in helper function

* Move cleaning other podcasts to separate function

* Reorganize sub-cleaning functions to own category

* Update helper function with flexible use

Allow helper function to either default view the data, print data to
console (printYes=1), or to print the number of instances

* Create new columns for significant other podcasts

- Update description of `clean_podcasts` function
- Add more variations to “None” response
- Add feedback to user on start and finish of function
- Add new columns for podcasts that were mentioned >15 times

* Separate a function for cleaning hours learned

* Add feedback in cleaning code events & exp earning

* Separate function for cleaning months programming

* Separate function cleaning post bootcamp salary

- Retain previous cleaning
- Add in same normalizations from expected income

* Separate function for cleaning money for learning

* Add description to entire script

* Floor values and remove outliers in money to learning

* Create function for cleaning age

* Initialize functions for columns needing cleaning

* Create new boolean column for PodcastOther

* Fix feedback message for cleaning hours learning

* Update draft of complete data

* Remove boolean Podcast Other column

* Finish cleaning income and remove extras

- Finished cleaning income function
- Removed changing ExpectedEarning to integer
- Remove unnecessary cleaning

* Remove "Other" from new podcast cols

* Finish cleaning commute times

* Update code events cleaning to make new cols

* Clean other resources

* Update code events threshold to 1.5% frequency

* Update detail on cutoff for other podcasts is 1.5%

* Add Bootcamp Name into joining key

* Add back in podcast and events from 2nd dataset

* Make ages less than 10 to NA

* Convert resources to boolean

* Finish cleaning data with consistency check

- Check for inconsistencies between job role interests
- Remove unnecessary columns

* Remove "Other" from new Podcast columns

* Clean student debt owed

* Add CodeEvent column to columns removed

* Write final polish of data

* Fix small spelling mistakes

* Update final dataset

* Remove first dataset

* Update script date
  • Loading branch information
erictleung authored and QuincyLarson committed May 18, 2016
1 parent 97ba361 commit 4c903b0
Show file tree
Hide file tree
Showing 8 changed files with 17,801 additions and 37 deletions.
11 changes: 10 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,9 +5,18 @@ We announced on [March 29th,

Survey development was lead by [Quincy Larson](https://twitter.com/ossia) with Free Code Camp and [Saron Yitbarek](https://twitter.com/saronyitbarek) with Code Newbie. For more about why we made this survey: ["How we crafted a survey for thousands of people who are learning to code"](https://medium.freecodecamp.com/we-just-launched-the-biggest-ever-survey-of-people-learning-to-code-cac81dadf1ea#.8g9ts8gm5).

## Table of Contents

- [About the Data](#about-the-data)
- [How to Contribute](#how-to-contribute)
- [Analysis of other relevant recent data](#analysis-of-other-relevant-recent-data)
- [License](#license)

## About the Data

The survey results are located in the [`data/`](data/) directory, in .csv format.
The raw survey results are located in the [`raw-data/`](raw-data/) directory, in `.csv` format.

We have cleaned and combined the data for convenience of downstream analyses and visualizations. The cleaned data is located in the [`clean-data/`](clean-data/) directory.

## How to Contribute

Expand Down
15,621 changes: 15,621 additions & 0 deletions clean-data/2016-FCC-New-Coders-Survey-Data.csv

Large diffs are not rendered by default.

65 changes: 65 additions & 0 deletions clean-data/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
# Cleaning and Combine Free Code Camp Survey Data

## Introduction

The survey data was broken up into two parts and need to be combined into one
for ease of future downstream analyses. Additionally, these two data sets need
to be cleaned up a bit because of the nature of survey data.

## Notable Data Transformations

### Obvious Outliers

In some of the numeric free text answers, numeric values were filtered out if it
was beyond a reasonable threshold. For example, an answer saying you've coded
for 100,000 months would be removed.

### Numeric Ranges

Some answers were given as ranges. For example, a range of "9-10" months of
programming might have been answer to a question. The average of this range was
taken when possible.

### Years to Months

Some answers to a question asking about months were given in years. These were
converted to months if possible.

### Normalization of Answers

Some of the free text answers were very similar to each other, with the
exception of a space or two. These will register as different answers if you
aren't looking for them. Answers like "Cybersecurity" and "Cyber Security" are
the same and were changed to a consistent manner. There may have been some
missed.


## Prerequisites to Rerun Data Manipulations

- [R][RProj] (>= 3.2.3)
- [dplyr][dplyrGH] (>= 0.4.3) [CRAN][dplyrCRAN]
- [Rcpp][RcppGH] (>= 0.12.4) [CRAN][RcppCRAN]

[RProj]: https://www.r-project.org/
[dplyrGH]: https://github.com/hadley/dplyr
[RcppGH]: https://github.com/RcppCore/Rcpp
[dplyrCRAN]: https://cran.r-project.org/web/packages/dplyr/index.html
[RcppCRAN]: https://cran.r-project.org/web/packages/Rcpp/index.html


## Reproduce Cleaning and Combining of Data

Running the following script will create a new file
`2016-New-Coders-Survey.csv` file in this directory `clean-data/`.

```shell
git clone https://github.com/FreeCodeCamp/2016-new-coder-survey.git
cd clean-data
Rscript clean-data.R
```


## Cleaning Pipeline

1. Rename column names
2. Clean free text fields for appropriate question
Loading

0 comments on commit 4c903b0

Please sign in to comment.