Spr2017-proj5-grp5 created by GitHub Classroom
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
data
doc
figs
lib
output
.gitignore
README.md

README.md

ADS Project 5:

Term: Spring 2017

  • Team 5

  • Projec title: The Fragile Families Challenge

  • Team members

    Please be advised: due to protection of privacy of fragile families, we are not allowed to upload any relevant dataset by the organizer at Princeton University.

  • Project summary:In this project, we use data based on the Fragile Families and Child Wellbeing Study, which has followed thousands of American families for more than 15 years. During this time, the Fragile Families study collected information about the children, their parents, their schools, and their larger environments.

Given all the background data from birth to year 9 and some training data from year 15,we infer six key outcomes (gpa, grit, material_hardship, eviction, layoff, jobtraining) in the year 15 test data. In the data cleaning process, we deal with categorical variable and continuous variables seperately, for continuous variable, we replace the NA with the median value of that variable, and create a new categrocial variable to indicate the NAs (where the NAs may contain information to some degree) ,attach the new indicating categorical variable to the oringinal categorical features.For categorical features, we replace the NAs with a number that doesn't exist in original dataset and transform every categorical to a dummy matrix, for every dummy matrix whose elements are either 0 or 1, we choose the 2nd to last column to avoid collinearity.

Given the cleaned data, our team work on different directions, one team work on different features and one team work on different machine learning tools, since we got to know the xgboost apparently outperforms other methods, we together work on features selected from various angles. For case 1: data obtained when children are at age 9 and only consider the continuous variables.Case 2: data obtained when children are at age 9, use categorical variables. Then we bag them by using the weighted average. We use the same strategy to other continuous outcomes( grit, material hardship).

Contribution statement: (default) All team members contributed equally in all stages of this project. All team members approve our work presented in this GitHub repository including this contributions statement.

Following suggestions by RICH FITZJOHN (@richfitz). This folder is orgarnized as follows.

proj/
├── lib/
├── data/
├── doc/
├── figs/
└── output/

Please see each subfolder for a README file.