Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
It doesn't look like there's a ton of info in the features in this dataset. I'd be interested to know how Grouper's best algorithmn does with these features. I tried lots of different things (feature engineering, various linear models, SVM's, random forests, etc.) Most were around 56% accurate on cross-validation sets. The best random forest model was about 57% accurate and finds, probably unsurprisingly, that the Facebook activity variables are most predictive of whether the users become Facebook friends.
My code has an ipython notebook for exploratory plotting, model building, and variable importance. It uses grid search and cross-validation to find the best type of model. You can run the regular python code to train that model:
/bin/bash cleanDataSets.sh
virtualenv .
pip install -r requirements.txt
python grouper_model.py
Final predictions are in test_data_withpreds.csv