In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import pandas as pd
import numpy as np
import seaborn as sns 
import matplotlib.pyplot as plt

from scripts.dataset_explorer import SODataSetExplorer

## Exploring Survey Features

This notebook is used to spot features similarity and presence in surveys over years.

The following steps are performed per feature:
* Displaying similar columns over years based on a substring. For example, look for feature columns with *edu* on its name.
* Rename columns and standarize feature names.
* Summarize feature presence over years.

Features:
* [Job Satisfaction](#6)
* [Remote Status ](#5)
* [Main Branch](#8)
* [Compensation](#20)
* [Developer Type](#17)
* [Education](#0)
* [Developer Mayor](#1)
* [Job Factors](#2)
* [Employment](#3)
* [Organization Size](#4)
* [Ethnicity](#7)
* [Hobby](#9)
* [Age](#10)
* [Gender](#11)
* [Sexuality](#12)
* [Learning related features](#13)
* [Code related features](#14)
* [Developer hours spend at](#15)
* [Industry](#16)
* [Occupation - years_experience](#18)

## Datasets load

A helper class has been defined in order to apply survey datasets loading and common feature transformations.

In [3]:
df_explorer = SODataSetExplorer()

The helper class implements methods as the following. *dataset_cols_difference* takes a year as argument and returns the columns present in that year and absent in the rest.

In [4]:
df_explorer.dataset_cols_difference(base_year=2011, years_against=[2012]);

2011 columns not present in 2012
{'stackoverflow_sites_most_visited', 'recommendation_likely_acted_upon'}




### Job Satisfaction <a id="6"></a>

In [5]:
df_explorer.similar_columns(['job_sat',]);

2011
{'job_satisfaction'}


2012
{'job_satisfaction'}


2013
{'job_satisfaction'}


2015
{'job_satisfaction'}


2016
{'job_satisfaction'}


2017
{'job_satisfaction'}


2018
{'job_satisfaction'}


2019
{'job_sat'}


2020
{'job_sat'}




In [6]:
job_satisfaction_rename = {
    'job_sat': 'job_satisfaction', 
}
df_explorer.rename_columns(job_satisfaction_rename)

* `job_satisfaction`. 2011-2020 but 2014. categorical.

### Remote Status <a id="5"></a>

In [7]:
df_explorer.similar_columns(['remote', 'work', 'loc']);

2012
{'add_rate_what_ads_i_use_an_ad_blocker'}


2013
{'add_rate_what_ads_i_use_an_ad_blocker',
 'importance_40_hour_work_week',
 'importance_high_caliber_team_is_everyone_else_smart_hardworking',
 'importance_limited_night_weekend_work',
 'importance_lots_of_control_over_your_own_work',
 'importance_quality_of_workstation_dream_machine_30inch_monitors_etc'}


2014
{'add_rate_what_ads_i_use_an_ad_blocker',
 'enjoy_working_remotely',
 'job_opportunity_email_details_importance_for_response_describes_benefits_perks_of_the_work_environment',
 'job_opportunity_email_details_importance_for_response_describes_the_team_i_will_work_on',
 'remote_location',
 'remote_status'}


2015
{'how_important_is_remote_when_evaluating_new_job_opportunity',
 'remote_status',
 'want_work_language',
 'want_work_language_other'}


2016
{'remote', 'agree_adblocker'}


2017
{'assess_job_remote',
 'collaborate_remote',
 'have_worked_database',
 'have_worked_framework',
 'have_worked_language',
 'have_worked_platfo

* `remote_status`, `remote`, `home_remote`, `assess_job_remote`, `collaborate_remote`, `work_remote`. 2014-2017,2019
* `remote_location`, `work_loc`. 2014, 2019

In [8]:
remote_rename = {
    'remote_status': 'remote', 
    'home_remote': 'remote', 
    'work_remote': 'remote', 
}
df_explorer.rename_columns(remote_rename)

* `remote`. 2014-2017, 2019. categorical. indicate frequency of remote working.

### Developer Main Branch <a id="8"></a>

In [9]:
df_explorer.similar_columns(['branch']);

2019
{'main_branch'}


2020
{'main_branch'}




In [10]:
df_explorer.display_feature_across_years('main_branch', display, year_summary=False, feature_per_year=True)

Unnamed: 0,year,main_branch
0,2019,I am a developer by profession
1,2019,I am a student who is learning to code
2,2019,"I am not primarily a developer, but I write co..."
3,2019,I code primarily as a hobby
4,2019,"I used to be a developer by profession, but no..."
5,2020,I am a developer by profession
6,2020,I am a student who is learning to code
7,2020,"I am not primarily a developer, but I write co..."
8,2020,I code primarily as a hobby
9,2020,"I used to be a developer by profession, but no..."


In [11]:
df_explorer.display_feature_across_years('professional', display)

Unnamed: 0_level_0,Unnamed: 1_level_0,count,percentage
year,professional,Unnamed: 2_level_1,Unnamed: 3_level_1
2017,None of these,914,0.0178
2017,Professional developer,36131,0.703
2017,Professional non-developer who sometimes writes code,5140,0.1
2017,Student,8224,0.16
2017,Used to be a professional developer,983,0.0191


* `main_branch`. 2019-2020. categorical.

### Compensation <a id="20"></a>

### Developer Type <a id="17"></a>

In [12]:
df_explorer.similar_columns(['occupation', 'dev_type', 'developer_type']);

2011
{'occupation'}


2012
{'occupation'}


2013
{'occupation'}


2014
{'occupation'}


2015
{'occupation'}


2016
{'occupation', 'occupation_group'}


2017
{'developer_type',
 'mobile_developer_type',
 'non_developer_type',
 'web_developer_type'}


2018
{'dev_type'}


2019
{'dev_type'}


2020
{'dev_type'}




In [13]:
df_explorer.display_feature_across_years('occupation', display);

Unnamed: 0_level_0,Unnamed: 1_level_0,count,percentage
year,occupation,Unnamed: 2_level_1,Unnamed: 3_level_1
2011,Database Administrator,23,0.0087
2011,Desktop Application Developer,419,0.1584
2011,Embedded Application Developer,115,0.0435
2011,"Executive (VP of Eng, CTO, CIO, etc.)",70,0.0265
2011,IT Manager,28,0.0106
...,...,...,...
2016,Product manager,333,0.0067
2016,Quality Assurance,379,0.0077
2016,Student,5619,0.1135
2016,System administrator,745,0.0150


### Education <a id="0"></a>

In [14]:
df_explorer.similar_columns(['ed_', 'education']);

2011
{'techn_related_purchases_last_year', 'recommendation_likely_acted_upon'}


2012
{'add_rate_ive_taken_a_trial/purchased_a_product_from_ads',
 'techn_related_purchases_last_year'}


2013
{'add_rate_ive_taken_a_trial/purchased_a_product_from_ads',
 'changed_job_last_year',
 'importance_limited_night_weekend_work',
 'preferred_mobile_support',
 'preferred_software_business_model',
 'techn_related_purchases_last_year'}


2014
{'add_rate_ive_taken_a_trial/purchased_a_product_from_ads',
 'changed_job_last_year',
 'contacted_about_job_opportunities_email',
 'contacted_about_job_opportunities_linkedin_inmail',
 'contacted_about_job_opportunities_phone',
 'contacted_about_job_opportunities_stackedoverflow_careers_message',
 'contacted_about_job_opportunities_twitter',
 'contacted_by_recruiters_frecuency',
 'job_opportunity_email_details_importance_for_response_message_is_personalized_to_me',
 'preferred_mobile_support'}


2015
{'changed_jobs_in_last_12_months',
 'education',
 'education_ot

In [15]:
df_explorer.display_feature_across_years('ed_level', display, year_summary=False, feature_per_year=True)

Unnamed: 0,year,ed_level
0,2019,Associate degree
1,2019,"Bachelor’s degree (BA, BS, B.Eng., etc.)"
2,2019,I never completed any formal education
3,2019,"Master’s degree (MA, MS, M.Eng., MBA, etc.)"
4,2019,"Other doctoral degree (Ph.D, Ed.D., etc.)"
5,2019,Primary/elementary school
6,2019,"Professional degree (JD, MD, etc.)"
7,2019,"Secondary school (e.g. American high school, G..."
8,2019,Some college/university study without earning ...
9,2020,"Associate degree (A.A., A.S., etc.)"


A brief summary of the feature could be made as follows:

* `education` 2015-2016. multiple choice categorical features.
* `education_types`. 2017-2018. multiple choice categorical features.
* `ed_level`. 2019-2020. categorical.

The same steps will be applied to the rest of the features.

### Developer Mayor <a id="1"></a>

In [16]:
df_explorer.similar_columns(['major']);

2017
{'major_undergrad'}


2018
{'undergrad_major'}


2019
{'undergrad_major'}


2020
{'undergrad_major'}




Here we rename first the columns to standarized the name and take them as unity for displaying its difference over years.

In [17]:
major_rename = {'major_undergrad': 'undergrad_major'}

In [18]:
df_explorer.rename_columns(major_rename)

In [19]:
df_explorer.display_feature_across_years('undergrad_major', display, year_summary=False, feature_per_year=True)

Unnamed: 0,year,undergrad_major
0,2017,A business discipline
1,2017,A health science
2,2017,A humanities discipline
3,2017,A natural science
4,2017,A non-computer-focused engineering discipline
5,2017,A social science
6,2017,Computer engineering or electrical/electronics...
7,2017,Computer programming or Web development
8,2017,Computer science or software engineering
9,2017,Fine arts or performing arts


* `undergrad_major`. 2017 onwards. categorical.

### Job Factors <a id="2"></a>

In [20]:
df_explorer.similar_columns(['factor']);

2013
{'time_per_week_refactoring_code_quality'}


2014
{'time_per_week_refactoring_code_quality'}


2019
{'job_factors'}


2020
{'job_factors'}




In [21]:
df_explorer.display_feature_across_years('job_factors', display, year_summary=False, feature_per_year=True)

Unnamed: 0,year,job_factors
0,2019,Diversity of the company or organization
1,2019,Diversity of the company or organization;Flex ...
2,2019,Diversity of the company or organization;How w...
3,2019,Diversity of the company or organization;How w...
4,2019,Financial performance or funding status of the...
...,...,...
400,2020,Specific department or team I’d be working on;...
401,2020,Specific department or team I’d be working on;...
402,2020,Specific department or team I’d be working on;...
403,2020,Specific department or team I’d be working on;...


* `job_factors`. 2019-2020. multiple choice categorical feature.

### Employment <a id="3"></a>

In [22]:
df_explorer.similar_columns(['employ']);

2015
{'employment_status'}


2016
{'employment_status'}


2017
{'employment_status'}


2018
{'employment'}


2019
{'employment'}


2020
{'employment'}




In [23]:
employment_rename = {'employment_status': 'employment'}

In [24]:
df_explorer.rename_columns(employment_rename)

In [25]:
df_explorer.display_feature_across_years('employment', display, year_summary=False, feature_per_year=True)

Unnamed: 0,year,employment
0,2015,Employed full-time
1,2015,Employed part-time
2,2015,Freelance / Contractor
3,2015,I'm a student
4,2015,Other
5,2015,Prefer not to disclose
6,2015,Retired
7,2015,Unemployed
8,2016,Employed full-time
9,2016,Employed part-time


* `employment`. 2015 onwards. categorical

### Organization Size <a id="4"></a>

In [26]:
df_explorer.similar_columns(['size']);

2011
{'company_size'}


2012
{'company_size'}


2013
{'company_developer_size', 'company_size', 'team_size'}


2014
{'company_developer_size'}


2016
{'company_size_range', 'team_size_range'}


2017
{'company_size'}


2018
{'company_size'}


2019
{'org_size'}


2020
{'org_size'}




In [27]:
org_size_rename = {'company_size': 'org_size'}
df_explorer.rename_columns(org_size_rename)

In [28]:
df_explorer.display_feature_across_years('org_size', display, year_summary=False, feature_per_year=True)

Unnamed: 0,year,org_size
0,2011,"Fortune 1000 (1,000+)"
1,2011,Mature Small Business (25-100)
2,2011,Mid Sized (100-999)
3,2011,"Other (not working, consultant, etc.)"
4,2011,Start Up (1-25)
5,2011,Student
6,2012,"Fortune 1000 (1,000+)"
7,2012,Mature Small Business (25-100)
8,2012,Mid Sized (100-999)
9,2012,"Other (not working, consultant, etc.)"


* `org_size`. 2017 onwards categorical range.

### Ethnicity <a id="7"></a>

In [29]:
df_explorer.similar_columns(['ethnicity']);

2018
{'race_ethnicity'}


2019
{'ethnicity'}


2020
{'ethnicity'}




In [30]:
ethnicity_rename = {'race_ethnicity': 'ethnicity'}
df_explorer.rename_columns(ethnicity_rename)

In [31]:
df_explorer.display_feature_across_years('ethnicity', display,)

Unnamed: 0_level_0,Unnamed: 1_level_0,count,percentage
year,ethnicity,Unnamed: 2_level_1,Unnamed: 3_level_1
2018,Black or of African descent,1224,0.0213
2018,Black or of African descent;East Asian,7,0.0001
2018,Black or of African descent;East Asian;Hispanic or Latino/Latina,2,0.0000
2018,"Black or of African descent;East Asian;Hispanic or Latino/Latina;Middle Eastern;Native American, Pacific Islander, or Indigenous Australian",1,0.0000
2018,"Black or of African descent;East Asian;Hispanic or Latino/Latina;Middle Eastern;Native American, Pacific Islander, or Indigenous Australian;South Asian",1,0.0000
...,...,...,...
2020,White or of European descent;Multiracial,66,0.0014
2020,White or of European descent;Multiracial;Southeast Asian,16,0.0003
2020,White or of European descent;South Asian,11,0.0002
2020,White or of European descent;South Asian;Multiracial,7,0.0002


* `ethnicity`. 2018 onwards. multiple choice categorical feature

### Hobby <a id="9"></a>

In [32]:
df_explorer.similar_columns(['hobby']);

2015
{'how_many_hours_programming_as_hobby_per_week'}


2016
{'hobby'}


2017
{'program_hobby'}


2018
{'hobby'}


2019
{'hobbyist'}


2020
{'hobbyist'}




In [33]:
hobbyist_rename = {'new_column_name': 'hobbyist', 'year_map': {2018: 'hobby'}}
df_explorer.rename_columns(hobbyist_rename, by_year=True)

In [34]:
df_explorer.display_feature_across_years('hobbyist', display)

Unnamed: 0_level_0,Unnamed: 1_level_0,count,percentage
year,hobbyist,Unnamed: 2_level_1,Unnamed: 3_level_1
2018,No,18958,0.1918
2018,Yes,79897,0.8082
2019,No,17626,0.1983
2019,Yes,71257,0.8017
2020,No,14028,0.2178
2020,Yes,50388,0.7822


* `hobbyist`. From 2018 onwards boolean feature

### Age <a id="10"></a>

In [35]:
df_explorer.similar_columns(['age']);

2011
{'programming_languages_other', 'age', 'programming_languages'}


2012
{'programming_languages_other', 'age', 'programming_languages'}


2013
{'age',
 'importance_positive_organization_structure_not_much_bureaucracy_helpful_management',
 'programming_languages',
 'programming_languages_other'}


2014
{'age',
 'contacted_about_job_opportunities_stackedoverflow_careers_message',
 'job_opportunity_email_details_importance_for_response_link_to_a_stack_overflow_careers_company_page_or_other_source_of_more_information_about_the_company_videos_articles_etc',
 'job_opportunity_email_details_importance_for_response_message_is_personalized_to_me',
 'programming_languages',
 'programming_languages_other'}


2015
{'age',
 'appealing_message_traits',
 'how_many_caffeinated_beverages_per_day',
 'programming_languages',
 'programming_languages_other',
 'want_work_language',
 'want_work_language_other'}


2016
{'age_midpoint', 'age_range'}


2017
{'equipment_satisfied_storage',
 'have_worked_lang

In [36]:
df_explorer.display_feature_across_years('age', display, year_summary=False, feature_per_year=True)

Unnamed: 0,year,age
0,2011,20-24
1,2011,25-29
2,2011,30-34
3,2011,35-39
4,2011,40-50
...,...,...
279,2020,95
280,2020,96
281,2020,97
282,2020,98


* `age`. 2019-2020. continuous feature. Before is defined as a categorical range.

In [37]:
df_explorer.display_feature_across_years('age1st_code', display)

Unnamed: 0_level_0,Unnamed: 1_level_0,count,percentage
year,age1st_code,Unnamed: 2_level_1,Unnamed: 3_level_1
2019,10,5061,0.0578
2019,11,3515,0.0401
2019,12,7735,0.0883
2019,13,6377,0.0728
2019,14,8452,0.0964
...,...,...,...
2020,83,1,0.0000
2020,85,4,0.0001
2020,9,1231,0.0213
2020,Older than 85,13,0.0002


* `age1st_code`. 2019-2020. continuous feature (with exceptions).

### Gender <a id="11"></a>

In [38]:
df_explorer.similar_columns(['gender']);

2014
{'gender'}


2015
{'gender'}


2016
{'gender'}


2017
{'gender'}


2018
{'gender'}


2019
{'gender'}


2020
{'gender'}




In [39]:
df_explorer.display_feature_across_years('gender', display)

Unnamed: 0_level_0,Unnamed: 1_level_0,count,percentage
year,gender,Unnamed: 2_level_1,Unnamed: 3_level_1
2014,Female,352,0.0479
2014,Male,6864,0.9344
2014,Prefer not to disclose,130,0.0177
2015,Female,1480,0.0575
2015,Male,23699,0.9206
...,...,...,...
2020,"Non-binary, genderqueer, or gender non-conforming",385,0.0076
2020,Woman,3844,0.0760
2020,Woman;Man,76,0.0015
2020,"Woman;Man;Non-binary, genderqueer, or gender non-conforming",26,0.0005


* `gender`. From 2014 till 2016: categorical single choice. 2017 onwards: categorical multiple choice

### Sexuality <a id="12"></a>

In [40]:
df_explorer.similar_columns(['sex']);

2018
{'sexual_orientation'}


2019
{'sexuality'}


2020
{'sexuality'}




In [41]:
sexuality_rename = {'sexual_orientation': 'sexuality'}
df_explorer.rename_columns(sexuality_rename)

In [42]:
df_explorer.display_feature_across_years('sexuality', display)

Unnamed: 0_level_0,Unnamed: 1_level_0,count,percentage
year,sexuality,Unnamed: 2_level_1,Unnamed: 3_level_1
2018,Asexual,717,0.012
2018,Bisexual or Queer,1950,0.0326
2018,Bisexual or Queer;Asexual,51,0.0009
2018,Gay or Lesbian,1181,0.0198
2018,Gay or Lesbian;Asexual,28,0.0005
2018,Gay or Lesbian;Bisexual or Queer,95,0.0016
2018,Gay or Lesbian;Bisexual or Queer;Asexual,15,0.0003
2018,Straight or heterosexual,55013,0.9205
2018,Straight or heterosexual;Asexual,235,0.0039
2018,Straight or heterosexual;Bisexual or Queer,351,0.0059


* `sexuality`. since 2018. multiple choice categorical feature.

In [43]:
df_explorer.similar_columns(['trans']);

2019
{'trans'}


2020
{'trans'}




In [44]:
df_explorer.display_feature_across_years('trans', display)

Unnamed: 0_level_0,Unnamed: 1_level_0,count,percentage
year,trans,Unnamed: 2_level_1,Unnamed: 3_level_1
2019,No,82576,0.9877
2019,Yes,1031,0.0123
2020,No,48871,0.9904
2020,Yes,474,0.0096


* `trans`. 2019-2020. boolean.

### Learning related features <a id="13"></a>

In [45]:
df_explorer.similar_columns(['learn']);

2013
{'importance_opportunity_to_use_learn_new_technologies',
 'time_per_week_learning_new_skills'}


2014
{'time_per_week_learning_new_skills'}


2016
{'why_learn_new_tech'}


2017
{'learning_new_tech', 'learned_hiring'}


2020
{'new_learn'}




In [46]:
df_explorer.display_feature_across_years('why_learn_new_tech', display)

Unnamed: 0_level_0,Unnamed: 1_level_0,count,percentage
year,why_learn_new_tech,Unnamed: 2_level_1,Unnamed: 3_level_1
2016,I learn new technology when my job requires it,2766,0.0599
2016,I want to be a better developer,16236,0.3518
2016,I'm curious,13231,0.2867
2016,Other (please specify),770,0.0167
2016,To build a specific product I have in mind,3647,0.079
2016,To keep my skills up to date,7400,0.1604
2016,To pursue career goals,2095,0.0454


* Learning related feature doesnt seem to be consistent across years

### Code related features <a id="14"></a>

In [47]:
df_explorer.similar_columns(['code']);

2013
{'time_per_week_refactoring_code_quality'}


2014
{'job_opportunity_email_details_importance_for_response_mentions_my_code_or_stack_overflow_activity',
 'time_per_week_refactoring_code_quality'}


2016
{'agree_nightcode'}


2017
{'check_in_code',
 'ex_coder10_years',
 'ex_coder_active',
 'ex_coder_balance',
 'ex_coder_belonged',
 'ex_coder_not_for_me',
 'ex_coder_return',
 'ex_coder_skills',
 'ex_coder_will_not_code',
 'other_peoples_code',
 'stack_overflow_copied_code',
 'years_coded_job',
 'years_coded_job_past'}


2018
{'check_in_code'}


2019
{'years_code', 'age1st_code', 'code_rev', 'years_code_pro', 'code_rev_hrs'}


2020
{'age1st_code', 'years_code', 'years_code_pro'}




In [48]:
df_explorer.display_feature_across_years('years_code', display)

Unnamed: 0_level_0,Unnamed: 1_level_0,count,percentage
year,years_code,Unnamed: 2_level_1,Unnamed: 3_level_1
2019,1,1814,0.0206
2019,10,6777,0.0771
2019,11,2265,0.0258
2019,12,3530,0.0401
2019,13,2036,0.0232
...,...,...,...
2020,7,3477,0.0603
2020,8,3407,0.0591
2020,9,2344,0.0406
2020,Less than 1 year,757,0.0131


* `years_coded_job`, `years_coded_job_past`. 2017. categorical range.
* `years_code`, `years_code_pro`. 2019-2020. continuous feature (with exceptions).

### Hours <a id="15"></a>

In [49]:
df_explorer.similar_columns(['hrs', 'hour']);

2013
{'importance_40_hour_work_week'}


2015
{'how_many_hours_programming_as_hobby_per_week'}


2017
{'hours_per_week'}


2018
{'hours_outside', 'hours_computer'}


2019
{'work_week_hrs', 'code_rev_hrs'}


2020
{'work_week_hrs'}




In [50]:
df_explorer.display_feature_across_years('hours_per_week', display)

Unnamed: 0_level_0,Unnamed: 1_level_0,count,percentage
year,hours_per_week,Unnamed: 2_level_1,Unnamed: 3_level_1
2017,0,5129,0.249
2017,1,5901,0.2865
2017,10,559,0.0271
2017,11,58,0.0028
2017,12,126,0.0061
2017,13,37,0.0018
2017,14,77,0.0037
2017,15,134,0.0065
2017,16,41,0.002
2017,17,25,0.0012


* `work_week_hrs`. 2019-2020. continuous feature.

### Industry <a id="16"></a>

In [51]:
df_explorer.similar_columns(['industry']);

2011
{'industry'}


2012
{'industry'}


2013
{'industry'}


2014
{'industry'}


2015
{'industry'}


2016
{'industry'}


2017
{'assess_job_industry'}




In [52]:
df_explorer.display_feature_across_years('industry', display)

Unnamed: 0_level_0,Unnamed: 1_level_0,count,percentage
year,industry,Unnamed: 2_level_1,Unnamed: 3_level_1
2011,Advertising,43,0.0158
2011,Consulting,322,0.1180
2011,Education,157,0.0575
2011,Finance / Banking,189,0.0693
2011,Foundation / Non-Profit,50,0.0183
...,...,...,...
2016,Other (please specify),3802,0.0948
2016,Retail,1023,0.0255
2016,Software Products,8916,0.2223
2016,Telecommunications,1407,0.0351


### Occupation - years_experience <a id="18"></a>

In [53]:
df_explorer.similar_columns(['years_experience']);

2011
{'years_experience'}


2012
{'years_experience'}


2013
{'years_experience'}


2014
{'years_experience'}


2015
{'years_experience'}




In [54]:
df_explorer.display_feature_across_years('years_experience', display)

Unnamed: 0_level_0,Unnamed: 1_level_0,count,percentage
year,years_experience,Unnamed: 2_level_1,Unnamed: 3_level_1
2011,11,1044,0.3826
2011,41310,717,0.2627
2011,41435,822,0.3012
2011,<2,146,0.0535
2012,11,1673,0.2805
2012,40944,1934,0.3243
2012,41070,1663,0.2788
2012,<2,694,0.1164
2013,11,3047,0.3229
2013,2/5/2013,2892,0.3065


* Wrong formated feature