# DataDrive2030 Early Learning Predictors Challenge
**Can you identify which features of an early learning programme predict better learning outcomes for children?**



The Thrive by Five Index (2021) found that less than half of children attending an early learning programme (such as a preschool or creche) in South Africa start school with the right early learning foundation. There are many factors that influence whether a child Thrives by Five, such as access to a quality early learning programme, as well as poverty, gender, malnutrition, and emotional well-being. Children without a good foundation struggle to keep up at school and have a major learning disadvantage.

The Index used the Early Learning Outcomes Measure (ELOM) to assess children, and categorises their development as either “on track”, “falling behind” or “falling far behind”.

In this competition, your challenge is to use machine learning techniques to identify which early learning programme factors contribute to better learning outcomes in children, by predicting a child’s ELOM score.

This will allow DataDrive2030 to design better interventions that make optimal use of limited resources to ensure South Africa’s children are thriving.

About DataDrive2030 (datadrive2030.co.za)


Established in March 2022, DataDrive2030 is a South African based social enterprise that supports the collection and use of high quality data to drive improved child outcomes in the first 6 years of life. Our suite of early learning measurement tools accurately measure a range of developmental outcomes in young children, and provide an indication of the quality of the early learning environment in home and programme settings. Tools are digitised and designed for affordable use at scale in all official South African languages, with built-in data quality assurance mechanisms.

We aim to make these tools widely accessible, and to ensure that the information that is generated is easily understandable and most importantly, is actionable. Using data, our goal is to drive tangible improvements in early childhood development services in South Africa, by 2030.

In [None]:
%%time
import pandas as pd
import numpy as np
import requests

myobj = {'auth_token': 'DxZPLjPYHdvGLYPdE26m54hP'} 
data_list=['Train.csv','Test.csv','SampleSubmission.csv','VariableDescription.csv']
target_dir=''
base_path='https://api.zindi.africa/v1/competitions/datadrive2030-early-learning-predictors-challenge/files/'
def load_zindi_data(data_list,base_path,target_dir):
  for data in data_list:
      target_path=  target_dir +data
      data_path=base_path+ data
      x = requests.post(data_path, data = myobj,stream=True)
      handle = open(target_path, "wb")
      for chunk in x.iter_content(chunk_size=512):
        if chunk:  # filter out keep-alive new chunks
          handle.write(chunk)
      handle.close()
load_zindi_data(data_list,base_path,target_dir)

CPU times: user 1.48 s, sys: 150 ms, total: 1.63 s
Wall time: 3.39 s


In [None]:
import pandas as pd
import numpy as np

In [None]:
train = pd.read_csv('Train.csv')
test = pd.read_csv('Test.csv')
var = pd.read_csv('VariableDescription.csv')

In [None]:
# Preview train
train.head()

Unnamed: 0,child_id,data_year,child_date,child_age,child_enrolment_date,child_months_enrolment,child_grant,child_years_in_programme,child_height,child_observe_attentive,...,obs_cooking_5,obs_cooking_6,obs_heating_1,obs_heating_2,obs_heating_3,obs_heating_4,obs_heating_5,obs_heating_6,obs_heating_7,target
0,ID_SYSJ2FM0D,2022.0,2022-02-03,59.0,,,,,,Sometimes,...,,,,,,,,,,51.5
1,ID_J5BTFOZR3,2019.0,,60.163933,,,,1st year in the programme,103.0,Sometimes,...,,,,,,,,,,55.869999
2,ID_R00SN7AUD,2022.0,2022-03-11,69.0,,,,,108.400002,Often,...,,,,,,,,,,47.52
3,ID_BSSK60PAZ,2021.0,2021-10-13,53.0,2020-01-15,20.0,No,1st year in the programme,98.099998,Almost always,...,,,,,,,,,,58.599998
4,ID_IZTY6TC4D,2021.0,2021-10-13,57.0,2021-10-13,0.0,,2nd year in programme,114.0,Almost always,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,76.599998


In [None]:
# Preview test
test.head()

Unnamed: 0,child_id,data_year,child_date,child_age,child_enrolment_date,child_months_enrolment,child_grant,child_years_in_programme,child_height,child_observe_attentive,...,obs_cooking_4,obs_cooking_5,obs_cooking_6,obs_heating_1,obs_heating_2,obs_heating_3,obs_heating_4,obs_heating_5,obs_heating_6,obs_heating_7
0,ID_0I0999N6S,2021.0,2021-09-20,57.0,,,Yes,2nd year in programme,108.0,Almost always,...,,,,,,,,,,
1,ID_GQ6ONJ4FP,2021.0,2021-10-21,54.0,2021-01-10,9.0,Yes,1st year in the programme,105.0,Almost always,...,,,,,,,,,,
2,ID_YZ76CVRW3,2021.0,2021-05-17,57.0,,,Yes,,101.5,Often,...,,,,,,,,,,
3,ID_BNINCRXH8,2022.0,2022-09-09,59.334702,,,,3rd year in programme,,Almost always,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
4,ID_1U7GDTLRI,2021.0,2021-10-12,54.0,2021-01-15,8.0,Yes,1st year in the programme,103.5,Often,...,,,,,,,,,,


In [None]:
# Preview submission file
ss.head()

Unnamed: 0,child_id,target,feature_1,feature_2,feature_3,feature_4,feature_5,feature_6,feature_7,feature_8,feature_9,feature_10,feature_11,feature_12,feature_13,feature_14,feature_15
0,ID_0I0999N6S,0,feature,feature,feature,feature,feature,feature,feature,feature,feature,feature,feature,feature,feature,feature,feature
1,ID_GQ6ONJ4FP,0,feature,feature,feature,feature,feature,feature,feature,feature,feature,feature,feature,feature,feature,feature,feature
2,ID_YZ76CVRW3,0,feature,feature,feature,feature,feature,feature,feature,feature,feature,feature,feature,feature,feature,feature,feature
3,ID_BNINCRXH8,0,feature,feature,feature,feature,feature,feature,feature,feature,feature,feature,feature,feature,feature,feature,feature
4,ID_1U7GDTLRI,0,feature,feature,feature,feature,feature,feature,feature,feature,feature,feature,feature,feature,feature,feature,feature


In [None]:
var.head(49)

Unnamed: 0,Variable Name,Variable Label,Answer Label
0,child_id,Unique child ID,Open ended
1,data_year,Year data was collected,Open ended
2,child_date,ELOM date,Open ended
3,child_age,Child age in months,Open ended
4,child_enrolment_date,Date enrolled in ELP,Open ended
5,child_months_enrolment,Months enrolled at ELP,Open ended
6,child_grant,Does the childs primary caretaker receive the ...,
7,child_years_in_programme,For how many years has this child been in the ...,
8,child_height,Child height in cm,Open ended
9,target,Total for Elom over all items,Open ended


# Get rid of any column containing at least 30% of null values:

In [None]:
percent_missing = train.isnull().sum() * 100/ len(train)
missing_value_train = pd.DataFrame({'column_name': train.columns,
                                  'percent_missing': percent_missing})
columns_to_drop = list(percent_missing[percent_missing >= 30].index)
missing_value_train

Unnamed: 0,column_name,percent_missing
child_id,child_id,0.000000
data_year,data_year,0.000000
child_date,child_date,21.211415
child_age,child_age,0.000000
child_enrolment_date,child_enrolment_date,69.470006
...,...,...
obs_heating_4,obs_heating_4,73.663366
obs_heating_5,obs_heating_5,73.663366
obs_heating_6,obs_heating_6,73.663366
obs_heating_7,obs_heating_7,73.663366


In [None]:
train = train.drop(columns = columns_to_drop)
test = test.drop(columns = columns_to_drop)

print(train.shape, test.shape)

(8585, 49) (3680, 48)


In [None]:
print(f'We have {train.shape[0]} rows and {train.shape[1]} columns in the train dataset')
print(f'We have {test.shape[0]} rows and {test.shape[1]} columns in the test dataset')

We have 8585 rows and 49 columns in the train dataset
We have 3680 rows and 48 columns in the test dataset
