# Box Plots for Education

### Summary

Budgets for schools and school districts are huge, complex, and unwieldy. It's no easy task to digest where and how schools are using their resources. Education Resource Strategies(ERS) is a non-profit that tackles just this task with the goal of letting districts be smarter, more strategic, and more effective in their spending.

Your task is a **multi-class-multi-label classification** problem with the goal of attaching canonical labels to the freeform text in budget line items. There are 9 broad categories that each take on many possible sub-label instances.

These labels let ERS understand how schools are spending money and tailor their strategy recommendations to improve outcomes for students, teachers, and administrators.

### Competition Description

In order to compare budget or expenditure data across districts, ERS assigns every line item to certain categories in a comprehensive financial spending framework. For instance, Object_Type describes what the spending "is"—Base Salary/Compensation, Benefits, Stipends & Other Compensation, Equipment & Equipment Lease, Property Rental, and so on. Other categories describe what the spending "does," which groups of students benefit, and where the funds come from.

Once this process is complete, we can finally offer cross-district insight into a partner's finances. We might observe that a particular partner spends more on facilities and maintenance than peer districts, or staffs teaching assistants more richly. These findings are not in themselves good or bad—they depend on the context, goals, and strategy of the partner district.

This task (which we call financial coding) is very time and labor-intensive. This limits our ability to provide this analysis to districts. It typically takes us several weeks to reliably code a financial file. Furthermore, the challenges of financial coding put a limit on the quality of our comparisons, since the only districts in our comparison database are those with whom we've gone through this lengthy, laborious process.

The right algorithm, paired with some human checks, will allow us to code financial files more accurately, more quickly, and more cheaply. As a result, we will be able to offer these valuable insights to many more districts at a much lower cost, greatly extending our impact. Eventually, we hope to offer a free self-service version of the algorithm through our website, which would allow any district to upload their data and receive comparisons to similar districts on a time scale of days or even hours.

### Dataset features

Your goal is to predict the probability that a certain label is attached to a budget line item. Each row in the budget has mostly free-form text features, except for the two below that are noted as float. Any of the fields may or may not be empty

- FTE float - If an employee, the percentage of full-time that the employee works.
- Facility_or_Department - If expenditure is tied to a department/facility, that department/facility.
- Function_Description - A description of the function the expenditure was serving.
- Fund_Description - A description of the source of the funds.
- Job_Title_Description - If this is an employee, a description of that employee's job title.
- Location_Description - A description of where the funds were spent.
- Object_Description - A description of what the funds were used for.
- Position_Extra - Any extra information about the position that we have.
- Program_Description - A description of the program that the funds were used for.
- SubFund_Description - More detail on Fund_Description
- Sub_Object_Description - More detail on Object_Description
- Text_1 - Any additional text supplied by the district.
- Text_2 - Any additional text supplied by the district.
- Text_3 - Any additional text supplied by the district.
- Text_4 - Any additional text supplied by the district.
- Total float - The total cost of the expenditure.



### Dataset labels

For each of these rows, ERS attaches one label from each of 9 different categories:

- `Function`
- `Object_Type`
- `Operating_Status`
- `Position_Type`
- `Pre_K`
- `Reporting`
- `Sharing`
- `Student_Type`
- `Use`

For a list of `sub_labels` check the [compettion page.](https://www.drivendata.org/competitions/46/box-plots-for-education-reboot/page/86/#labels_list)

### Submission format

Your goal is to predict a probability for each possible label in the dataset given a row of new data. Each of these probabilities goes in a separate column in the submission file. The submission must be `50064x104` where `50064` is the number of rows in the test dataset (excluding the header) and `104` is the number of columns (excluding a first column of row ids). The columns in the submission have the format `ColumnName__PossibleLabel`, which we have listed below for your convenience. This is simply a flattening of the labels that we listed above.

[Competition Page]('https://www.drivendata.org/competitions/46/box-plots-for-education-reboot/')  

[Github Repo for the 1st, 2nd and 3rd place submissions]('https://github.com/drivendataorg/box-plots-for-education')  

[Course resources repo]('https://github.com/datacamp/course-resources-ml-with-experts-budgets')