# Practice Project 2.1 - Preparing School Data
As always, we start with the Problem Solving Framework

## Business Understanding
A school district wants to predict the per pupil costs of a school based on some high level summary data about the school. This way they’ll have a good estimation of how well a school is managing its costs relative to what the model would predict. You’ve been asked to to prepare the data for modelling.

Data Understanding
You’ve been given four CSV files that contain data for two different school districts. You can find these files at the bottom of the page.

DistrictA_Attendance - This file contains average daily attendance, percent attendance, and pupil-teacher ratio data for the 25 schools in district A.
DistrictA_Finance - This file contains average monthly teacher salary and per pupil cost data for the 25 schools in district A.
DistrictB_Attendance - This file contains average daily attendance, percent attendance, and pupil-teacher ratio data for the 21 schools in district B.
DistrictB_Finance - This file contains average monthly teacher salary and per pupil cost data for the 21 schools in district B.
Data Preparation
Step 1: Combine the data
First you’ll need to combine the data from the various files into one sheet, with one row per school. To do this, you’ll use the skills you learned in the Formatting Data and Blending Data lessons.

Step 2: Clean the Data
Next you’ll clean the data, which includes addressing duplicate data, missing data, and any other data issues. To do this, you’ll use the skills you learned in the Data Issues lesson.

Step 3: Identify and Deal with Outliers
Lastly, you’ll look for outliers and determine the best way to address them. To do this, you’ll use the skills you learned in the Data Issues lesson.

Self-Assessment
Do your best to complete the practice project on your own. Once you are done, or get stuck, take a look at the solution. We’ve provided the solution dataset, Alteryx workflow, as well as a walk through of how to complete the project. There's not necessarily one perfectly right answer.


Supporting Materials
DistrictA Attendance
DistrictA Finance
DistrictB Attendance
DistrictB Finance

# Resources for Self Assessment
You have three resources to assess how you did on the practice project and how ready you are for a project you will submit.

Solution Dataset
At the bottom of the page you will see the Cleaned School Data file. There are a few judgement calls to make on regarding outliers and missing data, so your solution may look a little different. Take a look and compare to your cleaned dataset.

Alteryx Solution Workflow
At the bottom of the page you will see an Alteryx workflow that provides a sample solution. You're solution may not look exactly the same. Take a look at the workflow to see how I approached it. You can also see a snapshot of the workflow below. Don't worry, it's not as complicated as it looks.

Detailed Walkthrough
See the next for a more detailed walk through on the process. You learn how to learn the Score tool, which makes applying the model results much easier.


Supporting Materials
Practice P2 Solution Workflow
Cleaned School Data

# Data Understanding
To begin, let’s take a look at the data. The data is in four different csv files, which we’ll have to combine in order to do the analysis.

To import, we’ll bring in four input tools, each one bringing in one of the files. There are two files for each district: an attendance file and a finance file.

Starting with District A’s attendance file, we one record for each school, and there are three numeric fields, average attendance, percent attendance, and pupil teacher ratio.

District A’s finance tab is structured differently. It has multiple rows for each school because the numeric fields are stacked on top of each other. So this data will have to be transformed before we can merge it with the attendance data.

District B’s two datasets are structured similarly.

# Formatting and Blending Data
To build the dataset, we’ll have to merge each of these datasets into one.

First, we’ll have to transform the finance datasets
Then we’ll merge the finance and attendance datasets for each district
And lastly we’ll combine the data for the two districts together.
Transforming with Crosstab
To transform the finance datasets, connect a Crosstab tool it to each of the finance input tools.

In the configuration window, select School as the group by value, metric as the column headers, and value for the Values for New Columns. This will format the data in the same way the attendance data is formatted.

Join
Next, bring in a join tool and connect both datasets.

In the configuration window, select the school variable in each of the datasets to join on. Then select the variables you want to keep. You can drop one of the school variables, but you should keep the 5 numeric variables. You’ll want to do the same for the other district.

Union
Next, bring in a union tool, which will stack the data on top of eachother. Since the datasets have the same variables, you don’t have to set any configurations.

This is good time to rename variables, since the variable names are pretty long. To do this, bring in a select tool. I’ve changed the names to ATT, PATT, PTR, SAL, and PPC. Let’s attached a browse tool and take a look at what we’ve got. Now the data is looks like it's in good format for analysis and modeling.

Many analysts at this point will output the dataset to make future data cleaning go a bit faster. In this case, it's not as important because we don't have a lot of data. But imagine doing this on datasets with millions of records; run time can take a long time.

# Cleaning Data
Now let’s check for data issues. Specifically, let’s look for duplicates and missing data first.

Visualizing with the Field Summary Tool
Visualizing the data is a good way of doing this. The field summary report gives a few reports that are helpful. Looking at histograms of each variable is a good way to do this.


Duplicate Data
First, we see a potential duplicate because there’s one school with two records. Let’s take a look at the observations to confirm. All the data is the same for these two records, so it appears to be a duplicate. So let’s delete one of the observations.

There are several ways to do this. An easy way is to use the select records tool. The record we want to delete is number 39, so in the configuration window of the select records tool, we can select 1-38 and everything 40 and after.

The record we want to delete is number 39, so in the configuration window of the select records tool, we can select 1-38 and everything 40 and after.

Missing Data
You can see on the field summary report we are missing a few observations for the two attendance variables (the red underneath the histogram indicates this). We can either delete or impute. Since there are only a few, let’s delete the records, but we’ll make note of it, since we may come back to run the model with the data imputed, or if we end up not using those variables in the model.

An easy way to remove missing data is to use a filter tool.

In the configuration window, you can select the ATT variable, and filter out records that are NULL. To do this, you want to keep all records that are not NULL, so you can use the drop downs to select ATT Is Not Null, or type in the formula !IsNull(ATT).


# Identifying and dealing with outliers
We are working on providing more detail to this sections. In the meantime, here's an overview.

To check for outliers, let’s start with some scatterplots to visualize the relationship between each predictor variable and the target variable. Since there are four predictor variables, let’s drag in four scatterplot tools, and connect them to the cleaned dataset. Then we configure each one with a different predictor variable, attached browse tools, and run the workflow.

Let’s start with attendance. The box and whisker plots on either axis use the interquartile range to determine whether a point is an outlier. You can see hear there are two dramatic outliers for attendance, so let’s address those first, then come back to the others.. If you need a quick refresher on what to do with outliers, take a look at the notes below the video.

For these outliers, it looks like the data may not be record properly. Let’s look at the observations by attaching a sort tool and browse tool to the dataset. You can see that almost every observation has a decimal, and these two observations seem to be about 10X the average, so it’s highly likely that the data is an error and we should divide by 10. Normally we would validate with the source, but for now we’ll just make the assumption. I’ll use a formula tool to filter out these records, and create another scatter plot.

ATTENDANCE: There still is one more outlier. For this one, it seems like the data is probably right, and it’s just a larger school. My concern with keeping this observation is that it may skew the data, creating or masking a relationship with PPC. If I knew that none of the schools I was predicting for were going to be that size, I’d delete it. Otherwise, I’d keep it in. Let’s plan on building a model with and without this variable.

PERCENT ATTENDANCE: Now let’s look at percent attendance. There doesn’t appear to be outliers. Great.

PUPIL TEACHER RATIO: For Pupil Teacher Ratio, there are 2 outliers, which are the also outliers for PPC. This makes sense since we’d expect the relationship to behave this way. Based on the fitted line, the outliers are in line with the relationship, so we’d leave them in.

TEACHER SALARIES: Two records are outliers, with very low salaries. Just like for PTR, these seem to be in line with the trend, and not dramatically different, so it’s probably best to keep them in.

NOTE: In this example, because of the small size of the dataset, we could look at each outlier and make decisions. For larger data sets, you’d likely have to make more systematic decisions, such as removing all outliers, or removing the top 1 or 2 percent of observations for each variable.

Outlier Summary
So, we kept 4 of the 5 outliers, and we’ll run the model with and without the fifth one. So now we are ready for modeling. Nice work.

How do deal with outliers
As a reminder, let’s quickly review how to handle an outlier. There are three main methods:

Delete: When data is erroneous or when the outlier hurts the model's’ ability to make prediction (perhaps the value is very unlikely to appear again so keeping it in the model will skew all other predictions).
Impute: Also for when data is erroneous, we could use the average or median value in its place.
Leave it: If the data is good data, it may be best to leave the data in and try with and without to see the difference.

# Project Overview
This project is the first part of a two-part series. In the first part, you will blend and format data and deal with outliers.

For the second part, you will use your cleaned up dataset to create another linear regression model. The difference this time is that you will have to choose which variable(s) are the most important for the model using new techniques learned in the More on Predictor Variables lesson.

Scenario
Pawdacity is a leading pet store chain in Wyoming with 13 stores throughout the state. This year, Pawdacity would like to expand and open a 14th store. Your manager has asked you to perform an analysis to recommend the city for Pawdacity’s newest store, based on predicted yearly sales.

How Do I Complete this Project?
This project uses skills learned throughout the "Data Preparation” lessons. To complete this project:

Go through the course.
Apply the skills learned in the course to solve the business problem given in the project details.
Use our guidelines and rubric to help build your project.
When you're ready, submit it to us for review using the submission template found in the supporting materials section.
Skills Required
In order to complete this project, you must be able to:

Understand different data types
Deal with a variety of data issues
Format data appropriately
Blend data together using joins and unions