Skip to content

SGP Data Preparation

dbetebenner edited this page May 2, 2017 · 11 revisions

Data Formatting

This wiki page provides SGP data formatting/preparation instructions for running SGP analyses. To help illustrate these formatting specifications there are exemplar data sets, sgpData and sgpData_LONG, embedded within the SGPdata Package. Ensuring your data is set up in the proper format will minimize later problems often encountered in running SGP analyses.

WIDE versus LONG data format

There a two formats for representing longitudinal (time dependent) student assessment data: WIDE and LONG format. For WIDE format data, each case/row represents a unique student and columns represent variables associated with the student at different times. For LONG format data, time dependent data for the student is spread out across multiple rows in the data set. The SGPdata Package, installed when one installs the SGP Package, includes exemplar WIDE and LONG data sets (sgpData and sgpData_LONG, respectively) to assist in setting up your data.

In general, the lower level functions in the SGP package that do the calculations, studentGrowthPercentiles and studentGrowthProjections require WIDE formatted data whereas the higher level functions that are built toward operational analyses require LONG data. If running anything but the basic analyses, we recommend setting up your data in the LONG format as much of the capability of the package is built around the user supplying data in that way.

Setting up long format data

For our purposes, this means that each row represents a unique student by content area by year combination. Thus, in the final long file, each student, by content area by year identifier must be unique.

By contrast, in wide formatted data a row represents a unique student and contains all available information for that student. For example, here are the first four rows (and only the first 7 columns) of the sample data:

> library(SGPdata)
> sgpData_LONG[1:4,1:7]
       ID LAST_NAME FIRST_NAME CONTENT_AREA      YEAR GRADE SCALE_SCORE
1 1000372   Daniels      Corey  MATHEMATICS 2011_2012     3         435
2 1000372   Daniels      Corey  MATHEMATICS 2012_2013     4         461
3 1000372   Daniels      Corey  MATHEMATICS 2013_2014     5         444
4 1000372   Daniels      Corey      READING 2011_2012     3         523

Notice that the same student is in each row, but that the rows represent different year and content area combinations. This is what is meant by long formatted data.

Required Variables

The following table gives the variables that are required for the calculation of Student Growth Percentiles and how they should be formatted (if applicable).

  • ID This column contains the unique student identifiers. This variable is of class character.

  • CONTENT_AREA This column describes the content area for a given row. Most data sets would presumably contain MATHEMATICS and READING, but other values are possible. These values must be capitalized. If analyses utilize embedded meta-data contained in SGPstateData, then these names must match the states’ assessment information contained in the SGPstateData object that is embedded within the SGP Package. Please contact @dbetebenner to have meta-data added to this object.

  • YEAR This column gives either the academic year (e.g., 2011_2012 as in the sample data) or the year in which the assessment took place (e.g., 2011). This variable is of class character.

  • GRADE The grade in which the assessment was administered. The column of this class should be set to character.

  • SCALE_SCORE The assessment scale score for each observation. This column’s class should be set to integer or numeric.

  • VALID_CASE This column identifies those students who should be included in subsequent analyses (value set to VALID_CASE) and those that should not be included (value set to INVALID_CASE. Duplicate cases are often left in the data and flagged as an INVALID_CASE. If your data contains all valid cases, then this variable can be set to all VALID_CASE for all cases.

Additional Variables

Although these variables are not required for Student Growth Percentile analyses, they are required for Student Growth Projection (i.e., Growth to Standard analyses), and/or the visualization and reporting functionality:

  • ACHIEVEMENT_LEVEL The achievement or proficiency category associated with each observed scale score. Values in this column should match the assessment program information included in the SGPstateData object.

  • FIRST_NAME Student first name. A character or a factor. (Only required for individual student reports)

  • LAST_NAME Student last name. A character or a factor. (Only required for individual student reports)

  • SCHOOL_NUMBER Unique identifier for the school/institution in which a student is enrolled for the given year and content area. Either an integer or character. (Only required for aggregations and bubble plots)

  • SCHOOL_NAME Name of the school/institution in which a student is enrolled in a given year. Either a factor or character. ((Only required for aggregations and bubble plots))

  • DISTRICT_NUMBER A unique identifier for the district/educational authority in which a student is enrolled in a given year. Either an integer or factor. (Only required for aggregations and bubble plots)

  • DISTRICT_NAME District/educational authority name in which a student is enrolled in a given year. Either a factor or character. (Only required for aggregations and bubble plots)

  • STATE_ENROLLMENT_STATUS Binary indicator of whether the student was continuously enrolled in the state and should be included in summary statistics. Indicator must be a factor, preferably with informative labels such as those in ; Enrolled State: Yes and Enrolled State: No. (Only required for aggregations and bubble plots)

  • DISTRICT_ENROLLMENT_STATUS Binary indicator of whether the student was continuously enrolled and should be included in district summary statistics. Indicator must be a factor, preferably with informative labels such as those in ; Enrolled District: Yes and Enrolled District: No. (Only required for aggregations and bubble plots)

  • SCHOOL_ENROLLMENT_STATUS Binary indicator of whether the student was continuously enrolled and should be included in school summary statistics. Indicator must be a factor, preferably with informative labels such as those in ; Enrolled School: Yes and Enrolled School: No. (Only required for aggregations and bubble plots)

  • ETHNICITY Ethnicity and other demographic variables if summarization by those groups is desired via summarizeSGP. (Only required for aggregations and bubble plots)

## Data Preparation