# SCR Data Profile
> Kamiku Xue(yx3494@nyu.edu)

In [1]:
// define the project root
import org.apache.spark.sql.DataFrame
val root_folder = "/user/yx3494_nyu_edu/scr_data/"
var year = 2018 to 2023

## 1 BOCES and Need-to-Resource Capacity Categories(N/RC)

The need-to-resource capacity (N/RC) index, a measure of a district’s ability to meet the needs of its students with local
resources, is the ratio of the estimated poverty percentage1 (expressed in standard score form) to the Combined Wealth
Ratio2 (expressed in standard score form). A district with both estimated poverty and Combined Wealth Ratio equal to
the State average would have a N/RC index of 1.0. N/RC categories are determined from this index using the definitions
in the table below.

In [3]:
// First see the data schema
for (i <- year){
    var df = spark.read.option("header", "true")
        .option("multiLine", "true")
        .option("inferSchema", "true")
        .option("escape", "\"")
        .csv(root_folder + i + "/BOCES_NRC.csv")

        println("Year " + i)
        println("Total Columns " + df.columns.length)
        
        // see data structure
        df.printSchema
        // peek some data
        z.show(df.limit(3))
}

 

We will use the following columns:

- `ENTITY_CD `: Unique identifier for the entity for foreign key
- `SCHOOL_NAME`: The name of the school
- `YEAR`: School Year (2021 for 2020-21, 2022 for 2021-22, 2023 for 2022-23)
- `DISTRICT_NAME`: The name of the district
- `COUNTY_NAME`: The name of the county
- `NEEDS_INDEX`: N/RC index

Now profile the each column

In [5]:
// loop from 2018 to 2023
for (i <- year){
    var df = spark.read.option("header", "true")
        .option("multiLine", "true")
        .option("inferSchema", "true")
        .option("escape", "\"")
        .csv(root_folder + i + "/BOCES_NRC.csv")

        println("============== Year " + i + " ==============")
        println("Total entities " + df.count)

        // profile the YEAR
        println("YEAR Profile")
        z.show(df.groupBy("YEAR").count().orderBy("YEAR")) // -> 3, this explains some entities occur more than once, because they are in different years


        // profile the ENTITY_CD
        println("ENTITY_CD Profile, distinct entities: " + df.select("ENTITY_CD").distinct.count)
        println("empty, null values: " + df.filter("ENTITY_CD is null or ENTITY_CD = ''").count)
        val groups_ids = df.groupBy("ENTITY_CD").count()
        z.show(groups_ids.select("count").describe()) // max: 3, min: 2, avg: 2.99 -> most of the entities occur 3 times

        // profile the SCHOOL_NAME
        println("SCHOOL_NAME Profile, distinct schools:" + df.select("SCHOOL_NAME").distinct.count)
        val groups_schools = df.groupBy("SCHOOL_NAME").count()
        z.show(groups_schools.describe()) // some values occur 12 times, no empty names

        // need to bind the SCHOOL_NAME with ENTITY_CD to check if the same school has different entity code
        println("CHOOL_NAME + ENTITY_CD Profile, Distribution of entities")
        val groups_schools_entity = df.groupBy("ENTITY_CD", "SCHOOL_NAME").count().select("count")
        z.show(groups_schools_entity.describe()) // max: 3, min: 2, avg: 2.99 -> match the entity_cd

        // profile the DISTRICT_NAME
        println("DISTRICT_NAME Profile, unique districts: " + df.select("DISTRICT_NAME").distinct.count)
        
        //group by YEAR for the NEEDS_INDEX
        val groups_districts = df.groupBy("DISTRICT_NAME").count()
        z.show(groups_districts.describe()) //DISTRICT max occur 800, min 5 times

        // profile the NEEDS_INDEX
        println("NEEDS_INDEX Profile")
        z.show(df.describe("NEEDS_INDEX")) // max:7, min: 1, avg: 3.55, stddev: 2.09
        
        //differset school year NEEDS_INDEX distribution
        println("NEEDS_INDEX Profile in different years")
        z.show(df
        .groupBy("YEAR")
        .agg(
            sum("NEEDS_INDEX"),
            avg("NEEDS_INDEX"), 
            min("NEEDS_INDEX"), 
            max("NEEDS_INDEX"), 
            stddev("NEEDS_INDEX"))
            .orderBy("YEAR"))
}

 
### N/RC Clean Step

We will do the following steps to clean the data

In [7]:
%spark
// create a dataframe to store the data for all years
var nrcDF : DataFrame = null

// UDF for need index description
val getNeedIndex = (index: Int) => {
    index match {
    case 1 => "High N/RC: New York City"
    case 2 => "High N/RC: Large City Districts "
    case 3 => "High N/RC: Urban-Suburban Districts"
    case 4 => "High N/RC: Rural Districts"
    case 5 => "Average N/RC Districts"
    case 6 => "Low N/RC Districts"
    case 7 => "Charter Schools"
    }
}
spark.udf.register("nrcStr", getNeedIndex)

for (i <- year){
    var df = spark.read.option("header", "true")
        .option("multiLine", "true")
        .option("inferSchema", "true")
        .option("escape", "\"")
        .csv(root_folder + i + "/BOCES_NRC.csv")
        // filder other year's data
        .filter("YEAR = " + i)
        // drop null values in SCHOOL_NAME
        .filter("SCHOOL_NAME is not null and SCHOOL_NAME != ''")
        // in DISTRICT_NAME, if null, replace with 'UNAVAILABLE'
        .withColumn("DISTRICT_NAME", when(col("DISTRICT_NAME").isNull, "UNAVAILABLE").otherwise(col("DISTRICT_NAME")))
        .withColumn("NEEDS_DESCRIPTION", expr("nrcStr(NEEDS_INDEX)"))
        // only select the columns we need
        .select("ENTITY_CD", "SCHOOL_NAME", "YEAR", "NEEDS_INDEX", "NEEDS_DESCRIPTION")
        // rename the columns
        .withColumnRenamed("ENTITY_CD", "School_BEDS_Code")
        .withColumnRenamed("SCHOOL_NAME", "School_Name")
        .withColumnRenamed("YEAR", "Year")
        .withColumnRenamed("NEEDS_INDEX", "N/RC_Index")
        .withColumnRenamed("NEEDS_DESCRIPTION", "N/RC_Index_Description")


    println("Year " + i + " (Total Entities: " + df.count + ")")
    if (nrcDF == null){
        nrcDF = df
    } else {
        nrcDF = nrcDF.union(df)
    }
}

println("============== Final NR/C Dataframe (Total: " + nrcDF.count + ") ==============")
z.show(nrcDF.limit(10))

// save the dataframe
nrcDF.write.mode("overwrite").parquet(root_folder + "nrc_cleaned.parquet")

## 2 Teaching Staff Data

The Teaching Staff data provides information on the number of teachers and principals have experience or inexperience in hight-proverty, low-performing schools.

> From 2020, the BOCES update the data version for Staff Qualiuifications, we need to check the columns when merge.

In [9]:
// 2018 and 2019 have Staff_Qualifications.csv
for (i <- 2018 to 2019){
    var df = spark.read.option("header", "true")
        .option("multiLine", "true")
        .option("inferSchema", "true")
        .option("escape", "\"")
        .csv(root_folder + i + "/Staff_Qualifications.csv")
    println("============== Year " + i + " ==============")
    // show the schema
    println("Columns: " + df.columns.length)
    df.printSchema
    // peek some data
    z.show(df.limit(3))
}

In [10]:
// 2020 - 2023 have Inexperienced_Teachers_Principals.csv
for (i <- 2020 to 2023){
    var df = spark.read.option("header", "true")
        .option("multiLine", "true")
        .option("inferSchema", "true")
        .option("escape", "\"")
        .csv(root_folder + i + "/Inexperienced_Teachers_Principals.csv")
    println("============== Year " + i + " ==============")
    // show the schema
    println("Columns: " + df.columns.length)
    df.printSchema
    // peek some data
    z.show(df.limit(3))
}

We find some different columns in the teaching staff and qualifications data

`INSTITUTION_ID, TOT_TEACH_LOW, TOT_TEACH_HIGH, TOT_PRINC_LOW, TOT_PRINC_HIGH, TEACH_DATA_REP_FLAG, PRIN_DATA_REP_FLAG`

We will not use above diff columns, we will use the following columns:

- ENTITY_CD  - Unique identifier for the entity for foreign key
- ENTITY_NAME - The name of the school / district
- YEAR - School Year (etc. 2021 for 2020-21)
- NUM_TEACH - Total number of teachers in the Student Information Repository System
(SIRS)
- NUM_TEACH_INEXP - Number of inexperienced teachers
- NUM_TEACH_LOW - Number of teachers with low-poverty schools statewide
- NUM_TEACH_HIGH - Number of teachers with high-poverty schools statewide
- NUM_PRINC - Total number of principals
- NUM_PRINC_INEXP - Number of inexperienced principals
- NUM_ PRINC_LOW - Number of principals with low-poverty schools statewide
- NUM_PRINC_HIGH - Number of principals with high-poverty schools statewide

Next profile these data

In [12]:
// for 2018 and 2019 Profile
for (i <- 2018 to 2019){
    var df = spark.read.option("header", "true")
        .option("multiLine", "true")
        .option("inferSchema", "true")
        .option("escape", "\"")
        .csv(root_folder + i + "/Staff_Qualifications.csv")
    
    println("============== Year " + i + " ==============")
    println("Total entities " + df.count)    

    // profile the YEAR
    println("YEAR Profile")
    z.show(df.groupBy("YEAR").count().orderBy("YEAR"))

    // profile the ENTITY_CD
    println("ENTITY_CD Profile, distinct entities: " + df.select("ENTITY_CD").distinct.count)

    // profile the ENTITY_NAME
    println("ENTITY_NAME Profile, distinct schools:" + df.select("ENTITY_NAME").distinct.count)

    // profile the ENTITY_CD + ENTITY_NAME
    println("ENTITY_CD + ENTITY_NAME Profile, Distribution of entities")
    val groups_schools_entity = df.groupBy("ENTITY_CD", "ENTITY_NAME").count().select("count")
    z.show(groups_schools_entity.describe())

    // profile the NUM_TEACH
    println("NUM_TEACH Profile")
    z.show(df.describe("NUM_TEACH")) 

    // profile the NUM_TEACH_INEXP
    println("NUM_TEACH_INEXP Profile")
    z.show(df.describe("NUM_TEACH_INEXP"))

    // profile the NUM_TEACH
    println("NUM_TEACH_LOW Profile")
    z.show(df.describe("NUM_TEACH_INEXP"))

    // profile the NUM_PRINC
    println("NUM_PRINC Profile")
    z.show(df.describe("NUM_PRINC"))

    // profile the NUM_PRINC_INEXP
    println("NUM_PRINC_INEXP Profile")
    z.show(df.describe("NUM_PRINC_INEXP"))

    // profile the NUM_PRINC_LOW
    println("NUM_PRINC_LOW Profile")
    z.show(df.describe("NUM_PRINC_LOW"))

    // profile the NUM_PRINC_HIGH
    println("NUM_PRINC_HIGH Profile")
    z.show(df.describe("NUM_PRINC_HIGH"))
}

In [13]:
// for 2020 and 2023 Inexperienced Teachers Principals
for (i <- 2020 to 2023){
    var df = spark.read.option("header", "true")
        .option("multiLine", "true")
        .option("inferSchema", "true")
        .option("escape", "\"")
        .csv(root_folder + i + "/Inexperienced_Teachers_Principals.csv")
    
    println("============== Year " + i + " ==============")
    println("Total entities " + df.count)    

    // profile the YEAR
    println("YEAR Profile")
    z.show(df.groupBy("YEAR").count().orderBy("YEAR"))
    // profile the ENTITY_CD
    println("ENTITY_CD Profile, distinct entities: " + df.select("ENTITY_CD").distinct.count)

    // profile the ENTITY_NAME
    println("ENTITY_NAME Profile, distinct schools:" + df.select("ENTITY_NAME").distinct.count)

    // profile the ENTITY_CD + ENTITY_NAME
    println("ENTITY_CD + ENTITY_NAME Profile, Distribution of entities")
    val groups_schools_entity = df.groupBy("ENTITY_CD", "ENTITY_NAME").count().select("count")
    z.show(groups_schools_entity.describe())

    // profile the NUM_TEACH
    println("NUM_TEACH Profile")
    z.show(df.describe("NUM_TEACH")) 

    // profile the NUM_TEACH_INEXP
    println("NUM_TEACH_INEXP Profile")
    z.show(df.describe("NUM_TEACH_INEXP"))

    // profile the NUM_TEACH
    println("NUM_TEACH_LOW Profile")
    z.show(df.describe("NUM_TEACH_INEXP"))

    // profile the NUM_PRINC
    println("NUM_PRINC Profile")
    z.show(df.describe("NUM_PRINC"))

    // profile the NUM_PRINC_INEXP
    println("NUM_PRINC_INEXP Profile")
    z.show(df.describe("NUM_PRINC_INEXP"))

    // profile the NUM_PRINC_LOW
    println("NUM_PRINC_LOW Profile")
    z.show(df.describe("NUM_PRINC_LOW"))

    // profile the NUM_PRINC_HIGH
    println("NUM_PRINC_HIGH Profile")
    z.show(df.describe("NUM_PRINC_HIGH"))
}

### Teaching Staff Clean Step

This data is almost clean, just pick the columns needed and merge to one table

In [15]:
%spark

// create a dataframe to store the data for 2018 and 2019
var oldStaffDF : DataFrame = null

// 2018 and 2019 have Staff_Qualifications.csv
for (i <- 2018 to 2019){
    var df = spark.read.option("header", "true")
        .option("multiLine", "true")
        .option("inferSchema", "true")
        .option("escape", "\"")
        .csv(root_folder + i + "/Staff_Qualifications.csv")
        // select the current year data
        .filter("YEAR = " + i)
        // select the columns
        .select("ENTITY_CD", "ENTITY_NAME", "YEAR", "NUM_TEACH", "NUM_TEACH_INEXP", "NUM_PRINC", "NUM_PRINC_INEXP")

    println("============== Year " + i + " (Total: " + df.count + ") ==============")
    if (oldStaffDF == null){
        oldStaffDF = df
    } else {
        oldStaffDF = oldStaffDF.union(df)
    }
}


In [16]:
%spark

// create a dataframe to store the data for 2020 to 2023
var newStaffDF : DataFrame = null
// 2020 and 2023 have Inexperienced_Teachers_Principals.csv
for (i <- 2020 to 2023){
    var df = spark.read.option("header", "true")
        .option("multiLine", "true")
        .option("inferSchema", "true")
        .option("escape", "\"")
        .csv(root_folder + i + "/Inexperienced_Teachers_Principals.csv")
        // select the current year data
        .filter("YEAR = " + i)
        // select the columns
        .select("ENTITY_CD", "ENTITY_NAME", "YEAR", "NUM_TEACH", "NUM_TEACH_INEXP", "NUM_PRINC", "NUM_PRINC_INEXP")

    println("============== Year " + i + " (Total: " + df.count + ") ==============")
    if (newStaffDF == null){
        newStaffDF = df
    } else {
        newStaffDF = newStaffDF.union(df)
    }
}

In [17]:
%spark

// final staff dataframe for 2018 - 2023
val finalStaffDF = oldStaffDF
.union(newStaffDF)
// rename the columns
.withColumnRenamed("ENTITY_CD", "School_BEDS_Code")
.withColumnRenamed("ENTITY_NAME", "School_Name")
.withColumnRenamed("YEAR", "Year")
.withColumnRenamed("NUM_TEACH", "Total_Teachers")
.withColumnRenamed("NUM_TEACH_INEXP", "4-_years_Teachers")
.withColumnRenamed("NUM_PRINC", "Total_Principals")
.withColumnRenamed("NUM_PRINC_INEXP", "4-_years_Principals")

println("============== Final Staff Dataframe (Total: " + finalStaffDF.count + ") ==============")
// peek some data row
z.show(finalStaffDF.limit(10))
// save the dataframe
finalStaffDF.write.mode("overwrite").parquet(root_folder + "staff_cleaned.parquet")

finalStaffDF.printSchema

## 3 Graduation Rate (new)

In [19]:
for (i <- 2018 to 2023){
    var df = spark.read.option("header", "true")
        .option("multiLine", "true")
        .option("inferSchema", "true")
        .option("escape", "\"")
        .csv(root_folder + i + "/Graduation_Rate.csv")

    println("============== Year " + i + " ==============")
    // show the schema
    println("Columns: " + df.columns.length)
    df.printSchema
}

## Clean Data

- Merget the 2018 -2023 data
- Rename the coloums

In [21]:
// define the root df
var gradDF : DataFrame = null
for (i <- 2018 to 2023){
    var df = spark.read.option("header", "true")
        .option("multiLine", "true")
        .option("inferSchema", "true")
        .option("escape", "\"")
        .csv(root_folder + i + "/Graduation_Rate.csv")
        // filter the current year data
        .filter("YEAR = " + i)
        // select the columns we need
        .select("YEAR", "ENTITY_CD", "ENTITY_NAME", "SUBGROUP_NAME", "COHORT", "COHORT_COUNT", "GRAD_RATE")
    if (gradDF == null){
        gradDF = df
    } else {
        gradDF = gradDF.union(df)
    }
}

println("============== Final Graduation Rate Dataframe (Total: " + gradDF.count + ") ==============")

In [22]:
// get all entities combine grad rate
val entityGradRate = gradDF
.filter("SUBGROUP_NAME = 'All Students' AND COHORT = 'Combined'")
// if gard_rate is 's', replace with 0 int
.withColumn("GRAD_RATE", when(col("GRAD_RATE") === "s", 0).otherwise(col("GRAD_RATE").cast("int")))
.groupBy("ENTITY_NAME")
.agg(
    min("GRAD_RATE").alias("MIN"),
    max("GRAD_RATE").alias("MAX"),
    avg("GRAD_RATE").alias("AVG"),
    stddev("GRAD_RATE").alias("STDDEV")
)

// show the distribution, invse the order
z.show(entityGradRate.orderBy(desc("AVG")).limit(10))

In [23]:
// per school per year grad rate
val schoolYearGradRate = gradDF
.filter("SUBGROUP_NAME = 'All Students' AND COHORT = 'Combined' AND GRAD_RATE != 's'")
// if gard_rate is 's', replace with 0 int
.withColumn("GRAD_RATE", when(col("GRAD_RATE") === "s", 0).otherwise(col("GRAD_RATE").cast("int")))
.groupBy("ENTITY_CD", "ENTITY_NAME", "YEAR")
.agg(
    avg("GRAD_RATE").alias("Gruadation_Rate")
)
// rename the column
.withColumnRenamed("ENTITY_CD", "School_BEDS_Code")
.withColumnRenamed("ENTITY_NAME", "School_Name")
.withColumnRenamed("YEAR", "Year")

// show the distribution,invse the order
z.show(schoolYearGradRate.orderBy(desc("Gruadation_Rate")).orderBy("School_BEDS_Code", "Year").limit(10))
schoolYearGradRate.count

// save the dataframe
gradDF.write.mode("overwrite").parquet(root_folder + "grad_rate_cleaned.parquet")

## Final Data Merge

Merge the N/RC, Teaching Staff and Graduation Rate data to one table, also cast the Entity cd to the string for the foreign table join

In [25]:
val combinedDF = nrcDF
.join(finalStaffDF, Seq("School_BEDS_Code", "School_Name", "Year"), "inner")
.join(schoolYearGradRate, Seq("School_BEDS_Code", "School_Name", "Year"), "inner")
.withColumn("School_BEDS_Code", col("School_BEDS_Code").cast("string"))

println("============== Combined Dataframe (Total: " + combinedDF.count + ") ==============")
combinedDF.printSchema
z.show(combinedDF.limit(10))

// save the dataframe into parquet
combinedDF.write.mode("overwrite").parquet(root_folder + "combined_cleaned.parquet")

// also save the dataframe into csv
combinedDF.write.mode("overwrite").csv(root_folder + "combined_cleaned.csv")