# SCR Data Profile
> Kamiku Xue(yx3494@nyu.edu)

In [1]:
// define the project root
import org.apache.spark.sql.DataFrame
val root_folder = "/user/yx3494_nyu_edu/scr_data/"

## 1 BOCES and Need-to-Resource Capacity Categories(N/RC)

The need-to-resource capacity (N/RC) index, a measure of a district’s ability to meet the needs of its students with local
resources, is the ratio of the estimated poverty percentage1 (expressed in standard score form) to the Combined Wealth
Ratio2 (expressed in standard score form). A district with both estimated poverty and Combined Wealth Ratio equal to
the State average would have a N/RC index of 1.0. N/RC categories are determined from this index using the definitions
in the table below.

In [3]:
// First see the data schema
for (i <- 2018 to 2023){
    var df = spark.read.option("header", "true")
        .option("multiLine", "true")
        .option("inferSchema", "true")
        .option("escape", "\"")
        .csv(root_folder + i + "/BOCES_NRC.csv")

        println("Year " + i)
        println("Total Columns " + df.columns.length)
        
        // see data structure
        df.printSchema
        // peek some data
        z.show(df.limit(3))
}

 

We will use the following columns:

- `ENTITY_CD `: Unique identifier for the entity for foreign key
- `SCHOOL_NAME`: The name of the school
- `YEAR`: School Year (2021 for 2020-21, 2022 for 2021-22, 2023 for 2022-23)
- `DISTRICT_NAME`: The name of the district
- `COUNTY_NAME`: The name of the county
- `NEEDS_INDEX`: N/RC index

Now profile the each column

In [5]:
// loop from 2018 to 2023 profile the NRC
for (i <- 2018 to 2023){
    var df = spark.read.option("header", "true")
        .option("multiLine", "true")
        .option("inferSchema", "true")
        .option("escape", "\"")
        .csv(root_folder + i + "/BOCES_NRC.csv")

        println("============== Year " + i + " ==============")
        println("Total entities " + df.count)

        // profile the YEAR
        println("YEAR Profile")
        z.show(df.groupBy("YEAR").count().orderBy("YEAR")) // -> 3, this explains some entities occur more than once, because they are in different years

        // profile the ENTITY_CD
        println("ENTITY_CD Profile, distinct entities: " + df.select("ENTITY_CD").distinct.count)
        println("empty, null values: " + df.filter("ENTITY_CD is null or ENTITY_CD = ''").count)
        val groups_ids = df.groupBy("ENTITY_CD").count()
        z.show(groups_ids.select("count").describe()) // max: 3, min: 2, avg: 2.99 -> most of the entities occur 3 times

        // profile the SCHOOL_NAME
        println("SCHOOL_NAME Profile, distinct schools:" + df.select("SCHOOL_NAME").distinct.count)
        val groups_schools = df.groupBy("SCHOOL_NAME").count()
        z.show(groups_schools.describe()) // some values occur 12 times, no empty names

        // need to bind the SCHOOL_NAME with ENTITY_CD to check if the same school has different entity code
        println("CHOOL_NAME + ENTITY_CD Profile, Distribution of entities")
        val groups_schools_entity = df.groupBy("ENTITY_CD", "SCHOOL_NAME").count().select("count")
        z.show(groups_schools_entity.describe()) // max: 3, min: 2, avg: 2.99 -> match the entity_cd

        // profile the DISTRICT_NAME
        println("DISTRICT_NAME Profile, unique districts: " + df.select("DISTRICT_NAME").distinct.count)
        
        //group by YEAR for the NEEDS_INDEX
        val groups_districts = df.groupBy("DISTRICT_NAME").count()
        z.show(groups_districts.describe()) //DISTRICT max occur 800, min 5 times

        // profile the NEEDS_INDEX
        println("NEEDS_INDEX Profile")
        z.show(df.describe("NEEDS_INDEX")) // max:7, min: 1, avg: 3.55, stddev: 2.09
        
        //differset school year NEEDS_INDEX distribution
        println("NEEDS_INDEX Profile in different years")
        z.show(df
        .groupBy("YEAR")
        .agg(
            sum("NEEDS_INDEX"),
            avg("NEEDS_INDEX"), 
            min("NEEDS_INDEX"), 
            max("NEEDS_INDEX"), 
            stddev("NEEDS_INDEX")
        )
        .orderBy("YEAR"))
}

 
### N/RC Clean Step

We will do the following steps to clean the data

In [7]:
%spark

//  Define UDFs for the Index Descrption
val getNeedIndex = (index: Int) => {
    index match {
    case 1 => "High N/RC: New York City"
    case 2 => "High N/RC: Large City Districts "
    case 3 => "High N/RC: Urban-Suburban Districts"
    case 4 => "High N/RC: Rural Districts"
    case 5 => "Average N/RC Districts"
    case 6 => "Low N/RC Districts"
    case 7 => "Charter Schools"
    case _ => "Other"
    }
}

spark.udf.register("nrcStr", getNeedIndex)

In [8]:
%spark
// create a dataframe to store the data for all years
var nrcDF : DataFrame = null
for (i <- 2018 to 2023){
    var df = spark.read.option("header", "true")
        .option("multiLine", "true")
        .option("inferSchema", "true")
        .option("escape", "\"")
        .csv(root_folder + i + "/BOCES_NRC.csv")
        // filder other year's data
        .filter("YEAR = " + i)
        // drop null values in SCHOOL_NAME
        .filter("SCHOOL_NAME is not null and SCHOOL_NAME != ''")
        // in DISTRICT_NAME, if null, replace with 'UNAVAILABLE'
        .withColumn("DISTRICT_NAME", when(col("DISTRICT_NAME").isNull, "UNAVAILABLE").otherwise(col("DISTRICT_NAME")))
        .withColumn("NEEDS_DESCRIPTION", expr("nrcStr(NEEDS_INDEX)"))
        .select("ENTITY_CD", "SCHOOL_NAME", "YEAR", "NEEDS_INDEX", "NEEDS_DESCRIPTION", "COUNTY_NAME", "DISTRICT_NAME")
    if (nrcDF == null){
        nrcDF = df
    } else {
        nrcDF = nrcDF.union(df)
    }
}

nrcDF.printSchema
val nrcCount = nrcDF.count
z.show(nrcDF.groupBy("Year").count())

### NR/C Analysis (Post Profile)

In [10]:
z.show(nrcDF.groupBy("COUNTY_NAME").count().orderBy("COUNTY_NAME"))

In [11]:
// see all level distribution
for(i <- 1 to 7){
    println(getNeedIndex(i))
    
    z.show(nrcDF
    .filter(s"NEEDS_INDEX = ${i}")
    .groupBy("COUNTY_NAME")
    .count()
    .orderBy(desc("count"))
    )
}

### Output Cleaned Data

In [13]:
println("============== Final NR/C Dataframe (Total: " + nrcDF.count + ") ==============")
val nrc_cleaned_Df = nrcDF 
// only select the columns we need
.drop("COUNTY_NAME", "DISTRICT_NAME")
// rename the columns
.withColumnRenamed("ENTITY_CD", "School_BEDS_Code")
.withColumnRenamed("SCHOOL_NAME", "School_Name")
.withColumnRenamed("YEAR", "Year")
.withColumnRenamed("NEEDS_INDEX", "N/RC_Index")
.withColumnRenamed("NEEDS_DESCRIPTION", "N/RC_Index_Description")
z.show(nrc_cleaned_Df.limit(10))
nrc_cleaned_Df.write.mode("overwrite").parquet(root_folder + "nrc_cleaned.parquet")

## 2 Graudation Rate(High School Only)

In [15]:
// see the structure and data
for (i <- 2018 to 2023){
    var df = spark.read.option("header", "true")
        .option("multiLine", "true")
        .option("inferSchema", "true")
        .option("escape", "\"")
        .csv(root_folder + i + "/Graduation_Rate.csv")
    println("============== Year " + i + " ==============")
    // show the schema
    println("Columns: " + df.columns.length)
    df.printSchema
    z.show(df.limit(5))
}

## Clean Data

- Merget the 2018 -2023 data
- Rename the coloums

In [17]:
// define the root df
var gradDF : DataFrame = null
for (i <- 2018 to 2023){
    var df = spark.read.option("header", "true")
        .option("multiLine", "true")
        .option("inferSchema", "true")
        .option("escape", "\"")
        .csv(root_folder + i + "/Graduation_Rate.csv")
        // filter the current year data
        .filter("YEAR = " + i)
        // select the columns we need
        .select("YEAR", "ENTITY_CD", "ENTITY_NAME", "SUBGROUP_NAME", "COHORT", "GRAD_RATE")
        
    if (gradDF == null){
        gradDF = df
    } else {
        gradDF = gradDF.union(df)
    }
}

println("============== Final Graduation Rate Dataframe (Total: " + gradDF.count + ") ==============")
gradDF.printSchema

 

### Profile

In [19]:
//  by year
z.show(gradDF.groupBy("Year").count().orderBy("Year"))

In [20]:
//  by subgroup types
z.show(gradDF.groupBy("SUBGROUP_NAME").count().orderBy("SUBGROUP_NAME"))

In [21]:
// merge English Language Learner
gradDF = gradDF.withColumn(
    "SUBGROUP_NAME",
    when(col("SUBGROUP_NAME") === "English Language Learner", "English Language Learners")
      .otherwise(col("SUBGROUP_NAME"))
)
z.show(gradDF.groupBy("SUBGROUP_NAME").count().orderBy("SUBGROUP_NAME"))

In [22]:
// by cohort types
z.show(gradDF.groupBy("COHORT").count())

In [23]:

val _4_year_rate = gradDF.filter($"SUBGROUP_NAME" === "All Students" && $"GRAD_RATE" =!= "s" && $"COHORT" === "4-Year").count
val _5_year_rate = gradDF.filter($"SUBGROUP_NAME" === "All Students" && $"GRAD_RATE" =!= "s" && $"COHORT" === "5-Year").count
val _6_year_rate = gradDF.filter($"SUBGROUP_NAME" === "All Students" && $"GRAD_RATE" =!= "s" && $"COHORT" === "6-Year").count

// invalid records
val invalid_rate = gradDF.filter($"SUBGROUP_NAME" === "All Students" && $"GRAD_RATE" === "s" && $"COHORT" === "Combined").count
// valid records
val valid_Rate = gradDF.filter($"SUBGROUP_NAME" === "All Students" && $"GRAD_RATE" =!= "s" && $"COHORT" === "Combined").count

This mean for the invalid reocrd, we can get the rate form the average value 4 year, 5 year and 6 year value.

### Clean and output

In [26]:
// get all entities combine grad rate
val entityGradRate = gradDF
.filter("SUBGROUP_NAME = 'All Students' AND COHORT != 'Combined' AND GRAD_RATE != 's'")
.groupBy("ENTITY_CD", "YEAR")
// get the avaerage grad rate
.agg(avg("GRAD_RATE").alias("Graduation_Rate"))
// rename the column
.withColumnRenamed("ENTITY_CD", "School_BEDS_Code")
.withColumnRenamed("YEAR", "Year")

schoolYearGradRate.printSchema

// show the distribution,invse the order
z.show(schoolYearGradRate.orderBy(desc("Graduation_Rate")).limit(10))
z.show(schoolYearGradRate.groupBy("Year").count())
z.show(schoolYearGradRate.describe())

// save the dataframe
schoolYearGradRate.write.mode("overwrite").parquet(root_folder + "grad_rate_cleaned.parquet")