### Data Suppressions and Small Group Insights
1. How many data points are suppressed (s) across all tables, and what subgroups are most affected?
2. How does combining small groups (e.g., Small Group Total) affect overall proficiency rates?
3. Is there a notable performance difference between suppressed and non-suppressed groups?
4. Are suppressed subgroups more common in specific grades or subjects?
5. How often is gender data suppressed in comparison to race/ethnicity data?


#### How many data points are suppressed (s) across all tables, and what subgroups are most affected?

In [6]:
import pandas as pd

In [8]:
# Load datasets
data_ela = pd.read_excel(r'E:\Data Analytics\NYSE Report Card\Annual_EM_ELA.xlsx')

In [10]:
data_math = pd.read_excel(r'E:\Data Analytics\NYSE Report Card\Annual_EM_MATH.xlsx')

In [12]:
data_science = pd.read_excel(r'E:\Data Analytics\NYSE Report Card\Annual_EM_SCIENCE.xlsx')

In [14]:
# Check for suppressed values ('s') in relevant columns across all subjects
suppressions_ela = (data_ela == "s").sum().sum()
suppressions_math = (data_math == "s").sum().sum()
suppressions_science = (data_science == "s").sum().sum()

In [16]:
# Total suppressions
total_suppressions = suppressions_ela + suppressions_math + suppressions_science

In [18]:
print(f"Total Suppressed Data Points Across All Subjects: {total_suppressions}")

Total Suppressed Data Points Across All Subjects: 4486928


In [20]:
# Identify subgroups most affected in each dataset
def subgroup_suppressions(data, column):
    suppressed_subgroups = data[data[column] == "s"]["SUBGROUP_NAME"].value_counts()
    return suppressed_subgroups

In [22]:
ela_suppressions = subgroup_suppressions(data_ela, "PER_PROF")
math_suppressions = subgroup_suppressions(data_math, "PER_PROF")
science_suppressions = subgroup_suppressions(data_science, "PER_PROF")

In [24]:
print("ELA Suppressed Subgroups:")
print(ela_suppressions)
print("Math Suppressed Subgroups:")
print(math_suppressions)
print("Science Suppressed Subgroups:")
print(science_suppressions)

ELA Suppressed Subgroups:
SUBGROUP_NAME
Multiracial                                        16843
Black or African American                          12977
Asian or Native Hawaiian/Other Pacific Islander    12784
Not Homeless                                       12475
Homeless                                           12237
Hispanic or Latino                                 11462
Non-English Language Learner                       10715
White                                              10512
English Language Learner                           10476
American Indian or Alaska Native                    7383
Not in Foster Care                                  6474
In Foster Care                                      6221
Students with Disabilities                          5449
General Education Students                          5429
Economically Disadvantaged                          4686
Not Economically Disadvantaged                      4654
Parent Not in Armed Forces                      

#### How does combining small groups (e.g., Small Group Total) affect overall proficiency rates?

In [32]:
# Handle 'PER_PROF' by converting to numeric and coercing invalid entries (e.g., 's') to NaN
data_ela["PER_PROF"] = pd.to_numeric(data_ela["PER_PROF"], errors="coerce")
data_math["PER_PROF"] = pd.to_numeric(data_math["PER_PROF"], errors="coerce")
data_science["PER_PROF"] = pd.to_numeric(data_science["PER_PROF"], errors="coerce")

In [34]:
# Filter for rows with "Small Group Total" in subgroup names
small_group_ela = data_ela[data_ela["SUBGROUP_NAME"].str.contains("Small Group Total", na=False)]
small_group_math = data_math[data_math["SUBGROUP_NAME"].str.contains("Small Group Total", na=False)]
small_group_science = data_science[data_science["SUBGROUP_NAME"].str.contains("Small Group Total", na=False)]

In [36]:
# Calculate proficiency rates for small groups and overall
small_group_proficiency = pd.concat([small_group_ela, small_group_math, small_group_science])["PER_PROF"].mean()
overall_proficiency = pd.concat([data_ela, data_math, data_science])["PER_PROF"].mean()

In [38]:
print(f"Small Group Proficiency Rate: {small_group_proficiency:.2f}%")
print(f"Overall Proficiency Rate: {overall_proficiency:.2f}%")

Small Group Proficiency Rate: 47.07%
Overall Proficiency Rate: 45.21%


#### Is there a notable performance difference between suppressed and non-suppressed groups?

In [44]:
all_subjects = pd.concat([data_ela, data_math, data_science])

In [46]:
# Identify rows with suppressed data and without suppressed data
suppressed_data = all_subjects[(all_subjects == "s").any(axis=1)]
non_suppressed_data = all_subjects[~(all_subjects == "s").any(axis=1)]

In [48]:
# Convert PER_PROF to numeric where applicable
suppressed_data["PER_PROF"] = pd.to_numeric(suppressed_data["PER_PROF"], errors="coerce")
non_suppressed_data["PER_PROF"] = pd.to_numeric(non_suppressed_data["PER_PROF"], errors="coerce")

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  suppressed_data["PER_PROF"] = pd.to_numeric(suppressed_data["PER_PROF"], errors="coerce")
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  non_suppressed_data["PER_PROF"] = pd.to_numeric(non_suppressed_data["PER_PROF"], errors="coerce")


In [50]:
# Calculate mean proficiency rates
suppressed_rate = suppressed_data["PER_PROF"].mean()
non_suppressed_rate = non_suppressed_data["PER_PROF"].mean()

In [52]:
print(f"Mean Proficiency Rate for Suppressed Groups: {suppressed_rate:.2f}%")
print(f"Mean Proficiency Rate for Non-Suppressed Groups: {non_suppressed_rate:.2f}%")

Mean Proficiency Rate for Suppressed Groups: nan%
Mean Proficiency Rate for Non-Suppressed Groups: 45.21%


#### Are suppressed subgroups more common in specific grades or subjects?

In [56]:
# Count suppressions by assessment name (grade/subject)
def count_suppressions_by_subject(data, column):
    suppressions_by_subject = data[data[column] == "s"]["ASSESSMENT_NAME"].value_counts()
    return suppressions_by_subject

In [58]:
ela_suppressions_by_grade = count_suppressions_by_subject(data_ela, "PER_PROF")
math_suppressions_by_grade = count_suppressions_by_subject(data_math, "PER_PROF")
science_suppressions_by_grade = count_suppressions_by_subject(data_science, "PER_PROF")

In [60]:
print("Suppressions by Grade/Subject for ELA:")
print(ela_suppressions_by_grade)
print("Suppressions by Grade/Subject for Math:")
print(math_suppressions_by_grade)
print("Suppressions by Grade/Subject for Science:")
print(science_suppressions_by_grade)

Suppressions by Grade/Subject for ELA:
Series([], Name: count, dtype: int64)
Suppressions by Grade/Subject for Math:
Series([], Name: count, dtype: int64)
Suppressions by Grade/Subject for Science:
Series([], Name: count, dtype: int64)


#### How often is gender data suppressed in comparison to race/ethnicity data?

In [64]:
# Filter for gender-related subgroups
gender_suppressions = all_subjects[all_subjects["SUBGROUP_NAME"].str.contains("Female|Male|Non-Binary", na=False)]
race_suppressions = all_subjects[all_subjects["SUBGROUP_NAME"].str.contains("American Indian|Asian|Black|Hispanic|White|Multiracial", na=False)]

In [66]:
# Count suppressed values
gender_suppressed_count = (gender_suppressions == "s").sum().sum()
race_suppressed_count = (race_suppressions == "s").sum().sum()

In [68]:
print(f"Gender Suppressions: {gender_suppressed_count}")
print(f"Race/Ethnicity Suppressions: {race_suppressed_count}")

Gender Suppressions: 92136
Race/Ethnicity Suppressions: 1744358
