<a href="https://colab.research.google.com/github/Sageh9/MSSP607/blob/main/Assignment1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import statsmodels.api as sm
import statsmodels.formula.api as smf

In [None]:
df = pd.read_excel('/content/sample_data/data_academic_performance.xlsx', sheet_name="SABER11_SABERPRO")

In [None]:
print(df)

                cod_s11 gender                         edu_father  \
0      SB11201210000129      F  Incomplete Professional Education   
1      SB11201210000137      F                 Complete Secundary   
2      SB11201210005154      M                           Not sure   
3      SB11201210007504      F                           Not sure   
4      SB11201210007548      M    Complete professional education   
...                 ...    ...                                ...   
12406  SB11201420568705      M                            Ninguno   
12407  SB11201420573045      M    Complete professional education   
12408  SB11201420578809      M   Complete technique or technology   
12409  SB11201420578812      F    Complete professional education   
12410  SB11201420583232      M                 Complete Secundary   

                             edu_mother  \
0      Complete technique or technology   
1       Complete professional education   
2                              Not sure   

In [None]:
# === Basic Cleaning ===
df = df.rename(columns=lambda x: x.strip())  # clean column names
df = df.dropna(subset=["gender", "percentile"])  # remove missing essential values
df["percentile"] = pd.to_numeric(df["percentile"], errors="coerce")

# === 1. Gender Comparison ===
gender_summary = df.groupby("gender")["percentile"].agg(["mean", "std", "count"])
print("=== Performance by Gender =====")
print(gender_summary, "\n")

# t-test between genders
male_scores = df.loc[df["gender"] == "M", "percentile"].dropna()
female_scores = df.loc[df["gender"] == "F", "percentile"].dropna()
t_stat, p_val = stats.ttest_ind(male_scores, female_scores, equal_var=False)
print(f"T-test Gender difference: t={t_stat:.3f}, p={p_val:.3f}\n")

# === 2. Parental Education Influence ===
edu_effect = df.groupby("edu_father")["percentile"].mean().sort_values(ascending=False)
print("=== Average Percentile by Father's Education ===")
print(edu_effect, "\n")

# ANOVA to test overall difference
groups = [group["percentile"].dropna() for name, group in df.groupby("edu_father")]
f_stat, p_val_edu = stats.f_oneway(*groups) # Changed p_val to p_val_edu to avoid conflict
print(f"ANOVA Father Education Effect: F={f_stat:.3f}, p={p_val_edu:.3f}\n")


# === 3. Socioeconomic Stratum Performance ===
stratum_effect = df.groupby("stratum")["percentile"].mean().sort_values(ascending=False)
print("=== Average Percentile by Stratum ===")
print(stratum_effect, "\n")

groups_stratum = [group["percentile"].dropna() for name, group in df.groupby("stratum")]
f_stat, p_val_stratum = stats.f_oneway(*groups_stratum) # Changed p_val to p_val_stratum to avoid conflict
print(f"ANOVA Stratum Effect: F={f_stat:.3f}, p={p_val_stratum:.3f}\n")

# === 4. Summary Insights ===
print("=== Insights & Recommendations ===")
if p_val_stratum < 0.05:
    print("- Socioeconomic stratum shows significant differences in performance.")
else:
    print("- No strong statistical difference between strata in performance.")

# Check if 'edu_father' is in the index of edu_effect before accessing its elements
if 'Complete professional education' in edu_effect.index:
    print("- Higher parental education levels correlate with higher percentiles.")
else:
    print("- Parental education effect not consistent across categories.")


if p_val < 0.05 and t_stat < 0:
    print("- Female students outperform male students on average.")
elif p_val < 0.05 and t_stat > 0:
    print("- Male students outperform female students on average.")
else:
    print("- Gender performance differences are not statistically significant.")

print("\nRecommendations:")
print("1. Strengthen support for students from lower strata via targeted aid and mentoring.")
print("2. Encourage parental involvement and awareness campaigns about education’s impact.")
print("3. Further investigate why gender or socioeconomic differences persist.")

=== Performance by Gender =====
             mean        std  count
gender                             
F       67.133254  25.801336   5043
M       69.345277  25.876128   7368 

T-test Gender difference: t=4.685, p=0.000

=== Average Percentile by Father's Education ===
edu_father
Postgraduate education                   82.417512
Incomplete Professional Education        73.818824
Complete professional education          73.712865
Complete technique or technology         70.152429
Incomplete technical or technological    68.339350
Not sure                                 65.405405
Complete Secundary                       64.488217
Ninguno                                  62.138211
Incomplete Secundary                     61.899175
Incomplete primary                       61.137415
0                                        61.089514
Complete primary                         60.347087
Name: percentile, dtype: float64 

ANOVA Father Education Effect: F=76.171, p=0.000

=== Average Percentil

Gender Comparison: The code calculated the mean, standard deviation, and count of the 'percentile' for each gender. An independent samples t-test was performed to compare the means of male and female students' percentiles.
Parental Education Influence: The average 'percentile' was calculated for each category of 'edu_father' (father's education level). An ANOVA test was conducted to determine if there were significant differences in mean percentiles across the different father's education levels.
Socioeconomic Stratum Performance: The average 'percentile' was calculated for each socioeconomic 'stratum'. An ANOVA test was conducted to determine if there were significant differences in mean percentiles across the different strata.
Summary Insights and Recommendations: Based on the p-values from the statistical tests, the code printed insights about the significance of gender, parental education, and socioeconomic stratum on academic performance and provided recommendations.
In essence, the code performs descriptive statistics and inferential statistical tests (t-test and ANOVA) to analyze the relationship between academic performance (measured by percentile) and demographic and socioeconomic factors like gender, parental education, and socioeconomic stratum.



Based on the analysis performed, here are the insights and recommendations, along with answers to your specific questions:

**Insights and Recommendations**

The analysis reveals significant relationships between academic performance (measured by percentile) and several socioeconomic and demographic factors.

1.  **Parental education significantly influences outcomes.** As shown by the "Average Percentile by Father's Education" output and the ANOVA p-value of 0.000, there are significant differences in academic performance across different levels of father's education. Students whose fathers have postgraduate education have the highest average percentile (82.42), while those with fathers having only complete primary education or no education tend to have lower percentiles (around 60-62). This suggests a positive correlation between higher parental education and better academic performance.

2.  **Socioeconomic stratum significantly impacts performance, and there are clear differences between strata.** The "Average Percentile by Stratum" output and the ANOVA p-value of 0.000 demonstrate significant differences in performance across strata. Stratum 6 performs best with an average percentile of 85.83, followed closely by Stratum 5 (83.20). Performance generally decreases as the stratum level decreases, with Stratum 1 and Stratum 0 showing the lowest average percentiles (56.97 and 58.86 respectively). This indicates a strong positive correlation between higher socioeconomic stratum and better academic performance.

3.  **How does each gender perform in the global assessment?** There is a statistically significant difference in global assessment performance between genders. The t-test result (t=4.685, p=0.000) indicates that this difference is statistically significant. Male students, with an average percentile of 69.35, outperform female students, who have an average percentile of 67.13.

4.  **Use your analysis to understand the results. What did you find interesting, and can you make any recommendations or inferences?**
    My analysis reinforces the strong influence of socioeconomic factors and parental education on academic performance. The significant difference in performance across strata and the correlation with parental education are particularly interesting findings, highlighting potential inequalities in educational opportunities or support. While the gender difference is statistically significant, the magnitude of the difference in mean percentiles is smaller compared to the differences observed across educational levels and strata.

**Recommendations:**

*   **Targeted Support for Lower Strata:** Implement programs and provide resources specifically designed to support students from lower socioeconomic strata to help bridge the performance gap.
*   **Promote Parental Engagement:** Develop initiatives to encourage parental involvement in education, particularly for parents with lower levels of formal education, emphasizing the positive impact it can have on their children's academic success.
*   **Investigate Gender Differences:** Conduct further research to understand the underlying reasons for the observed gender difference in performance to develop targeted interventions if necessary.