# Deep Analysis of PISA 2022 Data: Interrelations of Academic Achievement with Sociocultural Factors

## Project Overview

This project undertakes an in-depth analysis of the Program for International Student Assessment (PISA), an initiative orchestrated by the Organisation for Economic Co-operation and Development (OECD). PISA is a worldwide study designed to evaluate the educational performance of 15-year-old students and their preparedness for real-world challenges beyond the formal school curriculum. This assessment, which takes place every three years, measures the abilities of students in critical cognitive domains including reading literacy, mathematical literacy, and scientific literacy.

The significance of PISA lies in its role as a global benchmark for evaluating education systems worldwide by comparing the skills and knowledge of students across different countries. This comparison helps in identifying effective educational practices and policies. By focusing on how well young adults can apply their knowledge to real-life situations, PISA provides valuable insights into the effectiveness of schooling in different regions and aids in policymaking to enhance educational outcomes.

Through rigorous and standardized testing methodologies, PISA evaluates not just rote memorization, but the ability of students to think critically, solve complex problems, and make reasoned decisions. It thus provides a comprehensive picture of students' capabilities in handling the demands of future academic and occupational settings. The insights drawn from PISA data are instrumental for educators, policymakers, and stakeholders in crafting strategies that improve educational standards and foster an environment that nurtures the full potential of every learner.

## Data Source

For this project, we harness the most recent 2022 data set obtained from the "Student Questionnaire Data File" available on the official OECD website. This file is a comprehensive repository of data, meticulously compiled from the responses of students and their parents across various countries. It serves as a foundational element for our analysis, providing both performance scores and a diverse range of background variables. These variables facilitate a deep dive into several critical aspects of the educational landscape:

- **Demographic and Socio-economic Profiles**: This category includes detailed demographic information about students and their familial backgrounds, capturing data points such as age, gender, immigration status, as well as the educational levels and socio-economic statuses of their parents. These variables are crucial for understanding the diverse contexts from which students hail and how these factors might influence their educational achievements.

- **Educational Performance**: The data file provides scores across fundamental academic subjects—mathematics, reading, and science. These scores are offered as plausible values, which are multiple imputed scores reflecting students’ abilities derived from their test performances. This approach allows for more nuanced statistical analyses and helps in understanding the variability in student performance across different educational systems.

- **Learning Environment and Behaviors**: This section sheds light on the non-academic aspects of students' school life, encompassing their study habits, self-reported motivational attitudes, their sense of belonging at their school, and their experiences with bullying. Such data points are invaluable for assessing the psycho-social dimensions that influence educational outcomes.

- **Digital Literacy and Resources**: In today’s digital age, access to technology is a significant factor in educational success. This variable assesses the availability and usage of information and communication technologies (ICT) at students’ homes. It examines how these tools are integrated into the learning process and their impact on students' educational performance.

By analyzing these detailed and multi-faceted data, this project aims to uncover patterns and trends that offer insights into the factors that most significantly impact student learning outcomes. These insights can help policymakers, educators, and communities to design targeted interventions that enhance educational equity and effectiveness.

## Core Analytical and Advanced Programming Methods

The project harnesses a diverse array of data analysis techniques tailored specifically to dissect the complex interactions within educational data provided by the PISA 2022 dataset. Here’s a rundown of the primary methods applied:

- **Statistical Analysis**:
  - **Descriptive Statistics**: Summarize data features like central tendency, variability, and distribution shapes.
  - **Shapiro-Wilk Test**: Assess the normality of data distributions, crucial for the validity of many other statistical tests.
  - **Correlation Analysis**: Determine the strength and direction of relationships between variables using Pearson’s correlation coefficient.
  - **Regression Analysis**: Linear regression to predict educational outcomes and polynomial regression to capture non-linear relationships.

- **Machine Learning**:
  - **Random Forest Regression**: Utilized to predict outcomes based on multiple input variables and to assess feature importance, providing insights into which factors most significantly impact student performance.
  - **Cross-Validation**: Enhance model validation through k-fold cross-validation, ensuring the model’s robustness and generalizability.

- **Deep Learning**:
  - **Neural Networks**: Deploy neural networks to model complex, non-linear interactions between variables. The multi-layer perceptron architecture facilitates the exploration of deeper patterns in the data, instrumental in uncovering hidden insights.

- **Data Visualization**:
  - **Plotly and Seaborn**: Generate interactive graphs and static plots to visualize data distributions, correlations, and regression outcomes effectively. This includes creating heatmaps for correlation matrices, scatter plots for regression analysis, and treemaps for hierarchical data exploration.

- **Advanced Data Processing**:
  - **Handling Missing Data**: Techniques such as imputation and dropping rows/columns to clean the dataset, ensuring the integrity of the analyses.
  - **Feature Engineering**: Includes generating polynomial features for regression models and encoding categorical variables to prepare the dataset for machine learning.

These methods collectively facilitate a thorough exploration of the intricate dynamics influencing educational achievements across various demographics and socio-economic backgrounds. Through the intelligent application of these techniques, the project aims to provide actionable insights that could inform policy-making and educational strategies.

## Project Aim

The overarching goal of this project is to dissect and understand the multitude of factors that influence educational outcomes for students assessed in the PISA 2022 survey. By leveraging a comprehensive dataset provided by the OECD, this analysis seeks to uncover the nuanced interplay between students' academic performances and their demographic, socio-economic, and environmental contexts.

### Objectives:
1. **Identify Key Factors**: Determine the primary demographic, socio-economic, and educational variables that significantly impact students' scores in mathematics, reading, and science.
2. **Model Educational Outcomes**: Utilize advanced statistical methods and machine learning algorithms to predict educational outcomes and interpret the relative importance of each predictor. This will include assessing the impact of factors such as access to technology, parental education levels, and school environments on student performance.
3. **Evaluate Policy Implications**: Analyze the data to provide evidence-based recommendations to educational authorities and policymakers. The aim is to identify potential areas for intervention that could lead to improvements in educational equity and effectiveness.
4. **Promote Educational Equity**: Explore how differences in educational access and quality affect performance across various groups, aiming to highlight disparities and recommend strategies for promoting inclusivity and fairness in education.

### Impact:
The insights derived from this project are intended to inform and enhance educational policies and practices worldwide. By understanding the factors that drive educational success, stakeholders can implement targeted interventions to support underperforming groups, optimize educational resources, and ultimately raise the standard of education provided to all students. Furthermore, this project aims to stimulate ongoing dialogue among educators, policymakers, and the academic community about how best to harness data-driven insights for educational planning and reform.

In summary, this project not only aims to analyze the data from the PISA 2022 survey comprehensively but also seeks to translate these analyses into practical strategies that can lead to real and sustainable improvements in educational systems globally.


In [4]:
import warnings
import numpy as np
import pandas as pd

## Foundational Libraries

- **warnings**: Utilized to manage warnings during runtime, particularly for suppressing specific warning categories that might clutter the output, ensuring a clean presentation of results. This is especially useful in a data science context to ignore routine warnings generated by third-party libraries without affecting the interpretation of code execution.

- **numpy (np)**: A cornerstone for numerical computing in Python, numpy offers comprehensive support for arrays and matrices alongside a vast library of mathematical functions to operate on these data structures. In our project, it is indispensable for handling numerical operations on arrays efficiently, which underpins various data manipulation tasks.

- **pandas (pd)**: Essential for data manipulation and analysis, pandas provides data structures and operations for manipulating numerical tables and time series. This library is crucial in our project for reading, writing, and processing data from various file formats. It enables sophisticated data manipulation capabilities such as merging, reshaping, selecting, as well as robust handling of missing data.

In [5]:
from scipy import stats
from scipy.stats import shapiro, randint as sp_randint
from sklearn.model_selection import train_test_split, RandomizedSearchCV, cross_val_score
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.linear_model import LinearRegression, SGDRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score
import statsmodels.api as sm
from statsmodels.stats.diagnostic import het_breuschpagan
from statsmodels.stats.stattools import durbin_watson
from statsmodels.stats.outliers_influence import variance_inflation_factor
from statsmodels.graphics.regressionplots import add_lowess

## Statistical and Machine Learning Libraries

- **scipy.stats, shapiro, sp_randint**: This suite of tools from the SciPy library supports the generation of random variables, conducting statistical tests, and exploratory data analysis. `shapiro` tests for normality, essential for validating assumptions in many statistical models, while `sp_randint` is used for generating discrete random numbers for hyperparameter tuning.

- **sklearn.model_selection**: Includes submodules like `train_test_split` for dividing data into training and test sets, `RandomizedSearchCV` for optimizing model parameters through random search, and `cross_val_score` for evaluating a model's performance using cross-validation.

- **sklearn.preprocessing**: Contains `PolynomialFeatures` for generating polynomial and interaction features and `StandardScaler` for feature scaling by standardizing variables. These preprocessing tools are vital for enhancing model performance and accuracy.

- **sklearn.linear_model**: Provides models like `LinearRegression` for standard linear regression analysis and `SGDRegressor` for linear models fitted by stochastic gradient descent, a practical approach for large datasets.

- **sklearn.ensemble**: Features `RandomForestRegressor`, an ensemble method based on randomized decision trees, known for its high accuracy and robustness against overfitting, particularly useful for regression tasks in complex datasets.

- **sklearn.metrics**: Includes performance metrics such as `mean_squared_error` for quantifying the accuracy of regression models and `r2_score` for assessing the proportion of variance captured by the model.

- **statsmodels.api**: Offers classes and functions for the estimation of different statistical models, as well as for conducting statistical tests and data exploration. An essential tool for in-depth statistical analysis, often used for regression diagnostics, time-series analysis, and hypothesis testing.

- **statsmodels' diagnostic tools**: `het_breuschpagan` tests for heteroscedasticity, `durbin_watson` assesses autocorrelation in residuals from a regression, `variance_inflation_factor` evaluates multicollinearity, and `add_lowess` (locally weighted scatterplot smoothing) is useful for trend fitting in regression diagnostics. These tools provide deeper insights into the model's assumptions and performance, ensuring robust statistical inference.

In [6]:
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense

## Deep Learning Frameworks: TensorFlow and Keras

- **TensorFlow**: An open-source library developed by Google for numerical computation and machine learning. TensorFlow provides a broad toolkit for developing and training machine learning models, including powerful features for deep learning. It is used in this project to harness complex patterns in the data that simpler models might miss, especially beneficial for large datasets with intricate features.

- **tensorflow.keras.models**: Contains `Sequential`, which is a linear stack of layers used for creating models. The Sequential model is straightforward to understand and use, which is particularly useful for standard deep learning applications where layers are added in sequence.

- **tensorflow.keras.layers**: Provides various layers, including `Dense`, which is a regular densely-connected neural network layer. Dense layers are fundamental in neural networks for learning high-level patterns in large data sets and are used extensively in our models to process and learn from educational data. Each neuron in a Dense layer receives input from all neurons of the previous layer, thus being well-suited for pattern recognition tasks found in complex datasets like PISA.

These tools are integral to building neural network architectures, facilitating the exploration and implementation of deep learning models that can potentially reveal nonlinear relationships and interactions not detectable by traditional statistical methods. This capability is particularly valuable in educational research, where interactions between variables are complex and multidimensional.

In [7]:
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.graph_objects as go
import plotly.express as px
import plotly.figure_factory as ff
import squarify

## Data Visualization Libraries

- **Matplotlib (plt)**: A powerful plotting library for Python, Matplotlib is fundamental for creating static, interactive, and animated visualizations in Python. In this project, Matplotlib is utilized primarily for generating histograms, scatter plots, and more complex visualizations like residual plots, providing a traditional and detailed approach to data visualization.

- **Seaborn (sns)**: Built on top of Matplotlib, Seaborn extends its functionality, making it easier to generate complex visualizations with more attractive and informative statistical graphics. This library is used for creating enhanced visualizations such as heatmaps for correlation matrices, which are crucial for identifying relationships between variables in the dataset.

- **Plotly (go, px, ff)**: A modern platform for creating interactive plots and dashboards. Plotly's Python graphing libraries `plotly.graph_objects` and `plotly.express` offer an extensive range of interactive plotting options that enhance user engagement with the data. Plotly is used for creating dynamic visualizations like choropleth maps, interactive scatter plots, and detailed histograms that allow stakeholders to delve deeper into the data insights through zooming, panning, and hovering to display additional data details.

- **Squarify**: This library visualizes hierarchical data with adjustable-sized rectangular tree maps, allowing for effective space-filling representations of proportions amongst categories. In the project, Squarify is used to produce treemaps that visually represent the distribution of observations across various categories, such as different countries or educational variables, making it easier to understand complex hierarchical relationships.

The combined capabilities of these libraries enable a comprehensive suite of visual tools that support both exploratory data analysis and the presentation of findings in a format conducive to stakeholder understanding and decision-making.

In [8]:
from sas7bdat import SAS7BDAT
import pyreadstat
import pycountry

## Data Import and Country Code Handling Libraries

- **sas7bdat (SAS7BDAT)**: This library is specifically designed for reading SAS data files in the `.sas7bdat` format, which are often used in large-scale data analysis contexts. In our project, `sas7bdat` enables the direct importation of the PISA dataset stored in this format, ensuring that data from such specialized formats is accessible for analysis in Python without the need for conversion.

- **pyreadstat**: A library that facilitates the reading and writing of SAS data files along with associated metadata. It provides a convenient bridge to work with `.sas7bdat` files in Python, similarly to how `sas7bdat` is used, but with additional support for reading the metadata, which can be crucial for understanding data structure and contents. This feature is essential for preprocessing steps in the project, as it allows a deeper understanding and manipulation of the data based on its inherent properties.

- **pycountry**: Utilized for converting country names and codes between different standards (e.g., ISO, FIPS). In the project, `pycountry` is crucial for mapping country names from the dataset to their corresponding ISO alpha-3 codes. This capability supports the integration of the data with other global datasets and aids in the creation of geographically accurate visualizations such as choropleth maps, enhancing the geographic data analysis aspect of the project.

These libraries collectively streamline the data import process from specialized formats and enhance the geographical mapping of data, key aspects that underpin the robust analysis and visualization capabilities of the project.

In [None]:
from IPython.display import HTML

## Enhanced Display Handling with IPython

- **IPython.display.HTML**: Part of the IPython ecosystem, this library is essential for embedding rich HTML content in Jupyter notebooks. IPython's `HTML` module enables the creation and rendering of HTML content dynamically within notebook cells. This functionality is particularly useful in our project for presenting information in a more visually appealing and interactive format than is possible with plain text output.

In this project, `HTML` is used to enhance the presentation of data and analytics results, allowing for the custom formatting of outputs, which can include styling with CSS, embedding images or interactive elements, and more. This capability significantly improves the readability and user interaction with the project's outputs, making it easier for stakeholders to engage with and understand complex data insights directly within the Jupyter notebook environment.

The use of `HTML` in IPython effectively bridges the gap between standard data output and a more polished, professional presentation format, catering to both technical and non-technical audiences. This ensures that our data visualizations and results are not only informative but also compelling and accessible.

In [9]:
warnings.filterwarnings('ignore', category=FutureWarning)

## Managing Warnings in Python

- **warnings**: This module is a powerful tool for handling warnings in Python programs. In data science projects, particularly those involving multiple libraries that may not always align perfectly in terms of dependencies or deprecated features, managing warnings effectively is crucial to maintain a clean and readable output. Using `warnings.filterwarnings('ignore', category=FutureWarning)` specifically instructs Python to ignore warnings about future changes to the libraries' APIs or deprecated features that have not yet been removed but will be in future versions.


In [11]:
# file_path = 'cy08msp_stu_qqq.sas7bdat'

# columns_to_load = [
#    "CNT", "OECD", "ST004D01T", "ST001D01T", "ST126Q01TA", "ST125Q01NA",
#    "PV1MATH", "PV2MATH", "PV3MATH", "PV4MATH", "PV5MATH",
#    "PV1READ", "PV2READ", "PV3READ", "PV4READ", "PV5READ",
#    "PV1SCIE", "PV2SCIE", "PV3SCIE", "PV4SCIE", "PV5SCIE",
#    "ESCS", "IMMIG", "BELONG", "BULLIED", "FEELSAFE",
#    "STUDYHMW", "DISCLIM", "ADMINMODE", "ST019CQ01T",
#    "ICTHOME", "MATHMOT"
#]

# df, metadata = pyreadstat.read_sas7bdat(file_path, usecols=columns_to_load)

# new_file_path = 'new_processed_pisa_data.csv'
# df.to_csv(new_file_path, index=False)

# html = '<ul>'
# html += ''.join(f'<li>{column}</li>' for column in df.columns)
# html += '</ul>'

# display(HTML(html))


# new_file_path_csv = 'processed_pisa_data.csv'

# df.to_csv(new_file_path_csv, index=False)

df_first_half = pd.read_csv('first_half.csv')
df_second_half = pd.read_csv('second_half.csv')

df = pd.concat([df_first_half, df_second_half], ignore_index=True)

html = '<ul>'
html += ''.join(f'<li>{column}</li>' for column in df.columns)
html += '</ul>'

display(HTML(html))

NameError: name 'HTML' is not defined

## Correction After Data Reading on Local Machine
1. **Loading Data**: 
   - We used `pyreadstat` to read only the necessary columns from the `cy08msp_stu_qqq.sas7bdat` file. This approach allows us to minimize memory usage and optimize processing time.

2. **Data Conversion to CSV**:
   - Due to the large size of the `.sas7bdat` file, it is not feasible to upload it directly to GitHub, as it exceeds the platform's file size limitations.
   - To address this, we converted the loaded DataFrame into a CSV format using `df.to_csv()`. This conversion significantly reduces the file size, making it manageable and suitable for version control systems like GitHub.

3. **Commenting Out Initial Code**:
   - The initial lines of code responsible for loading and saving the data are commented out to indicate that these operations were performed prior to uploading the data to GitHub.
   - This ensures that anyone reviewing the code understands that the data preparation phase has been completed and the resulting CSV file is ready for analysis and further processing.

**Important Note:** The forthcoming descriptions assume that there is no commented-out part of the code. This ensures a clear understanding of our original approach, which involves directly reading from the `.sas7bdat` file into a DataFrame. The explanations and procedures outlined will proceed as if we are interacting with the data freshly loaded from the `.sas7bdat` file, reflecting the steps and methodologies initially used in our data handling process.

## Data Import and Column Selection

In this segment of the project, we are focusing on importing a specific dataset and selectively loading certain columns relevant to our analysis. This is executed through the use of the `pyreadstat` library, which provides an interface to read SAS data files directly into Python, preserving the dataset's structure and metadata.

### Code Explanation:

- **File Path Specification**: The `file_path` variable is defined to store the path of the `.sas7bdat` file. This path points to the location on the local machine where the PISA dataset is stored, making it accessible for loading.

- **Columns Selection**: `columns_to_load` is a list containing the names of the specific columns we want to import from the dataset. This selective loading helps optimize memory usage and processing speed by only importing data that is necessary for our subsequent analyses.

- **Reading the Dataset**: `pyreadstat.read_sas7bdat()` is called with `file_path` and `usecols` parameters. The `usecols` parameter takes the list `columns_to_load`, which instructs `pyreadstat` to load only those columns specified, enhancing the efficiency of the data import process.

- **Generating an HTML List of Columns**: Once the dataset is loaded into the dataframe `df`, we generate an HTML formatted list of the column names. This is done by iterating over `df.columns`, creating an HTML list item for each column name. This not only confirms the columns that have been loaded but also presents this information in a visually appealing format within the Jupyter Notebook.

The use of `pyreadstat` here bridges the gap between specialized data storage formats used in large-scale studies like PISA and the flexible, dynamic environment of Python, allowing for sophisticated data manipulation and analysis directly within Jupyter notebooks.

## Selective Variable Inclusion for Analysis

Given the extensive number of columns available in the PISA dataset, a key task was to judiciously select a subset of variables from various categories. This selection process aimed to focus on variables that are likely to influence testing outcomes, thus enabling a targeted and insightful analysis. The chosen variables span demographic details, academic performance, socio-cultural factors, educational environment, and technological access, providing a broad spectrum of data for a comprehensive evaluation.

### Organized Selection of Variables:

#### 1. **Demographic and Country-Specific Information:**
- **Country (CNT)**: Essential for identifying educational outcomes across different nations and understanding global educational disparities.
- **OECD Membership (OECD)**: Facilitates a comparative analysis of educational systems and outcomes between OECD member countries and others, offering insights into the effectiveness of educational policies within the OECD framework.

#### 2. **Student and Family Background:**
- **Gender (ST004D01T)**: Critical for assessing gender disparities in educational attainment and supporting gender-inclusive educational strategies.
- **Grade (ST001D01T)**: Helps evaluate the impact of academic progression on performance across different educational stages.
- **Father's Education Level (ST126Q01TA)** and **Mother's Education Level (ST125Q01NA)**: Serve as indicators of familial socio-economic status, which is a known factor influencing educational access and quality.

#### 3. **Academic Performance Scores:**
- **Math, Reading, and Science Scores (PV1MATH to PV5SCIE)**: These scores are central to analyzing core academic competencies and are fundamental for educational assessments in PISA.

#### 4. **Socio-Cultural and Psychological Factors:**
- **Economic Social Cultural Status (ESCS)**: A composite measure that provides a socioeconomic context to student performance, highlighting disparities and opportunities for targeted interventions.
- **Immigration Status (IMMIG)**: Crucial for studying the challenges and performances of immigrant students, which are often different from native students.
- **Sense of Belonging (BELONG)** and **Feeling of Safety (FEELSAFE)**: Psychological well-being metrics that significantly impact student engagement and academic success.
- **Bullying Frequency (BULLIED)**: Reflects the school environment's safety, directly correlating with student's mental health and learning outcomes.

#### 5. **Educational Environment and Practices:**
- **Weekly Study Time (STUDYHMW)**: Indicates the level of academic engagement outside of school hours, which is predictive of academic success.
- **Disciplinary Climate (DISCLIM)**: A measure of classroom management and educational climate, which affects learning efficiency.
- **Test Administration Mode (ADMINMODE)**: Distinguishes between digital and paper testing environments, relevant in the context of modern educational practices.

#### 6. **Technological Access and Motivation:**
- **Access to ICT at Home (ICTHOME)**: Represents the digital divide in education, which has become increasingly important in today's technology-driven world.
- **Math Motivation (MATHMOT)**: Directly influences student performance in math and is presumed to also positively impact performance in other subjects, reflecting intrinsic and extrinsic motivations toward academic achievement.


In [None]:
column_names_map = {
    "CNT": "Country",
    "OECD": "OECD_Membership",
    "ST004D01T": "Gender",
    "ST001D01T": "Grade",
    "ST126Q01TA": "Father_Education_Level",
    "ST125Q01NA": "Mother_Education_Level",
    "PV1MATH": "Math_Score_PV1",
    "PV2MATH": "Math_Score_PV2",
    "PV3MATH": "Math_Score_PV3",
    "PV4MATH": "Math_Score_PV4",
    "PV5MATH": "Math_Score_PV5",
    "PV1READ": "Reading_Score_PV1",
    "PV2READ": "Reading_Score_PV2",
    "PV3READ": "Reading_Score_PV3",
    "PV4READ": "Reading_Score_PV4",
    "PV5READ": "Reading_Score_PV5",
    "PV1SCIE": "Science_Score_PV1",
    "PV2SCIE": "Science_Score_PV2",
    "PV3SCIE": "Science_Score_PV3",
    "PV4SCIE": "Science_Score_PV4",
    "PV5SCIE": "Science_Score_PV5",
    "ESCS": "Economic_Social_Cultural_Status",
    "IMMIG": "Immigration_Status",
    "BELONG": "Sense_of_Belonging",
    "BULLIED": "Bullying_Frequency",
    "FEELSAFE": "Feeling_of_Safety",
    "STUDYHMW": "Weekly_Study_Time",
    "DISCLIM": "Disciplinary_Climate",
    "ADMINMODE": "Test_Administration_Mode",
    "ST019CQ01T": "Student_father’s_country_of_birth",
    "ICTHOME": "Access_to_ICT_at_Home",
    "MATHMOT": "Math_Motivation"
}

html = '<ul>'
html += ''.join(f'<li>{column}</li>' for column in df.columns)
html += '</ul>'

# Отображаем HTML
display(HTML(html))

## Data Renaming and Verification

In this section of the code, we first map the abbreviated column names from the PISA dataset to more descriptive names using a dictionary called `column_names_map`. This mapping facilitates easier understanding and manipulation of the data by replacing cryptic identifiers with clear, descriptive terms that accurately reflect the content of each column. For example, "CNT" is renamed to "Country", and "ST004D01T" to "Gender", making the dataset more intuitive for anyone analyzing the data.

After renaming the columns, we generate a formatted HTML list of the new column names to verify that all the renaming operations have been successfully applied. This visual confirmation is crucial to ensure no column has been overlooked and that each name aligns with our analytical framework.

In [None]:
missing_data = df.isnull().mean() * 100

missing_data_df = pd.DataFrame({'Column': missing_data.index, 'MissingPercentage': missing_data.values})

missing_data_df = missing_data_df[missing_data_df['MissingPercentage'] > 0].sort_values('MissingPercentage', ascending=False)

fig = px.bar(missing_data_df, x='MissingPercentage', y='Column', orientation='h',
             height=800, width=1000,
             title='Percentage of Missing Values by Column',
             labels={'MissingPercentage': 'Percentage of Missing Values', 'Column': 'Column'},
             text='MissingPercentage')

fig.update_traces(texttemplate='%{text:.2f}%', textposition='outside', marker_color='rgb(69, 117, 180)')
fig.update_layout(plot_bgcolor='rgba(0,0,0,0)', xaxis_showgrid=True, yaxis_showgrid=True,
                  paper_bgcolor='rgba(0,0,0,0)', title_x=0.5)

fig.show()

## Handling Missing Data and Visualization

In this segment of the code, we focus on identifying and visualizing the percentage of missing values for each column in the dataset. The process begins by calculating the proportion of missing data per column, which is achieved through the `df.isnull().mean() * 100` expression. This expression checks for null values in each column, computes the mean percentage of missing entries, and scales the result to a percentage form for easier interpretation.

Once the percentages are calculated, we construct a DataFrame named `missing_data_df` to organize these values along with the corresponding column names. This DataFrame is particularly structured to include only those columns that have missing values, which is filtered by `missing_data_df['MissingPercentage'] > 0`. Furthermore, it is sorted in descending order of missing percentage to prioritize attention to columns with the highest rates of missing data.

To visually represent this data, we employ Plotly Express to create a horizontal bar chart. This chart is not only visually appealing but also interactive, allowing for a detailed examination of the data. Each bar represents a column from the dataset, with its length proportional to the percentage of missing values, providing an immediate visual indication of data completeness across the dataset.

The visual enhancement of the chart includes:
- **Text labels outside bars**: Displaying exact percentages of missing data for clear, quantitative evaluation.
- **Color customization**: Using a consistent and distinct color to ensure that the visualization is both aesthetically pleasing and easy to read.
- **Background and grid adjustments**: Setting a transparent background and enabling grid lines to focus attention on the data representation itself.


The missing data visualization provides critical insights that are fundamental to guiding the subsequent steps in our data preprocessing strategy. for analytical purposes, it's essential to assess the impact of missing data across various features to ensure the reliability and validity of our findings. The analysis reveals a varying degree of missing data across several key variables, which will significantly influence our approach to data preparation and analysis.

The most notable is the high percentage of missing values in **Access to ICT at Home (45.97%)**, which implies a substantial data gap in an area crucial for understanding digital literacy's role in educational outcomes. This high level of missing data necessitates careful consideration, such as the potential use of imputation techniques or a sensitivity analysis to understand the impact of missing data on our study's conclusions.

Other variables with significant missing data include **Math Motivation (17.47%)** and **Feeling of Safety (15.73%)**. These factors are vital for analyzing student engagement and well-being, which are known to affect academic performance. The absence of this data could bias our analysis, highlighting the importance of addressing these gaps adequately through statistical methods that can handle missingness without compromising the study's overall integrity.

Lesser, but still noteworthy percentages in **Disciplinary Climate, Sense of Belonging, Bullying Frequency,** and **Immigration Status** suggest that while these areas are less problematic, they still require attention to ensure comprehensive data coverage.

Given these insights, our next steps will involve deciding on the appropriate techniques for handling missing data. Options include imputation where reasonable (particularly where missing data might introduce bias) or possibly excluding variables with excessively high missingness if they are likely to distort the analysis. This strategic approach will help mitigate the risk of drawing inaccurate conclusions and ensure that our analysis remains robust and reliable. The ultimate aim is to maintain the analytical validity while ensuring that our findings are reflective of the true patterns and relationships present in the data.

In [None]:
quantitative_vars = ['Access_to_ICT_at_Home', 'Feeling_of_Safety']
for var in quantitative_vars:
    median_value = df[var].median()
    df[var].fillna(median_value, inplace=True)

categorical_vars = ['Math_Motivation']
for var in categorical_vars:
    mode_value = df[var].mode()[0]
    df[var].fillna(mode_value, inplace=True)

low_missing_vars = ['Disciplinary_Climate', 'Sense_of_Belonging', 'Bullying_Frequency', 'Immigration_Status', 
                    'Father_Education_Level', 'Weekly_Study_Time', 'Mother_Education_Level', 'Student_father’s_country_of_birth',
                    'Economic_Social_Cultural_Status', 'Grade', 'Gender']
df.dropna(subset=low_missing_vars, inplace=True)

missing_values_summary = df.isnull().sum()

print("Summary of missing values for each column:")
print(missing_values_summary[missing_values_summary > 0])

if df.isnull().any().any():
    print("There are still missing values in the dataframe.")
else:
    print("No missing values found in the dataframe.")

### Filling Missing Values for Quantitative Variables

We used pandas' `median()` function to compute the median of 'Access_to_ICT_at_Home' and 'Feeling_of_Safety', followed by `fillna()` to apply these median values where data was missing. This is done in a loop over each quantitative variable to automate the process and reduce the risk of errors.

### Imputation for a Categorical Variable

For 'Math_Motivation', the mode (most common value) was determined using pandas' `mode()` function and then used to fill missing entries. This method ensures that the imputed values are representative of the most frequently occurring category within the dataset.

### Dropping Rows with Few Missing Values

We identified specific columns with minimal missing values and decided to remove any rows with missing data in these columns. This was achieved using `dropna()` with the subset parameter, which specifies the columns to consider for dropping rows. This approach cleans the data while retaining as much information as possible.

### Final Verification

After all imputation steps, we ran a check using `isnull().sum()` to list any columns still containing missing values and `any().any()` to confirm the absence of missing data. This verification step is crucial to ensure data integrity before moving forward with any further analysis.


## Comprehensive Handling of Missing Data

In our approach to managing missing data, we meticulously categorized variables based on their nature and proportion of missing values to apply the most suitable imputation techniques, enhancing the dataset's completeness without compromising its integrity.

### Handling Quantitative Variables

For quantitative variables such as 'Access_to_ICT_at_Home' and 'Feeling_of_Safety', which had significant percentages of missing data, we used median imputation. The median is less affected by outliers and skewed data, making it an appropriate choice for filling gaps in these numerical columns. This approach helps maintain the distributional characteristics of the original data.

### Handling a Categorical Variable

The variable 'Math_Motivation' is categorical and had missing entries. We filled these gaps with the mode of the column, which is the most frequent value. Using the mode for categorical data ensures that the imputed values align with the most commonly observed category, preserving the dataset's underlying distributions without introducing bias.

### Dealing with Variables Having Few Missing Values

For columns like 'Disciplinary_Climate', 'Sense_of_Belonging', and others with relatively low missing values, we opted for row deletion. This method was chosen because the low percentage of missing data meant that removing these rows would not lead to significant loss of information but would ensure a completely clean dataset for analysis.

### Verification of Imputation Effectiveness

Post-imputation, we performed a thorough check to ensure no remaining missing values. The results confirmed that our strategies effectively addressed all gaps, as evidenced by the absence of missing data across all considered columns.

This careful handling of missing data ensures that our dataset is primed for high-quality analyses, reflecting an accurate representation of the underlying phenomena.

In [None]:
country_counts = df['Country'].value_counts()
total_observations = country_counts.sum()
country_percentage = (country_counts / total_observations) * 100
country_data = pd.DataFrame({
    'Country': country_counts.index,
    'Observations': country_counts.values,
    'Percentage': country_percentage.values
})

fig = px.treemap(country_data, path=['Country'], values='Observations',
                 color='Percentage', hover_data=['Observations', 'Percentage'],
                 color_continuous_scale='RdYlGn', title='Distribution of Observations by Country')

fig.update_layout(margin=dict(t=50, l=25, r=25, b=25))

fig.show()

## Data Preparation and Visualization: Country Distribution

In this section, we delve into the geographic distribution of the dataset's observations by examining the frequency of records from different countries. This analysis is pivotal for understanding the data's breadth and ensuring a balanced representation across various regions.

### Step-by-Step Code Explanation:

- **Counting Observations per Country**: We use `df['Country'].value_counts()` to generate a series containing the counts of occurrences for each country in the dataset. This method efficiently tallies the number of rows for each unique country, providing a clear view of the dataset's composition.

- **Calculating Total Observations**: The total number of observations within the dataset is calculated by summing up all entries in the `country_counts` series using `country_counts.sum()`. This total serves as the basis for further calculations, like percentage distribution.

- **Percentage Calculation**: To put the data into perspective relative to the entire dataset, we compute the percentage of observations for each country with `country_counts / total_observations * 100`. This gives a normalized view of the data, highlighting the proportionate contribution of each country.

- **Creating DataFrame for Visualization**: We then construct a DataFrame named `country_data` that contains columns for `Country`, `Observations`, and `Percentage`. This structured format is necessary for generating visualizations that require specific data arrangements, such as treemaps.

- **Generating a Treemap**: Using Plotly Express, we create an interactive treemap with `px.treemap()`. The treemap displays each country as a distinct segment, sized and colored based on the number of observations. We configure the treemap with hover data to show detailed statistics when a segment is hovered over. This visualization method is particularly effective for illustrating hierarchical (nested) data and provides a quick, digestible format to assess the data's geographical distribution.

- **Layout Customization**: The layout of the treemap is customized to enhance readability and aesthetics, adjusting margins and title positioning to fit the presentation context better.



## Analysis of Country-Wise Observations in PISA Dataset

The data exploration stage of our project includes a comprehensive visualization of the distribution of observations by country within the PISA dataset. This analysis is pivotal as it highlights the breadth and diversity of data available across different regions, which is essential for ensuring the robustness of our subsequent analyses.

### Interpretation of the Treemap Visualization:

- **Top Observations**: The countries with the highest number of observations, such as Spain (ESP), United Arab Emirates (ARE), and Kazakhstan (KAZ), indicate extensive data collection efforts in these regions. Spain leads with 5.74% of the total observations, suggesting a very active participation in the PISA survey.

- **Significant Contributors**: Countries like Canada (CAN), Indonesia (IDN), and Australia (AUS) also contribute significantly, with more than 2% of the total observations each. These contributions are crucial as they ensure a diverse input reflective of various educational systems.

- **Moderate to Low Observations**: Countries like Italy, the United Kingdom, and Finland show moderate levels of participation. It’s notable that countries with historically strong educational outcomes like Finland still contribute valuable data (1.85%).

- **Global Participation**: The data includes countries from varying socio-economic backgrounds and geographic locations, from Argentina in South America to Vietnam in Asia, providing a global perspective on educational achievement.

- **Special Cases**: Smaller countries or regions like Malta (MLT), Iceland (ISL), and Panama (PAN) show lower participation rates, which is expected due to their smaller population sizes.

### Analytical Insights:

- **Global Coverage and Educational Insights**: The widespread coverage from 75 countries provides a robust dataset that is invaluable for cross-country comparisons and understanding global educational trends.


In [None]:
math_scores = df.groupby('Country')[['Math_Score_PV1', 'Math_Score_PV2', 'Math_Score_PV3', 'Math_Score_PV4', 'Math_Score_PV5']].mean().mean(axis=1).round(0)

reading_scores = df.groupby('Country')[['Reading_Score_PV1', 'Reading_Score_PV2', 'Reading_Score_PV3', 'Reading_Score_PV4', 'Reading_Score_PV5']].mean().mean(axis=1).round(0)

science_scores = df.groupby('Country')[['Science_Score_PV1', 'Science_Score_PV2', 'Science_Score_PV3', 'Science_Score_PV4', 'Science_Score_PV5']].mean().mean(axis=1).round(0)

average_scores = pd.DataFrame({
    'Average Math Score': math_scores,
    'Average Reading Score': reading_scores,
    'Average Science Score': science_scores
})

average_scores.reset_index()

#### Calculating Average Scores:
   - **Data Grouping**: We begin by grouping the data by 'Country' using the `groupby` method.
   - **Score Calculation**:
     - For each subject (mathematics, reading, and science), we select the plausible values and calculate their mean for each country using the `mean()` method.
     - We then compute the mean of these values across the columns, which represent different plausible values, to obtain a single average score per subject per country.
     - These scores are rounded to the nearest whole number to facilitate easier interpretation.
   - **Combining the Results**: All the calculated averages are combined into a single DataFrame named `average_scores`, which includes columns for average math, reading, and science scores, providing a comprehensive view of the performance across subjects for each country.
   - **Displaying the Results**: The `reset_index()` method is used to transform the index (countries) back into a column, making the DataFrame suitable for further analysis or export.

In [None]:
def get_iso_alpha(country_name):
    try:
        return pycountry.countries.lookup(country_name).alpha_3
    except LookupError:
        return None

average_scores['iso_alpha'] = average_scores.index.map(get_iso_alpha)
missing_iso = average_scores['iso_alpha'].isnull().sum()
print(f"Number of missing ISO codes: {missing_iso}")
missing_iso_countries = average_scores[average_scores['iso_alpha'].isnull()]
missing_iso_countries

## Mapping Country Names to ISO Alpha-3 Codes

The purpose of this segment is to integrate international standards into our dataset by converting country names into ISO alpha-3 codes, which are three-letter country codes defined by the International Organization for Standardization (ISO). This conversion facilitates consistent country identification across different datasets and enhances the compatibility of our data for global analysis.

### Code Explanation:

- **Function Definition (`get_iso_alpha`)**: We define a function called `get_iso_alpha` that attempts to find the ISO alpha-3 code for a given country name. This function uses the `pycountry` library, which contains data on country names and their corresponding codes.

  - **Try-Except Block**: Within the function, we use a try-except block to handle situations where a country name might not be recognized or is missing from the `pycountry` database. If the country is found, the ISO alpha-3 code is returned; otherwise, `None` is returned.

- **Mapping ISO Codes**: We apply the `get_iso_alpha` function to each country name in the index of the `average_scores` DataFrame. This is done using the `.map()` method, which transforms the values of the index according to the function defined.

- **Counting Missing Codes**: After mapping the codes, we calculate the number of missing ISO codes by counting the `None` values in the new `iso_alpha` column. This is accomplished using the `.isnull().sum()` method, providing a quick count of how many country names could not be mapped to ISO codes.

- **Identifying Countries with Missing Codes**: To further investigate and address the missing codes, we filter the `average_scores` DataFrame to list entries where the ISO code is `None`. This subset, stored in `missing_iso_countries`, helps identify specific countries that require further attention for correct ISO mapping.

### Results:

After mapping ISO codes to country names, we identified that 3 countries did not have corresponding ISO alpha-3 codes in the database: `QAZ`, `QUR`, `TAP`.


In [None]:
average_scores.loc[average_scores.index == 'QAZ', 'iso_alpha'] = 'AZE' 
average_scores.loc[average_scores.index == 'TAP', 'iso_alpha'] = 'TWN'

average_scores = average_scores.drop(index='QUR')

missing_iso_countries = average_scores[average_scores['iso_alpha'].isnull()]
missing_iso_countries

## Correcting ISO Alpha-3 Codes for Specific Countries

Following the initial mapping of country names to ISO alpha-3 codes, further scrutiny was required to ensure the accuracy of our dataset. A close examination of the PISA codebook provided insights into the correct identifiers for certain countries which were initially unrecognized.

### Code Explanation:

- **Manual Corrections**: Based on insights gained from the PISA codebook:
  
  - **Azerbaijan (QAZ)**: Initially not found under its common code, it was determined that 'QAZ' should be corrected to 'AZE', representing Azerbaijan. This correction was implemented using the `.loc[]` method on the `average_scores` DataFrame, specifying the index 'QAZ' and assigning 'AZE' to the 'iso_alpha' column.
  
  - **Taiwan (TAP)**: Similarly, 'TAP' was identified as the code used for Taiwan. It was corrected to 'TWN', aligning with the standard ISO alpha-3 code for Taiwan. This adjustment was also made using the `.loc[]` method.

- **Removing Undefined Entries**: 
  - **QUR**: Despite exhaustive checks, no corresponding entry for 'QUR' was found in the PISA codebook. To maintain the integrity and accuracy of our analyses, rows associated with 'QUR' were removed from the dataset using the `.drop()` method.

- **Final Verification**:
  - After these corrections and the removal of the undefined entries, we again checked for any remaining entries with null ISO codes using the condition `average_scores[average_scores['iso_alpha'].isnull()]`. The result confirmed that all remaining entries now have valid ISO alpha-3 codes.


In [1]:
trace1 = go.Choropleth(
    locations=average_scores['iso_alpha'],
    z=average_scores['Average Math Score'], 
    colorscale=[
        [0, 'rgb(255,0,0)'], 
        [0.5, 'rgb(255,255,0)'], 
        [1, 'rgb(0,128,0)'] 
    ],
    autocolorscale=False,
    reversescale=False,
    marker_line_color='black',
    marker_line_width=0.5,
    colorbar_title='Math Scores',
    name='Math Score',
    hoverinfo='location+z+name'
)

trace2 = go.Choropleth(
    locations=average_scores['iso_alpha'],
    z=average_scores['Average Reading Score'],
    colorscale=[
        [0, 'rgb(255,0,0)'], 
        [0.5, 'rgb(255,255,0)'], 
        [1, 'rgb(0,128,0)'] 
    ],
    autocolorscale=False,
    reversescale=False,
    marker_line_color='black',
    marker_line_width=0.5,
    colorbar_title='Reading Scores',
    visible=False,
    name='Reading Score',
    hoverinfo='location+z+name'
)

trace3 = go.Choropleth(
    locations=average_scores['iso_alpha'],
    z=average_scores['Average Science Score'],
    colorscale=[
        [0, 'rgb(255,0,0)'], 
        [0.5, 'rgb(255,255,0)'],
        [1, 'rgb(0,128,0)']
    ],
    autocolorscale=False,
    reversescale=False,
    marker_line_color='black',
    marker_line_width=0.5,
    colorbar_title='Science Scores',
    visible=False,
    name='Science Score',
    hoverinfo='location+z+name'
)

layout = go.Layout(
    title_text='Global Average Scores by Subject',
    title_x=0,
    title_y=1,
    title_xanchor='left',
    title_yanchor='top',
    geo=dict(
        showframe=True,
        showcoastlines=True,
        projection_type='natural earth'
    ),
    updatemenus=[
        dict(
            buttons=list([
                dict(
                    args=[{'visible': [True, False, False]}],
                    label='Mathematics',
                    method='update'
                ),
                dict(
                    args=[{'visible': [False, True, False]}],
                    label='Reading',
                    method='update'
                ),
                dict(
                    args=[{'visible': [False, False, True]}],
                    label='Science',
                    method='update'
                )
            ]),
            direction='down',
            pad={'r': 10, 't': 10},
            showactive=True,
            x=0.5,
            xanchor='center',
            y=1.15,
            yanchor='top'
        ),
    ]
)

fig = go.Figure(data=[trace1, trace2, trace3], layout=layout)

fig.show(renderer='browser')

NameError: name 'go' is not defined

## Interactive Choropleth Map Visualization

In this section, we delve into creating an interactive choropleth map to visually represent the average scores in mathematics, reading, and science across various countries using Plotly's `Choropleth` map capabilities.

### Implementation Details:

- **Choropleth Map Setup**:
  Each subject (math, reading, science) is represented by a separate `Choropleth` trace, which maps the ISO alpha-3 country codes to their respective average scores:
  
  - **Math Scores (`trace1`)**: Configured to display the average math scores. The color scale transitions from red for lower scores, through yellow for middle scores, to green for high scores, which provides a clear visual gradient of performance across countries.
  
  - **Reading Scores (`trace2`)**: Set up similarly but made invisible by default. This allows the user to switch to this view interactively without reloading the data.
  
  - **Science Scores (`trace3`)**: Also configured like the others but remains invisible initially, available for display via interactive controls.

- **Color Scale and Appearance**:
  All traces share a color scale that enhances the visual consistency across different subjects. The `autocolorscale` is set to `False` to use the custom color scale defined. The `reversescale` is also set to `False` to maintain the color direction from low to high scores.

- **Interactivity**:
  An interactive dropdown menu is implemented using `updatemenus`, allowing the viewer to switch between math, reading, and science scores. This interactivity is crucial for engaging users and enabling a comparative analysis without switching between different pages or views.

- **Geographical and Styling Options**:
  The layout of the map includes configurations like showing coastlines and a natural earth projection, which enhance geographical clarity. Additionally, titles and labels are carefully positioned to guide the viewer's understanding of the data being presented.

- **Rendering the Map**:
  The final step involves compiling the three traces into a single figure with the specified layout settings. This figure is then displayed in the browser using `fig.show(renderer='browser')` to ensure it renders interactively outside of the Jupyter environment.

### Summary:

This coding setup provides a robust platform for visualizing complex geographic data in an interactive manner. By leveraging Plotly's advanced mapping functionalities, the map offers not only a dynamic data exploration tool but also a clear visual representation of educational performance metrics across different countries, facilitating global educational insights.

## Global Analysis of PISA 2022 Average Scores by Subject

### Overview
The data visualized represents the average scores in Mathematics, Reading, and Science from the PISA 2022 survey across various countries. Each score reflects the proficiency of 15-year-old students in these subjects, offering insights into the educational outcomes and the effectiveness of educational systems globally.

### Key Observations
- **High Performers**: Singapore (SGP), Macao (MAC), and Hong Kong (HKG) standout with exceptionally high scores across all subjects, particularly in Mathematics and Science. This trend highlights the strong emphasis on STEM education in these regions, which is often supported by rigorous academic standards and substantial educational investments.
- **Subject Variability**: Countries like Australia (AUS), Canada (CAN), and Germany (DEU) show strong performances across all subjects but do not lead in any specific area. This indicates a well-rounded educational approach that does not heavily favor any particular subject.
- **Emerging Trends**: Countries like Vietnam (VNM) and Kazakhstan (KAZ) show promising results, especially in Science and Mathematics. This could be indicative of recent educational reforms or focused national strategies to enhance STEM education.
- **Underperformance Issues**: Regions such as Dominican Republic (DOM), Guatemala (GTM), and Panama (PAN) lag significantly behind in all subjects. These lower scores might reflect broader socio-economic challenges, limited educational resources, or other systemic issues affecting educational outcomes.

### Detailed Analysis
- **Europe**: Most European countries like Estonia (EST), Finland (FIN), and Netherlands (NLD) score well above the global average, particularly in Science. This success can be attributed to strong educational policies, early childhood education quality, and substantial public investment in education.
- **Asia**: Asian countries demonstrate high variability. While East Asia showcases top scores, South and Southeast Asian regions like Cambodia (KHM) and Philippines (PHL) underperform, highlighting a significant disparity in educational quality and access within the continent.
- **Americas**: The United States (USA) shows strong Reading scores, aligning with its emphasis on literacy and critical thinking in curriculum. Conversely, countries in South America like Brazil (BRA) and Colombia (COL) show moderate to low scores, suggesting room for curriculum development and educational reforms.
- **Challenges in Middle East and Africa**: Countries like Qatar (QAT) and Saudi Arabia (SAU) have lower scores compared to global standards despite high economic outputs, pointing towards potential mismatches between educational system outputs and the needs of a modern economy.

### Implications for Policy and Practice
- **Focus on STEM**: Countries lagging in Mathematics and Science might consider revising their STEM curricula and teacher training programs to boost performance and ensure their youth are equipped for modern challenges.
- **Literacy and Reading**: Enhancing reading scores through comprehensive literacy programs from early grades could improve overall academic performance, as reading proficiency is foundational for success in other subjects.
- **Addressing Inequities**: Lower-performing countries need targeted interventions to address educational inequities, such as improving school infrastructure, increasing teacher to student ratios, and providing additional support for disadvantaged students.

### Conclusion
This analysis underscores the importance of continuous monitoring and evaluation of educational systems worldwide. By understanding the strengths and weaknesses revealed through such data, stakeholders can implement focused strategies that enhance educational outcomes and foster an environment where all students can excel in their academic endeavors.

In [None]:
df = df[df['Country'] != 'QUR']

df['Average Math Score'] = df[['Math_Score_PV1', 'Math_Score_PV2', 'Math_Score_PV3', 'Math_Score_PV4', 'Math_Score_PV5']].mean(axis=1)
df['Average Reading Score'] = df[['Reading_Score_PV1', 'Reading_Score_PV2', 'Reading_Score_PV3', 'Reading_Score_PV4', 'Reading_Score_PV5']].mean(axis=1)
df['Average Science Score'] = df[['Science_Score_PV1', 'Science_Score_PV2', 'Science_Score_PV3', 'Science_Score_PV4', 'Science_Score_PV5']].mean(axis=1)

## Integration of Key Calculations into Main DataFrame

### Exclusion of Undefined Country Entries
In our main DataFrame, we continue with data cleansing by excluding entries associated with the undefined country identifier 'QUR'. This measure is crucial for maintaining the dataset's integrity, ensuring that subsequent analyses are based on verifiable and relevant data.

### Implementation of Average Score Calculations
Previously, we had computed average scores for Math, Reading, and Science in a separate DataFrame to facilitate analysis. Recognizing the importance and utility of these calculations, we've now integrated this process directly into our main DataFrame. This integration enhances data accessibility and streamlines operations by allowing direct access to average scores during our analytical procedures.

By performing these operations in the main DataFrame, we ensure that all subsequent analyses utilize the most refined and accurate data representation. This approach minimizes redundancy and maximizes efficiency, enabling more straightforward and effective data manipulation and analysis going forward.

In [None]:
sns.set(style="whitegrid")

variables = [
    'Test_Administration_Mode', 'Grade', 'Gender', 'Student_father’s_country_of_birth',
    'Mother_Education_Level', 'Father_Education_Level', 'Immigration_Status', 
    'Weekly_Study_Time', 'Math_Motivation'
]

fig, axes = plt.subplots(nrows=len(variables), ncols=1, figsize=(10, 45))

total_observations = df.shape[0]

for i, var in enumerate(variables):
    data = df[var].value_counts().sort_index()
    norm = plt.Normalize(data.min(), data.max())
    colors = plt.cm.viridis(norm(data.values))
    colors = colors.tolist() 

    bars = sns.barplot(x=data.index, y=data.values, palette=colors, ax=axes[i])
    axes[i].set_title(f'Distribution of {var}', fontsize=16)
    axes[i].set_ylabel('Counts', fontsize=14)
    axes[i].set_xlabel(var, fontsize=14)
    axes[i].tick_params(axis='x', rotation=45) 

    for bar in bars.patches:
        height = bar.get_height()
        percentage = (height / total_observations) * 100
        axes[i].text(bar.get_x() + bar.get_width() / 2., height,
                     f'{int(height)}\n({percentage:.2f}%)',
                     ha='center', va='bottom', color='black', fontsize=12)

plt.tight_layout()
plt.show()

## Data Visualization: Distribution of Key Variables in PISA 2022 Data

In this section of our analysis, we focus on visualizing the distribution of several key sociocultural and educational variables within the PISA 2022 dataset. By creating a series of bar plots, we aim to illustrate the frequency and proportions of various categories within each variable, providing insights into the demographic and educational landscapes of the students assessed.

### Code Explanation:

1. **Setting Aesthetic Style**:
   - We begin by setting the aesthetic style of our plots to "whitegrid" using `sns.set()`. This style provides a clean and clear background with grid lines that help in gauging the scale of the bar plots.

2. **Variables to Plot**:
   - We define a list named `variables` that contains the names of the sociocultural and educational variables we intend to explore. These variables include 'Test Administration Mode', 'Grade', 'Gender', 'Student father’s country of birth', 'Mother Education Level', 'Father Education Level', 'Immigration Status', 'Weekly Study Time', and 'Math Motivation'.

3. **Creating Figure for Subplots**:
   - A figure `fig` and a series of axes `axes` are created using `plt.subplots()`. This arrangement allows us to plot multiple subplots in a single column layout, with each subplot dedicated to a different variable. The figure size is set to 10 inches in width and 45 inches in height to accommodate all subplots vertically.

4. **Calculating Total Observations**:
   - `total_observations` is calculated using `df.shape[0]`, which provides the total number of rows (students) in our dataset. This value is crucial for calculating percentages in subsequent annotations.

5. **Iterating Over Variables**:
   - We loop through each variable using a for loop. Within each iteration, the following steps are executed for the current variable:
     - **Data Extraction**: The frequency of each category within the variable is obtained using `df[var].value_counts().sort_index()`. This function sorts the categories to maintain a consistent order across the plots.
     - **Normalization and Color Palette**: A normalization object `norm` is created using `plt.Normalize()`, which scales the counts to a range that is used to map colors. We then create a color palette `colors` that corresponds to the normalized data values, using a 'viridis' colormap which offers good color differentiation.
     - **Bar Plot Creation**: A bar plot is generated using `sns.barplot()`, with the categories on the x-axis and their counts on the y-axis. The `palette` argument is supplied with the list of colors created in the previous step.
     - **Plot Customization**: Titles and labels are set for each subplot to clearly identify the variable being displayed. The x-axis labels are rotated for better readability.
     - **Annotation**: Each bar is annotated with its count and the corresponding percentage of the total observations, providing a clear quantitative measure of each category's prevalence.

6. **Layout Adjustment**:
   - `plt.tight_layout()` is called to automatically adjust the subplots' parameters to give some padding and prevent overlap between them.

7. **Displaying the Plots**:
   - Finally, `plt.show()` is used to display the figure with all the subplots. This command ensures that all our visual configurations are rendered properly.

## Detailed Analysis of Variable Distributions with Extended Context

Our analysis delves deeper into the key sociodemographic and educational variables from the PISA 2022 dataset. Each variable's distribution provides a nuanced understanding of the educational landscape.

### Test Administration Mode
- **Distribution Breakdown**: 
  - `Paper` (1): 17,307 instances.
  - `Computer` (2): 454,102 instances.
- **Analysis**: The overwhelming preference for computer-based testing (454,102 vs. 17,307 for paper) reflects modern educational trends and digital integration in testing environments. This shift towards digital platforms can influence test performance, potentially benefiting students more familiar with digital interfaces.

### Grade
- **Distribution Breakdown** (selected):
  - Grade 10: 278,958
  - Grade 9: 133,640
  - Other grades show varying counts, with significant placeholders for non-standard categories like 'Invalid' and 'Missing'.
- **Analysis**: The prevalence of students in Grades 9 and 10 suggests these are critical years for assessment across countries, possibly coinciding with the culmination of lower secondary education or transition to higher levels. The presence of grades like 'Invalid' and 'Missing' could indicate data integrity issues or variations in educational systems, such as differing numbers of grades across countries, which could affect comparative analysis.

### Gender
- **Distribution Breakdown**:
  - `Female` (1): 243,605
  - `Male` (2): 227,804
- **Analysis**: The near balance in gender distribution ensures that the dataset adequately represents both genders. This parity is crucial for analyzing gender-based differences in educational outcomes without bias.

### Student Father’s Country of Birth
- **Distribution Breakdown**:
  - Country of test (1): 392,822
  - Other country (2): 75,865
  - Unknown or stateless (3): 2,722
- **Analysis**: The dominant number of fathers born in the country of test suggests minimal migration influence for the majority. However, the significant number from 'Other countries' and a small proportion from 'Unknown or stateless' backgrounds can provide insights into the integration and educational challenges faced by students from diverse familial origins.

### Parent Education Level (Mother and Father)
- **Assumed Levels**:
  - 1: No formal education
  - 2: Primary education
  - 3: Some secondary
  - 4: Secondary completed
  - 5: Some college
  - 6: Bachelor's degree
  - 7: Master's degree
  - 8: Doctorate or higher
- **Analysis**: The varied educational levels of parents, with a higher frequency at the secondary and some college levels, underscore the influence of parental education on student's educational access and attitudes. Higher parental education often correlates with better educational support at home, potentially enhancing student performance.

### Immigration Status
- **Distribution Breakdown**:
  - Native (1): 414,227
  - Second-Generation (2): 29,932
  - First-Generation (3): 27,250
- **Analysis**: The predominance of native students in the dataset highlights a primarily stable resident student population, with fewer first and second-generation immigrants. This distribution is essential for analyzing the impact of immigration on educational outcomes, where native students might exhibit different performance patterns compared to their immigrant peers.

### Weekly Study Time
- **Definition**: Amount of time spent on homework per week in hours.
- **Analysis**: The mode of 10 hours per week indicates a significant commitment from students towards homework, suggesting a correlation between time investment and educational outcomes. Variations in study time can reflect different educational pressures, personal discipline, and possibly the effectiveness of school homework policies.

### Math Motivation
- **Distribution Breakdown**:
  - No motivation (0): 448,892
  - Has motivation (1): 22,517
- **Analysis**: The stark contrast in math motivation with a vast majority displaying no motivation poses significant challenges. It is crucial to address these motivational gaps through targeted interventions, as motivation is a key driver of engagement and achievement in mathematics.


In [None]:
sns.set(style="whitegrid")

variables = [
    'Sense_of_Belonging', 'Bullying_Frequency', 
    'Feeling_of_Safety', 'Disciplinary_Climate', 
    'Economic_Social_Cultural_Status', 'Access_to_ICT_at_Home'
]

titles = [
    'Distribution of Sense of Belonging Scores',
    'Distribution of Bullying Frequency Scores',
    'Distribution of Feeling of Safety Scores',
    'Distribution of Disciplinary Climate Scores',
    'Distribution of Economic Social Cultural Status Scores',
    'Distribution of Access to ICT at Home Scores'
]

for variable, title in zip(variables, titles):
    fig = px.histogram(df, x=variable, title=title, nbins=30, labels={variable: 'Score'},
                       marginal='box', color_discrete_sequence=['#636EFA']) 
    fig.update_layout(xaxis_title=variable, yaxis_title='Frequency', title_x=0.5)
    fig.show()

### Code Explanation:

1. **Setting Aesthetic Style**:
   - We initialize the aesthetic style to "whitegrid" using `sns.set()`. This setting ensures that all our plots have a uniform, clean, and visually appealing background that enhances readability and visual consistency.

2. **Variables and Titles Setup**:
   - `variables`: A list containing the key variables of interest. These include psychosocial factors like 'Sense of Belonging', 'Bullying Frequency', 'Feeling of Safety', and educational environment indicators such as 'Disciplinary Climate', 'Economic Social Cultural Status', and 'Access to ICT at Home'.
   - `titles`: Corresponding titles for each histogram to be plotted. Each title is carefully crafted to clearly represent the variable being analyzed, enhancing the interpretability of the visualizations.

3. **Loop Through Variables for Plotting**:
   - We employ a loop to iterate over each variable and its corresponding title. This approach facilitates the efficient generation of multiple histograms in a systematic manner.
   - Within the loop, the following steps are executed:
     - **Histogram Generation**: Utilizing Plotly's `px.histogram`, we generate a histogram for each variable. The function is configured to:
       - Display the variable on the x-axis as 'Score'.
       - Set the number of bins to 30, optimizing the resolution of the distribution.
       - Include a box plot (`marginal='box'`) to provide a summary of the distribution's central tendency and variability.
       - Use a consistent color (`color_discrete_sequence=['#636EFA']`) to maintain visual coherence across all plots.
     - **Layout Customization**: Each plot's layout is updated to:
       - Set the titles of the x-axis and y-axis to enhance clarity.
       - Center the main title above the plot for balanced aesthetics.
     - **Interactive Display**: The `fig.show()` function is called to render each plot interactively, allowing users to engage directly with the data, such as zooming in on details or hovering to see specific values.

4. **Quantitative vs. Categorical Variables**:
   - In contrast to our earlier use of bar plots for categorical variables, these histograms are specifically designed for quantitative data. This distinction is crucial as histograms allow us to visualize the distribution of numerical data, providing insights into trends, spread, and outliers, which are not typically evident in categorical data visualization. Histograms are particularly useful for examining the distribution characteristics and central tendencies within continuous data sets, whereas bar plots are more suited to showing frequency counts in discrete categories.

5. **Execution and Display**:
   - The execution of the loop ensures that each variable is visualized in sequence, and each histogram is displayed immediately after creation, providing immediate feedback and visual data analysis.

## Detailed Analytical Insights into Psychosocial and Educational Variables

This comprehensive analysis extends our understanding of the distributions of key educational and psychosocial variables from the PISA 2022 dataset, focusing on deeper interpretations and potential educational implications.

### Sense of Belonging
- **Statistical Summary**:
  - Minimum: -3.5037, Maximum: 3.1727
  - Quartiles: Q1 = -0.6313, Median = -0.2934, Q3 = 0.3661
  - Outliers: Lower fence = -2.1267, Upper fence = 1.844
- **Distribution Insights**:
  - The distribution's central concentration between -2 and 2 suggests that most students feel a moderate sense of belonging, which is crucial for emotional and academic success. The sharp peak around 2.25 to 2.75 indicates a subset of students with exceptionally high belonging, possibly due to strong peer or school support systems. Conversely, the low distribution tails beyond -2 highlight a vulnerable group with a severe lack of belonging, potentially leading to feelings of isolation and disengagement from school activities. Schools might utilize this data to develop targeted interventions to enhance community building and inclusive practices, ensuring every student feels part of the school community.

### Bullying Frequency
- **Statistical Summary**:
  - Minimum: -1.228, Maximum: 4.6939
  - Quartiles: Q1 = -1.228, Median = -0.391, Q3 = 0.5222
  - Outliers: Lower fence = -1.228, Upper fence = 3.1472
- **Distribution Insights**:
  - The pronounced peak at the lowest scoring range (-1.4 to -1.2) with 212,788 students suggests a majority reporting minimal bullying, reflecting effective anti-bullying policies or underreporting issues. The distribution from -0.6 to 2 with a peak around 0.2 to 0.4, however, indicates a significant number still experience bullying at varying frequencies. The extreme values nearing 4.7 represent severe cases, which might include chronic bullying scenarios. These insights could be instrumental for schools and policymakers to further strengthen their anti-bullying strategies, focusing on the areas and student groups where bullying is more prevalent.

### Feeling of Safety
- **Statistical Summary**:
  - Minimum: -2.7886, Maximum: 1.1687
  - Quartiles: Q1 = -0.756, Median = -0.039, Q3 = 1.1246
  - Outliers: Lower fence = -2.7886, Upper fence = 1.1687
- **Distribution Insights**:
  - The bipolar distribution with significant accumulation at both extremes suggests a polarized student perception regarding safety. This polarization can be indicative of differing school environments, where some schools are perceived as safe havens while others may be viewed as lacking in safety measures. Such disparities could be critical for district-level educational authorities to investigate, aiming to standardize safety measures and improve student perceptions of safety across all schools.

### Disciplinary Climate
- **Statistical Summary**:
  - Minimum: -2.6027, Maximum: 2.2258
  - Quartiles: Q1 = -0.5304, Median = -0.0611, Q3 = 0.6888
  - Outliers: Lower fence = -2.356, Upper fence = 2.2258
- **Distribution Insights**:
  - The quasi-normal distribution suggests that while most students perceive the disciplinary climate as fair, the range of perceptions is quite broad. This variance might reflect differences in implementation of disciplinary policies across schools or regions. The outliers, particularly those on the lower end, might indicate excessively harsh disciplinary environments, which could impact student morale and engagement negatively. Tailoring disciplinary measures to be both firm and fair could help improve perceptions and enhance overall student compliance and satisfaction.

### Economic Social Cultural Status
- **Statistical Summary**:
  - Minimum: -6.3958, Maximum: 7.38
  - Quartiles: Q1 = -0.9944, Median = -0.1177, Q3 = 0.6313
  - Outliers: Lower fence = -3.4328, Upper fence = 3.0654
- **Distribution Insights**:
  - The socioeconomic status exhibits a gradual increase in frequency towards the median, with a notable peak in higher socioeconomic brackets. This trend suggests that while a significant portion of students come from middle to upper socioeconomic backgrounds, there exists a small but significant population at the extreme low end, potentially lacking basic educational resources. These insights are crucial for policymakers to allocate resources effectively, ensuring that lower socioeconomic groups receive the necessary support to bridge educational disparities.

### Access to ICT at Home
- **Statistical Summary**:
  - Most observations are concentrated around 0-0.5, indicating widespread access to ICT.
- **Distribution Insights**:
  - The predominant concentration near zero suggests that the majority of students have good access to ICT, essential for engaging with modern educational content and methods. However, the minor peaks at extreme negative values highlight the presence of a digital divide, where a small group of students lacks adequate access. This situation calls for targeted technological interventions to ensure all students can benefit from digital learning opportunities.


In [None]:
df = df[~df['Grade'].isin([96, 98, 99])]

df['Grade'] = pd.Categorical(df['Grade'], categories=[7, 8, 9, 10, 11, 12, 13], ordered=True)
df['Grade_code'] = df['Grade'].cat.codes

df['Test_Administration_Mode'] = df['Test_Administration_Mode'].map({1: 0, 2: 1})

df = pd.get_dummies(df, columns=['Student_father’s_country_of_birth'], prefix='Father_Country')

df['Gender'] = df['Gender'].replace({1: 0, 2: 1})

education_levels = ['Mother_Education_Level', 'Father_Education_Level']
for level in education_levels:
    df = pd.get_dummies(df, columns=[level], prefix=level)

df = pd.get_dummies(df, columns=['Immigration_Status'], prefix='Immig_Status')

df['Weekly_Study_Time'] = pd.Categorical(df['Weekly_Study_Time'], categories=range(11), ordered=True)
df['Weekly_Study_Time_code'] = df['Weekly_Study_Time'].cat.codes

df['Math_Motivation'] = df['Math_Motivation'].astype(int)

print(df.head())

## Detailed Data Preparation and Transformation Techniques

In our analysis of the PISA 2022 dataset, we undertake several transformations to refine the dataset for accurate and meaningful insights. Each transformation uses specific functions from the Python ecosystem, mainly pandas, which is a powerful library for data manipulation and analysis. Below is a detailed explanation of each transformation step, the functions used, and their operational mechanisms.

### 1. Exclusion of Invalid Grade Levels

- **Objective**: Remove records where the `Grade` is marked as invalid, not applicable, or missing.
- **Function Used**: `DataFrame.isin()`
  - **Mechanics**: `isin()` is a function that checks each element in the DataFrame column against a list of values. It returns a Boolean DataFrame indicating whether each element matches any value in the provided list.
  - **Code**: `df[~df['Grade'].isin([96, 98, 99])]`
  - **Explanation**: We use `~` to negate the Boolean DataFrame, effectively filtering out any rows where `Grade` is 96, 98, or 99. This ensures our analysis only includes valid and relevant educational data.

### 2. Reformatting Grade as a Categorical Variable

- **Objective**: Treat `Grade` as an ordered categorical variable to preserve educational progression.
- **Function Used**: `pandas.Categorical`
  - **Mechanics**: This function converts a column to a categorical type with an optional ordered parameter. When `ordered=True`, the categorical data follows a specific order.
  - **Code**: `df['Grade'] = pd.Categorical(df['Grade'], categories=[7, 8, 9, 10, 11, 12, 13], ordered=True)`
  - **Explanation**: By setting grades in an ordered list, the model can understand the progression from lower to higher grades, which is crucial for trend analysis and ordinal comparisons.

### 3. Recoding Variables for Simplification and Analysis

- **Test Administration Mode**:
  - **Function Used**: `map`
    - **Mechanics**: `map` is used to replace specified values in a Series based on a mapping dictionary.
    - **Code**: `df['Test_Administration_Mode'] = df['Test_Administration_Mode'].map({1: 0, 2: 1})`
    - **Explanation**: This transformation recodes the testing mode from 1 (paper) and 2 (computer) to 0 and 1, simplifying the data and making it more suitable for binary logistic regression or other binary classification methods.

- **Gender Recoding**:
  - **Function Used**: `replace`
    - **Mechanics**: `replace` substitutes a set of values with another set in a pandas Series.
    - **Code**: `df['Gender'] = df['Gender'].replace({1: 0, 2: 1})`
    - **Explanation**: Similar to `Test_Administration_Mode`, this recodes gender to a binary format for straightforward binary comparisons in analysis.

### 4. One-Hot Encoding for Categorical Variables

- **Objective**: Transform categorical variables into a format that can be provided to machine learning algorithms.
- **Function Used**: `pandas.get_dummies`
  - **Mechanics**: This function converts categorical variable(s) into dummy/indicator variables. It creates new columns for each unique value in the original column, filled with 1s and 0s to indicate the presence of each categorical value.
  - **Code Examples**:
    - `df = pd.get_dummies(df, columns=['Student_father’s_country_of_birth'], prefix='Father_Country')`
    - `df = pd.get_dummies(df, columns=['Immigration_Status'], prefix='Immig_Status')`
  - **Explanation**: For variables like `Student_father’s_country_of_birth` and `Immigration_Status`, transforming them into dummy variables allows models to "see" each category as a separate feature, improving the model's ability to learn patterns based on these categorical inputs.

### 5. Handling Numerical Encodings and Transformations

- **Weekly Study Time and Math Motivation**:
  - **Objective**: Convert categorical ordinal data into codes that can be used in mathematical models.
  - **Functions Used**: `pandas.Categorical` and `astype`
  - **Code**:
    - `df['Weekly_Study_Time'] = pd.Categorical(df['Weekly_Study_Time'], categories=range(11), ordered=True)`
    - `df['Weekly_Study_Time_code'] = df['Weekly_Study_Time'].cat.codes`
    - `df['Math_Motivation'] = df['Math_Motivation'].astype(int)`
  - **Explanation**: For `Weekly_Study_Time`, it's first treated as an ordered categorical variable. Then `cat.codes` is used to transform these categories into integer codes that preserve the order. This is useful for regression analyses where ordinal relationships matter. `Math_Motivation` is simply converted to an integer format to ensure consistency in data type for processing.

These data preparation steps are vital for the integrity and effectiveness of the subsequent analysis. By ensuring each variable is appropriately cleaned, formatted, and transformed, we can conduct more reliable and meaningful statistical tests and machine learning analyses. This meticulous preparation also helps to uncover deeper insights into how different factors influence educational outcomes, guiding policy decisions and educational strategies.

In [None]:
quantitative_columns = ['Sense_of_Belonging', 'Bullying_Frequency', 
                        'Feeling_of_Safety', 'Disciplinary_Climate', 
                        'Economic_Social_Cultural_Status', 'Access_to_ICT_at_Home', 'Grade_code',
                        'Test_Administration_Mode', 'Gender', 'Weekly_Study_Time_code', 'Math_Motivation',
                        'Average Math Score', 'Average Reading Score', 'Average Science Score'] 

data_for_correlation = df[quantitative_columns]

correlation_matrix = data_for_correlation.corr().round(2)

fig = ff.create_annotated_heatmap(
    z=correlation_matrix.values,
    x=list(correlation_matrix.columns),
    y=list(correlation_matrix.index),
    annotation_text=correlation_matrix.values,
    colorscale='Viridis',
    showscale=True
)

fig.update_layout(
    title_text='Interactive Correlation Matrix of Math Scores and Various Factors',
    title_x=0.5,
    xaxis=dict(
        tickmode='array',
        tickvals=np.arange(0.5, len(correlation_matrix.columns)),
        ticktext=correlation_matrix.columns,
        side='bottom' 
    ),
    yaxis=dict(
        tickmode='array',
        tickvals=np.arange(0.5, len(correlation_matrix.columns)),
        ticktext=correlation_matrix.columns
    ),
    margin=dict(l=200, r=200, t=50, b=150)
)

fig.show(renderer='browser')

### 1. Selecting Quantitative Variables

- **Objective**: Identify and gather all relevant quantitative variables to analyze their interrelationships.
- **Function Used**: Direct column selection from pandas DataFrame.
  - **Mechanics**: This approach involves specifying a list of column names to keep in the DataFrame.
  - **Code Example**:
    ```python
    quantitative_columns = ['Sense_of_Belonging', 'Bullying_Frequency', ...]
    ```
  - **Explanation**: This list includes both direct performance measures like average scores and related sociocultural factors, ensuring a comprehensive dataset for correlation analysis.

### 2. Creating a Focused DataFrame

- **Objective**: Filter the dataset to only include the selected quantitative variables.
- **Function Used**: Indexing with `[]` on pandas DataFrame.
  - **Mechanics**: Uses the list of column names to extract only those columns from the main DataFrame.
  - **Code Example**:
    ```python
    data_for_correlation = df[quantitative_columns]
    ```
  - **Explanation**: This step narrows down the dataset to the variables of interest, simplifying the subsequent analysis steps and focusing on the variables that are crucial for understanding correlations.

### 3. Computing the Correlation Matrix

- **Objective**: Calculate the correlation coefficients between the selected variables to identify any significant relationships.
- **Function Used**: `corr()` method from pandas.
  - **Mechanics**: Computes Pearson correlation coefficients for each pair of variables within the DataFrame.
  - **Code Example**:
    ```python
    correlation_matrix = data_for_correlation.corr().round(2)
    ```
  - **Explanation**: The correlation coefficients help quantify the degree to which variables are related, providing a foundational analysis for identifying trends or potential causal relationships.

### 4. Visualizing the Correlation Matrix

- **Objective**: Create an interactive heatmap to visually represent the correlation coefficients.
- **Function Used**: `create_annotated_heatmap()` from Plotly.
  - **Mechanics**: Generates a heatmap with annotations for each cell showing the correlation values.
  - **Adjustments for Enhanced Usability**:
    - **X-axis Labels**: Positioned at the bottom of the heatmap for better readability.
    - **Margins**: Adjusted to ensure that labels and titles are not cut off and are clearly visible.
  - **Explanation**: This visualization not only makes the correlations easier to understand at a glance but also allows for interactive exploration, which can be particularly useful in presentations or interactive reports.

### 5. Displaying the Interactive Heatmap

- **Objective**: Render the heatmap in a browser to enable dynamic exploration of the data.
- **Function Used**: `show()` method from Plotly.
  - **Mechanics**: Renders the Plotly figure in the specified environment, here set to a web browser for interactivity.
  - **Explanation**: Displaying the heatmap in a browser allows stakeholders to interact with the data, exploring different variables and their correlations in detail, which can aid in hypothesis formation and decision-making processes.



## Detailed Analytical Report on Correlation Matrix

This report delves into the complex interrelations among various educational and sociocultural factors within the PISA 2022 dataset. By examining the correlation coefficients between factors such as sense of belonging, bullying frequency, and educational scores, we aim to uncover patterns that could inform targeted educational strategies.

### Key Observations from the Correlation Matrix

#### 1. Sense of Belonging
- **Correlation with Other Factors**:
  - Positively correlated with `Feeling_of_Safety` (0.32) and `Disciplinary_Climate` (0.15).
  - Negatively correlated with `Bullying_Frequency` (-0.26).
- **Implications**: The strong positive correlation with `Feeling_of_Safety` suggests that a greater sense of belonging enhances students' perceptions of safety. The negative correlation with bullying indicates that improving the sense of belonging could be a strategic focus to reduce bullying incidents.

#### 2. Bullying Frequency
- **Correlation with Other Factors**:
  - Negatively impacts `Sense_of_Belonging` (-0.26) and `Feeling_of_Safety` (-0.16).
- **Implications**: These negative correlations highlight the destructive impact of bullying on both the sense of belonging and safety, pointing to the need for robust anti-bullying programs.

#### 3. Academic Performance (Math, Reading, Science Scores)
- **Correlations**:
  - All three scores show high inter-correlation (0.86 to 0.92), indicating consistent performance across subjects.
  - Strong positive correlation with `Economic_Social_Cultural_Status` (about 0.45), suggesting that socioeconomic status significantly impacts academic achievement.
- **Implications**: The strong linkage between socioeconomic status and scores underscores the importance of addressing economic disparities to improve educational outcomes.

#### 4. Economic Social Cultural Status
- **Correlation with Performance**:
  - Highly correlated with `Average Math Score` (0.47), `Average Reading Score` (0.44), and `Average Science Score` (0.45).
- **Implications**: This suggests that socioeconomic factors play a critical role in educational success across different subjects, potentially guiding policy to focus on socio-economic support programs.

#### 5. Gender and Academic Performance
- **Correlation**:
  - Slight positive correlation with `Average Math Score` (0.07) and a negative correlation with `Average Reading Score` (-0.09).
- **Implications**: Indicates subtle gender differences in performance, particularly in reading.

### Subtle Yet Insightful Observations
- **Access to ICT at Home**: Although weakly correlated with academic scores, its presence in the matrix emphasizes the need for technology access as part of educational infrastructure.
- **Weekly Study Time**: Exhibits a very slight negative correlation with academic performance, which could suggest inefficiencies in study habits or the quality of study time rather than quantity being the determining factor.



In [None]:
variables_to_plot = ['Bullying_Frequency', 'Feeling_of_Safety', 'Disciplinary_Climate', 
                     'Economic_Social_Cultural_Status', 'Grade_code', 'Test_Administration_Mode']

for var in variables_to_plot:
    fig = px.scatter(
        df, x='Average Math Score', y=var,
        trendline="ols", 
        title=f'Correlation between Average Math Score and {var}',
        labels={'x': 'Average Math Score', 'y': var},
        width=800, height=400
    )

    corr = df['Average Math Score'].corr(df[var])
    fig.add_annotation(
        x=max(df['Average Math Score']), y=min(df[var]),
        text=f'Pearson Corr: {corr:.2f}',
        showarrow=False,
        yshift=10,
        bgcolor="white",
        bordercolor="black",
        borderpad=4
    )
    
    fig.update_traces(line=dict(color='red'))
    
    fig.show(renderer='browser')

In [None]:
for var in variables_to_plot:
    fig = px.scatter(
        df, x='Average Reading Score', y=var,
        trendline="ols", 
        title=f'Correlation between Average Reading Score and {var}',
        labels={'x': 'Average Reading Score', 'y': var},
        width=800, height=400
    )
    
    corr = df['Average Reading Score'].corr(df[var])
    fig.add_annotation(
        x=max(df['Average Reading Score']), y=min(df[var]),
        text=f'Pearson Corr: {corr:.2f}',
        showarrow=False,
        yshift=10,
        bgcolor="white",
        bordercolor="black",
        borderpad=4
    )
    
    fig.update_traces(line=dict(color='red'))
    
    fig.show(renderer='browser')

In [None]:
for var in variables_to_plot:
    fig = px.scatter(
        df, x='Average Science Score', y=var,
        trendline="ols", 
        title=f'Correlation between Average Science Score and {var}',
        labels={'x': 'Average Science Score', 'y': var},
        width=800, height=400
    )
    
    corr = df['Average Science Score'].corr(df[var])
    fig.add_annotation(
        x=max(df['Average Science Score']), y=min(df[var]),
        text=f'Pearson Corr: {corr:.2f}',
        showarrow=False,
        yshift=10,
        bgcolor="white",
        bordercolor="black",
        borderpad=4
    )
    
    fig.update_traces(line=dict(color='red'))
    
    fig.show(renderer='browser')

## Detailed Analytical Approach to Visualizing Educational Data

### Introduction

Following a thorough assessment of the correlation matrix which compared average scores in mathematics, reading, and science against a broad set of sociocultural and educational variables, we have strategically selected a subset of variables for deeper visual analysis. The variables chosen—'Bullying_Frequency', 'Feeling_of_Safety', 'Disciplinary_Climate', 'Economic_Social_Cultural_Status', 'Grade_code', and 'Test_Administration_Mode'—demonstrate the most significant correlations with average academic scores, although these correlations are generally moderate.

The correlation strengths for most of these variables range from 0.1 to 0.2, indicating a mild but notable relationship with academic performance. The exception is 'Economic_Social_Cultural_Status', which shows a more robust correlation coefficient exceeding 0.4. This stronger correlation underscores the substantial impact of socioeconomic factors on educational outcomes. Despite the moderate nature of most correlations, our analysis focuses on these variables to explore potential cumulative effects that might inform educational strategies and policy-making.

### Visual Analysis with Scatter Plots and Trend Lines

This section of our analysis uses scatter plots to visually explore the relationships between students' average scores and the selected variables. By incorporating ordinary least squares (OLS) trend lines and Pearson correlation coefficients, we aim to illustrate and quantify these relationships, providing insights that could influence educational practices and interventions.

### Process and Technical Details

#### Setting Up the Visualization Environment
- **Objective**: Configure the aesthetic style for the plots to ensure clarity and visual appeal.
- **Method**: Use `sns.set()` from the seaborn library with the `style` parameter set to "whitegrid". This style provides a neutral background with grid lines that help in interpreting the data points and trend lines.

#### Data Visualization with Plotly Express
- **Objective**: Create scatter plots to examine the correlation between average scores in math, reading, and science, and selected variables.
- **Tools Used**: `px.scatter` from Plotly Express.
  - **Functionality**: This function generates scatter plots with options to add ordinary least squares (OLS) trend lines, customize titles, labels, and dimensions.
  - **Customization**: Each plot is configured to display the average score (math, reading, or science) on the X-axis against one of the selected variables on the Y-axis.

#### Enhancing Plots with Trend Lines and Annotations
- **Objective**: Incorporate trend lines to identify linear relationships and annotate plots with correlation coefficients.
- **Trend Lines**: Added using the `trendline` parameter set to `"ols"` within `px.scatter`, which computes the line using the method of ordinary least squares.
- **Annotations**:
  - **Purpose**: Display the Pearson correlation coefficient on each plot to quantify the strength of the relationship.
  - **Implementation**: Use `fig.add_annotation()` to place a text box on each plot, showing the correlation coefficient rounded to two decimal places.
  - **Aesthetic Settings**: Annotations are formatted with a white background and black border to enhance readability.

#### Plot Customizations and Display
- **Objective**: Ensure that the plots are easy to read and interact with.
- **X-axis Adjustments**:
  - **Labels at the Bottom**: The x-axis labels are positioned at the bottom of the plot for better visibility, achieved by setting the `side` attribute of `xaxis` to `"bottom"`.
- **Dimensions and Margins**:
  - **Consistency**: All plots are set to a width of 800 and a height of 400 for uniformity.
  - **Margin Adjustment**: Margins are specifically adjusted to ensure that axis labels and annotations are not truncated, providing a clean and professional presentation.

#### Interactive Rendering
- **Objective**: Allow interactive exploration of the plots.
- **Execution**: Utilize `fig.show(renderer='browser')` to render each plot in a web browser. This method enables interactive features such as zooming and panning, which are helpful for detailed examination of the data points and trends.

### Conclusion

By using scatter plots with trend lines and annotations, we effectively visualize and quantify the relationships between average scores and various influential factors. This approach not only aids in identifying significant correlations but also enhances our understanding of how different variables may impact educational outcomes.

### Comprehensive Analysis of Scatter Plots: Correlating Academic Performance with Sociocultural Variables

This analysis endeavors to uncover the intricate relationships between students' academic performance in mathematics, reading, and science with various sociocultural factors. By scrutinizing scatter plots and employing Pearson correlation coefficients, we aim to derive actionable insights and hypothesize the underlying dynamics influencing educational outcomes.

#### 1. General Observation Across All Variables

All correlations across the different variables relative to average scores in mathematics, reading, and science demonstrate nearly identical trends. This consistency suggests that the mechanisms influencing student performance are uniformly applicable across these academic disciplines.

- **Hypothesis**: The uniformity in correlation patterns across subjects may indicate that certain intrinsic and extrinsic educational factors consistently affect student learning and performance irrespective of the subject matter. This could imply a standardized impact of the educational environment and student behavior on academic success.

#### 2. Correlation with Bullying Frequency

The plots illustrate a broad dispersion of data points, suggesting a weak to moderate inverse correlation between academic scores and bullying frequency. As average scores increase, there's a noticeable decline in bullying frequency, particularly highlighted by Pearson coefficients (-0.11 for math and reading, -0.10 for science).

- **Interesting Observation**: At high academic performance levels (scores above 700), bullying frequency significantly drops. This could suggest that higher academic environments might be more conducive to positive social interactions or that high-performing students are less exposed or susceptible to bullying.
- **Analytical Insight**: The decrease in bullying with higher academic scores might also reflect the presence of a supportive school environment that fosters both academic excellence and positive social behavior, emphasizing the role of comprehensive educational strategies.

#### 3. Feeling of Safety

The relationship between students' feelings of safety and their academic scores is subtle but noticeable, with Pearson correlations indicating a mild positive impact (0.15 for math, 0.11 for reading, 0.13 for science).

- **Analysis**: Higher feelings of safety might contribute to a learning environment where students can focus better, leading to improved academic performance. The difference in score ranges between the lower and upper safety levels underscores the potential academic impact of perceived safety, suggesting that ensuring a safe school environment could be key to boosting academic outcomes.

#### 4. Disciplinary Climate

Data points are densely packed without clear patterns, suggesting a subtle yet consistent influence of disciplinary climate on academic performance, with all subjects reflecting a correlation coefficient of approximately 0.13.

- **Interpretation**: A positive disciplinary climate might subtly enhance students' ability to engage with academic content without major distractions, although the effect is not strongly marked. This reinforces the idea that while necessary, disciplinary measures alone are not sufficient to drastically influence academic outcomes but must be part of a broader array of educational improvements.

#### 5. Economic Social Cultural Status

The correlation here is more pronounced (0.47 for math, 0.44 for reading, 0.45 for science), with data points ascending alongside increases in socioeconomic status, suggesting a strong positive correlation.

- **Deep Dive**: This significant correlation likely reflects the comprehensive advantages conferred by higher socioeconomic status, including access to better educational resources, more supportive learning environments, and broader life experiences that contribute to educational success. The alignment of points around the regression line indicates a robust relationship between socioeconomic factors and academic achievement.

#### 6. Grade Level (Grade_code)

There is a small positive correlation (0.14 for math, 0.15 for reading, 0.14 for science) observed, which is logical but not as strong as one might expect.

- **Analytical Insight**: Considering that the dataset evaluates students of the same age across different educational systems with varying grade structures, the correlation isn't stark. However, the positive trend does suggest that higher grade levels generally correspond to better academic performance, possibly due to the cumulative effect of continued education.

#### 7. Test Administration Mode

This variable shows a binary distribution (0 for paper-based, 1 for computer-based tests) with Pearson correlations indicating a slight advantage for computer-based test takers (0.13 for math, 0.10 for reading, 0.12 for science).

- **In-depth Analysis**: The narrow score range for paper-based tests compared to computer-based suggests that students taking computer-based tests not only score higher on average but also show a broader range of outcomes. This could be attributed to factors like familiarity with digital tools, faster information processing, or the interactive nature of computer-based tests enhancing student engagement and performance.

### Conclusion

This detailed exploration into how different sociocultural variables correlate with academic performance provides critical insights into potential educational interventions. By understanding these dynamics, educators and policymakers can tailor strategies that not only focus on improving academic instruction but also address the broader sociocultural factors that significantly impact student learning and well-being. This analysis underscores the complexity of education systems and the multifaceted approach needed to foster an environment where all students can achieve their full academic potential.

In [None]:
df['Overall Average Score'] = df[['Average Math Score', 'Average Reading Score', 'Average Science Score']].mean(axis=1)

X = df['Economic_Social_Cultural_Status'].values.reshape(-1, 1)
y = df['Overall Average Score'].values

poly_features = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly_features.fit_transform(X)

model = LinearRegression()
model.fit(X_poly, y)

y_pred = model.predict(X_poly)

fig = px.scatter(
    df, x='Economic_Social_Cultural_Status', y='Overall Average Score',
    labels={'x': 'Economic Social Cultural Status', 'y': 'Overall Average Score'},
    title='Polynomial Regression: Overall Average Score vs Economic Status'
)

sorted_indices = np.argsort(X.ravel())
fig.add_traces(px.line(x=X.ravel()[sorted_indices], y=y_pred[sorted_indices], color_discrete_sequence=['red']).data)

fig.show()

### Polynomial Regression Analysis on Overall Average Scores and Economic Status

In our continued exploration of how various factors correlate with academic performance, we have decided to delve deeper into the relationship between students' overall average scores and their economic status. This analysis builds upon our earlier findings where Economic Social Cultural Status (ESCS) showed the strongest correlation with individual subject scores. To refine our understanding and potentially uncover nonlinear relationships, we employ polynomial regression.

#### Creation of an Overall Average Score Column

Given that our previous visualizations indicated similar behavior across individual subject scores in their relationships with other variables, we decided to consolidate these scores into a single metric.

- **Objective**: Calculate an overall average score to simplify our model and focus on general academic performance rather than subject-specific outcomes.
- **Code Implementation**:
  - We first calculate the mean of 'Average Math Score', 'Average Reading Score', and 'Average Science Score' using `df.mean(axis=1)`, which computes the mean across the specified axis, in this case, horizontally across each row.
  - **Code**: `df['Overall Average Score'] = df[['Average Math Score', 'Average Reading Score', 'Average Science Score']].mean(axis=1)`

#### Polynomial Feature Generation

To capture potential nonlinear effects of economic status on overall academic performance, polynomial features are generated up to the second degree.

- **Objective**: Enhance the model's ability to identify complex patterns by including not only the original feature but also its square.
- **Function Used**: `PolynomialFeatures` from `sklearn.preprocessing`
  - **Mechanics**: This function transforms our input feature into a new feature matrix consisting of all polynomial combinations of the features with degree less than or equal to the specified degree.
  - **Code**: 
    - We reshape the 'Economic_Social_Cultural_Status' into a 2D array suitable for transformation.
    - `X_poly = PolynomialFeatures(degree=2, include_bias=False).fit_transform(X)`

#### Linear Regression Model

With the polynomial features at hand, a linear regression model is applied to predict the overall average scores.

- **Objective**: Use the generated polynomial features to fit a linear regression model and predict overall average scores based on economic status.
- **Function Used**: `LinearRegression` from `sklearn.linear_model`
  - **Code**:
    - `model = LinearRegression()`
    - `model.fit(X_poly, y)`  where `y` represents the overall average scores.

#### Visualization of Results

The final step in our analysis is to visualize the relationship along with the regression line to illustrate the fit of our polynomial model.

- **Visualization Tool**: Plotly Express
- **Features**:
  - Scatter plot of actual data points for clear visibility of data distribution.
  - A line plot for the regression predictions to show the trend and fit.
    - The regression line is added by sorting the values and plotting them to ensure continuity and correct alignment.

## Detailed Analysis of Polynomial Regression Outcomes

### Exploration of Socioeconomic Impact on Academic Performance
In our exploration of the relationship between Economic Social Cultural Status (ESCS) and overall academic performance, reflected by average scores in mathematics, reading, and science, a polynomial regression model reveals compelling patterns. These insights deepen our understanding of how socioeconomic factors might influence educational outcomes.

### Visual Inspection of Polynomial Regression
The regression line, marked in red, predominantly exhibits an upward trajectory, suggesting a positive correlation between ESCS and overall academic scores. This trend supports the hypothesis that higher economic status correlates with better academic performance, likely due to increased access to resources, better educational opportunities, and a conducive learning environment.

### Analysis of Regression Line Behavior
- **Horizontal Trend in Lower ESCS Range**: From about ESCS -5 to -2, the regression line appears almost horizontal, indicating that changes in ESCS have minimal impact on academic scores within this lower economic status range. This phenomenon suggests that other factors may play a more significant role than economic status in influencing academic outcomes here.
- **Increased Slope Post ESCS -2**: Beyond ESCS -2, the regression line ascends more steeply, highlighting that improvements in economic status significantly enhance academic performance. This section illustrates a direct relationship where increments in economic status strongly correlate with better academic results.
- **Separation of Observations at Higher ESCS**: Notably, within the ESCS range of approximately 3 to 7, a distinct cluster of data points lies below the main trend line, separated from the main cloud of observations. This separation may indicate outliers or special cases where, despite high economic status, the expected academic performance is not realized, potentially due to inefficient resource utilization, lack of motivation, or personal circumstances.

### Additional Observations on Lower ESCS Values
- **Low ESCS Observations**: Particularly at very low ESCS values (from -4 and below), the range of observed overall average scores (from 250 to 500) is significantly lower than that seen in the main data cloud (ranging from 180 to 850). This discrepancy suggests that extremely low economic status severely limits academic achievement potential, possibly due to inadequate access to basic educational resources or environments not conducive to learning.

### Implications and Theoretical Considerations
- **Effectiveness of Resources**: The flat trend at lower ESCS values may indicate a threshold effect, where basic educational needs must be met before any additional economic benefits can enhance academic performance. This finding underscores the importance of ensuring that fundamental educational necessities are addressed as a priority.
- **Non-Linear Influence of ESCS**: The increasing slope past ESCS -2 supports the idea that the benefits of socioeconomic status on education are not linear but accelerate with higher status levels. This insight could guide policies aimed at progressive resource allocation to maximize incremental benefits.
- **Anomalous Lower Performances in High ESCS**: The outliers in higher ESCS ranges challenge the conventional wisdom that more resources always lead to better outcomes. Investigating these anomalies could shed light on the qualitative aspects of how resources are utilized, the role of parental involvement, and individual student characteristics that might influence these deviations.

### Conclusion
This polynomial regression analysis not only corroborates the expected positive correlation between economic status and educational performance but also unveils complex patterns that challenge simplistic linear assumptions. By dissecting different segments of the ESCS spectrum, we uncover insights crucial for designing targeted educational policies and interventions.

In [None]:
X = df[['Test_Administration_Mode', 'Grade_code', 'Gender', 'Weekly_Study_Time_code', 'Math_Motivation',
        'Sense_of_Belonging', 'Bullying_Frequency', 'Feeling_of_Safety', 'Disciplinary_Climate',
        'Economic_Social_Cultural_Status', 'Access_to_ICT_at_Home',
        'Father_Country_1.0', 'Father_Country_2.0',
        'Mother_Education_Level_1.0', 'Mother_Education_Level_2.0', 'Mother_Education_Level_3.0', 
        'Mother_Education_Level_4.0', 'Mother_Education_Level_5.0', 'Mother_Education_Level_6.0', 
        'Mother_Education_Level_7.0', 'Father_Education_Level_1.0', 
        'Father_Education_Level_2.0', 'Father_Education_Level_3.0', 'Father_Education_Level_4.0', 
        'Father_Education_Level_5.0', 'Father_Education_Level_6.0', 
        'Immig_Status_1.0', 'Immig_Status_2.0']]

y = df['Overall Average Score']

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=0)

model = LinearRegression()
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

print('Coefficients:', model.coef_)
print('Intercept:', model.intercept_)
print('Mean squared error (MSE):', mean_squared_error(y_test, y_pred))
print('Coefficient of determination (R^2):', r2_score(y_test, y_pred))

## Multivariate Linear Regression Analysis on Educational Data

In our investigation of the PISA 2022 dataset, we employ multivariate linear regression to understand the multifaceted influences on students' overall academic performance. This approach allows us to assess the combined effects of various educational and socio-demographic factors.

### Setting Up the Variables
- **Independent Variables**: The analysis incorporates a range of predictors including test administration mode, grade level, gender, study habits, socio-emotional factors like sense of belonging and safety, socioeconomic status, access to ICT, educational levels of parents, and immigration status. Each categorical variable is represented using dummy variables, excluding one category each to avoid multicollinearity—a common statistical issue where predictors are too highly correlated.
- **Dependent Variable**: The target variable is the `Overall Average Score`, calculated as the mean of scores in mathematics, reading, and science, which represents a holistic measure of academic achievement.

### Data Preparation and Normalization
- **Normalization**: Prior to modeling, the data undergoes scaling using `StandardScaler` to ensure that the variance of the independent variables is on a comparable scale. This step enhances the stability and performance of the regression model.
- **Data Splitting**: The dataset is divided into an 80% training set and a 20% testing set using `train_test_split`, ensuring reproducibility with a consistent random state.

### Model Building and Training
- **Model Creation**: Utilizing `LinearRegression()` from the `sklearn.linear_model` library, we configure a linear model to minimize the residual sum of squares between the observed and predicted values in the dataset.
- **Training**: The model is trained on the normalized training data, learning how to predict the `Overall Average Score` based on the independent variables provided.

### Prediction and Model Evaluation
- **Prediction**: Academic performance is predicted for the test set using the trained model.
- **Performance Metrics**:
  - **Coefficients**: Reflect the influence of each independent variable on the academic scores, indicating the direction and magnitude of their impact.
  - **Intercept**: Represents the expected mean score when all predictors are at their mean value.
  - **Mean Squared Error (MSE)**: Measures the average squared difference between the estimated values and what is estimated, with lower values indicating better fit.
  - **Coefficient of Determination (R^2)**: Describes the proportion of variance in the dependent variable that can be predicted from the independent variables, with a value of 1 representing perfect prediction.

### Conclusion
This detailed regression analysis underscores the importance of considering multiple factors to understand their collective impact on student performance. By analyzing the coefficients and overall model accuracy, we identify critical determinants of educational outcomes, facilitating targeted educational policies and interventions. This approach not only quantifies the direct influences but also highlights the complex interplay between various educational determinants.

## Detailed Analysis of Multivariate Linear Regression Results

The results of our multivariate linear regression model provide valuable insights into the factors that influence students' overall academic performance across mathematics, reading, and science. Below, we dissect the model outcomes and draw analytical conclusions from the regression coefficients, the intercept, and performance metrics like Mean Squared Error (MSE) and R² (Coefficient of Determination).

### Regression Model Coefficients Analysis
The coefficients derived from the regression model illustrate the influence of each predictor on the `Overall Average Score`. Notably:

- **Positive Influencers**: Positive coefficients suggest that as the value of these predictors increases, so does the overall average score. For example, `Economic_Social_Cultural_Status` shows the most substantial positive impact with a coefficient of 37.268, indicating a strong correlation between higher economic status and better academic performance. This is consistent with the theory that higher economic status often correlates with better access to educational resources.
  
- **Negative Influencers**: On the other hand, variables like `Weekly_Study_Time_code` with a coefficient of -8.338 suggest a negative relationship. A possible interpretation might be that an excessive amount of weekly study time could be indicative of struggles with academic content, potentially leading to lower performance.

### The Intercept
The intercept of 457.156 implies that if all predictor variables were set to zero, the expected overall average score would be around 457.156. This value serves as a baseline from which the influence of other variables is measured.

### Performance Metrics
- **Mean Squared Error (MSE)**: The MSE of our model is 6201.937, quantifying the average squared difference between the estimated values and what was actually observed. While we strive for a lower MSE to indicate a better fit of our model, the value we have points to moderate predictive accuracy, considering the complexity and variability inherent in educational data.

- **Coefficient of Determination (R²)**: An R² value of 0.305 suggests that our model explains approximately 30.5% of the variability in the dependent variable, the `Overall Average Score`. Given the vast array of over 1000 variables available in the PISA dataset, the selection of our variables was driven by theoretical relevance and empirical evidence suggesting their potential impact on academic scores. Although the R² is modest, it is relatively attractive in educational studies where outcomes are influenced by a multitude of factors, many of which are difficult to quantify. This highlights the effectiveness of our variable selection in capturing significant influences on educational achievement, despite the limited number we included in the model.

### Analytical Insights and Implications
- **Economic Status's Strong Influence**: The significant positive coefficient for `Economic_Social_Cultural_Status` reinforces the pivotal role socioeconomic factors play in educational outcomes. It underscores the potential effectiveness of policies aimed at alleviating economic disparities to boost educational achievement.

- **Complexity of Study Time**: The negative association of `Weekly_Study_Time_code` suggests that more hours spent studying do not necessarily equate to better outcomes, emphasizing the need for quality and efficiency in study practices rather than quantity alone.

- **Scope for Model Enhancement**: The moderate explanatory power indicated by the R² value suggests there are additional factors not captured by this model that significantly influence educational outcomes. This could include variables such as teacher effectiveness, student resilience, and parental engagement, suggesting directions for future research and model augmentation.


In [None]:
residuals = y_test - y_pred

residuals_df = pd.DataFrame({
    'Predicted Values': y_pred,
    'Residuals': residuals
})

fig = px.scatter(
    residuals_df, x='Predicted Values', y='Residuals',
    labels={'Predicted Values': 'Predicted Values', 'Residuals': 'Residuals'},
    title='Residual Plot'
)

fig.add_hline(y=0, line_color='red', line_dash='dash')

fig.show()

## Residual Analysis in Linear Regression

Following our multivariate linear regression analysis of the PISA 2022 dataset, a residual analysis is conducted to assess the performance of our model. This analysis helps in understanding the discrepancies between the observed values and the predictions provided by the model, crucial for identifying potential model improvements.

### Overview of Residuals

Residuals, the differences between the observed and predicted values, serve as a diagnostic measure to evaluate the accuracy of a regression model. By plotting these residuals against the predicted values, we can visually inspect the variance and patterns that may indicate issues such as heteroscedasticity or non-linearity.

### Residual Plot Construction

- **Objective**: To create a visual representation of the residuals to evaluate the fit of our linear regression model and to identify any systematic errors.
- **Procedure**:
  - **Calculation of Residuals**: Residuals are computed as the difference between the test set outcomes (`y_test`) and the predicted values (`y_pred`). This subtraction yields a series of residuals indicating the error in prediction for each observation.
  - **Visualization**: The residuals are plotted against the predicted values to visually assess the spread and distribution. An ideal residual plot shows a random dispersion of residuals around the horizontal line at zero, indicating that the model's predictions are unbiased across the range of data.
  
### Plot Features and Interpretation

- **Horizontal Zero Line**: A red dashed horizontal line at y=0 is included in the plot to act as a baseline, making it easier to identify residuals that deviate significantly from zero.
- **Scatter Plot**: Each residual is plotted as a point relative to its predicted value. This scatter helps in identifying patterns such as funnel-shaped plots (indicative of heteroscedasticity) or clear trends that suggest non-linear relationships not captured by the model.

### Analytical Benefits

- **Identification of Outliers**: Large residuals far from the zero line indicate outliers in data that the model fails to accurately predict.
- **Assessment of Homoscedasticity**: The spread of residuals should be uniform across all levels of predicted values; any systematic change in spread suggests heteroscedasticity, prompting a need for transformations or different modeling approaches.
- **Model Diagnostics**: A lack of pattern or trend in the residual plot supports the suitability of the linear model for the data. Conversely, any pattern may indicate model misspecification or the presence of influential variables not included in the model.