# UC00181_Factors_Impact_on_Student_Performance

Thai Ha NGUYEN

Email: s224082764@deakin.edu.au

## 1. Scenario

As a data analyst working for the Victorian Department of Education. Your team has been tasked with analyzing the PISA 2022 results for Victorian students to identify key insights about student performance, engagement, and educational outcomes. This analysis will inform policy decisions and educational strategies for improving student outcomes across Victoria.

The Minister of Education has requested a comprehensive analysis that can be presented to stakeholders and integrated into the Ministry of Education Platform (MOP) for ongoing monitoring and decision-making.

## 2. What This Use Case Will Teach You

By the end of Sprint 1, you will have learned how to:

- **Data Wrangling**: Clean and prepare complex educational survey data from SPSS format
- **Environment Setup**: Establish a professional data science workflow using GitHub and Python
- **Domain Knowledge**: Understand PISA assessment framework and educational metrics
- **Data Quality Assessment**: Identify and handle missing data, outliers, and inconsistencies
- **Exploratory Data Analysis**: Conduct initial investigations to understand data patterns
- **Documentation**: Create clear, reproducible analysis documentation
- **Collaboration**: Set up workflows for integration with web development teams
- **Project Management**: Structure a data science project for iterative development

## 3. Background and Introduction

### PISA Overview
The Programme for International Student Assessment (PISA) is a triennial international survey conducted by the OECD that evaluates education systems worldwide. PISA 2022 assessed 15-year-old students' capabilities in:
- Reading literacy
- Mathematical literacy  
- Scientific literacy
- Creative thinking (new domain in 2022)

### Victorian Education Context
Victoria participates in PISA as part of Australia's commitment to international educational benchmarking. The results help inform:
- Curriculum development
- Teacher training programs
- Resource allocation
- Educational policy decisions

### Project Significance
This analysis will provide evidence-based insights to support Victorian educational excellence and identify areas for improvement in student outcomes.

## 4. Dataset Introduction

### Dataset Overview
- **Name**: PISA 2022 Student Questionnaire - Victoria, Australia Subset
- **Format**: SPSS (.sav) file
- **Scope**: 15-year-old students from Victorian schools
- **Assessment Year**: 2022
- **Data Collection**: Conducted between March-August 2022

### Key Components
The dataset contains multiple dimensions of student data:

**Performance Measures:**
- Reading, Mathematics, and Science plausible values
- Creative thinking assessment scores
- Domain-specific subscale scores

**Student Background:**
- Socioeconomic status indicators
- Home language and cultural background
- Family structure and parental education
- Immigration status and length of residence

**School Context:**
- School type (government, Catholic, independent)
- School location (metropolitan, regional, remote)
- School size and resources
- Teacher qualifications and experience

**Learning Environment:**
- Student attitudes toward subjects
- Motivation and engagement measures
- Learning strategies and study habits
- Technology use and digital literacy

### Data Structure Expectations
- Approximately 600-800 Victorian student records
- 500+ variables covering assessment scores and questionnaire responses
- Hierarchical structure (students nested within schools)
- Complex survey design with sampling weights

## 5. Importing Datasets

### 5.1 Environment Setup

#### Required Libraries

In [6]:
# Core data manipulation
import pandas as pd
import numpy as np

# SPSS file handling
import pyreadstat

# Statistical analysis
import scipy.stats as stats
from scipy.stats import chi2_contingency, pearsonr, spearmanr
from statsmodels.stats.weightstats import ttest_ind
import statsmodels.api as sm

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go

# Data quality
import pandas_profiling
from dataprep.eda import create_report

# Version control integration
import os
from datetime import datetime

  import pandas_profiling


ModuleNotFoundError: No module named 'dataprep'

### 5.2 Data Import Process

#### Loading SPSS Data

In [None]:
def load_pisa_data(file_path):
    """
    Load PISA 2022 SPSS dataset with proper handling of metadata
    """
    # Read SPSS file with metadata
    df, meta = pyreadstat.read_sav(file_path, 
                                   apply_value_formats=True,
                                   formats_as_ordered_category=False)
    
    # Store variable labels for documentation
    variable_labels = meta.column_labels
    value_labels = meta.variable_value_labels
    
    return df, variable_labels, value_labels

# Load the dataset
pisa_data, var_labels, val_labels = load_pisa_data('SPSS2022VIC.sav')

In [None]:
# Basic dataset information
print(f"Dataset shape: {pisa_data.shape}")
print(f"Number of students: {pisa_data.shape[0]}")
print(f"Number of variables: {pisa_data.shape[1]}")

# Preview first few rows
display(pisa_data.head())

# Data types overview
print("\nData types summary:")
print(pisa_data.dtypes.value_counts())