# 03. Exploratory Data Analysis

## Objectives

- Perform exploratory analysis of the cleaned tabular dataset
- Generate descriptive statistics and visual summaries of language proficiency scores
- Identify correlations and relationships between different test components
- Highlight trends, anomalies, and patterns relevant to CEFR classification and modelling

## Inputs

- Cleaned dataset from the Data Cleaning phase (data/processed/lang_proficiency_results.csv)
- Python libraries: pandas, numpy, matplotlib, seaborn, pandas_profiling (optional)

## Outputs

- Pandas profiling report or equivalent descriptive overview
- Correlation analysis (correlation heatmap, pair plots, or PPS study)
- Visualisations of score distributions, relationships, and CEFR class balance
- Written insights to inform feature engineering and modelling steps


## Additional information

- All plots will be interpreted in context with the business requirements
- Correlations and visual insights will help define predictive features for ML
- Any unusual patterns identified here may prompt refinement of cleaning or feature engineering


---

# Project Directory Structure

## Change working directory

We need to change the working directory from its current folder to the folder the code of this project is currently located

In [1]:
import os
current_dir = os.getcwd()
current_dir

'c:\\Users\\husse\\OneDrive\\Projects\\lang-level-pred\\jupyter_notebooks'

In [2]:
from pathlib import Path

# swtich to project root directory
project_root = Path.cwd().parent
os.chdir(project_root)
print(f"Working directory: {os.getcwd()}")

Working directory: c:\Users\husse\OneDrive\Projects\lang-level-pred


---

# Data loading and basic exploration
This code block imports fundamental Python libraries for data analysis and visualization and checks their versions

- pandas: For data manipulation and analysis
- numpy: For numerical computations
- matplotlib: For creating visualizations and plots

The version checks help ensure:
- Code compatibility across different environments
- Reproducibility of analysis
- Easy debugging of version-specific issues

In [3]:
# Import data analysis tools
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt


print(f"pandas version: {pd.__version__}")
print(f"NumPy version: {np.__version__}")
print(f"matplotlib version: {matplotlib.__version__}")

pandas version: 2.3.1
NumPy version: 2.3.1
matplotlib version: 3.10.5
