# 02. Data Cleaning Notebook

## Objectives
- Remove duplicate entries from the dataset to ensure data quality
- Handle missing values using appropriate strategies
- Standardize data formats and fix inconsistent data types

## Inputs
- A cleaned version of the raw dataset file from data collection phase (i.e lang_proficiency_results)

## Outputs
- Cleaned dataset with consistent formatting and data types
- Documentation of data cleaning decisions and transformations applied
- Summary statistics of data quality improvements

## Additional information
- All cleaning operations are documented with clear rationale for reproducibility
- Any data removed or significantly modified is logged for transparency
- Raw data backups are preserved and data quality metrics tracked throughout the process
- Missing values and outliers are handled using domain-appropriate strategies with clear documentation

---

# Project Directory Structure

## Change working directory

We need to change the working directory from its current folder to the folder the code of this project is currently located

In [1]:
import os
current_dir = os.getcwd()
current_dir

'c:\\Users\\husse\\OneDrive\\Projects\\lang-level-pred\\jupyter_notebooks'

In [2]:
from pathlib import Path

# swtich to project root directory
project_root = Path.cwd().parent
os.chdir(project_root)
print(f"Working directory: {os.getcwd()}")

Working directory: c:\Users\husse\OneDrive\Projects\lang-level-pred


---

# Data loading and basic exploration
This code block imports fundamental Python libraries for data analysis and visualization and checks their versions

- pandas: For data manipulation and analysis
- numpy: For numerical computations
- matplotlib: For creating visualizations and plots

The version checks help ensure:
- Code compatibility across different environments
- Reproducibility of analysis
- Easy debugging of version-specific issues

In [4]:
# Import data analysis tools
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt


print(f"pandas version: {pd.__version__}")
print(f"NumPy version: {np.__version__}")
print(f"matplotlib version: {matplotlib.__version__}")

pandas version: 2.3.1
NumPy version: 2.3.1
matplotlib version: 3.10.5


### List Files and Folders
- This code shows what files and folders are in our data folder and what folder we are currently in. 
- Subsequently we aren't in the right folder so the current path is set to the parent folder.

In [5]:
import os
from pathlib import Path

dataset_dir = Path("data/raw")
print(f"[INFO] Files/folders available in {dataset_dir}:")
os.listdir(dataset_dir)

[INFO] Files/folders available in data\raw:


['lang_proficiency_results_raw.csv']

## Load dataset
This code loads the dataset that is then displayed in the dataframe.

In [7]:
import pandas as pd
from pathlib import Path

# Define the path to the CSV file
file_path = Path("data/raw/lang_proficiency_results_raw.csv")

# Read the CSV file
df = pd.read_csv(file_path)

## Identifying problems in dataset

### Check Missing Values in Dataset
This code analyzes and displays missing values in our bulldozer dataset:
- Uses pandas' isna() function to identify missing values
- Counts total missing values per column using sum()
- Sorts results in descending order to highlight columns with most missing data

Previously this showed multiple missing values in some of the columns but i mistakenly ran it again after fixing the issue.

In [8]:
# Check missing values
df.isna().sum().sort_values(ascending=False)

overall_cefr       16
listening_score    16
speaking_score     14
reading_score       9
writing_score       8
user_id             0
dtype: int64