# 02. Data Cleaning Notebook

## Objectives
- Remove duplicate entries from the dataset to ensure data quality
- Handle missing values using appropriate strategies
- Standardize data formats and fix inconsistent data types

## Inputs
- A cleaned version of the raw dataset file from data collection phase (i.e lang_proficiency_results)

## Outputs
- Cleaned dataset with consistent formatting and data types
- Documentation of data cleaning decisions and transformations applied
- Summary statistics of data quality improvements

## Additional information
- All cleaning operations are documented with clear rationale for reproducibility
- Any data removed or significantly modified is logged for transparency
- Raw data backups are preserved and data quality metrics tracked throughout the process
- Missing values and outliers are handled using domain-appropriate strategies with clear documentation

---

# Project Directory Structure

## Change working directory

We need to change the working directory from its current folder to the folder the code of this project is currently located

In [1]:
import os
current_dir = os.getcwd()
current_dir

'c:\\Users\\husse\\OneDrive\\Projects\\lang-level-pred\\jupyter_notebooks'

In [2]:
from pathlib import Path

# swtich to project root directory
project_root = Path.cwd().parent
os.chdir(project_root)
print(f"Working directory: {os.getcwd()}")

Working directory: c:\Users\husse\OneDrive\Projects\lang-level-pred


---

# Data loading and basic exploration
This code block imports fundamental Python libraries for data analysis and visualization and checks their versions

- pandas: For data manipulation and analysis
- numpy: For numerical computations
- matplotlib: For creating visualizations and plots

The version checks help ensure:
- Code compatibility across different environments
- Reproducibility of analysis
- Easy debugging of version-specific issues

In [4]:
# Import data analysis tools
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt


print(f"pandas version: {pd.__version__}")
print(f"NumPy version: {np.__version__}")
print(f"matplotlib version: {matplotlib.__version__}")

pandas version: 2.3.1
NumPy version: 2.3.1
matplotlib version: 3.10.5


### List Files and Folders
- This code shows what files and folders are in our data folder and what folder we are currently in. 
- Subsequently we aren't in the right folder so the current path is set to the parent folder.

In [5]:
import os
from pathlib import Path

dataset_dir = Path("data/raw")
print(f"[INFO] Files/folders available in {dataset_dir}:")
os.listdir(dataset_dir)

[INFO] Files/folders available in data\raw:


['lang_proficiency_results_raw.csv']

## Load dataset
This code loads the dataset that is then displayed in the dataframe.

In [7]:
import pandas as pd
from pathlib import Path

# Define the path to the CSV file
file_path = Path("data/raw/lang_proficiency_results_raw.csv")

# Read the CSV file
df = pd.read_csv(file_path)

## Identifying problems in dataset

### Check Missing Values in Dataset
This code analyzes and displays missing values in our bulldozer dataset:
- Uses pandas' isna() function to identify missing values
- Counts total missing values per column using sum()
- Sorts results in descending order to highlight columns with most missing data

Previously this showed multiple missing values in some of the columns but i mistakenly ran it again after fixing the issue.

In [8]:
# Check missing values
df.isna().sum().sort_values(ascending=False)

overall_cefr       16
listening_score    16
speaking_score     14
reading_score       9
writing_score       8
user_id             0
dtype: int64

### Check for Duplicates
This code checks if any duplicates exist in the dataset

In [9]:
# Check if there are any duplicate rows
duplicates = df.duplicated()

# Count total duplicates
print(f"Number of duplicate rows: {duplicates.sum()}")

# View the duplicate rows (if any)
df[duplicates]

Number of duplicate rows: 10


Unnamed: 0,user_id,speaking_score,reading_score,listening_score,writing_score,overall_cefr
1000,508,32.0,31.0,35.0,39.0,A1
1001,819,57.0,55.0,58.0,56.0,B1
1002,453,61.0,59.0,57.0,70.0,B1
1003,369,53.0,55.0,47.0,53.0,A2
1004,243,51.0,47.0,41.0,48.0,A2
1005,930,82.0,72.0,71.0,74.0,B2
1006,263,89.0,91.0,90.0,88.0,C1
1007,811,29.0,40.0,27.0,31.0,A1
1008,319,52.0,50.0,50.0,47.0,A2
1009,50,27.0,32.0,40.0,25.0,A1


## Check for outliers
This code checks the dataset to see if any scores exist outide of the range 100 < x < 0

In [10]:
# Check outliers for all score columns
score_columns = ['speaking_score', 'listening_score', 'reading_score', 'writing_score']

for col in score_columns:
   outliers = ((df[col] < 1) | (df[col] > 100)).sum()
   print(f"{col}: {outliers} outliers")

speaking_score: 0 outliers
listening_score: 0 outliers
reading_score: 7 outliers
writing_score: 0 outliers


In [11]:
df.dtypes

user_id              int64
speaking_score     float64
reading_score      float64
listening_score    float64
writing_score      float64
overall_cefr        object
dtype: object

In [12]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1010 entries, 0 to 1009
Data columns (total 6 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   user_id          1010 non-null   int64  
 1   speaking_score   996 non-null    float64
 2   reading_score    1001 non-null   float64
 3   listening_score  994 non-null    float64
 4   writing_score    1002 non-null   float64
 5   overall_cefr     994 non-null    object 
dtypes: float64(4), int64(1), object(1)
memory usage: 47.5+ KB


In [20]:
# Check if any non-numeric values exist
df['listening_score'].dtype == 'object'  # True if column contains strings

False

In [21]:
# Check if any non-numeric values exist
df['speaking_score'].dtype == 'object'  # True if column contains strings

False

In [None]:
# Check if any non-numeric values exist
df['writing_score'].dtype == 'object'  # True if column contains strings

False

In [19]:
# Check if any non-numeric values exist
df['reading_score'].dtype == 'object'  # True if column contains strings

False

## Data Cleaning

### Handling Missing Values by Group Mean Imputation

In this dataset, each user's skill scores (e.g., Speaking, Listening, Reading, Writing) directly contribute to their **overall CEFR level**.  
When a score is missing, replacing it with a general mean across all CEFR levels could distort the relationship between skills and the overall level.

We replace missing values in each skill **with the mean score for that skill within the same overall CEFR group**.  
This ensures that imputed values are consistent with the performance range typical of that CEFR level.

**Example:**  
If a `speaking_score` is missing for a B2-level user, we:
1. Look at all B2 users’ speaking scores.
2. Calculate the mean speaking score for the B2 group.
3. Fill the missing value with this mean.