# 01. Data Collection and Initial Exploration
*This notebook focuses on getting the data into a usable format and performing basic exploration.*

## Objectives
- *Load language proficiency test result dataset from a publicly accessible source.*
- *To ensure the dataset is stored securely within the project's directory structure.*
- *To prepare the data for subsequent preprocessing and analysis steps.*
- *To document the origin and relevance of the dataset in the context of the project.*
## Inputs
- *A URL pointing to the compressed dataset file (e.g., a zip file).*
- *Python libraries: pandas, os, pathlib, matpotlib*

## Outputs
- *Loaded dataset displayed and summarized within the notebook.*
- *CSV file stored or confirmed under data/raw/*

## Additional information
* *The Data Collection section primarily involves downloading and organizing the raw dataset before further processing.*
* *Network connectivity and library installations should be verified before executing the code.*
* *Ensure sufficient disk space and consider implementing error handling for robust data acquisition.*


***

# Project Directory Structure

## Change working directory

We need to change the working directory from its current folder to its parent folder
- *We access the current directory with os.getcwd()*

In [1]:
import os
current_dir = os.getcwd()
current_dir

'c:\\Users\\husse\\OneDrive\\Projects\\lang-level-pred\\jupyter_notebooks'

We want to make the parent of the current directory the new current directory.

- *os.path.dirname() gets the parent directory*
- *os.chir() defines the new current directory*

In [2]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


In [3]:
import os
current_dir = os.getcwd()
current_dir

'c:\\Users\\husse\\OneDrive\\Projects\\lang-level-pred'

## Download dataset from github

The following tasks are performed below:
- a new folder called 'data/raw' is created to store files
- fetches the dataset from github
- Checks if the download worked correctly
- Opens the downloaded file and removes any temporary files when done

In [4]:
import requests, os, zipfile

# Create data folder
os.makedirs("data/raw", exist_ok=True)

# Define download path
zip_url = "https://raw.githubusercontent.com/Ilyas355/language-proficiency-dataset/main/lang_proficiency_results_raw.zip"
zip_path = "data/raw/lang_proficiency_results_raw.zip"

# Download the file
response = requests.get(zip_url)
with open(zip_path, "wb") as f:
    f.write(response.content)

# Unzip the contents
with zipfile.ZipFile(zip_path, 'r') as zip_ref:
    zip_ref.extractall("data/raw")

# Delete the zip file if you want
os.remove(zip_path)

---

# Data loading and basic exploration
This code block imports fundamental Python libraries for data analysis and visualization and checks their versions

- pandas: For data manipulation and analysis
- numpy: For numerical computations
- matplotlib: For creating visualizations and plots

The version checks help ensure:
- Code compatibility across different environments
- Reproducibility of analysis
- Easy debugging of version-specific issues

In [6]:
# Import data analysis tools
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt


print(f"pandas version: {pd.__version__}")
print(f"NumPy version: {np.__version__}")
print(f"matplotlib version: {matplotlib.__version__}")


pandas version: 2.3.1
NumPy version: 2.3.1
matplotlib version: 3.10.5


### List Files and Folders
This code shows what files and folders are in our data folder

In [7]:
import os
from pathlib import Path

dataset_dir = Path("data/raw")
print(f"[INFO] Files/folders available in {dataset_dir}:")
os.listdir(dataset_dir)

[INFO] Files/folders available in data\raw:


['lang_proficiency_results_raw.csv']

## Load dataset
This code loads the dataset that is then displayed in the dataframe.

In [8]:
import pandas as pd
from pathlib import Path

# Define the path to the CSV file
file_path = Path("data/raw/lang_proficiency_results_raw.csv")

# Read the CSV file
df = pd.read_csv(file_path)

## DataFrame Information Display

This code generates a comprehensive summary of our DataFrame, displaying:
- Total number of entries
- Column names and their data types
- Memory usage statistics
- Non-null count per column

In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1010 entries, 0 to 1009
Data columns (total 6 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   user_id          1010 non-null   int64  
 1   speaking_score   996 non-null    float64
 2   reading_score    1001 non-null   float64
 3   listening_score  994 non-null    float64
 4   writing_score    1002 non-null   float64
 5   overall_cefr     994 non-null    object 
dtypes: float64(4), int64(1), object(1)
memory usage: 47.5+ KB


## Statistical summary

This code computes summary statistics for each numeric column:
- count → number of non-missing (NaN excluded) values
- mean → arithmetic mean of the non-missing values
- std → standard deviation
- min → minimum value
- 25%, 50%, 75% → quartiles (Q1, median, Q3)
- max → maximum value


In [10]:
df.describe(include='all')

Unnamed: 0,user_id,speaking_score,reading_score,listening_score,writing_score,overall_cefr
count,1010.0,996.0,1001.0,994.0,1002.0,994
unique,,,,,,8
top,,,,,,A2
freq,,,,,,213
mean,500.262376,61.194779,61.280719,61.381288,61.07984,
std,288.692973,21.339708,21.550594,21.42103,21.200193,
min,1.0,25.0,25.0,25.0,25.0,
25%,251.25,43.0,43.0,42.0,43.0,
50%,499.5,60.0,61.0,60.0,59.0,
75%,750.75,79.0,79.0,80.0,80.0,


## View first 60 rows

The code below displays the first 60 rows in the dataset:

In [11]:
df.head(60)

Unnamed: 0,user_id,speaking_score,reading_score,listening_score,writing_score,overall_cefr
0,1,26.0,40.0,28.0,33.0,A1
1,2,91.0,92.0,89.0,87.0,C1
2,3,61.0,66.0,66.0,57.0,B1
3,4,65.0,60.0,55.0,55.0,B1
4,5,77.0,76.0,83.0,78.0,B2
5,6,58.0,66.0,61.0,56.0,B1
6,7,50.0,46.0,55.0,52.0,A2
7,8,78.0,71.0,82.0,81.0,B2
8,9,45.0,51.0,42.0,49.0,A2
9,10,64.0,64.0,68.0,56.0,B1
