# 01. Data Collection and Initial Exploration
*This notebook focuses on getting the data into a usable format and performing basic exploration.*

## Objectives
- *Load language proficiency test result dataset from a publicly accessible source.*
- *To ensure the dataset is stored securely within the project's directory structure.*
- *To prepare the data for subsequent preprocessing and analysis steps.*
- *To document the origin and relevance of the dataset in the context of the project.*
## Inputs
- *A URL pointing to the compressed dataset file (e.g., a zip file).*
- *Python libraries: pandas, os, pathlib, matpotlib*

## Outputs
- *Loaded dataset displayed and summarized within the notebook.*
- *CSV file stored or confirmed under data/raw/*

## Additional information
* *The Data Collection section primarily involves downloading and organizing the raw dataset before further processing.*
* *Network connectivity and library installations should be verified before executing the code.*
* *Ensure sufficient disk space and consider implementing error handling for robust data acquisition.*


***

# Project Directory Structure

## Change working directory

We need to change the working directory from its current folder to its parent folder
- *We access the current directory with os.getcwd()*

In [1]:
import os
current_dir = os.getcwd()
current_dir

'c:\\Users\\husse\\OneDrive\\Projects\\lang-level-pred\\jupyter_notebooks'

We want to make the parent of the current directory the new current directory.

- *os.path.dirname() gets the parent directory*
- *os.chir() defines the new current directory*

In [2]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


In [3]:
import os
current_dir = os.getcwd()
current_dir

'c:\\Users\\husse\\OneDrive\\Projects\\lang-level-pred'

## Download dataset from github

The following tasks are performed below:
- a new folder called 'data/raw' is created to store files
- fetches the dataset from github
- Checks if the download worked correctly
- Opens the downloaded file and removes any temporary files when done

In [4]:
import requests, os, zipfile

# Create data folder
os.makedirs("data/raw", exist_ok=True)

# Define download path
zip_url = "https://raw.githubusercontent.com/Ilyas355/language-proficiency-dataset/main/lang_proficiency_results_raw.zip"
zip_path = "data/raw/lang_proficiency_results_raw.zip"

# Download the file
response = requests.get(zip_url)
with open(zip_path, "wb") as f:
    f.write(response.content)

# Unzip the contents
with zipfile.ZipFile(zip_path, 'r') as zip_ref:
    zip_ref.extractall("data/raw")

# Delete the zip file if you want
os.remove(zip_path)

# Data loading and basic exploration
This code block imports fundamental Python libraries for data analysis and visualization and checks their versions

- pandas: For data manipulation and analysis
- numpy: For numerical computations
- matplotlib: For creating visualizations and plots

The version checks help ensure:
- Code compatibility across different environments
- Reproducibility of analysis
- Easy debugging of version-specific issues

In [5]:
%pip install matplotlib

Collecting matplotlib
  Using cached matplotlib-3.10.5-cp313-cp313-win_amd64.whl.metadata (11 kB)
Collecting contourpy>=1.0.1 (from matplotlib)
  Using cached contourpy-1.3.3-cp313-cp313-win_amd64.whl.metadata (5.5 kB)
Collecting cycler>=0.10 (from matplotlib)
  Using cached cycler-0.12.1-py3-none-any.whl.metadata (3.8 kB)
Collecting fonttools>=4.22.0 (from matplotlib)
  Downloading fonttools-4.59.1-cp313-cp313-win_amd64.whl.metadata (111 kB)
Collecting kiwisolver>=1.3.1 (from matplotlib)
  Downloading kiwisolver-1.4.9-cp313-cp313-win_amd64.whl.metadata (6.4 kB)
Collecting pyparsing>=2.3.1 (from matplotlib)
  Using cached pyparsing-3.2.3-py3-none-any.whl.metadata (5.0 kB)
Using cached matplotlib-3.10.5-cp313-cp313-win_amd64.whl (8.1 MB)
Using cached contourpy-1.3.3-cp313-cp313-win_amd64.whl (226 kB)
Using cached cycler-0.12.1-py3-none-any.whl (8.3 kB)
Downloading fonttools-4.59.1-cp313-cp313-win_amd64.whl (2.3 MB)
   ---------------------------------------- 0.0/2.3 MB ? eta -:--:--
   


[notice] A new release of pip is available: 25.1.1 -> 25.2
[notice] To update, run: python.exe -m pip install --upgrade pip


In [6]:
# Import data analysis tools
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt


print(f"pandas version: {pd.__version__}")
print(f"NumPy version: {np.__version__}")
print(f"matplotlib version: {matplotlib.__version__}")


pandas version: 2.3.1
NumPy version: 2.3.1
matplotlib version: 3.10.5
