# 🧼 Dataset Cleaning – Mexican Labor Market Project

This notebook is part of the *Income Evolution in Mexico* project. The goal of this notebook is to perform an initial data cleaning step on individual-level population datasets. These datasets span multiple census years and contain millions of observations.

Due to the large size of the files, we will extract a random sample of 8000 individuals from each dataset containing information on "Personas" to enable faster analysis and modeling. Later, we will combine all samples into a single dataset, which will be used for machine learning tasks.

**Main steps:**
1. Import necessary libraries
2. Load a random sample of 8000 rows per file containing “Personas”
3. Combine all sampled data into one dataset
4. Save the resulting dataset for downstream analysis

### 📁 Importing Required Libraries

We import Python packages for file manipulation, data reading, and sampling.

- `os` and `glob`: for handling paths and locating files
- `pandas`: for data manipulation
- `collections.defaultdict`: for grouping data by year

In [6]:
import os
from glob import glob
import pandas as pd 
from collections import defaultdict

### 📦 Sampling and Combining "Personas" Data

We begin by setting the path to the raw data directory and locating all `.txt` files in it. Among these files, we focus only on those whose names include **"Personas"**, as they contain individual-level information.

For each relevant file:
- We load the full dataset using `pandas`.
- We extract a **random sample of 8,000 rows** using `df.sample(n=8000, random_state=42)`.
  - The `random_state` ensures reproducibility of the sampling process.
- The sampled data is stored in a list for later concatenation.

Finally, we **combine all sampled datasets** into a single `DataFrame`, which will serve as our consolidated working dataset for analysis and machine learning.

We also print the shape of the resulting dataset to confirm the total number of rows and columns.

In [7]:
# Step 1: Set your folder path
project_dir = os.path.dirname("/Data/anahi.reyes-miguel/Introduction_to_ML/Project-data/")
folder_path = os.path.join(project_dir, "rawdata")

# Step 2: Find all .txt files in the folder
file_paths = glob(os.path.join(folder_path, '*.txt'))

# Step 3: Read a random 8000-row sample from each "Personas" file and combine them
sampled_personas_data = []

for path in file_paths:
    file_name = os.path.basename(path)
    
    if "Personas" not in file_name:
        continue

    print(f"📄 Sampling 8,000 rows from: {file_name}")
    
    try:
        # Read the full data from the file
        df = pd.read_csv(path, encoding='latin-1', low_memory=False)
        
        # Take a random sample of 8000 rows
        sample = df.sample(n=8000, random_state=42)
        sampled_personas_data.append(sample)
        
        print(f"✅ Sampled {sample.shape[0]:,} rows, {sample.shape[1]} columns\n")
    
    except Exception as e:
        print(f"❌ Could not process {file_name}: {e}\n")

# Step 4: Combine all the sampled data into one single DataFrame
combined_personas_data = pd.concat(sampled_personas_data, ignore_index=True)
combined_personas_data.shape
print(f"📊 Combined dataset shape: {combined_personas_data.shape[0]:,} rows, {combined_personas_data.shape[1]} columns")

📄 Sampling 8,000 rows from: Informacion Personas 1990_0.txt
✅ Sampled 8,000 rows, 66 columns

📄 Sampling 8,000 rows from: Informacion Personas 2000_0.txt
✅ Sampled 8,000 rows, 66 columns

📄 Sampling 8,000 rows from: Informacion Personas 2000_1.txt
✅ Sampled 8,000 rows, 66 columns

📄 Sampling 8,000 rows from: Informacion Personas 2010_0.txt
✅ Sampled 8,000 rows, 65 columns

📄 Sampling 8,000 rows from: Informacion Personas 2010_1.txt
✅ Sampled 8,000 rows, 65 columns

📄 Sampling 8,000 rows from: Informacion Personas 2015_0.txt
✅ Sampled 8,000 rows, 65 columns

📄 Sampling 8,000 rows from: Informacion Personas 2015_1.txt
✅ Sampled 8,000 rows, 65 columns

📄 Sampling 8,000 rows from: Informacion Personas 2015_2.txt
✅ Sampled 8,000 rows, 65 columns

📄 Sampling 8,000 rows from: Informacion Personas 2020_0.txt
✅ Sampled 8,000 rows, 65 columns

📄 Sampling 8,000 rows from: Informacion Personas 2020_1.txt
✅ Sampled 8,000 rows, 65 columns

📊 Combined dataset shape: 80,000 rows, 66 columns


### 💾 Saving the Combined Dataset to Disk

We save the resulting dataset to a `.csv` file in the `cleandata` directory. This file can be reused later without the need to reload or sample from the original massive datasets.

In [8]:
# Step 5: Save the combined dataset to disk
output_dir = os.path.join(project_dir, "Project-data", "cleandata")
os.makedirs(output_dir, exist_ok=True)

output_path = os.path.join(output_dir, "combined_personas_sample.csv")
combined_personas_data.to_csv(output_path, index=False, encoding='utf-8')

print(f"💾 Saved combined sample to: {output_path}")

💾 Saved combined sample to: /Data/anahi.reyes-miguel/Introduction_to_ML/Project-data/Project-data/cleandata/combined_personas_sample.csv
