# Phase 1: Data Engineering Pipeline (Google Colab - CPU)

This notebook executes the complete data engineering pipeline for the AI Auto Complete Code Extension project on Google Colab using CPU.

## Pipeline Steps:
1. **Setup & Upload Scripts**: Upload Phase 1 Python scripts to Colab
2. **Data Crawling & Filtering**: Download and filter code files from The Stack
3. **Secret Scrubbing**: Remove sensitive information (API keys, passwords, etc.)
4. **Advanced Transformation**: Remove comments and apply Import Dropout
5. **FIM Dataset Generation**: Create hybrid training dataset (60% Inline + 40% Block)
6. **Download Results**: Save to Google Drive or download locally



## Step 0: Setup Environment
Install required dependencies and mount Google Drive (optional, for saving results).

In [1]:
!pip install -q datasets tqdm huggingface_hub

In [2]:
from google.colab import drive
drive.mount('/content/drive')

!mkdir -p '/content/drive/MyDrive/AI-Auto-Complete/phase1_output'

Mounted at /content/drive


## Step 1: Upload Phase 1 Scripts
Upload the 4 Python scripts from your local `phase1_data engineering` folder:
- `01_crawl_filter.py`
- `02_scrubbing.py`
- `03_transform.py`
- `04_fim_gen.py` (Updated with Hybrid Mode)

**Method 1:** Use Colab's file upload UI (left sidebar)

**Method 2:** Run the cell below to trigger upload dialog

In [3]:
from google.colab import files
import os

print("Please upload ALL 4 Python scripts:")
print("- 01_crawl_filter.py")
print("- 02_scrubbing.py")
print("- 03_transform.py")
print("- 04_fim_gen.py")
print("\nClick 'Choose Files' and select all 4 scripts at once.\n")

uploaded = files.upload()

# Verify all scripts are uploaded
required_scripts = ['01_crawl_filter.py', '02_scrubbing.py', '03_transform.py', '04_fim_gen.py']
missing = [s for s in required_scripts if not os.path.exists(s)]

if missing:
    print(f"\nMissing scripts: {missing}")
    print("Please upload them before continuing.")
else:
    print("\nAll scripts uploaded successfully!")

Please upload ALL 4 Python scripts:
- 01_crawl_filter.py
- 02_scrubbing.py
- 03_transform.py
- 04_fim_gen.py

Click 'Choose Files' and select all 4 scripts at once.



Saving 01_crawl_filter.py to 01_crawl_filter.py
Saving 02_scrubbing.py to 02_scrubbing.py
Saving 03_transform.py to 03_transform.py
Saving 04_fim_gen.py to 04_fim_gen.py

All scripts uploaded successfully!


## Step 2: Data Crawling & Filtering
Download code samples from `bigcode/the-stack-smol-xl` and apply filters.

**Configuration:**
- Max samples: 10000 per language (Python, Java, C++)

In [2]:
!python 01_crawl_filter.py --output_dir raw_data --max_samples 10000

2025-12-03 21:42:44,916 - INFO - Starting optimized download from bigcode/the-stack-smol-xl. Target: 10000 samples per language.
2025-12-03 21:42:44,917 - INFO - Using 2 parallel I/O workers with batch size 100
2025-12-03 21:42:44,917 - INFO - Processing language: Python...
2025-12-03 21:42:45,667 - INFO - HTTP Request: HEAD https://huggingface.co/datasets/bigcode/the-stack-smol-xl/resolve/main/README.md "HTTP/1.1 307 Temporary Redirect"
2025-12-03 21:42:45,679 - INFO - HTTP Request: HEAD https://huggingface.co/api/resolve-cache/datasets/bigcode/the-stack-smol-xl/e782ebf35c7e4cafccb08ca680b0a76706533067/README.md "HTTP/1.1 200 OK"
2025-12-03 21:42:45,957 - INFO - HTTP Request: HEAD https://huggingface.co/datasets/bigcode/the-stack-smol-xl/resolve/e782ebf35c7e4cafccb08ca680b0a76706533067/the-stack-smol-xl.py "HTTP/1.1 404 Not Found"
2025-12-03 21:42:46,880 - INFO - HTTP Request: HEAD https://s3.amazonaws.com/datasets.huggingface.co/datasets/datasets/bigcode/the-stack-smol-xl/bigcode/the

## Step 3: Secret Scrubbing
Remove sensitive information using regex patterns.

In [3]:
!python 02_scrubbing.py --input_dir raw_data --output_dir scrubbed_data

2025-12-03 21:47:04,047 - INFO - Found 28420 files to process.
2025-12-03 21:47:04,047 - INFO - Using 12 parallel workers.

  0%|          | 0/28420 [00:00<?, ?it/s]
  7%|▋         | 1881/28420 [00:00<00:01, 17067.98it/s]
 13%|█▎        | 3588/28420 [00:10<01:27, 284.15it/s]  
 13%|█▎        | 3629/28420 [00:10<01:26, 285.55it/s]
 15%|█▌        | 4355/28420 [00:16<01:56, 206.05it/s]
 17%|█▋        | 4756/28420 [00:17<01:47, 220.86it/s]
 18%|█▊        | 5011/28420 [00:19<01:54, 204.03it/s]
 18%|█▊        | 5179/28420 [00:20<01:51, 209.31it/s]
 19%|█▊        | 5299/28420 [00:20<01:56, 197.89it/s]
 19%|█▉        | 5384/28420 [00:21<01:55, 199.23it/s]
 19%|█▉        | 5449/28420 [00:21<01:56, 197.03it/s]
 19%|█▉        | 5500/28420 [00:22<02:03, 184.86it/s]
 19%|█▉        | 5539/28420 [00:22<02:05, 182.39it/s]
 20%|█▉        | 5571/28420 [00:22<02:18, 164.85it/s]
 20%|█▉        | 5596/28420 [00:22<02:20, 162.23it/s]
 20%|█▉        | 5627/28420 [00:23<02:13, 170.92it/s]
 20%|█▉        | 565

In [None]:
# Check output
!ls -lh scrubbed_data/*/

## Step 4: Code Transformation
Remove comments and apply Import Dropout (30% dropout rate).

In [4]:
# Step 3: Transformation
!python 03_transform.py --input_dir scrubbed_data --output_dir transformed_data --dropout_rate 0.3

2025-12-03 21:51:13,811 - INFO - Found 28420 files to process.
2025-12-03 21:51:13,811 - INFO - Using 12 parallel workers.

  0%|          | 0/28420 [00:00<?, ?it/s]
  3%|▎         | 977/28420 [00:00<00:03, 7243.51it/s]
  6%|▌         | 1702/28420 [00:05<01:37, 273.12it/s]
  7%|▋         | 2009/28420 [00:06<01:44, 253.15it/s]
  8%|▊         | 2186/28420 [00:08<02:04, 209.91it/s]
  8%|▊         | 2297/28420 [00:09<02:10, 199.99it/s]
  8%|▊         | 2374/28420 [00:09<02:17, 189.81it/s]
  9%|▊         | 2430/28420 [00:09<02:18, 187.99it/s]
  9%|▊         | 2473/28420 [00:10<02:19, 186.39it/s]
  9%|▉         | 2508/28420 [00:10<02:28, 174.47it/s]
  9%|▉         | 2536/28420 [00:10<02:27, 175.27it/s]
  9%|▉         | 2561/28420 [00:10<02:29, 172.64it/s]
  9%|▉         | 2583/28420 [00:10<02:28, 174.51it/s]
  9%|▉         | 2609/28420 [00:10<02:19, 184.59it/s]
  9%|▉         | 2634/28420 [00:11<02:16, 188.89it/s]
  9%|▉         | 2657/28420 [00:11<02:15, 190.63it/s]
  9%|▉         | 2684/28

In [9]:
# Check output
!ls -lh transformed_data/*/

[1;30;43mKết quả truyền trực tuyến bị cắt bớt đến 5000 dòng cuối.[0m
drwxr-xr-x   3 root root 4.0K Dec  2 02:32  aioalice
drwxr-xr-x   3 root root 4.0K Dec  2 02:32  aioftx
drwxr-xr-x   2 root root 4.0K Dec  2 02:32  aiogoogle
drwxr-xr-x   2 root root 4.0K Dec  2 02:32  aioherepy
drwxr-xr-x   3 root root 4.0K Dec  2 02:32  aiohttp_socks
drwxr-xr-x   2 root root 4.0K Dec  2 02:32  aiohttp_xmlrpc
drwxr-xr-x   2 root root 4.0K Dec  2 02:32  aiopg
drwxr-xr-x   3 root root 4.0K Dec  2 02:32  aiotdlib
drwxr-xr-x   3 root root 4.0K Dec  2 02:32  airbyte-integrations
drwxr-xr-x   3 root root 4.0K Dec  2 02:32  aircrack-ng-master
drwxr-xr-x   9 root root 4.0K Dec  2 02:32  airflow
drwxr-xr-x   3 root root 4.0K Dec  2 02:32  ajaxPythonChat
drwxr-xr-x   3 root root 4.0K Dec  2 02:32  akshare
drwxr-xr-x   3 root root 4.0K Dec  2 02:32  Alchemistry_toolkits
drwxr-xr-x   2 root root 4.0K Dec  2 02:32  alchemy_mock
drwxr-xr-x   3 root root 4.0K Dec  2 02:32  alectiolite
-rw-r--r--   1 root root 5.1

## Step 5: FIM Dataset Generation (Hybrid Mode)
Create the final training dataset with:
- **70% Inline FIM**: Single-line completions (no newline in MIDDLE)
- **30% Block FIM**: Multi-line completions (1-3 lines with `<|im_end|>` stop token)

In [5]:
!python 04_fim_gen.py --input_dir transformed_data --output_file fim_dataset.jsonl

2025-12-03 22:00:16,034 - INFO - Found 28420 files to process.
2025-12-03 22:00:16,035 - INFO - Using 12 parallel workers.
2025-12-03 22:00:16,035 - INFO - Context Limits: 64 lines / 2048 chars per side.

  0%|          | 0/28420 [00:00<?, ?it/s]
  2%|▏         | 472/28420 [00:00<00:06, 4101.18it/s]
  3%|▎         | 883/28420 [00:01<01:06, 417.14it/s] 
  4%|▍         | 1067/28420 [00:02<01:10, 390.45it/s]
  4%|▍         | 1182/28420 [00:02<01:13, 371.61it/s]
  4%|▍         | 1264/28420 [00:03<01:16, 353.00it/s]
  5%|▍         | 1327/28420 [00:03<01:19, 341.41it/s]
  5%|▍         | 1378/28420 [00:03<01:19, 339.16it/s]
  5%|▌         | 1423/28420 [00:03<01:21, 330.75it/s]
  5%|▌         | 1463/28420 [00:03<01:44, 258.12it/s]
  5%|▌         | 1502/28420 [00:04<01:40, 269.04it/s]
  5%|▌         | 1534/28420 [00:04<01:42, 263.02it/s]
  6%|▌         | 1564/28420 [00:04<01:42, 262.06it/s]
  6%|▌         | 1593/28420 [00:04<01:47, 250.03it/s]
  6%|▌         | 1620/28420 [00:04<01:46, 251.20it/

In [None]:
# Check output size
!ls -lh fim_dataset.jsonl
!wc -l fim_dataset.jsonl

## Step 6: Verification & Sample Inspection
Verify the dataset quality and check distribution of Inline vs Block samples.

In [6]:
import json
import os

output_file = 'fim_dataset.jsonl'

if os.path.exists(output_file):
    inline_count = 0
    block_count = 0

    print("Dataset Statistics:")
    print("=" * 50)

    with open(output_file, 'r', encoding='utf-8') as f:
        for line in f:
            data = json.loads(line)
            if data['metadata']['type'] == 'FIM_INLINE':
                inline_count += 1
            elif data['metadata']['type'] == 'FIM_BLOCK':
                block_count += 1

    total = inline_count + block_count
    print(f"Total samples: {total}")
    print(f"Inline samples: {inline_count} ({inline_count/total*100:.1f}%)")
    print(f"Block samples: {block_count} ({block_count/total*100:.1f}%)")
    print("\n" + "=" * 50)

    print("\nSample Examples:")
    print("=" * 50)

    with open(output_file, 'r', encoding='utf-8') as f:
        for i in range(3):
            line = f.readline()
            if not line: break
            data = json.loads(line)
            print(f"\nSample {i+1} ({data['metadata']['type']})")
            print("-" * 50)
            text = data['text']
            if len(text) > 300:
                print(text[:300] + "...")
            else:
                print(text)
else:
    print("Output file not found.")

Dataset Statistics:
Total samples: 28353
Inline samples: 19791 (69.8%)
Block samples: 8562 (30.2%)


Sample Examples:

Sample 1 (FIM_INLINE)
--------------------------------------------------
<PRE>  
class Solution {
public:
    int evalRPN(vector<string>& tokens) {
        int n = tokens.size(), res = 0;
        int  <SUF>         stack<int> operand;
        
        for (int i=0; i<n; i++) {
            string tmp = tokens[i];
            if (tmp.size()==1 && (tmp[0]<'0' || tmp[0]>'9')...

Sample 2 (FIM_BLOCK)
--------------------------------------------------
<PRE>  
 
 


 

 

 
class Solution {
public:
	int minDistance(string word1, string word2) {
		 
		vector<vector<int>> distance(word1.size() + 1, vector<int>(word2.size() + 1));
		
		 
		for (int col = 0; col < distance[0].size(); ++col) {
			distance[0][col] = col;
		}
		for (int row = 0; row < di...

Sample 3 (FIM_BLOCK)
--------------------------------------------------
<PRE>  
class Solution {
public:
     TreeNode * util(

## Step 7: Save to Google Drive
Copy the final dataset to Google Drive for later use in Phase 2 (Training).

In [12]:
!cp fim_dataset.jsonl '/content/drive/MyDrive/AI-Auto-Complete/phase1_output/fim_dataset.jsonl'

print("Dataset saved to Google Drive!")
print("Location: /content/drive/MyDrive/AI-Auto-Complete/phase1_output/fim_dataset.jsonl")
print("\nYou can now use this dataset in Phase 2 (Training).")

Dataset saved to Google Drive!
Location: /content/drive/MyDrive/AI-Auto-Complete/phase1_output/fim_dataset.jsonl

You can now use this dataset in Phase 2 (Training).


## Optional: Download Dataset Locally
If you prefer to download the dataset to your computer instead of Drive.

In [None]:
# Download dataset to your computer
from google.colab import files

files.download('fim_dataset.jsonl')
print("Download started! Check your browser's download folder.")

## Summary

✅ Phase 1 Pipeline completed successfully!

**What we created:**
- Hybrid FIM dataset optimized for Tabnine-style code completion
- 60% Inline samples (fast single-line completion)
- 40% Block samples (intelligent multi-line completion)
- Stop tokens integrated for better model control

**Next Steps:**
1. Proceed to **Phase 2: Training** using `02_training.ipynb`
2. Use the `fim_dataset.jsonl` file as training data
3. Fine-tune Qwen2.5-Coder-0.5B with QLoRA

**Important:** The training notebook will need to be updated to handle the new metadata types (`FIM_INLINE` and `FIM_BLOCK`).