# setup.ipynb

This notebook initializes the project environment for the Small Data NER project.

It performs the following steps:
1. Mounts Google Drive.
2. Creates the project folder structure if missing:
   - raw/: original E3C data
   - conll/: train/dev/test and few-shot splits
   - utils/: helper scripts (conll_io.py, metrics.py)
   - notebooks for preprocessing, model training, and evaluation
3. Installs required Python packages.
4. Verifies that all files and paths are accessible.

After running setup.ipynb once, all team members can open other notebooks directly (e.g., preprocessing.ipynb, prompting.ipynb).

In [5]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [9]:
%cd /content
!rm -rf .git

/content


In [10]:
%cd /content/drive/MyDrive/small_data_NER_project
!printf "%s\n" \
"__pycache__/" \
"*.ipynb_checkpoints/" \
"*.gdoc" "*.gsheet" "*.tmp" \
"wandb/" \
"*.bin" "*.pt" "*.safetensors" \
".DS_Store" \
".config/" "sample_data/" > .gitignore

/content/drive/MyDrive/small_data_NER_project


In [11]:
!git init -b main
!git config user.name "Chenxinnnn"
!git config user.email "cg3423@nyu.edu"

Initialized empty Git repository in /content/drive/MyDrive/small_data_NER_project/.git/


In [12]:
!git add -A
!git commit -m "initial commit: data prep + few-shot sampler + baseline"

[main (root-commit) eaa5d27] initial commit: data prep + few-shot sampler + baseline
 56 files changed, 349726 insertions(+)
 create mode 100644 .gitignore
 create mode 100644 biobert_baseline.ipynb
 create mode 100644 conll/dev.conll
 create mode 100644 conll/fewshot_k10_seed42_mention/dev.conll
 create mode 100644 conll/fewshot_k10_seed42_mention/test.conll
 create mode 100644 conll/fewshot_k10_seed42_mention/train.conll
 create mode 100644 conll/fewshot_k10_seed42_sent/dev.conll
 create mode 100644 conll/fewshot_k10_seed42_sent/test.conll
 create mode 100644 conll/fewshot_k10_seed42_sent/train.conll
 create mode 100644 conll/fewshot_k1_seed42_mention/dev.conll
 create mode 100644 conll/fewshot_k1_seed42_mention/test.conll
 create mode 100644 conll/fewshot_k1_seed42_mention/train.conll
 create mode 100644 conll/fewshot_k1_seed42_sent/dev.conll
 create mode 100644 conll/fewshot_k1_seed42_sent/test.conll
 create mode 100644 conll/fewshot_k1_seed42_sent/train.conll
 create mode 100644 c

In [14]:
!git remote remove origin 2>/dev/null || true
!git remote add origin https://github.com/Chenxinnnn/small_data_NER_project.git

In [15]:
from getpass import getpass
token = getpass("Paste your GitHub Personal Access Token (classic, with repo scope): ")

# 用 token 临时替换远程 URL 进行一次性推送
!git remote set-url origin https://{token}@github.com/Chenxinnnn/small_data_NER_project.git
!git push -u origin main

Paste your GitHub Personal Access Token (classic, with repo scope): ··········
remote: Repository not found.
fatal: repository 'https://github.com/Chenxinnnn/small_data_NER_project.git/' not found


In [None]:
import os

base_path = "/content/drive/MyDrive/small_data_NER_project"

# Define project subfolders
subfolders = [
    "raw",
    "conll",
    "utils",
    "results"
]

# Create directories
for sub in subfolders:
    os.makedirs(os.path.join(base_path, sub), exist_ok=True)

# Install basic dependencies
!pip install transformers datasets seqeval peft accelerate -q

# Verify structure
print("Project structure:")
for root, dirs, files in os.walk(base_path):
    level = root.replace(base_path, '').count(os.sep)
    indent = ' ' * 4 * level
    print(f"{indent}{os.path.basename(root)}/")
    subindent = ' ' * 4 * (level + 1)
    for f in files:
        print(f"{subindent}{f}")

print("\nSetup complete.")

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/43.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.6/43.6 kB[0m [31m2.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
  Building wheel for seqeval (setup.py) ... [?25l[?25hdone
Project structure:
small_data_NER_project/
    raw/
    conll/
    utils/
    results/

Setup complete.
