# Introduction to Google Colab and Python for LLM-Based Research

This tutorial prepares you to work productively with the Gemini API and Hugging Face models.

The goal is **operational fluency**, not software engineering expertise.

You will learn:

* How Google Colab works as an execution environment
* How Python is used as a coordination language for data and models
* Which libraries matter for LLM-based research, and why
* How to work effectively using AI-assisted (“vibe”) coding

---

## 1. Google Colab: What It Is and How It Works

### 1.1 What Google Colab Is

Google Colab is a **cloud-hosted Jupyter notebook environment**.
It allows you to write and execute Python code in your browser without installing anything locally.

Key characteristics:

* Runs on Google’s infrastructure
* Comes with Python preinstalled
* Can access CPUs and GPUs on demand
* Designed for interactive experimentation, not production systems

**Mental model**

Think of Colab as a **digital research notebook** where:

* Text cells document reasoning
* Code cells execute experiments
* Outputs (tables, plots, model responses) appear inline

This makes Colab well suited for:

* Data exploration
* Prototyping model usage
* Reproducible research workflows

---

### 1.2 Notebook Structure

A Colab notebook consists of **cells**, executed top to bottom.

There are two main cell types:

* **Text (Markdown) cells**: explanations, notes, instructions
* **Code cells**: executable Python code

Important properties:

* The notebook is **stateful**
* Variables persist across cells
* Execution order matters

If you run a cell that depends on a previous cell, that previous cell must already have been executed.

If the runtime is restarted, **all variables, imports, and installed libraries are lost**.

---

### 1.3 Runtime and Hardware

Colab notebooks run inside a **runtime session**.

Key points:

* The runtime is temporary
* It can disconnect after inactivity
* You may manually restart it at any time

Colab allows you to select hardware:

* CPU (default)
* GPU (useful for large models)
* TPU (not used in this course)

Important clarification:

* GPUs accelerate **model computation**
* GPUs do **not** speed up normal Python or pandas operations

For most API-based LLM work, CPU is sufficient.
GPU becomes relevant when loading local Hugging Face models.

---

### 1.4 Files and Storage

Colab provides a temporary file system at `/content`.

Properties:

* Files exist only for the duration of the session
* Restarting the runtime deletes all files
* Uploads must be repeated unless saved externally

Common use cases:

* Uploading CSV files
* Uploading PDFs for analysis
* Saving intermediate outputs (embeddings, results)

You can also mount Google Drive if persistence is required, but this is optional for most tutorials.

### Storage structure

`/content` is the root working directory of this Colab session. Anything we upload, download, or generate will appear somewhere under this path.

**DATA** points to `/content/data`. This folder is for inputs — files that come from outside the notebook.

Examples include:

- CSV files with survey or interview text

- PDF documents for analysis

- Files copied from Google Drive

- Files downloaded from GitHub or external URLs

> If a file existed before the analysis started, it belongs in data.”

**OUT** points to `/content/outputs`. This folder is for results produced by the notebook.

Examples include:

- Cleaned or transformed CSV files

- Model predictions or labels

- Extracted text from PDFs

- Embeddings saved for reuse

- JSON files with intermediate results

> If a file is created by running code in this notebook, it belongs in outputs.

`/content` lists everything currently inside, so we can visually confirm where our data and outputs are located.”

**This separation matters because it keeps the workflow reproducible: inputs go in data, computations happen in code, and results go in outputs.**

In [None]:
from pathlib import Path
import os, textwrap, json, time

BASE = Path("/content")
DATA = BASE / "data"
OUT  = BASE / "outputs"
DATA.mkdir(exist_ok=True, parents=True)
OUT.mkdir(exist_ok=True, parents=True)

print("DATA:", DATA)
print("OUT :", OUT)
print("Listing /content:", os.listdir("/content"))

### Upload a file from local machine

In [2]:
from google.colab import files

uploaded = files.upload()

Saving embd.png to embd.png


At this point, the file is in /content.

### Inspect /content

In [4]:
import os
os.listdir("/content")

['.config', 'embd.png', 'data', 'outputs', 'sample_data']

### Google Drive (mount + read/write)

In [None]:
from google.colab import drive

# Mount Google Drive to persist files across sessions
drive.mount("/content/drive")

# Your Drive root will be available here:
DRIVE_ROOT = Path("/content/drive/MyDrive")
print("Drive root exists:", DRIVE_ROOT.exists())
print("Drive sample listing:", list(DRIVE_ROOT.iterdir())[:10])

Load a file located in `acspri` folder

In [5]:
# Mount Google Drive
from google.colab import drive
drive.mount("/content/drive")

# Define the file path
from pathlib import Path

DRIVE = Path("/content/drive/MyDrive")
csv_path = DRIVE / "acspri" / "data.csv"

print(csv_path)
print("File exists:", csv_path.exists())

# If exists() returns True, the path is correct.

# Load the CSV
import pandas as pd

df = pd.read_csv(csv_path)
df.head()

Mounted at /content/drive
/content/drive/MyDrive/acspri/data.csv
File exists: True


Unnamed: 0,Document Count,Publication Year,Document Type
0,1,2022,Journal Article
1,1,2023,Journal Article
2,9,2024,Journal Article
3,12,2025,Journal Article


In [None]:
# or one line version once the Google Drive is mounted
df = pd.read_csv("/content/drive/MyDrive/acspri/data.csv")


### Working with external url

In [6]:
!wget -O /content/example.csv https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv

--2026-02-08 19:16:10--  https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.111.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3858 (3.8K) [text/plain]
Saving to: ‘/content/example.csv’


2026-02-08 19:16:10 (34.0 MB/s) - ‘/content/example.csv’ saved [3858/3858]



In [None]:
# Inspect /content
import os
os.listdir("/content")

---

### 1.5 Installing Libraries

Colab does not permanently store installed libraries.

Libraries are installed **per session** using `pip`:

In [None]:
!pip install pandas transformers datasets google-generativeai


**Important rules:**

* Library installation must be rerun after every runtime restart
* Version mismatches are common causes of errors
* Restarting the runtime often fixes unexplained failures

**Operational rule**
If something breaks unexpectedly:

1. Restart runtime
2. Run all cells from the top in order

---

## 2. Python: What You Need (and What You Don’t)

This course does **not** assume Python programming expertise.

Python is used here as a **control language**:

* To load data
* To call libraries
* To pass text into and out of models

You are not expected to design software systems.

---

### 2.1 Core Python Concepts You Must Recognise

#### Variables

Variables store values:

In [None]:
text = "This is a document"
year = 2024

Variables are dynamically typed:

* You do not declare types
* Python infers them automatically

---

#### Lists

Lists store ordered collections:

In [None]:
texts = ["doc one", "doc two", "doc three"]

Lists are used constantly for:

* Multiple documents
* Batches of model inputs
* Collections of results

---

#### Dictionaries

Dictionaries store key–value pairs:

In [None]:
metadata = {
    "source": "survey",
    "language": "English",
    "year": 2023
}

Dictionaries are critical because:

* API inputs are often dictionaries
* Model outputs are often dictionaries
* Configuration settings use dictionaries

---

#### Loops (conceptual understanding only)

In [None]:
for t in texts:
    print(t)

Interpretation:

* Apply the same operation to every element
* Common for processing many documents

You do not need to write complex logic inside loops.

---

#### Functions (black-box usage)

In [None]:
def clean_text(x):
    return x.lower()

You will mostly:

* Call functions written by others
* Pass inputs
* Receive outputs

Writing functions is optional, not required.

---

### 2.2 Imports

Libraries must be imported before use:

In [None]:
import pandas as pd
from transformers import pipeline

Key idea:

* `import` makes external tools available
* Aliases (like `pd`) are conventions, not new concepts

---

### 2.3 Reading Errors Without Panic

You are not expected to debug deeply.

Learn to recognise common errors:

* `NameError`: variable not defined → cell not run
* `ModuleNotFoundError`: library not installed
* `CUDA out of memory`: model too large for GPU
* `KeyError`: wrong dictionary key

In practice:

* Copy the error message
* Ask an LLM to explain and fix it
* Re-run from the top

---

## 3. Core Libraries Used in This Course

Only a small number of libraries are required.

---

### 3.1 pandas — Structured Data

Used for:

* Survey data
* Interview metadata
* Tabular annotations

Example:

In [None]:
import pandas as pd
df = pd.read_csv("responses.csv")
df.head()

Conceptual model:

* Rows = observations (documents, respondents)
* Columns = variables (text, labels, attributes)

---

### 3.2 Hugging Face `datasets`

Used for:

* Loading text datasets
* Accessing benchmark corpora
* Managing train/test splits

Example:

In [None]:
from datasets import load_dataset
dataset = load_dataset("imdb")

Key point:

* Hugging Face datasets are **not pandas DataFrames**
* They behave more like dictionaries of splits

---

### 3.3 Hugging Face `transformers`

Used to access pretrained models.

In this course, models are used via **pipelines**, not low-level APIs.

Example:


In [None]:
from transformers import pipeline
classifier = pipeline("sentiment-analysis")
classifier("This policy will increase inequality.")

You are not expected to understand:

* Attention mechanisms
* Tokenization internals
* Model training

Treat models as **research instruments**.

---

### 3.4 Gemini API (Preview Context)

The Gemini API allows you to:

* Send text prompts
* Receive structured or unstructured responses
* Use multimodal inputs (later)

At this stage, only the **interaction pattern** matters:

* Configure API key
* Send prompt
* Receive output

Deeper usage is covered in later modules.

---

## 4. Vibe Coding: How You Will Actually Work

This course assumes **AI-assisted coding as the default**.

You are not expected to memorise syntax.

---

### 4.1 The Vibe Coding Loop

1. Describe the task in natural language
2. Ask an LLM to generate Python code
3. Paste code into Colab
4. Run the cell
5. Inspect output or error
6. Paste error back to the LLM
7. Iterate

This is a **human-in-the-loop workflow**.

---

### 4.2 Rules for Safe Vibe Coding

* Never run large blocks blindly
* Execute cell by cell
* Always inspect outputs
* Assume the model can hallucinate
* Use errors as feedback, not failure

---

## 5. What Is Intentionally Not Covered

The following are out of scope:

* Object-oriented programming
* Software packaging
* Environment management
* Git and version control
* Performance optimisation
* Model training internals

These are unnecessary for effective LLM-based research.


