# Getting started with Colab

Google's Colaboratory (Colab) allows users to write and test Python scripts simply in your browser.

The following is an official introduction for Colab, but you can also just get started below.
<center>
  <a href="https://www.youtube.com/watch?v=inN8seMm7UI" target="_blank">
  <img alt='Thumbnail for Get started with Google Colaboratory Video' src="https://ndha-public-data-ap-southeast-2.s3.ap-southeast-2.amazonaws.com/iPRES-2025/resources/colab+icon.jpeg" width=500>
  </a>
</center>

In general, working on digitized (or born-digital) collection data includes the following six steps:

1. LOAD THE DATASET
2. EXPLORE THE DATASET
3. CLEAN AND PRE-PROCESS
4. ANALYSE OR MODEL THE DATA
5. VISUALISE RESULTS
6. DOCUMENT AND SHARE

We will briefly cover these steps in the following Colab cells. When you open a new Colab tab and execute any cell inside it, you will be automatically assigned a Google compute engine with it.

> Note that this is a one-off Jupyter notebook environment, which will not be shared with the other Jupyter notebooks you opened.



## 1. LOAD THE DATASET

<left>
  <img src="https://ndha-public-data-ap-southeast-2.s3.ap-southeast-2.amazonaws.com/iPRES-2025/resources/Download+with+solid+fill.png" width=100>
  </a>
</left>

Import data from local files or
online sources (e.g. CVS, JSON, EXcel, SQL, APIs).

- Starts with a label like **[1] ✓**, showing the execution order.
- You can write multiple lines of code in one cell.
- When run (Shift + Enter), output appears directly below

In [None]:
import pandas as pd

# load a sample csv data from our AWS s3 bucket
# this is a synthetic annual income dataset that assumes the annual income increases with age and education (+ noise)
data = pd.read_csv("https://ndha-public-data-ap-southeast-2.s3.ap-southeast-2.amazonaws.com/iPRES-2025/resources/test_data.csv")

## 2. EXPLORE THE DATSET

<left>
  <img src="https://ndha-public-data-ap-southeast-2.s3.ap-southeast-2.amazonaws.com/iPRES-2025/resources/Research+with+solid+fill.png" width=100>
  </a>
</left>

Preview and inspect data directly in the notebook cells.

In [None]:
data.head()

Unnamed: 0,id,age,education_years,annual_income,city
0,1,22.0,10.0,27351.258651,Auckland
1,2,59.0,10.0,54934.127297,Wellington
2,3,52.0,12.0,56388.642846,Auckland
3,4,41.0,19.0,45321.070504,Wellington
4,5,40.0,15.0,57242.826775,Christchurch


In [None]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 300 entries, 0 to 299
Data columns (total 5 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   id               300 non-null    int64  
 1   age              287 non-null    float64
 2   education_years  286 non-null    float64
 3   annual_income    274 non-null    float64
 4   city             300 non-null    object 
dtypes: float64(3), int64(1), object(1)
memory usage: 11.8+ KB


In [None]:
data.describe()

Unnamed: 0,id,age,education_years,annual_income
count,300.0,287.0,286.0,274.0
mean,150.5,43.972125,14.982517,52142.855933
std,86.746758,14.44035,3.041763,13514.930571
min,1.0,18.0,10.0,16515.871748
25%,75.75,32.0,13.0,43459.084312
50%,150.5,44.0,15.0,52904.505524
75%,225.25,56.0,17.0,60820.4822
max,300.0,70.0,20.0,89229.203124


## 3. CLEAN AND PRE-PROCESS

<left>
  <img src="https://ndha-public-data-ap-southeast-2.s3.ap-southeast-2.amazonaws.com/iPRES-2025/resources/Clipboard+Partially+Ticked+with+solid+fill.png" width=100>
  </a>
</left>

Perform transformations (filterning, renaming, filling missing values, encoding).

In [None]:
data_clean = data.dropna().copy()
data_clean['age'] = data_clean['age'].astype(int)

## 4. ANALYSE OR MODEL THE DATA

<left>
  <img src="https://ndha-public-data-ap-southeast-2.s3.ap-southeast-2.amazonaws.com/iPRES-2025/resources/Bar+chart+with+solid+fill.png" width=100>
  </a>
</left>

Run analysis or apply algorithms (e.g. statistic, machine learning).

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score


# Select features and target
X = data_clean[["age", "education_years"]]
y = data_clean["annual_income"]

model = LinearRegression().fit(X, y)


# In-sample prediction and R²
y_pred = model.predict(X)
r2 = r2_score(y, y_pred)

print("=== Model Summary ===")
print(f"Rows (original / after cleaning): {len(data)} / {len(data_clean)}")
print(f"Intercept: {model.intercept_:.2f}")
print(f"Coef (age): {model.coef_[0]:.2f}")
print(f"Coef (education_years): {model.coef_[1]:.2f}")
print(f"R² (train): {r2:.3f}")


## 5. VISUALISE RESULTS

<left>
  <img src="https://ndha-public-data-ap-southeast-2.s3.ap-southeast-2.amazonaws.com/iPRES-2025/resources/Statistics+with+solid+fill.png" width=100>
  </a>
</left>

Generate plots inline with libraries like Matplotlib, Seaborn, or Plotly.

In [None]:
import matplotlib.pyplot as plt


plt.hist(data_clean["age"], bins=15, color="#4C78A8", edgecolor="white")
plt.title("Age Distribution")
plt.xlabel("Age")
plt.ylabel("Count")

In [None]:

# Scatter: age vs income
plt.scatter(
    data_clean["age"], data_clean["annual_income"],
    s=15, alpha=0.7, color="#F58518"
)
plt.title("Age vs Annual Income")
plt.xlabel("Age")
plt.ylabel("Annual Income")

## 6. DOCUMENT AND SHARE

<left>
  <img src="https://ndha-public-data-ap-southeast-2.s3.ap-southeast-2.amazonaws.com/iPRES-2025/resources/Document+with+solid+fill.png" width=100>
  </a>
</left>

Combine code, outputs, and markdown explanations in one document
Export results as HTML, PDF, slides.