# Analyzing Global Data Center Distribution

## Project 2

**Josefina Piddo**

In this project, we’ll perform a statistical analysis of data center connectivity values using data from the [PeeringDB / World Bank Data Center Connectivity dataset.](https://data360.worldbank.org/en/indicator/PEERING_DB_CONN_DATA_CENT) 

We will follow three steps:

Part 1 — The Pandas Way

Part 2 — The Hard Way (Standard Library Only)

Part 3 — Text-Based Visualization (Standard Library Only)

For this project, we study how data centers are distributed across countries using a dataset that reports totals for a single year: 2023.  
All statistical calculations and visualizations will be based on the `OBS_VALUE` column, which records the number of data centers in each country for that year.

Source: World Bank Data 360
File: PEERING_DB_CONN_DATA_CENT.csv

# Step 0 – Setup

In [32]:
import csv

import pandas as pd

# Define our file name and column name as constants.
FILE_NAME = "PEERING_DB_CONN_DATA_CENT.csv"
COLUMN_NAME = "OBS_VALUE"

pd.read_csv(FILE_NAME)


Unnamed: 0,STRUCTURE,STRUCTURE_ID,ACTION,FREQ,FREQ_LABEL,REF_AREA,REF_AREA_LABEL,INDICATOR,INDICATOR_LABEL,SEX,...,UNIT_MULT,UNIT_MULT_LABEL,UNIT_TYPE,UNIT_TYPE_LABEL,TIME_FORMAT,TIME_FORMAT_LABEL,OBS_STATUS,OBS_STATUS_LABEL,OBS_CONF,OBS_CONF_LABEL
0,datastructure,WB.DATA360:DS_DATA360(1.2),I,A,Annual,ARE,United Arab Emirates,PEERING_DB_CONN_DATA_CENT,Number of connected data centers (PeeringDB),_T,...,0,Units,COUNT,Count (Integer),602,CCYY,A,Normal value,PU,Public
1,datastructure,WB.DATA360:DS_DATA360(1.2),I,A,Annual,ATG,Antigua and Barbuda,PEERING_DB_CONN_DATA_CENT,Number of connected data centers (PeeringDB),_T,...,0,Units,COUNT,Count (Integer),602,CCYY,A,Normal value,PU,Public
2,datastructure,WB.DATA360:DS_DATA360(1.2),I,A,Annual,ALB,Albania,PEERING_DB_CONN_DATA_CENT,Number of connected data centers (PeeringDB),_T,...,0,Units,COUNT,Count (Integer),602,CCYY,A,Normal value,PU,Public
3,datastructure,WB.DATA360:DS_DATA360(1.2),I,A,Annual,ARM,Armenia,PEERING_DB_CONN_DATA_CENT,Number of connected data centers (PeeringDB),_T,...,0,Units,COUNT,Count (Integer),602,CCYY,A,Normal value,PU,Public
4,datastructure,WB.DATA360:DS_DATA360(1.2),I,A,Annual,AGO,Angola,PEERING_DB_CONN_DATA_CENT,Number of connected data centers (PeeringDB),_T,...,0,Units,COUNT,Count (Integer),602,CCYY,A,Normal value,PU,Public
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
198,datastructure,WB.DATA360:DS_DATA360(1.2),I,A,Annual,TMN,Middle East & North Africa (IDA & IBRD),PEERING_DB_CONN_DATA_CENT,Number of connected data centers (PeeringDB),_T,...,0,Units,COUNT,Count (Integer),602,CCYY,A,Normal value,PU,Public
199,datastructure,WB.DATA360:DS_DATA360(1.2),I,A,Annual,TSA,South Asia (IDA & IBRD),PEERING_DB_CONN_DATA_CENT,Number of connected data centers (PeeringDB),_T,...,0,Units,COUNT,Count (Integer),602,CCYY,A,Normal value,PU,Public
200,datastructure,WB.DATA360:DS_DATA360(1.2),I,A,Annual,TSS,Sub-Saharan Africa (IDA & IBRD),PEERING_DB_CONN_DATA_CENT,Number of connected data centers (PeeringDB),_T,...,0,Units,COUNT,Count (Integer),602,CCYY,A,Normal value,PU,Public
201,datastructure,WB.DATA360:DS_DATA360(1.2),I,A,Annual,UMC,Upper middle income,PEERING_DB_CONN_DATA_CENT,Number of connected data centers (PeeringDB),_T,...,0,Units,COUNT,Count (Integer),602,CCYY,A,Normal value,PU,Public


# Step 1 — Using Pandas for our analysis

In [33]:
# Read the CSV
df = pd.read_csv(FILE_NAME)

# Filter out WRD
df = df[df["REF_AREA"] != "WRD"]

# Select numeric column
series = df[COLUMN_NAME]

# Compute stats
mean_pd = series.mean()
median_pd = series.median()
mode_pd = series.mode()[0]

print(f"Mean:   {mean_pd:.4f}")
print(f"Median: {median_pd}")
print(f"Mode:   {mode_pd}")


Mean:   229.9655
Median: 12.0
Mode:   1


# Step 2 – The Hard Way

We will:

- read the CSV manually

- Skip the World Total (WRD)

- convert values to integers

- calculate mean, median, mode manually

In [36]:
numbers_list = []

with open(FILE_NAME, mode="r", encoding="utf-8") as f:
    reader = csv.reader(f)
    header = next(reader)

    col_index = header.index(COLUMN_NAME)
    area_index = header.index("REF_AREA")

    for row in reader:
        if row[area_index] == "WRD":
            continue
        if row[col_index] == "":
            continue

        numbers_list.append(int(row[col_index]))

# --- Basic Stats ---
count = len(numbers_list)

# Mean
mean_hw = sum(numbers_list) / count

# Median
sorted_nums = sorted(numbers_list)
mid = count // 2
if count % 2 == 0:
    median_hw = (sorted_nums[mid - 1] + sorted_nums[mid]) / 2
else:
    median_hw = sorted_nums[mid]

# Mode
value_counts = {}
for num in numbers_list:
    value_counts[num] = value_counts.get(num, 0) + 1
mode_hw = max(value_counts, key=value_counts.get)

print(f"Mean:   {mean_hw:.4f}")
print(f"Median: {median_hw}")
print(f"Mode:   {mode_hw}")


Mean:   229.9655
Median: 12
Mode:   1


# Part 3 — Text-Based Visualization (Standard Library Only)

Because our dataset ranges from 0 up to thousands, we use realistic bins:

- 0–10

- 11–50

- 51–100

- 101–500

- 501–1000

- 1001+

In [37]:
# Define bins
bins = {"0–10": 0, "11–50": 0, "51–100": 0, "101–500": 0, "501–1000": 0, "1001+": 0}

# Fill bins
for val in numbers_list:
    if val <= 10:
        bins["0–10"] += 1
    elif val <= 50:
        bins["11–50"] += 1
    elif val <= 100:
        bins["51–100"] += 1
    elif val <= 500:
        bins["101–500"] += 1
    elif val <= 1000:
        bins["501–1000"] += 1
    else:
        bins["1001+"] += 1

# Scaling
max_count = max(bins.values())
max_width = 50
scale = max_width / max_count if max_count else 0

# Chart
print("\nDistribution of Data Center Values (excluding WRD)")
print("------------------------------------------------------")

for label, count in bins.items():
    bar = "█" * int(count * scale)
    print(f"Data Centers {label:<10} | {count:>5} | {bar}")

print(f"\n(scale: each '█' ≈ {max_count / max_width:.1f} countries)")



Distribution of Data Center Values (excluding WRD)
------------------------------------------------------
Data Centers 0–10       |    99 | ██████████████████████████████████████████████████
Data Centers 11–50      |    38 | ███████████████████
Data Centers 51–100     |    16 | ████████
Data Centers 101–500    |    30 | ███████████████
Data Centers 501–1000   |     6 | ███
Data Centers 1001+      |    14 | ███████

(scale: each '█' ≈ 2.0 countries)
