<a href="https://colab.research.google.com/github/Terabyte007/Google_Colab/blob/main/Data_processing_of_Business_Funding_Data_in_Nigeria_final.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
# ==============================
# Step 1 — Setup & Helpers
# ==============================
import pandas as pd
import numpy as np
import re
import ast

def banner(txt):
    print("\n" + "="*len(txt))
    print(txt)
    print("="*len(txt))

def nrows(df):
    return f"{len(df):,} rows"

In [2]:
# ==============================
# Step 2 — Load dataset (with encoding fallback)
# ==============================
banner("Step 2 — Load dataset")

path = "/content/Business Funding Data.csv"  # path for the file

try:
    df = pd.read_csv(path)
except UnicodeDecodeError:
    df = pd.read_csv(path, encoding="latin1")

print("Loaded:", nrows(df))
print("Columns:", list(df.columns))


Step 2 — Load dataset
Loaded: 26 rows
Columns: ['Website Domain', 'Effective date', 'Found At', 'Financing Type', 'Financing Type Normalized', 'Categories', 'Investors', 'Investors Count', 'Amount', 'Amount Normalized', 'Source Urls']


In [3]:
# ==============================
# Step 3 — Standardize column names
# ==============================
banner("Step 3 — Standardize column names")

df.columns = df.columns.str.strip().str.lower().str.replace(" ", "_")
print("Renamed Columns:", list(df.columns))


Step 3 — Standardize column names
Renamed Columns: ['website_domain', 'effective_date', 'found_at', 'financing_type', 'financing_type_normalized', 'categories', 'investors', 'investors_count', 'amount', 'amount_normalized', 'source_urls']


In [4]:
# ==============================
# Step 4 — Normalize blank tokens
# ==============================
banner("Step 4 — Normalize blank tokens")

df.replace(["N/A", "NA", "na", "", " ", "—", "-"], np.nan, inplace=True)
print("Blank tokens replaced with NaN")


Step 4 — Normalize blank tokens
Blank tokens replaced with NaN


In [5]:
# ==============================
# Step 5 — Fill selected columns with 'unspecified'
# ==============================
banner("Step 5 — Fill selected columns with 'unspecified'")

for col in ["effective_date", "investors", "investors_count", "financing_type", "financing_type_normalized"]:
    if col in df.columns:
        df[col] = df[col].fillna("unspecified")

print("Filled missing values in target columns with 'unspecified'")


Step 5 — Fill selected columns with 'unspecified'
Filled missing values in target columns with 'unspecified'


In [6]:
# ==============================
# Step 6 — Normalize financing_type_normalized from categories
# ==============================
banner("Step 6 — Normalize financing_type_normalized from categories")

def normalize_financing_type(row):
    cat = str(row.get("categories")).strip()
    if row.get("financing_type_normalized") == "unspecified":
        if cat == "['private_equity']":
            return "private_equity"
        elif cat == "['debt_financing']":
            return "debt_financing"
    return row.get("financing_type_normalized")

df["financing_type_normalized"] = df.apply(normalize_financing_type, axis=1)
print("Financing type normalized based on categories")


Step 6 — Normalize financing_type_normalized from categories
Financing type normalized based on categories


In [7]:
# ==============================
# Step 7 — Parse investors into list format
# ==============================
banner("Step 7 — Parse investors into list format")

def parse_investors(val):
    if pd.isna(val) or val == "unspecified":
        return []
    return [x.strip() for x in str(val).split(",") if x.strip()]

df["investors_parsed"] = df["investors"].apply(parse_investors)
print("Investors parsed into list format")


Step 7 — Parse investors into list format
Investors parsed into list format


In [8]:
# ==============================
# Step 8 — Parse categories into list format
# ==============================
banner("Step 8 — Parse categories into list format")

def parse_categories(val):
    if pd.isna(val):
        return []
    try:
        return ast.literal_eval(val) if isinstance(val, str) else val
    except:
        return []

df["categories_parsed"] = df["categories"].apply(parse_categories)
print("Categories parsed into list format")


Step 8 — Parse categories into list format
Categories parsed into list format


In [9]:
# ==============================
# Step 9 — Reorder parsed columns
# ==============================
banner("Step 9 — Reorder parsed columns")

# Get current column order
cols = list(df.columns)

# Move categories_parsed after categories
if "categories" in cols and "categories_parsed" in cols:
    cols.remove("categories_parsed")
    idx = cols.index("categories") + 1
    cols.insert(idx, "categories_parsed")

# Move investors_parsed after investors
if "investors" in cols and "investors_parsed" in cols:
    cols.remove("investors_parsed")
    idx = cols.index("investors") + 1
    cols.insert(idx, "investors_parsed")

# Apply new column order
df = df[cols]

print("✅ Reordered columns so parsed fields follow originals")


Step 9 — Reorder parsed columns
✅ Reordered columns so parsed fields follow originals


In [10]:
# ==============================
# Step 10 — Final check
# ==============================
banner("Step 10 — Final check")

print("Final shape:", nrows(df))
df[["effective_date", "investors", "investors_parsed", "categories", "categories_parsed", "financing_type_normalized"]].head()


Step 10 — Final check
Final shape: 26 rows


Unnamed: 0,effective_date,investors,investors_parsed,categories,categories_parsed,financing_type_normalized
0,unspecified,unspecified,[],[],[],unspecified
1,unspecified,"avivainvestors.com, lloydsbankinggroup.com, sa...","[avivainvestors.com, lloydsbankinggroup.com, s...",[],[],unspecified
2,unspecified,unspecified,[],"[""private_equity""]",[private_equity],unspecified
3,unspecified,stackcapitalgroup.com,[stackcapitalgroup.com],[],[],unspecified
4,unspecified,chevychasetrust.com,[chevychasetrust.com],[],[],unspecified


In [15]:
# ==============================
# Step 11 — Save & Quick QA
# ==============================
banner("Step 11 — Save & Quick QA")

out_path = "/content/Business_Funding_Data_Cleaned_afeez.csv"
df.to_csv(out_path, index=False)
print(f"✅ Saved: {out_path}")
print("Final shape:", nrows(df))
print("\nMissing values by column:")
print(df.isna().sum())


Step 11 — Save & Quick QA
✅ Saved: /content/Business_Funding_Data_Cleaned_afeez.csv
Final shape: 26 rows

Missing values by column:
website_domain               0
effective_date               0
found_at                     0
financing_type               0
financing_type_normalized    0
categories                   0
categories_parsed            0
investors                    0
investors_parsed             0
investors_count              0
amount                       0
amount_normalized            0
source_urls                  0
dtype: int64

Top financing types:
financing_type_normalized
unspecified    18
seed            4
series_b        1
series_i        1
series_a2       1
Name: count, dtype: int64


In [13]:
# ==============================
# Step 12 — Download cleaned CSV to your computer
# ==============================
banner("Step 12 — Download cleaned CSV to your computer")

from google.colab import files
files.download("/content/Business_Funding_Data_Cleaned_afeez.csv")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

# 🧾 Assignment Reflection: Business Funding Data in Nigeria

## 🔍 Observations from Exploring the Data

- Several columns contained missing values, inconsistent formats, or placeholder strings like `"N/A"`.
- The `categories` and `investors` columns appeared to contain list-like data but were stored as raw strings, limiting usability.
- Currency values were messy, with different symbols and formats, and some entries lacked normalized amounts.
- Some fields like `financing_type_normalized` were blank even when `categories` clearly indicated the funding type.
- The dataset included both Nigerian and non-Nigerian businesses, but filtering was not the focus of this task.

---

## 🧹 Steps Taken to Clean, Preprocess, and Transform the Data

1. **Standardized column names** to lowercase with underscores for consistency.
2. **Replaced common blank tokens** (`"N/A"`, `"NA"`, `""`, etc.) with `NaN` to unify missing value handling.
3. **Filled selected columns** (`effective_date`, `investors`, `investors_count`, `financing_type`, `financing_type_normalized`) with `"unspecified"` where data was missing.
4. **Inferred financing type** from `categories` when it was clearly `"private_equity"` or `"debt_financing"`.
5. **Parsed `investors`** from comma-separated strings into clean Python lists (`investors_parsed`).
6. **Parsed `categories`** using `ast.literal_eval` to convert stringified lists into usable Python lists (`categories_parsed`).
7. **Reordered columns** so that parsed fields (`investors_parsed`, `categories_parsed`) appear directly after their originals.
8. **Saved the cleaned dataset** and provided a quick QA summary.
9. **Enabled download** of the final CSV for local use.

---

## ✅ Justifications for Each Technique

- **Standardizing column names** improves readability and prevents errors in code referencing.
- **Replacing blank tokens** ensures consistent missing value detection across the dataset.
- **Using `"unspecified"`** preserves rows while clearly marking missing or unknown data without introducing bias.
- **Inferring financing types** from `categories` adds structure and fills gaps using logical relationships.
- **Parsing list-like fields** unlocks powerful operations like filtering, counting, and grouping.
- **Reordering columns** enhances readability and keeps related data together.
- **Exporting and downloading** the cleaned file ensures portability and reproducibility.

---

## 💡 Reflections on the Importance of Preprocessing

Preprocessing is the foundation of reliable data analysis. Raw data is often messy, inconsistent and incomplete, without proper cleaning, any insights or models built on it can be misleading or outright wrong. This assignment highlights how thoughtful preprocessing transforms unusable data into a structured, analyzable asset. It’s not just about fixing errors, it’s about unlocking the full potential of the dataset.

