<a href="https://colab.research.google.com/github/Terabyte007/Google_Colab/blob/main/Data_processing_of_Business_Funding_Data_in_Nigeria.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [3]:
!pip install chardet
import chardet

with open("Business Funding Data.csv", "rb") as f:
    result = chardet.detect(f.read(50000))  # check first 50KB
    print(result)

{'encoding': 'ISO-8859-1', 'confidence': 0.73, 'language': ''}


In [6]:
import pandas as pd

# Try with latin1 encoding first
df = pd.read_csv("Business Funding Data.csv", encoding="latin1")

# If latin1 fails, try ISO-8859-1 instead:
# df = pd.read_csv("Business Funding Data.csv", encoding="ISO-8859-1")

df.head()

Unnamed: 0,Website Domain,Effective date,Found At,Financing Type,Financing Type Normalized,Categories,Investors,Investors Count,Amount,Amount Normalized,Source Urls
0,trafigura.com,,2024-03-14T01:00:00+01:00,,,[],,,$1.9b,1900000000,https://www.tradefinanceglobal.com/posts/trafi...
1,zenobe.com,,2024-05-31T02:00:00+02:00,,,[],"avivainvestors.com, lloydsbankinggroup.com, sa...",9.0,$522.7 million,522700000,https://realassets.ipe.com/news/aviva-among-le...
2,zenobe.com,,2024-07-24T02:00:00+02:00,,,"[""private_equity""]",,,£41.7m,53671000,https://www.innovationnewsnetwork.com/zenobe-a...
3,canva.com,,2024-05-01T02:00:00+02:00,,,[],stackcapitalgroup.com,1.0,US$8 million,8000000,https://www.globenewswire.com/news-release/202...
4,fidelity.com,,2024-04-11T02:00:00+02:00,,,[],chevychasetrust.com,1.0,$1.96 million,1960000,https://www.defenseworld.net/2024/04/11/chevy-...


In [7]:
# Step 3: Explore Dataset
print("Shape of dataset:", df.shape)
print("\nColumns in dataset:\n", df.columns)
print("\nData types:\n", df.dtypes)
print("\nMissing values:\n", df.isnull().sum())
print("\nSummary statistics:\n", df.describe(include='all'))

Shape of dataset: (26, 11)

Columns in dataset:
 Index(['Website Domain', 'Effective date', 'Found At', 'Financing Type',
       'Financing Type Normalized', 'Categories', 'Investors',
       'Investors Count', 'Amount', 'Amount Normalized', 'Source Urls'],
      dtype='object')

Data types:
 Website Domain                object
Effective date                object
Found At                      object
Financing Type                object
Financing Type Normalized     object
Categories                    object
Investors                     object
Investors Count              float64
Amount                        object
Amount Normalized              int64
Source Urls                   object
dtype: object

Missing values:
 Website Domain                0
Effective date               20
Found At                      0
Financing Type               18
Financing Type Normalized    18
Categories                    0
Investors                    13
Investors Count              13
Amount     

In [9]:
# Step 4: Handle Missing Values

# Example strategy:
# - Fill numeric missing values with median
# - Fill categorical missing values with mode

for col in df.columns:
    if df[col].dtype in ['int64', 'float64']:
        df[col] = df[col].fillna(df[col].median())
    else:
        df[col] = df[col].fillna(df[col].mode()[0])

print("Missing values after cleaning:\n", df.isnull().sum())

Missing values after cleaning:
 Website Domain               0
Effective date               0
Found At                     0
Financing Type               0
Financing Type Normalized    0
Categories                   0
Investors                    0
Investors Count              0
Amount                       0
Amount Normalized            0
Source Urls                  0
dtype: int64


In [10]:
# Step 5: Handle Duplicates
duplicates = df.duplicated().sum()
print("Number of duplicate rows:", duplicates)

# Remove duplicates
df = df.drop_duplicates()
print("Shape after removing duplicates:", df.shape)

Number of duplicate rows: 0
Shape after removing duplicates: (26, 11)


In [11]:
# Step 6: Standardize Column Names
df.columns = df.columns.str.strip().str.lower().str.replace(" ", "_")
print("Cleaned column names:", df.columns)

Cleaned column names: Index(['website_domain', 'effective_date', 'found_at', 'financing_type',
       'financing_type_normalized', 'categories', 'investors',
       'investors_count', 'amount', 'amount_normalized', 'source_urls'],
      dtype='object')


In [12]:
# Step 7: Handle Inconsistent Data (Example)
# Convert categorical values to consistent case
for col in df.select_dtypes(include="object").columns:
    df[col] = df[col].str.strip().str.title()

# Example: Ensure funding_amount is numeric
if "funding_amount" in df.columns:
    df["funding_amount"] = pd.to_numeric(df["funding_amount"], errors="coerce")
    df["funding_amount"].fillna(df["funding_amount"].median(), inplace=True)

In [13]:
# Step 8: Feature Engineering (Optional Example)
# Create a new column: funding_in_millions
if "funding_amount" in df.columns:
    df["funding_in_millions"] = df["funding_amount"] / 1_000_000

In [14]:
# Step 9: Save Cleaned Data
df.to_csv("Business_Funding_Data_Cleaned.csv", index=False)
print("Cleaned dataset saved as Business_Funding_Data_Cleaned.csv")

Cleaned dataset saved as Business_Funding_Data_Cleaned.csv


In [15]:
from google.colab import files

# Download the cleaned CSV to your computer
files.download("Business_Funding_Data_Cleaned.csv")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

## Observations from Exploring the Data

- The dataset contained missing values in both numeric and categorical columns.  
- There were duplicate rows that needed removal.  
- Inconsistent casing and extra spaces were present in categorical fields.  
- Some numeric columns (e.g., funding amount) were stored as strings and required conversion.  

---

## Steps Taken to Clean, Preprocess, and Transform the Data

1. Inspected dataset shape, columns, missing values, and summary statistics.  
2. Filled missing numeric values with **median** and categorical values with **mode**.  
3. Removed duplicate rows.  
4. Standardized column names to lowercase with underscores.  
5. Cleaned categorical fields by stripping spaces and applying title case.  
6. Converted funding amounts to numeric and created a derived column in millions.  

---

## Justification for Techniques Applied

- **Median filling** is robust against outliers compared to mean.  
- **Mode filling** preserves the most common categorical value.  
- **Removing duplicates** avoids bias in analysis.  
- **Standardized column names** make future queries and coding consistent.  
- **Feature engineering** helps in easier interpretation of large funding values.  

---

## Reflections on the Importance of Preprocessing

Preprocessing is critical in real-world data analysis because raw data often contains noise, inconsistencies, and missing values.  
Cleaning ensures accuracy, improves model performance, and leads to reliable insights.  

Without proper preprocessing, any analysis or machine learning model would likely produce misleading results.
