# Business Funding Data Preprocessing Assignment

**Objective:** Analyze and preprocess the *Business Funding Data* dataset to prepare it for further analysis.

This notebook will explore, clean, and transform the dataset while documenting observations, preprocessing steps, justifications, and reflections.

In [17]:
import pandas as pd
# Load dataset

df = pd.read_csv(r"C:\Users\hp\Desktop\Data sceince projects\Business_Funding _Data.csv", encoding = 'latin1')  # Update path if needed

# Display first few rows
df.head()

Unnamed: 0,Website Domain,Effective date,Found At,Financing Type,Financing Type Normalized,Categories,Investors,Investors Count,Amount,Amount Normalized,Source Urls
0,trafigura.com,,2024-03-14T01:00:00+01:00,,,[],,,$1.9b,1900000000,https://www.tradefinanceglobal.com/posts/trafi...
1,zenobe.com,,2024-05-31T02:00:00+02:00,,,[],"avivainvestors.com, lloydsbankinggroup.com, sa...",9.0,$522.7 million,522700000,https://realassets.ipe.com/news/aviva-among-le...
2,zenobe.com,,2024-07-24T02:00:00+02:00,,,"[""private_equity""]",,,£41.7m,53671000,https://www.innovationnewsnetwork.com/zenobe-a...
3,canva.com,,2024-05-01T02:00:00+02:00,,,[],stackcapitalgroup.com,1.0,US$8 million,8000000,https://www.globenewswire.com/news-release/202...
4,fidelity.com,,2024-04-11T02:00:00+02:00,,,[],chevychasetrust.com,1.0,$1.96 million,1960000,https://www.defenseworld.net/2024/04/11/chevy-...


## Step 1: Observations from Exploring the Data

In [18]:
# Shape of the dataset
print('Shape:', df.shape)

# Check for missing values
print('\nMissing values:\n', df.isnull().sum())

# Data types
print('\nData types:\n', df.dtypes)

# Check duplicates
print('\nDuplicate rows:', df.duplicated().sum())

# Quick summary of numeric and categorical columns
df.describe(include='all')

Shape: (26, 11)

Missing values:
 Website Domain                0
Effective date               20
Found At                      0
Financing Type               18
Financing Type Normalized    18
Categories                    0
Investors                    13
Investors Count              13
Amount                        0
Amount Normalized             0
Source Urls                   0
dtype: int64

Data types:
 Website Domain                object
Effective date                object
Found At                      object
Financing Type                object
Financing Type Normalized     object
Categories                    object
Investors                     object
Investors Count              float64
Amount                        object
Amount Normalized              int64
Source Urls                   object
dtype: object

Duplicate rows: 0


Unnamed: 0,Website Domain,Effective date,Found At,Financing Type,Financing Type Normalized,Categories,Investors,Investors Count,Amount,Amount Normalized,Source Urls
count,26,6,26,8,8,26,13,13.0,26,26.0,26
unique,21,6,23,5,5,9,13,,26,,26
top,zenobe.com,2024-04-18T02:00:00+02:00,2024-04-24T02:00:00+02:00,Seed,seed,[],"avivainvestors.com, lloydsbankinggroup.com, sa...",,$1.9b,,https://www.tradefinanceglobal.com/posts/trafi...
freq,2,1,2,4,4,11,1,,1,,1
mean,,,,,,,,1.846154,,226468700.0,
std,,,,,,,,2.230327,,538323900.0,
min,,,,,,,,1.0,,1600000.0,
25%,,,,,,,,1.0,,4685750.0,
50%,,,,,,,,1.0,,11600000.0,
75%,,,,,,,,1.0,,47500000.0,


## Step 2: Data Cleaning, Preprocessing, and Transformation

In [19]:
# Handle missing values
for col in df.select_dtypes(include=['float64','int64']).columns:
    df[col].fillna(df[col].median(), inplace=True)

for col in df.select_dtypes(include=['object']).columns:
    df[col].fillna(df[col].mode()[0], inplace=True)

# Remove duplicates
df.drop_duplicates(inplace=True)

# Standardize categorical values (example for 'State' column)
if 'State' in df.columns:
    df['State'] = df['State'].str.strip().str.title()

# Convert columns to correct data types (example if 'Date' exists)
if 'Date' in df.columns:
    df['Date'] = pd.to_datetime(df['Date'], errors='coerce')

# Outlier treatment example (IQR method for a numeric column like 'FundingAmount')
if 'FundingAmount' in df.columns:
    Q1 = df['FundingAmount'].quantile(0.25)
    Q3 = df['FundingAmount'].quantile(0.75)
    IQR = Q3 - Q1
    df = df[(df['FundingAmount'] >= Q1 - 1.5*IQR) & (df['FundingAmount'] <= Q3 + 1.5*IQR)]

# Final check
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26 entries, 0 to 25
Data columns (total 11 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   Website Domain             26 non-null     object 
 1   Effective date             26 non-null     object 
 2   Found At                   26 non-null     object 
 3   Financing Type             26 non-null     object 
 4   Financing Type Normalized  26 non-null     object 
 5   Categories                 26 non-null     object 
 6   Investors                  26 non-null     object 
 7   Investors Count            26 non-null     float64
 8   Amount                     26 non-null     object 
 9   Amount Normalized          26 non-null     int64  
 10  Source Urls                26 non-null     object 
dtypes: float64(1), int64(1), object(9)
memory usage: 2.4+ KB


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df[col].fillna(df[col].median(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df[col].fillna(df[col].mode()[0], inplace=True)


## Step 3: Justifications for Preprocessing Steps
- **Missing values:** Filled with median/mode to retain data without biasing too much.
- **Duplicates:** Removed to avoid double counting.
- **Categorical standardization:** Ensures consistency (e.g., 'lagos' vs 'Lagos').
- **Type conversions:** Correct formats allow proper analysis.
- **Outlier handling:** Prevents extreme values from distorting insights.
- **Feature engineering (optional):** Could add metrics like *Funding per Year* or *Business Age*.

## Step 4: Reflections on Preprocessing
- Real-world data is often messy and requires thorough cleaning before analysis.
- Preprocessing ensures **accuracy, consistency, and reliability** of results.
- Around **70–80% of a data scientist’s work** is spent on preprocessing.
- Without preprocessing, downstream models and insights may be misleading.
- This step is the foundation for trustworthy business analytics and machine learning.