<a href="https://colab.research.google.com/github/Tomisin510/LearnWithDSN/blob/main/Week2/Week2_LearnWithDSN_Assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Business Funding Data – Cleaning and Preprocessing for Analysis

This notebook demonstrates a thorough cleaning and preparation of a business funding dataset, making it ready for downstream analysis (e.g., regression, clustering, or visualization).

**Assumptions:**  
- Dataset file: `Business Funding Data.csv` (uploaded to the Colab environment).  

---

## 1. Setup and Data Loading

In [10]:
# Import necessary libraries for data cleaning and data manipulation
import pandas as pd
import numpy as np

# Load the dataset
df = pd.read_csv('/content/Business Funding Data.csv', encoding='latin1')

# To check the shape of the dataframe
print("Dataset shape:", df.shape)


Dataset shape: (26, 11)


In [11]:
# To check the first 5 rows of the dataframe
df.head()

Unnamed: 0,Website Domain,Effective date,Found At,Financing Type,Financing Type Normalized,Categories,Investors,Investors Count,Amount,Amount Normalized,Source Urls
0,trafigura.com,,2024-03-14T01:00:00+01:00,,,[],,,$1.9b,1900000000,https://www.tradefinanceglobal.com/posts/trafi...
1,zenobe.com,,2024-05-31T02:00:00+02:00,,,[],"avivainvestors.com, lloydsbankinggroup.com, sa...",9.0,$522.7 million,522700000,https://realassets.ipe.com/news/aviva-among-le...
2,zenobe.com,,2024-07-24T02:00:00+02:00,,,"[""private_equity""]",,,£41.7m,53671000,https://www.innovationnewsnetwork.com/zenobe-a...
3,canva.com,,2024-05-01T02:00:00+02:00,,,[],stackcapitalgroup.com,1.0,US$8 million,8000000,https://www.globenewswire.com/news-release/202...
4,fidelity.com,,2024-04-11T02:00:00+02:00,,,[],chevychasetrust.com,1.0,$1.96 million,1960000,https://www.defenseworld.net/2024/04/11/chevy-...


In [12]:
# To check the last 5 rows of the dataframe
df.tail()

Unnamed: 0,Website Domain,Effective date,Found At,Financing Type,Financing Type Normalized,Categories,Investors,Investors Count,Amount,Amount Normalized,Source Urls
21,claritisoftware.com,2024-06-26T02:00:00+02:00,2024-06-26T02:00:00+02:00,,,"[""private_equity""]",cibc.com,1.0,$10 million,10000000,https://www.marketscreener.com/quote/stock/CAN...
22,biointelligence.com,,2024-04-30T02:00:00+02:00,Seed,seed,"[""seed"", ""venture""]",,,$5 million CAD,3653000,https://betakit.com/biointelligence-technologi...
23,gaiia.com,2024-06-27T02:00:00+02:00,2024-06-27T02:00:00+02:00,Series A,series_a,"[""series_a"", ""venture""]",inovia.vc,1.0,US$13.2M,13200000,https://financialpost.com/globe-newswire/gaiia...
24,sinnstudio.com,,2024-05-22T02:00:00+02:00,,,"[""private_equity"", ""venture""]",,,$2.5M,2500000,https://www.finsmes.com/2024/05/sinn-studio-ra...
25,topicflow.com,,2024-06-25T02:00:00+02:00,,,[],,,$2.5m,2500000,https://www.streetinsider.com/Accesswire/Topic...


## Objective: Produce a well-cleaned, analysis-ready dataset for further analysis by applying various pre-processing techniques.

In [13]:
# To display basic info
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26 entries, 0 to 25
Data columns (total 11 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   Website Domain             26 non-null     object 
 1   Effective date             6 non-null      object 
 2   Found At                   26 non-null     object 
 3   Financing Type             8 non-null      object 
 4   Financing Type Normalized  8 non-null      object 
 5   Categories                 26 non-null     object 
 6   Investors                  13 non-null     object 
 7   Investors Count            13 non-null     float64
 8   Amount                     26 non-null     object 
 9   Amount Normalized          26 non-null     int64  
 10  Source Urls                26 non-null     object 
dtypes: float64(1), int64(1), object(9)
memory usage: 2.4+ KB


## DEALING WITH MISSING VALUES


In [14]:
# Total sum of mussing values per columns in the dataframe
df.isnull().sum()

Unnamed: 0,0
Website Domain,0
Effective date,20
Found At,0
Financing Type,18
Financing Type Normalized,18
Categories,0
Investors,13
Investors Count,13
Amount,0
Amount Normalized,0


In [15]:
# To calculate missing values in percentage relative to its column
missing_percentage = df.isnull().sum() / len(df) * 100
missing_percentage

Unnamed: 0,0
Website Domain,0.0
Effective date,76.923077
Found At,0.0
Financing Type,69.230769
Financing Type Normalized,69.230769
Categories,0.0
Investors,50.0
Investors Count,50.0
Amount,0.0
Amount Normalized,0.0


In [16]:
# To find out the unique values in 'Effective date' because it has the highest percentage of null values
df['Effective date'].unique()

array([nan, '2024-04-18T02:00:00+02:00', '2024-04-16T02:00:00+02:00',
       '2024-06-20T02:00:00+02:00', '2024-04-24T02:00:00+02:00',
       '2024-06-26T02:00:00+02:00', '2024-06-27T02:00:00+02:00'],
      dtype=object)

In [17]:
# Drop the 'Effective date' column
df.drop(columns=['Effective date'], inplace=True)
df

Unnamed: 0,Website Domain,Found At,Financing Type,Financing Type Normalized,Categories,Investors,Investors Count,Amount,Amount Normalized,Source Urls
0,trafigura.com,2024-03-14T01:00:00+01:00,,,[],,,$1.9b,1900000000,https://www.tradefinanceglobal.com/posts/trafi...
1,zenobe.com,2024-05-31T02:00:00+02:00,,,[],"avivainvestors.com, lloydsbankinggroup.com, sa...",9.0,$522.7 million,522700000,https://realassets.ipe.com/news/aviva-among-le...
2,zenobe.com,2024-07-24T02:00:00+02:00,,,"[""private_equity""]",,,£41.7m,53671000,https://www.innovationnewsnetwork.com/zenobe-a...
3,canva.com,2024-05-01T02:00:00+02:00,,,[],stackcapitalgroup.com,1.0,US$8 million,8000000,https://www.globenewswire.com/news-release/202...
4,fidelity.com,2024-04-11T02:00:00+02:00,,,[],chevychasetrust.com,1.0,$1.96 million,1960000,https://www.defenseworld.net/2024/04/11/chevy-...
5,swtchenergy.com,2024-04-24T02:00:00+02:00,Series B,series_b,"[""series_b"", ""venture""]","alantra.com, blueearth.capital",2.0,$27.2 Million,27200000,https://www.mercomindia.com/funding-and-ma-rou...
6,carnow.com,2024-04-16T02:00:00+02:00,,,"[""debt_financing""]",runwaygrowth.com,1.0,$40 million,40000000,https://www.prnewswire.com/news-releases/runwa...
7,databricks.com,2024-08-07T02:00:00+02:00,Series I,series_i,"[""series_i"", ""venture""]",,,$685 million,685000000,https://iteuropa.com/news/large-language-model...
8,anthropic.com,2024-07-08T02:00:00+02:00,,,[],damachotelsandresorts.com,1.0,$50mn,50000000,https://www.arabianbusiness.com/industries/tec...
9,ey.com,2024-04-18T02:00:00+02:00,,,[],,,AU$10.7M,6865000,https://www.biometricupdate.com/202404/ey-secu...


In [18]:
# Display Basic Info
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26 entries, 0 to 25
Data columns (total 10 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   Website Domain             26 non-null     object 
 1   Found At                   26 non-null     object 
 2   Financing Type             8 non-null      object 
 3   Financing Type Normalized  8 non-null      object 
 4   Categories                 26 non-null     object 
 5   Investors                  13 non-null     object 
 6   Investors Count            13 non-null     float64
 7   Amount                     26 non-null     object 
 8   Amount Normalized          26 non-null     int64  
 9   Source Urls                26 non-null     object 
dtypes: float64(1), int64(1), object(8)
memory usage: 2.2+ KB


In [19]:
# To find out the unique values in 'Investors Count'.
df['Investors Count'].unique()

array([nan,  9.,  1.,  2.,  3.])

In [20]:
# To find out the values in 'Investors Count'.
df['Investors Count']

Unnamed: 0,Investors Count
0,
1,9.0
2,
3,1.0
4,1.0
5,2.0
6,1.0
7,
8,1.0
9,


In [21]:
# To finfill out all NaN with Median because there is a presence of an outlier '9' in the column.
df['Investors Count'].fillna(df['Investors Count'].median(), inplace=True)
df['Investors Count']

Unnamed: 0,Investors Count
0,1.0
1,9.0
2,1.0
3,1.0
4,1.0
5,2.0
6,1.0
7,1.0
8,1.0
9,1.0


In [22]:
# Display basic info
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26 entries, 0 to 25
Data columns (total 10 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   Website Domain             26 non-null     object 
 1   Found At                   26 non-null     object 
 2   Financing Type             8 non-null      object 
 3   Financing Type Normalized  8 non-null      object 
 4   Categories                 26 non-null     object 
 5   Investors                  13 non-null     object 
 6   Investors Count            26 non-null     float64
 7   Amount                     26 non-null     object 
 8   Amount Normalized          26 non-null     int64  
 9   Source Urls                26 non-null     object 
dtypes: float64(1), int64(1), object(8)
memory usage: 2.2+ KB


In [23]:
# To find out the unique values in 'Financing Type'.
df['Financing Type'].unique()

array([nan, 'Series B', 'Series I', 'Seed', 'Series A2', 'Series A'],
      dtype=object)

In [24]:
# To calculate the mode of the column 'Financing Type'.
df['Financing Type'].mode()[0]

'Seed'

In [25]:
# To fill all missing value occurrences in column 'Financing Type' with the mode value of the same column
df['Financing Type'].fillna(df['Financing Type'].mode()[0], inplace=True)
df

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Financing Type'].fillna(df['Financing Type'].mode()[0], inplace=True)


Unnamed: 0,Website Domain,Found At,Financing Type,Financing Type Normalized,Categories,Investors,Investors Count,Amount,Amount Normalized,Source Urls
0,trafigura.com,2024-03-14T01:00:00+01:00,Seed,,[],,1.0,$1.9b,1900000000,https://www.tradefinanceglobal.com/posts/trafi...
1,zenobe.com,2024-05-31T02:00:00+02:00,Seed,,[],"avivainvestors.com, lloydsbankinggroup.com, sa...",9.0,$522.7 million,522700000,https://realassets.ipe.com/news/aviva-among-le...
2,zenobe.com,2024-07-24T02:00:00+02:00,Seed,,"[""private_equity""]",,1.0,£41.7m,53671000,https://www.innovationnewsnetwork.com/zenobe-a...
3,canva.com,2024-05-01T02:00:00+02:00,Seed,,[],stackcapitalgroup.com,1.0,US$8 million,8000000,https://www.globenewswire.com/news-release/202...
4,fidelity.com,2024-04-11T02:00:00+02:00,Seed,,[],chevychasetrust.com,1.0,$1.96 million,1960000,https://www.defenseworld.net/2024/04/11/chevy-...
5,swtchenergy.com,2024-04-24T02:00:00+02:00,Series B,series_b,"[""series_b"", ""venture""]","alantra.com, blueearth.capital",2.0,$27.2 Million,27200000,https://www.mercomindia.com/funding-and-ma-rou...
6,carnow.com,2024-04-16T02:00:00+02:00,Seed,,"[""debt_financing""]",runwaygrowth.com,1.0,$40 million,40000000,https://www.prnewswire.com/news-releases/runwa...
7,databricks.com,2024-08-07T02:00:00+02:00,Series I,series_i,"[""series_i"", ""venture""]",,1.0,$685 million,685000000,https://iteuropa.com/news/large-language-model...
8,anthropic.com,2024-07-08T02:00:00+02:00,Seed,,[],damachotelsandresorts.com,1.0,$50mn,50000000,https://www.arabianbusiness.com/industries/tec...
9,ey.com,2024-04-18T02:00:00+02:00,Seed,,[],,1.0,AU$10.7M,6865000,https://www.biometricupdate.com/202404/ey-secu...


In [26]:
# To find out the unique values in 'Financing Type Normalized'.
df['Financing Type Normalized'].unique()

array([nan, 'series_b', 'series_i', 'seed', 'series_a2', 'series_a'],
      dtype=object)

In [27]:
# To calculate the mode of the column 'Financing Type Normalized'.
df['Financing Type Normalized'].mode()[0]

'seed'

In [28]:
# To fill all missing value occurrences in column 'Financing Type Normalized' with the mode value of the same column
df['Financing Type Normalized'].fillna(df['Financing Type Normalized'].mode()[0], inplace=True)
df

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Financing Type Normalized'].fillna(df['Financing Type Normalized'].mode()[0], inplace=True)


Unnamed: 0,Website Domain,Found At,Financing Type,Financing Type Normalized,Categories,Investors,Investors Count,Amount,Amount Normalized,Source Urls
0,trafigura.com,2024-03-14T01:00:00+01:00,Seed,seed,[],,1.0,$1.9b,1900000000,https://www.tradefinanceglobal.com/posts/trafi...
1,zenobe.com,2024-05-31T02:00:00+02:00,Seed,seed,[],"avivainvestors.com, lloydsbankinggroup.com, sa...",9.0,$522.7 million,522700000,https://realassets.ipe.com/news/aviva-among-le...
2,zenobe.com,2024-07-24T02:00:00+02:00,Seed,seed,"[""private_equity""]",,1.0,£41.7m,53671000,https://www.innovationnewsnetwork.com/zenobe-a...
3,canva.com,2024-05-01T02:00:00+02:00,Seed,seed,[],stackcapitalgroup.com,1.0,US$8 million,8000000,https://www.globenewswire.com/news-release/202...
4,fidelity.com,2024-04-11T02:00:00+02:00,Seed,seed,[],chevychasetrust.com,1.0,$1.96 million,1960000,https://www.defenseworld.net/2024/04/11/chevy-...
5,swtchenergy.com,2024-04-24T02:00:00+02:00,Series B,series_b,"[""series_b"", ""venture""]","alantra.com, blueearth.capital",2.0,$27.2 Million,27200000,https://www.mercomindia.com/funding-and-ma-rou...
6,carnow.com,2024-04-16T02:00:00+02:00,Seed,seed,"[""debt_financing""]",runwaygrowth.com,1.0,$40 million,40000000,https://www.prnewswire.com/news-releases/runwa...
7,databricks.com,2024-08-07T02:00:00+02:00,Series I,series_i,"[""series_i"", ""venture""]",,1.0,$685 million,685000000,https://iteuropa.com/news/large-language-model...
8,anthropic.com,2024-07-08T02:00:00+02:00,Seed,seed,[],damachotelsandresorts.com,1.0,$50mn,50000000,https://www.arabianbusiness.com/industries/tec...
9,ey.com,2024-04-18T02:00:00+02:00,Seed,seed,[],,1.0,AU$10.7M,6865000,https://www.biometricupdate.com/202404/ey-secu...


In [29]:
# Display Basic Info
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26 entries, 0 to 25
Data columns (total 10 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   Website Domain             26 non-null     object 
 1   Found At                   26 non-null     object 
 2   Financing Type             26 non-null     object 
 3   Financing Type Normalized  26 non-null     object 
 4   Categories                 26 non-null     object 
 5   Investors                  13 non-null     object 
 6   Investors Count            26 non-null     float64
 7   Amount                     26 non-null     object 
 8   Amount Normalized          26 non-null     int64  
 9   Source Urls                26 non-null     object 
dtypes: float64(1), int64(1), object(8)
memory usage: 2.2+ KB


In [30]:
# To find out the unique values in 'Investors'.
df['Investors'].unique()

array([nan,
       'avivainvestors.com, lloydsbankinggroup.com, santander.co.uk, swip.com, cibc.com, societegenerale.com, natwest.us, rabobank.com, mufg.jp',
       'stackcapitalgroup.com', 'chevychasetrust.com',
       'alantra.com, blueearth.capital', 'runwaygrowth.com',
       'damachotelsandresorts.com', 'surocap.com', 'eib.org',
       'vistaragrowth.com', 'accelia.vc',
       'edc.ca, desjardinscapital.com, fondsftq.com', 'cibc.com',
       'inovia.vc'], dtype=object)

In [31]:
# To display all occurences of missing or Null values in the column 'Investors'
df[df['Investors'].isna()]

Unnamed: 0,Website Domain,Found At,Financing Type,Financing Type Normalized,Categories,Investors,Investors Count,Amount,Amount Normalized,Source Urls
0,trafigura.com,2024-03-14T01:00:00+01:00,Seed,seed,[],,1.0,$1.9b,1900000000,https://www.tradefinanceglobal.com/posts/trafi...
2,zenobe.com,2024-07-24T02:00:00+02:00,Seed,seed,"[""private_equity""]",,1.0,£41.7m,53671000,https://www.innovationnewsnetwork.com/zenobe-a...
7,databricks.com,2024-08-07T02:00:00+02:00,Series I,series_i,"[""series_i"", ""venture""]",,1.0,$685 million,685000000,https://iteuropa.com/news/large-language-model...
9,ey.com,2024-04-18T02:00:00+02:00,Seed,seed,[],,1.0,AU$10.7M,6865000,https://www.biometricupdate.com/202404/ey-secu...
10,openpipe.ai,2024-03-26T01:00:00+01:00,Seed,seed,"[""seed"", ""venture""]",,1.0,$6.7 million,6700000,https://www.geekwire.com/2024/seattle-startup-...
12,syntetica.co,2024-07-16T02:00:00+02:00,Seed,seed,"[""seed"", ""venture""]",,1.0,Û4.2 million,4581000,https://tech.eu/2024/07/16/syntetica-raises-eu...
14,sparelabs.com,2024-06-05T02:00:00+02:00,Seed,seed,[],,1.0,$10M,10000000,https://www.finsmes.com/2024/06/spare-receives...
15,e-zinc.ca,2024-06-27T02:00:00+02:00,Series A2,series_a2,"[""series_a2"", ""venture""]",,1.0,USD 31M,31000000,https://www.explorebit.io/article/e-Zinc%20Rai...
16,biointelligence.com,2024-04-25T02:00:00+02:00,Seed,seed,[],,1.0,$5M,5000000,https://www.prweb.com/releases/biointelligence...
20,topicflow.com,2024-06-25T02:00:00+02:00,Seed,seed,"[""seed"", ""venture""]",,1.0,CAD$2.5M,1823000,https://www.finsmes.com/2024/06/topicflow-rais...


In [32]:
# To display all occurences of non-missing or non-null values in the column 'Investors'
df[df['Investors'].notnull()]

Unnamed: 0,Website Domain,Found At,Financing Type,Financing Type Normalized,Categories,Investors,Investors Count,Amount,Amount Normalized,Source Urls
1,zenobe.com,2024-05-31T02:00:00+02:00,Seed,seed,[],"avivainvestors.com, lloydsbankinggroup.com, sa...",9.0,$522.7 million,522700000,https://realassets.ipe.com/news/aviva-among-le...
3,canva.com,2024-05-01T02:00:00+02:00,Seed,seed,[],stackcapitalgroup.com,1.0,US$8 million,8000000,https://www.globenewswire.com/news-release/202...
4,fidelity.com,2024-04-11T02:00:00+02:00,Seed,seed,[],chevychasetrust.com,1.0,$1.96 million,1960000,https://www.defenseworld.net/2024/04/11/chevy-...
5,swtchenergy.com,2024-04-24T02:00:00+02:00,Series B,series_b,"[""series_b"", ""venture""]","alantra.com, blueearth.capital",2.0,$27.2 Million,27200000,https://www.mercomindia.com/funding-and-ma-rou...
6,carnow.com,2024-04-16T02:00:00+02:00,Seed,seed,"[""debt_financing""]",runwaygrowth.com,1.0,$40 million,40000000,https://www.prnewswire.com/news-releases/runwa...
8,anthropic.com,2024-07-08T02:00:00+02:00,Seed,seed,[],damachotelsandresorts.com,1.0,$50mn,50000000,https://www.arabianbusiness.com/industries/tec...
11,canva.com,2024-05-09T02:00:00+02:00,Seed,seed,[],surocap.com,1.0,$2 billion,2000000000,https://www.investing.com/news/stock-market-ne...
13,zf.com,2024-07-15T02:00:00+02:00,Seed,seed,[],eib.org,1.0,Û425 million,464695000,https://thebrakereport.com/eib-funds-zf-brakin...
17,claritisoftware.com,2024-04-10T02:00:00+02:00,Seed,seed,"[""private_equity""]",vistaragrowth.com,1.0,$24.6 million CAD,18139000,https://betakit.com/government-software-provid...
18,heylist.com,2024-06-18T02:00:00+02:00,Seed,seed,"[""private_equity""]",accelia.vc,1.0,$1.6M,1600000,https://www.newswire.ca/news-releases/heylist-...


In [33]:
# To delete the column 'Investors' from the dataframe
df.drop(columns=['Investors'], inplace=True)
df

Unnamed: 0,Website Domain,Found At,Financing Type,Financing Type Normalized,Categories,Investors Count,Amount,Amount Normalized,Source Urls
0,trafigura.com,2024-03-14T01:00:00+01:00,Seed,seed,[],1.0,$1.9b,1900000000,https://www.tradefinanceglobal.com/posts/trafi...
1,zenobe.com,2024-05-31T02:00:00+02:00,Seed,seed,[],9.0,$522.7 million,522700000,https://realassets.ipe.com/news/aviva-among-le...
2,zenobe.com,2024-07-24T02:00:00+02:00,Seed,seed,"[""private_equity""]",1.0,£41.7m,53671000,https://www.innovationnewsnetwork.com/zenobe-a...
3,canva.com,2024-05-01T02:00:00+02:00,Seed,seed,[],1.0,US$8 million,8000000,https://www.globenewswire.com/news-release/202...
4,fidelity.com,2024-04-11T02:00:00+02:00,Seed,seed,[],1.0,$1.96 million,1960000,https://www.defenseworld.net/2024/04/11/chevy-...
5,swtchenergy.com,2024-04-24T02:00:00+02:00,Series B,series_b,"[""series_b"", ""venture""]",2.0,$27.2 Million,27200000,https://www.mercomindia.com/funding-and-ma-rou...
6,carnow.com,2024-04-16T02:00:00+02:00,Seed,seed,"[""debt_financing""]",1.0,$40 million,40000000,https://www.prnewswire.com/news-releases/runwa...
7,databricks.com,2024-08-07T02:00:00+02:00,Series I,series_i,"[""series_i"", ""venture""]",1.0,$685 million,685000000,https://iteuropa.com/news/large-language-model...
8,anthropic.com,2024-07-08T02:00:00+02:00,Seed,seed,[],1.0,$50mn,50000000,https://www.arabianbusiness.com/industries/tec...
9,ey.com,2024-04-18T02:00:00+02:00,Seed,seed,[],1.0,AU$10.7M,6865000,https://www.biometricupdate.com/202404/ey-secu...


In [34]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26 entries, 0 to 25
Data columns (total 9 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   Website Domain             26 non-null     object 
 1   Found At                   26 non-null     object 
 2   Financing Type             26 non-null     object 
 3   Financing Type Normalized  26 non-null     object 
 4   Categories                 26 non-null     object 
 5   Investors Count            26 non-null     float64
 6   Amount                     26 non-null     object 
 7   Amount Normalized          26 non-null     int64  
 8   Source Urls                26 non-null     object 
dtypes: float64(1), int64(1), object(7)
memory usage: 2.0+ KB


## DEALING WITH DUPLICATES


In [35]:
# To check for duplicates
duplicates = df.duplicated().sum()
print(f"\nNumber of duplicated rows: {duplicates}")


Number of duplicated rows: 0


# QUESTION 1


## Data Exploration Observeations

- The dataset consists of 26 rows and 11 columns, with missing values present in 5 columns.
- Data types:
  - Numeric: Investors Count (float64) and Amount Normalized (int64)
  - Categorical/Text: All other columns are of type object (string)

Note: When describing data types, prefer precise names from pandas dtypes and, if relevant, the underlying NumPy dtype (e.g., float64, int64, object).

# QUESTION 2

## Data Cleaning, Preprocessing, and Transformation

1) Initial data quality assessment
- Checked for missing values using isnull and computed the count of missing values per column.
- Calculated the percentage of missing values for each column relative to the total number of rows (26).

2) Handling missing values
- Addressed the five columns containing NaNs on a column-by-column basis:
  - Investors Count (numeric): This column exhibited a single extreme outlier. Given the distribution, the median is the appropriate robust statistic for imputing the missing values.
  - Financing Type (object): Missing values were imputed with the mode of the respective column.
  - Financing Type Normalized (object): Missing values were imputed with the mode of the respective column.
  - Effective Date (object): This row was dropped from the dataset due to incomplete temporal information.
  - Investors (object): This row was dropped due to missing critical categorization data.

3) Imputation strategies
- Numeric imputation (Investors Count): Use the median to mitigate the influence of outliers and preserve the central tendency.
- Categorical imputation (Financing Type, Financing Type Normalized): Use the mode (most frequent value) to preserve the prevailing category.
- Row-dropping decisions:
  - Rows with missing critical temporal or identifier information (e.g., Effective Date, Investors) were removed to avoid introducing ambiguity into downstream analyses.

4) Summary of the transformation
- The dataset has been cleaned by:
  - Dropping two rows containing missing critical categorical/temporal information.
  - Imputing missing values in numeric columns with the median.
  - Imputing missing values in categorical columns with the mode.
- Post-cleaning, the data are ready for downstream tasks such as feature engineering, statistical analysis, or modeling.

Notes and recommendations
- If you want to preserve more data, consider alternative imputation strategies (e.g., fillna with a sentinel category for categorical columns or use a model-based imputation).
- Validate the imputation decisions by comparing pre- and post-imputation distributions and by ensuring that the chosen strategies align with domain knowledge.


# QUESTION 3

## Justifications for each technique or decision

- Data exploration and type awareness
  - Why: Knowing 26 rows, 11 columns, and which columns are numeric vs. categorical guides suitable imputation and downstream modeling.
  - Rationale: Distinguishing numeric (e.g., Investors Count as float64) from object types ensures appropriate statistics and avoids invalid operations on text.

- Missing value assessment
  - Why: Quantifying per-column missingness reveals data quality issues, biases, and where remedies are needed.
  - Rationale: Per-column counts and percentages inform whether to impute, drop, or seek additional data.

- Numeric imputation with the median
  - Why: Investors Count has a single outlier; the median best preserves central tendency without being skewed.
  - Rationale: Median is robust to outliers and skew, unlike the mean.

- Categorical imputation with the mode
  - Why: Financing Type and Financing Type Normalized: imputing with the most frequent category maintains the predominant class.
  - Rationale: Mode imputation preserves the existing distribution when no extra information is available.

- Row-dropping for critical gaps
  - Why: Effective Date and Investors contain information deemed non-recoverable or essential.
  - Rationale: Dropping such rows avoids ambiguity and misinterpretation downstream.




# QUESTION 4

## Reflections on preprocessing importance

- Gatekeeper role
  - Preprocessing bridges raw data and reliable analysis; mishandling missing values or types biases results and harms models.

- Impact on model quality
  - Imputation choices shape distributions and feature inputs, influencing accuracy and generalization. Robust methods reduce sensitivity to anomalies.

- Pragmatism vs. integrity
  - Real-world data are imperfect. Pragmatic, defensible preprocessing (drop irrecoverable rows, impute with median/mode) enables timely insights while preserving data integrity.

- Reproducibility and transparency
  - Documenting decisions and parameters ensures reproducibility, audits, and collaboration. Share rationale alongside results.

- Domain knowledge integration
  - Imputation should reflect business context; sometimes model-based or domain-specific rules yield better results than simple baselines.
