**List of Noteable Variables in this Workbook**

- combined_df : dataset dataframe after removing duplication 
- drop_df : dataset dataframe after removing duplication + missing value 
- drop_df_removed_shift : dataset dataframe after removing duplication + missing value + shifted rows

## Import Data & Data Preprocessing

In [72]:
import pandas as pd

In [73]:
import numpy as np
import os

In [74]:
#pip install pyarrow
#pip install fastparquet

**1. Remove Duplication**

Step 1: remove duplicate iteration  
(third line of code) The variable "files" inside pd.read_parquet() of the original file is revised to "file" .

In [75]:
num_files = 15
files = [f"files/batch_{i}.parquet" for i in range(num_files)]

dfs = [pd.read_parquet(file) for file in files]
combined_df = pd.concat(dfs, ignore_index=True)
combined_df

Unnamed: 0,Investors,Primary Contact,Description,Geography,Preferred Industry,Preferred Investment Type,Primary Investor Type,geography_tags,preferred_investment_type_tags,preferred_industry_tags
0,Techstars,David Cohen,"Founded in 26, Techstars is an accelerator bas...","Africa, Americas, Asia, Canada, Middle East, O...","Beverages, Computer Hardware, Education and Tr...","Accelerator/Incubator, Early Stage VC, Later S...",Accelerator/Incubator,"[africa, americas, asia, canada, middle east, ...","[accelerator/incubator, early stage vc, later ...","[beverages, computer hardware, education and t..."
1,Y Combinator,David Lieb,"Founded in 25, Y Combinator is an accelerator ...","Africa, Americas, Asia, Europe, Oceania, Unite...","Biotechnology, Commercial Transportation, Comm...","Accelerator/Incubator, Early Stage VC, Later S...",Accelerator/Incubator,"[africa, americas, asia, europe, oceania, unit...","[accelerator/incubator, early stage vc, later ...","[biotechnology, commercial transportation, com..."
2,Plug and Play Tech Center,Marc Steiner,"Founded in 26, Plug and Play Tech Center is an...","Africa, Americas, Asia, Canada, Europe, Middle...","Aerospace and Defense, Animal Husbandry, Aquac...","Accelerator/Incubator, Early Stage VC, Later S...",Accelerator/Incubator,"[africa, americas, asia, canada, europe, middl...","[accelerator/incubator, early stage vc, later ...","[aerospace and defense, animal husbandry, aqua..."
3,Gaingels,Paul Grossinger,"Founded in 214, Gaingels is a venture capital ...","Africa, Americas, Asia, Canada, Europe, Middle...","Business Products and Services (B2B), Consumer...","Early Stage VC, Later Stage VC, PE Growth/Expa...",Venture Capital,"[africa, americas, asia, canada, europe, middl...","[early stage vc, later stage vc, pe growth/exp...","[business products and services (b2b), consume..."
4,Antler,Magnus Grimeland,"Founded in 217, Antler is a venture capital in...","Australia, Brazil, Canada, China, Denmark, Est...","Agriculture, Business Products and Services (B...","Accelerator/Incubator, Early Stage VC, Seed Round",Venture Capital,"[australia, brazil, canada, china, denmark, es...","[accelerator/incubator, early stage vc, seed r...","[agriculture, business products and services (..."
...,...,...,...,...,...,...,...,...,...,...
155339,ECS Tuning,Imran Jooma,Manufacturer and distributor of automotive par...,,"Commercial Products, Transportation","Add-on, Buyout/LBO, Merger/Acquisition",0,[],"[add-on, buyout/lbo, merger/acquisition]","[commercial products, transportation]"
155340,ECSEL JU,0,,,"Semiconductors, Software",0,0,[],[0],"[semiconductors, software]"
155341,Ecster,0,Operator of payment solutions for both busines...,,0,Merger/Acquisition,0,[],[merger/acquisition],[0]
155342,ECU Health,Michael Waldrum,,,0,Merger/Acquisition,0,[],[merger/acquisition],[0]


Step 2: checking all other possible duplications (rows with exactly same row values --> found to be 590 number of rows). 

In [76]:
# If values in a column conatains unhashable data e.g. list / ndarray, formula will return error --> last three columns
# Thus, only use columns with hashable types --> subset of other columns
# This formula will return a series of boolean values indicating whether each row is a duplicate, True = Duplicate

check_duplicates = combined_df.duplicated(subset=['Investors', 'Primary Contact', 'Description', 'Geography', 'Preferred Industry', 'Preferred Investment Type', 'Primary Investor Type'])
print( check_duplicates )

print( check_duplicates.sum() )           # Count of duplicates --> 590

0         False
1         False
2         False
3         False
4         False
          ...  
155339    False
155340    False
155341    False
155342    False
155343    False
Length: 155344, dtype: bool
590


In [77]:
duplicates_index_number = check_duplicates[check_duplicates == True]
duplicates_index_number.index

Index([ 11372,  26404,  26415,  26517,  32353,  32422,  32967,  33122,  33561,
        34570,
       ...
       153768, 153769, 153774, 153861, 153899, 154109, 154162, 154866, 154867,
       154873],
      dtype='int64', length=590)

In [78]:
combined_df.drop(index = duplicates_index_number.index, inplace=True)  # Drop duplicates
combined_df

Unnamed: 0,Investors,Primary Contact,Description,Geography,Preferred Industry,Preferred Investment Type,Primary Investor Type,geography_tags,preferred_investment_type_tags,preferred_industry_tags
0,Techstars,David Cohen,"Founded in 26, Techstars is an accelerator bas...","Africa, Americas, Asia, Canada, Middle East, O...","Beverages, Computer Hardware, Education and Tr...","Accelerator/Incubator, Early Stage VC, Later S...",Accelerator/Incubator,"[africa, americas, asia, canada, middle east, ...","[accelerator/incubator, early stage vc, later ...","[beverages, computer hardware, education and t..."
1,Y Combinator,David Lieb,"Founded in 25, Y Combinator is an accelerator ...","Africa, Americas, Asia, Europe, Oceania, Unite...","Biotechnology, Commercial Transportation, Comm...","Accelerator/Incubator, Early Stage VC, Later S...",Accelerator/Incubator,"[africa, americas, asia, europe, oceania, unit...","[accelerator/incubator, early stage vc, later ...","[biotechnology, commercial transportation, com..."
2,Plug and Play Tech Center,Marc Steiner,"Founded in 26, Plug and Play Tech Center is an...","Africa, Americas, Asia, Canada, Europe, Middle...","Aerospace and Defense, Animal Husbandry, Aquac...","Accelerator/Incubator, Early Stage VC, Later S...",Accelerator/Incubator,"[africa, americas, asia, canada, europe, middl...","[accelerator/incubator, early stage vc, later ...","[aerospace and defense, animal husbandry, aqua..."
3,Gaingels,Paul Grossinger,"Founded in 214, Gaingels is a venture capital ...","Africa, Americas, Asia, Canada, Europe, Middle...","Business Products and Services (B2B), Consumer...","Early Stage VC, Later Stage VC, PE Growth/Expa...",Venture Capital,"[africa, americas, asia, canada, europe, middl...","[early stage vc, later stage vc, pe growth/exp...","[business products and services (b2b), consume..."
4,Antler,Magnus Grimeland,"Founded in 217, Antler is a venture capital in...","Australia, Brazil, Canada, China, Denmark, Est...","Agriculture, Business Products and Services (B...","Accelerator/Incubator, Early Stage VC, Seed Round",Venture Capital,"[australia, brazil, canada, china, denmark, es...","[accelerator/incubator, early stage vc, seed r...","[agriculture, business products and services (..."
...,...,...,...,...,...,...,...,...,...,...
155339,ECS Tuning,Imran Jooma,Manufacturer and distributor of automotive par...,,"Commercial Products, Transportation","Add-on, Buyout/LBO, Merger/Acquisition",0,[],"[add-on, buyout/lbo, merger/acquisition]","[commercial products, transportation]"
155340,ECSEL JU,0,,,"Semiconductors, Software",0,0,[],[0],"[semiconductors, software]"
155341,Ecster,0,Operator of payment solutions for both busines...,,0,Merger/Acquisition,0,[],[merger/acquisition],[0]
155342,ECU Health,Michael Waldrum,,,0,Merger/Acquisition,0,[],[merger/acquisition],[0]


Step 3: Check duplication for same investor inputs

In [79]:
check_duplicates_2 = combined_df.duplicated(subset=['Investors', 'Description'])
print(check_duplicates_2)
print( check_duplicates_2.sum() )                 # Count of duplicates --> 15044

0         False
1         False
2         False
3         False
4         False
          ...  
155339    False
155340    False
155341    False
155342    False
155343    False
Length: 154754, dtype: bool
15044


In [80]:
duplicates_index_number_2 = check_duplicates_2[check_duplicates_2 == True]
duplicates_index_number_2.index

Index([  1674,   1835,   1837,   1840,   1841,   1842,   1843,   1845,   1847,
         1855,
       ...
       155283, 155286, 155301, 155307, 155309, 155312, 155314, 155317, 155327,
       155331],
      dtype='int64', length=15044)

In [81]:
combined_df.drop(index = duplicates_index_number_2.index, inplace=True)  # Drop duplicates
combined_df

Unnamed: 0,Investors,Primary Contact,Description,Geography,Preferred Industry,Preferred Investment Type,Primary Investor Type,geography_tags,preferred_investment_type_tags,preferred_industry_tags
0,Techstars,David Cohen,"Founded in 26, Techstars is an accelerator bas...","Africa, Americas, Asia, Canada, Middle East, O...","Beverages, Computer Hardware, Education and Tr...","Accelerator/Incubator, Early Stage VC, Later S...",Accelerator/Incubator,"[africa, americas, asia, canada, middle east, ...","[accelerator/incubator, early stage vc, later ...","[beverages, computer hardware, education and t..."
1,Y Combinator,David Lieb,"Founded in 25, Y Combinator is an accelerator ...","Africa, Americas, Asia, Europe, Oceania, Unite...","Biotechnology, Commercial Transportation, Comm...","Accelerator/Incubator, Early Stage VC, Later S...",Accelerator/Incubator,"[africa, americas, asia, europe, oceania, unit...","[accelerator/incubator, early stage vc, later ...","[biotechnology, commercial transportation, com..."
2,Plug and Play Tech Center,Marc Steiner,"Founded in 26, Plug and Play Tech Center is an...","Africa, Americas, Asia, Canada, Europe, Middle...","Aerospace and Defense, Animal Husbandry, Aquac...","Accelerator/Incubator, Early Stage VC, Later S...",Accelerator/Incubator,"[africa, americas, asia, canada, europe, middl...","[accelerator/incubator, early stage vc, later ...","[aerospace and defense, animal husbandry, aqua..."
3,Gaingels,Paul Grossinger,"Founded in 214, Gaingels is a venture capital ...","Africa, Americas, Asia, Canada, Europe, Middle...","Business Products and Services (B2B), Consumer...","Early Stage VC, Later Stage VC, PE Growth/Expa...",Venture Capital,"[africa, americas, asia, canada, europe, middl...","[early stage vc, later stage vc, pe growth/exp...","[business products and services (b2b), consume..."
4,Antler,Magnus Grimeland,"Founded in 217, Antler is a venture capital in...","Australia, Brazil, Canada, China, Denmark, Est...","Agriculture, Business Products and Services (B...","Accelerator/Incubator, Early Stage VC, Seed Round",Venture Capital,"[australia, brazil, canada, china, denmark, es...","[accelerator/incubator, early stage vc, seed r...","[agriculture, business products and services (..."
...,...,...,...,...,...,...,...,...,...,...
155339,ECS Tuning,Imran Jooma,Manufacturer and distributor of automotive par...,,"Commercial Products, Transportation","Add-on, Buyout/LBO, Merger/Acquisition",0,[],"[add-on, buyout/lbo, merger/acquisition]","[commercial products, transportation]"
155340,ECSEL JU,0,,,"Semiconductors, Software",0,0,[],[0],"[semiconductors, software]"
155341,Ecster,0,Operator of payment solutions for both busines...,,0,Merger/Acquisition,0,[],[merger/acquisition],[0]
155342,ECU Health,Michael Waldrum,,,0,Merger/Acquisition,0,[],[merger/acquisition],[0]


The variable "combined_df" here is updated with all duplicates removed.

**2. Handle Missing Vlue**

In [82]:
# import os
# Define the path to your desktop
#desktop_path = os.path.expanduser("~/Desktop/")

# Define the file path for the CSV file on your desktop
#csv_file_path = os.path.join(desktop_path, 'output.csv')

# Export the DataFrame to a CSV file on your desktop
#combined_df.to_csv(csv_file_path, index=False)

In [83]:
combined_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 139710 entries, 0 to 155343
Data columns (total 10 columns):
 #   Column                          Non-Null Count   Dtype 
---  ------                          --------------   ----- 
 0   Investors                       139710 non-null  object
 1   Primary Contact                 139709 non-null  object
 2   Description                     139710 non-null  object
 3   Geography                       139710 non-null  object
 4   Preferred Industry              139696 non-null  object
 5   Preferred Investment Type       139168 non-null  object
 6   Primary Investor Type           139090 non-null  object
 7   geography_tags                  139710 non-null  object
 8   preferred_investment_type_tags  139710 non-null  object
 9   preferred_industry_tags         139710 non-null  object
dtypes: object(10)
memory usage: 11.7+ MB


In [84]:
combined_df.drop(columns=['Primary Contact'], inplace=True, errors='ignore')

In [85]:
print(combined_df)

                        Investors  \
0                       Techstars   
1                    Y Combinator   
2       Plug and Play Tech Center   
3                        Gaingels   
4                          Antler   
...                           ...   
155339                 ECS Tuning   
155340                   ECSEL JU   
155341                     Ecster   
155342                 ECU Health   
155343              ECU Worldwide   

                                              Description  \
0       Founded in 26, Techstars is an accelerator bas...   
1       Founded in 25, Y Combinator is an accelerator ...   
2       Founded in 26, Plug and Play Tech Center is an...   
3       Founded in 214, Gaingels is a venture capital ...   
4       Founded in 217, Antler is a venture capital in...   
...                                                   ...   
155339  Manufacturer and distributor of automotive par...   
155340                                                      
155341

In [86]:
combined_df.isnull().sum()

Investors                           0
Description                         0
Geography                           0
Preferred Industry                 14
Preferred Investment Type         542
Primary Investor Type             620
geography_tags                      0
preferred_investment_type_tags      0
preferred_industry_tags             0
dtype: int64

In [87]:
drop_df = combined_df.dropna()

In [88]:
drop_df

Unnamed: 0,Investors,Description,Geography,Preferred Industry,Preferred Investment Type,Primary Investor Type,geography_tags,preferred_investment_type_tags,preferred_industry_tags
0,Techstars,"Founded in 26, Techstars is an accelerator bas...","Africa, Americas, Asia, Canada, Middle East, O...","Beverages, Computer Hardware, Education and Tr...","Accelerator/Incubator, Early Stage VC, Later S...",Accelerator/Incubator,"[africa, americas, asia, canada, middle east, ...","[accelerator/incubator, early stage vc, later ...","[beverages, computer hardware, education and t..."
1,Y Combinator,"Founded in 25, Y Combinator is an accelerator ...","Africa, Americas, Asia, Europe, Oceania, Unite...","Biotechnology, Commercial Transportation, Comm...","Accelerator/Incubator, Early Stage VC, Later S...",Accelerator/Incubator,"[africa, americas, asia, europe, oceania, unit...","[accelerator/incubator, early stage vc, later ...","[biotechnology, commercial transportation, com..."
2,Plug and Play Tech Center,"Founded in 26, Plug and Play Tech Center is an...","Africa, Americas, Asia, Canada, Europe, Middle...","Aerospace and Defense, Animal Husbandry, Aquac...","Accelerator/Incubator, Early Stage VC, Later S...",Accelerator/Incubator,"[africa, americas, asia, canada, europe, middl...","[accelerator/incubator, early stage vc, later ...","[aerospace and defense, animal husbandry, aqua..."
3,Gaingels,"Founded in 214, Gaingels is a venture capital ...","Africa, Americas, Asia, Canada, Europe, Middle...","Business Products and Services (B2B), Consumer...","Early Stage VC, Later Stage VC, PE Growth/Expa...",Venture Capital,"[africa, americas, asia, canada, europe, middl...","[early stage vc, later stage vc, pe growth/exp...","[business products and services (b2b), consume..."
4,Antler,"Founded in 217, Antler is a venture capital in...","Australia, Brazil, Canada, China, Denmark, Est...","Agriculture, Business Products and Services (B...","Accelerator/Incubator, Early Stage VC, Seed Round",Venture Capital,"[australia, brazil, canada, china, denmark, es...","[accelerator/incubator, early stage vc, seed r...","[agriculture, business products and services (..."
...,...,...,...,...,...,...,...,...,...
155339,ECS Tuning,Manufacturer and distributor of automotive par...,,"Commercial Products, Transportation","Add-on, Buyout/LBO, Merger/Acquisition",0,[],"[add-on, buyout/lbo, merger/acquisition]","[commercial products, transportation]"
155340,ECSEL JU,,,"Semiconductors, Software",0,0,[],[0],"[semiconductors, software]"
155341,Ecster,Operator of payment solutions for both busines...,,0,Merger/Acquisition,0,[],[merger/acquisition],[0]
155342,ECU Health,,,0,Merger/Acquisition,0,[],[merger/acquisition],[0]


In [89]:
drop_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 139086 entries, 0 to 155343
Data columns (total 9 columns):
 #   Column                          Non-Null Count   Dtype 
---  ------                          --------------   ----- 
 0   Investors                       139086 non-null  object
 1   Description                     139086 non-null  object
 2   Geography                       139086 non-null  object
 3   Preferred Industry              139086 non-null  object
 4   Preferred Investment Type       139086 non-null  object
 5   Primary Investor Type           139086 non-null  object
 6   geography_tags                  139086 non-null  object
 7   preferred_investment_type_tags  139086 non-null  object
 8   preferred_industry_tags         139086 non-null  object
dtypes: object(9)
memory usage: 10.6+ MB


In [90]:
drop_df.columns

Index(['Investors', 'Description', 'Geography', 'Preferred Industry',
       'Preferred Investment Type', 'Primary Investor Type', 'geography_tags',
       'preferred_investment_type_tags', 'preferred_industry_tags'],
      dtype='object')

In [91]:
drop_df.isin(['', '0', '[]', '[0]'])

Unnamed: 0,Investors,Description,Geography,Preferred Industry,Preferred Investment Type,Primary Investor Type,geography_tags,preferred_investment_type_tags,preferred_industry_tags
0,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...
155339,False,False,True,False,False,True,True,False,False
155340,False,True,True,False,True,True,True,False,False
155341,False,False,True,True,False,True,True,False,False
155342,False,True,True,True,False,True,True,False,False


In [92]:
drop_df.isin(['', '0', '[]', '[0]']).any(axis=1)

0         False
1         False
2         False
3         False
4         False
          ...  
155339     True
155340     True
155341     True
155342     True
155343     True
Length: 139086, dtype: bool

In [93]:
~drop_df.isin(['', '0', '[]', '[0]']).any(axis=1)

0          True
1          True
2          True
3          True
4          True
          ...  
155339    False
155340    False
155341    False
155342    False
155343    False
Length: 139086, dtype: bool

In [94]:
drop_df = drop_df[~drop_df.isin(['', '0', '[]', '[0]']).any(axis=1)]


In [95]:
data1 = {
    'Category': ['A', 'B', 'A', 'C', 'B']
}

df3 = pd.DataFrame(data1)

# Convert the categorical column 'Category' into dummy variables
dummy_df = pd.get_dummies(df3['Category'], prefix='Category')

# Concatenate the dummy variables with the original DataFrame
df3 = pd.concat([df3, dummy_df], axis=1)

df3

Unnamed: 0,Category,Category_A,Category_B,Category_C
0,A,True,False,False
1,B,False,True,False
2,A,True,False,False
3,C,False,False,True
4,B,False,True,False


In [96]:
drop_df

Unnamed: 0,Investors,Description,Geography,Preferred Industry,Preferred Investment Type,Primary Investor Type,geography_tags,preferred_investment_type_tags,preferred_industry_tags
0,Techstars,"Founded in 26, Techstars is an accelerator bas...","Africa, Americas, Asia, Canada, Middle East, O...","Beverages, Computer Hardware, Education and Tr...","Accelerator/Incubator, Early Stage VC, Later S...",Accelerator/Incubator,"[africa, americas, asia, canada, middle east, ...","[accelerator/incubator, early stage vc, later ...","[beverages, computer hardware, education and t..."
1,Y Combinator,"Founded in 25, Y Combinator is an accelerator ...","Africa, Americas, Asia, Europe, Oceania, Unite...","Biotechnology, Commercial Transportation, Comm...","Accelerator/Incubator, Early Stage VC, Later S...",Accelerator/Incubator,"[africa, americas, asia, europe, oceania, unit...","[accelerator/incubator, early stage vc, later ...","[biotechnology, commercial transportation, com..."
2,Plug and Play Tech Center,"Founded in 26, Plug and Play Tech Center is an...","Africa, Americas, Asia, Canada, Europe, Middle...","Aerospace and Defense, Animal Husbandry, Aquac...","Accelerator/Incubator, Early Stage VC, Later S...",Accelerator/Incubator,"[africa, americas, asia, canada, europe, middl...","[accelerator/incubator, early stage vc, later ...","[aerospace and defense, animal husbandry, aqua..."
3,Gaingels,"Founded in 214, Gaingels is a venture capital ...","Africa, Americas, Asia, Canada, Europe, Middle...","Business Products and Services (B2B), Consumer...","Early Stage VC, Later Stage VC, PE Growth/Expa...",Venture Capital,"[africa, americas, asia, canada, europe, middl...","[early stage vc, later stage vc, pe growth/exp...","[business products and services (b2b), consume..."
4,Antler,"Founded in 217, Antler is a venture capital in...","Australia, Brazil, Canada, China, Denmark, Est...","Agriculture, Business Products and Services (B...","Accelerator/Incubator, Early Stage VC, Seed Round",Venture Capital,"[australia, brazil, canada, china, denmark, es...","[accelerator/incubator, early stage vc, seed r...","[agriculture, business products and services (..."
...,...,...,...,...,...,...,...,...,...
155008,ECK.Ventures,"Founded in 217, ECK.Ventures is a venture capi...",Germany,Software,"Early Stage VC, Later Stage VC, Seed Round",<span>0.05 - 0.13</span>,[germany],"[early stage vc, later stage vc, seed round]",[software]
155039,Eclipseon Ventures,"Founded in 217, Eclipseon Ventures is a ventur...",Brazil,"Healthcare, Movies, Music and Entertainment, S...","Early Stage VC, Seed Round",<span>1.00 - 5.00</span>,[brazil],"[early stage vc, seed round]","[healthcare, movies, music and entertainment, ..."
155121,EcoElectron Ventures,"Founded in 1998, EcoElectron Ventures is a ven...",Southern California,"Communication Software, Energy","Early Stage VC, Seed Round",<span>0.15 - 0.50</span>,[southern california],"[early stage vc, seed round]","[communication software, energy]"
155138,Ecoinvestors Capital,EcoInvestors Capital is investing flexible cap...,"Americas, Canada, Europe, United States","Agriculture, Building Products, Construction (...","Bridge, Early Stage VC, Joint Venture, Later S...",<span>10.00 - 100.00</span>,"[americas, canada, europe, united states]","[bridge, early stage vc, joint venture, later ...","[agriculture, building products, construction ..."


In [97]:
drop_df.info()

# import os
# Define the path to your desktop
desktop_path = os.path.expanduser("~/Desktop/")

# Define the file path for the CSV file on your desktop
csv_file_path = os.path.join(desktop_path, 'output2.csv')

# Export the DataFrame to a CSV file on your desktop
drop_df.to_csv(csv_file_path, index=False)

<class 'pandas.core.frame.DataFrame'>
Index: 12324 entries, 0 to 155262
Data columns (total 9 columns):
 #   Column                          Non-Null Count  Dtype 
---  ------                          --------------  ----- 
 0   Investors                       12324 non-null  object
 1   Description                     12324 non-null  object
 2   Geography                       12324 non-null  object
 3   Preferred Industry              12324 non-null  object
 4   Preferred Investment Type       12324 non-null  object
 5   Primary Investor Type           12324 non-null  object
 6   geography_tags                  12324 non-null  object
 7   preferred_investment_type_tags  12324 non-null  object
 8   preferred_industry_tags         12324 non-null  object
dtypes: object(9)
memory usage: 962.8+ KB


OSError: Cannot save file into a non-existent directory: 'C:\Users\leung\Desktop'

**3. Remove shifted data.**

Check shift by checking "www." in the column "Description"

In [98]:
df_removed_shifted = drop_df[['Description']][drop_df['Description'].str.contains('www.')]
df_removed_shifted

Unnamed: 0,Description
118,www.hf.com
251,www.nhqv.com
338,www.tdpfund.com
551,www.andlinger.com
1672,www.jandjgroup.com
...,...
153206,www.dunross.com.cy
153629,www.dynamicpt.com
153758,www.ea-companies.com
154717,www.vertexventures.sg


In [99]:
index_list_removed_shifted = df_removed_shifted.index.tolist()
index_list_removed_shifted

[118,
 251,
 338,
 551,
 1672,
 1867,
 1873,
 1950,
 1959,
 1960,
 2437,
 5767,
 6040,
 6983,
 8058,
 8590,
 9463,
 9474,
 9475,
 9536,
 10128,
 11483,
 11612,
 11826,
 12002,
 12370,
 13490,
 14180,
 14196,
 14418,
 14590,
 14766,
 15135,
 15141,
 15438,
 15733,
 17349,
 17360,
 17959,
 18049,
 18062,
 18170,
 18191,
 18576,
 18787,
 19131,
 19226,
 19972,
 20308,
 21556,
 22484,
 23489,
 24690,
 25442,
 26351,
 27752,
 29005,
 30053,
 30201,
 30466,
 30870,
 32089,
 32555,
 33316,
 36283,
 38338,
 38471,
 38614,
 39302,
 39701,
 39974,
 40586,
 40597,
 40656,
 41211,
 41622,
 41973,
 42174,
 42324,
 42890,
 44752,
 45274,
 45593,
 46461,
 47375,
 47376,
 48286,
 49408,
 49612,
 50268,
 51085,
 52212,
 52578,
 53123,
 53147,
 53535,
 54936,
 55367,
 55590,
 55608,
 55955,
 56121,
 56216,
 56271,
 56625,
 57065,
 57096,
 57695,
 59376,
 59607,
 60028,
 60340,
 60958,
 61254,
 61414,
 62254,
 62735,
 63417,
 63458,
 64478,
 65321,
 65413,
 67823,
 68437,
 70917,
 70923,
 73185,
 73414,


Drop shifted rows and store in a new dataframe "drop_df_removed_shift"

In [100]:
drop_df_removed_shift = drop_df.drop(index= index_list_removed_shifted )
drop_df_removed_shift

Unnamed: 0,Investors,Description,Geography,Preferred Industry,Preferred Investment Type,Primary Investor Type,geography_tags,preferred_investment_type_tags,preferred_industry_tags
0,Techstars,"Founded in 26, Techstars is an accelerator bas...","Africa, Americas, Asia, Canada, Middle East, O...","Beverages, Computer Hardware, Education and Tr...","Accelerator/Incubator, Early Stage VC, Later S...",Accelerator/Incubator,"[africa, americas, asia, canada, middle east, ...","[accelerator/incubator, early stage vc, later ...","[beverages, computer hardware, education and t..."
1,Y Combinator,"Founded in 25, Y Combinator is an accelerator ...","Africa, Americas, Asia, Europe, Oceania, Unite...","Biotechnology, Commercial Transportation, Comm...","Accelerator/Incubator, Early Stage VC, Later S...",Accelerator/Incubator,"[africa, americas, asia, europe, oceania, unit...","[accelerator/incubator, early stage vc, later ...","[biotechnology, commercial transportation, com..."
2,Plug and Play Tech Center,"Founded in 26, Plug and Play Tech Center is an...","Africa, Americas, Asia, Canada, Europe, Middle...","Aerospace and Defense, Animal Husbandry, Aquac...","Accelerator/Incubator, Early Stage VC, Later S...",Accelerator/Incubator,"[africa, americas, asia, canada, europe, middl...","[accelerator/incubator, early stage vc, later ...","[aerospace and defense, animal husbandry, aqua..."
3,Gaingels,"Founded in 214, Gaingels is a venture capital ...","Africa, Americas, Asia, Canada, Europe, Middle...","Business Products and Services (B2B), Consumer...","Early Stage VC, Later Stage VC, PE Growth/Expa...",Venture Capital,"[africa, americas, asia, canada, europe, middl...","[early stage vc, later stage vc, pe growth/exp...","[business products and services (b2b), consume..."
4,Antler,"Founded in 217, Antler is a venture capital in...","Australia, Brazil, Canada, China, Denmark, Est...","Agriculture, Business Products and Services (B...","Accelerator/Incubator, Early Stage VC, Seed Round",Venture Capital,"[australia, brazil, canada, china, denmark, es...","[accelerator/incubator, early stage vc, seed r...","[agriculture, business products and services (..."
...,...,...,...,...,...,...,...,...,...
155008,ECK.Ventures,"Founded in 217, ECK.Ventures is a venture capi...",Germany,Software,"Early Stage VC, Later Stage VC, Seed Round",<span>0.05 - 0.13</span>,[germany],"[early stage vc, later stage vc, seed round]",[software]
155039,Eclipseon Ventures,"Founded in 217, Eclipseon Ventures is a ventur...",Brazil,"Healthcare, Movies, Music and Entertainment, S...","Early Stage VC, Seed Round",<span>1.00 - 5.00</span>,[brazil],"[early stage vc, seed round]","[healthcare, movies, music and entertainment, ..."
155121,EcoElectron Ventures,"Founded in 1998, EcoElectron Ventures is a ven...",Southern California,"Communication Software, Energy","Early Stage VC, Seed Round",<span>0.15 - 0.50</span>,[southern california],"[early stage vc, seed round]","[communication software, energy]"
155138,Ecoinvestors Capital,EcoInvestors Capital is investing flexible cap...,"Americas, Canada, Europe, United States","Agriculture, Building Products, Construction (...","Bridge, Early Stage VC, Joint Venture, Later S...",<span>10.00 - 100.00</span>,"[americas, canada, europe, united states]","[bridge, early stage vc, joint venture, later ...","[agriculture, building products, construction ..."


**4. Work on One Hot Encoding**

For each tag column, do the below steps individually to avoid duplicates during explode in step 1  
Step 1: extract the unique items in each tag columns, review and categorize in excel  
Step 2: import back the excel file to dataframe, do lookup to the existing dataset dataframe, do one hot encoding after that .  
Step 3: combine the resulting encoding columns to the existing dataset dataframe.

* Extract Tag Column - Geography Tags

In [101]:
geography_tags_list = drop_df_removed_shift['geography_tags'].explode().dropna().unique().tolist()
print("Geography Tags:", geography_tags_list)

df_to_excel = pd.DataFrame(geography_tags_list)
df_to_excel.columns = ['Location']
df_to_excel

Geography Tags: ['africa', 'americas', 'asia', 'canada', 'middle east', 'oceania', 'united kingdom', 'united states', 'europe', 'australia', 'brazil', 'china', 'denmark', 'estonia', 'france', 'germany', 'india', 'indonesia', 'japan', 'kenya', 'malaysia', 'netherlands', 'norway', 'pakistan', 'philippines', 'portugal', 'singapore', 'south korea', 'spain', 'sweden', 'thailand', 'united arab emirates', 'vietnam', 'north america', 'central america', 'south america', 'israel', 'ireland', 'bay area', 'mexico', 'new york', 'new york metro', 'costa rica', 'cuba', 'dominica', 'el salvador', 'east asia', 'mid atlantic', 'northeast', 'south asia', 'luxembourg', 'hong kong', 'southeast asia', 'taiwan', 'andorra', 'austria', 'belgium', 'czech republic', 'hungary', 'liechtenstein', 'monaco', 'northern europe', 'poland', 'slovakia', 'slovenia', 'switzerland', 'southern europe', 'western europe', 'midwest', 'southeast', 'west coast', 'eastern europe', 'south', 'new zealand', 'indiana', 'california', 'f

Unnamed: 0,Location
0,africa
1,americas
2,asia
3,canada
4,middle east
...,...
572,chad
573,serves as chairman at hivello. he is a co-foun...
574,serves as chairman at red swan ventures. he is...
575,glass bottles


In [102]:
%pip install pycountry

import pandas as pd
import zipfile
import io
import requests
import pycountry

Note: you may need to restart the kernel to use updated packages.


In [None]:
# Step 1: 下載並讀取 GeoNames 的城市資料
url = "http://download.geonames.org/export/dump/cities15000.zip"
response = requests.get(url)

if response.status_code == 200:
    with zipfile.ZipFile(io.BytesIO(response.content)) as z:
        with z.open('cities15000.txt') as f:
            columns = [
                'geonameid', 'name', 'asciiname', 'alternatenames', 'latitude', 'longitude',
                'feature_class', 'feature_code', 'country_code', 'cc2', 'admin1_code',
                'admin2_code', 'admin3_code', 'admin4_code', 'population', 'elevation',
                'dem', 'timezone', 'modification_date'
            ]
            df_geo = pd.read_csv(f, sep='\t', header=None, names=columns)
else:
    raise Exception("GeoNames 資料下載失敗，請檢查網路。")

# Step 2: 建立 地名 ➜ 國家名稱 的對照字典
def get_country_name(code):
    try:
        return pycountry.countries.get(alpha_2=code).name
    except:
        return None

df_geo['country_name'] = df_geo['country_code'].apply(get_country_name)

# 使用 asciiname 建立城市/地區 對應國家名
location_to_country = dict(zip(df_geo['asciiname'].str.lower(), df_geo['country_name']))

# 加上所有國家名稱的直接對應（像 "Germany" ➜ "Germany"）
for country in pycountry.countries:
    location_to_country[country.name.lower()] = country.name

# Step 3: 建立轉換函數
def standardize_location(value):
    if pd.isna(value):
        return 'N/A'
    name = value.strip().lower()
    return location_to_country.get(name, 'N/A')

# Step 4: 應用到你的 df_to_excel
df_to_excel['Standardized_Country'] = df_to_excel['Location'].apply(standardize_location)

# Step 5: 預覽結果（可省略）
print(df_to_excel[['Location', 'Standardized_Country']])

# Step 6: 儲存結果到 Excel
df_to_excel.to_excel('standardized_locations.xlsx', index=False)

                                              Location Standardized_Country
0                                               africa                  N/A
1                                             americas                  N/A
2                                                 asia          Philippines
3                                               canada               Canada
4                                          middle east                  N/A
..                                                 ...                  ...
573                                               chad                 Chad
574  serves as chairman at hivello. he is a co-foun...                  N/A
575  serves as chairman at red swan ventures. he is...                  N/A
576                                      glass bottles                  N/A
577                              mining and industrial                  N/A

[578 rows x 2 columns]


In [103]:
geography_categories = pd.read_excel('Tag Columns Classification.xlsx', sheet_name='geography_tags')
geography_categories

Unnamed: 0,Location,Standardized_Country,2nd,3rd
0,africa,,africa,Africa
1,americas,,americas,americas
2,asia,Philippines,Philippines,Philippines
3,canada,Canada,Canada,Canada
4,middle east,,middle east,
...,...,...,...,...
573,chad,Chad,Chad,
574,serves as chairman at hivello. he is a co-foun...,,,
575,serves as chairman at red swan ventures. he is...,,,
576,glass bottles,,,


* Extract Tag Column - Preferred Investment Type Tags

Extract all the unique items in the column to excel file and review manually

In [104]:
investment_tags = pd.Series( drop_df_removed_shift['preferred_investment_type_tags'].explode().unique().tolist() )

# To output an excel file, remove the # sign in the next line 
# investment_tags.to_excel('preferred_investment_type_tags.xlsx', index=False, header=['Preferred Investment Type Tags'])

After manual review, import the respective excel worksheet

In [105]:
investment_categories = pd.read_excel('Tag Columns Classification.xlsx', sheet_name='preferred_investment_type_tags')
investment_categories

Unnamed: 0,Preferred Investment Type Tags,Investment Categories
0,accelerator/incubator,Accelerator/incubator
1,early stage vc,Venture capital
2,later stage vc,Venture capital
3,seed round,Seed round
4,pe growth/expansion,Private equity
...,...,...
234,pension and provident,Pension and provident
235,served as non-executive director at unith. he ...,Not relevant
236,new york,Not relevant
237,aluminum cans,Not relevant


* Extract Tag Column - Preferred Industry Tags

Extract all the unique items in the column to excel file and review manually

In [106]:
industry_tags = pd.Series( drop_df_removed_shift['preferred_industry_tags'].explode().unique().tolist() )        

# To output an excel file, remove the # sign in the next line 
# industry_tags.to_excel('preferred_industry_tags.xlsx', index=False, header=['preferred_industry_tags'])      

After manual review, import the respective excel worksheet

In [107]:
industry_categories = pd.read_excel('Tag Columns Classification.xlsx', sheet_name='preferred_industry_tags')
industry_categories = industry_categories.loc[:,'preferred_industry_tags':'industries categories']
industry_categories

Unnamed: 0,preferred_industry_tags,industries categories
0,beverages,beverages
1,computer hardware,computer hardware
2,education and training services (b2b),education and training
3,educational and training services (b2c),
4,energy,energy
...,...,...
418,projects assistance,
419,serves as chairman at apollo crypto. he is an ...,
420,serves as a chief executive officer at pumpkin...,
421,pet bottles,


**Do lookup after all three categories are done**

1.1 Merge geography categories into dataset dataframe

In [108]:
# Explode the tags so each tag is in its own row
exploded_geography = drop_df_removed_shift.explode('geography_tags')
# Merge will reset the index number, this step is to keep the index order after explosion
exploded_geography_index = exploded_geography.index


# Merge on the exploded tags
drop_df_removed_shift_1 = exploded_geography.merge(
	geography_categories,
	left_on='geography_tags',
	right_on='Location',
	how='left'
)
# This step is to restore the index order after explosion
drop_df_removed_shift_1.index = exploded_geography_index
drop_df_removed_shift_1

Unnamed: 0,Investors,Description,Geography,Preferred Industry,Preferred Investment Type,Primary Investor Type,geography_tags,preferred_investment_type_tags,preferred_industry_tags,Location,Standardized_Country,2nd,3rd
0,Techstars,"Founded in 26, Techstars is an accelerator bas...","Africa, Americas, Asia, Canada, Middle East, O...","Beverages, Computer Hardware, Education and Tr...","Accelerator/Incubator, Early Stage VC, Later S...",Accelerator/Incubator,africa,"[accelerator/incubator, early stage vc, later ...","[beverages, computer hardware, education and t...",africa,,africa,Africa
0,Techstars,"Founded in 26, Techstars is an accelerator bas...","Africa, Americas, Asia, Canada, Middle East, O...","Beverages, Computer Hardware, Education and Tr...","Accelerator/Incubator, Early Stage VC, Later S...",Accelerator/Incubator,americas,"[accelerator/incubator, early stage vc, later ...","[beverages, computer hardware, education and t...",americas,,americas,americas
0,Techstars,"Founded in 26, Techstars is an accelerator bas...","Africa, Americas, Asia, Canada, Middle East, O...","Beverages, Computer Hardware, Education and Tr...","Accelerator/Incubator, Early Stage VC, Later S...",Accelerator/Incubator,asia,"[accelerator/incubator, early stage vc, later ...","[beverages, computer hardware, education and t...",asia,Philippines,Philippines,Philippines
0,Techstars,"Founded in 26, Techstars is an accelerator bas...","Africa, Americas, Asia, Canada, Middle East, O...","Beverages, Computer Hardware, Education and Tr...","Accelerator/Incubator, Early Stage VC, Later S...",Accelerator/Incubator,canada,"[accelerator/incubator, early stage vc, later ...","[beverages, computer hardware, education and t...",canada,Canada,Canada,Canada
0,Techstars,"Founded in 26, Techstars is an accelerator bas...","Africa, Americas, Asia, Canada, Middle East, O...","Beverages, Computer Hardware, Education and Tr...","Accelerator/Incubator, Early Stage VC, Later S...",Accelerator/Incubator,middle east,"[accelerator/incubator, early stage vc, later ...","[beverages, computer hardware, education and t...",middle east,,middle east,
...,...,...,...,...,...,...,...,...,...,...,...,...,...
155138,Ecoinvestors Capital,EcoInvestors Capital is investing flexible cap...,"Americas, Canada, Europe, United States","Agriculture, Building Products, Construction (...","Bridge, Early Stage VC, Joint Venture, Later S...",<span>10.00 - 100.00</span>,americas,"[bridge, early stage vc, joint venture, later ...","[agriculture, building products, construction ...",americas,,americas,americas
155138,Ecoinvestors Capital,EcoInvestors Capital is investing flexible cap...,"Americas, Canada, Europe, United States","Agriculture, Building Products, Construction (...","Bridge, Early Stage VC, Joint Venture, Later S...",<span>10.00 - 100.00</span>,canada,"[bridge, early stage vc, joint venture, later ...","[agriculture, building products, construction ...",canada,Canada,Canada,Canada
155138,Ecoinvestors Capital,EcoInvestors Capital is investing flexible cap...,"Americas, Canada, Europe, United States","Agriculture, Building Products, Construction (...","Bridge, Early Stage VC, Joint Venture, Later S...",<span>10.00 - 100.00</span>,europe,"[bridge, early stage vc, joint venture, later ...","[agriculture, building products, construction ...",europe,,europe,europe
155138,Ecoinvestors Capital,EcoInvestors Capital is investing flexible cap...,"Americas, Canada, Europe, United States","Agriculture, Building Products, Construction (...","Bridge, Early Stage VC, Joint Venture, Later S...",<span>10.00 - 100.00</span>,united states,"[bridge, early stage vc, joint venture, later ...","[agriculture, building products, construction ...",united states,United States,United States,USA


1.2 One-hot encoding for Geography Tags

In [109]:
drop_df_removed_shift_1_encoded = pd.get_dummies(drop_df_removed_shift_1, columns=['3rd'])
drop_df_removed_shift_1_encoded


# In case you want to check the columns of the DataFrame after encoding, remove the # sign in the next lines
# a = pd.DataFrame(drop_df_removed_shift_2_encoded.columns.to_list())
# a.to_csv('columns_list_geography.csv', index=False)

Unnamed: 0,Investors,Description,Geography,Preferred Industry,Preferred Investment Type,Primary Investor Type,geography_tags,preferred_investment_type_tags,preferred_industry_tags,Location,...,3rd_Norway,3rd_Philippines,3rd_Taiwan,3rd_UK,3rd_UK: a British Overseas Territory,3rd_USA,3rd_america,3rd_americas,3rd_canada,3rd_europe
0,Techstars,"Founded in 26, Techstars is an accelerator bas...","Africa, Americas, Asia, Canada, Middle East, O...","Beverages, Computer Hardware, Education and Tr...","Accelerator/Incubator, Early Stage VC, Later S...",Accelerator/Incubator,africa,"[accelerator/incubator, early stage vc, later ...","[beverages, computer hardware, education and t...",africa,...,False,False,False,False,False,False,False,False,False,False
0,Techstars,"Founded in 26, Techstars is an accelerator bas...","Africa, Americas, Asia, Canada, Middle East, O...","Beverages, Computer Hardware, Education and Tr...","Accelerator/Incubator, Early Stage VC, Later S...",Accelerator/Incubator,americas,"[accelerator/incubator, early stage vc, later ...","[beverages, computer hardware, education and t...",americas,...,False,False,False,False,False,False,False,True,False,False
0,Techstars,"Founded in 26, Techstars is an accelerator bas...","Africa, Americas, Asia, Canada, Middle East, O...","Beverages, Computer Hardware, Education and Tr...","Accelerator/Incubator, Early Stage VC, Later S...",Accelerator/Incubator,asia,"[accelerator/incubator, early stage vc, later ...","[beverages, computer hardware, education and t...",asia,...,False,True,False,False,False,False,False,False,False,False
0,Techstars,"Founded in 26, Techstars is an accelerator bas...","Africa, Americas, Asia, Canada, Middle East, O...","Beverages, Computer Hardware, Education and Tr...","Accelerator/Incubator, Early Stage VC, Later S...",Accelerator/Incubator,canada,"[accelerator/incubator, early stage vc, later ...","[beverages, computer hardware, education and t...",canada,...,False,False,False,False,False,False,False,False,False,False
0,Techstars,"Founded in 26, Techstars is an accelerator bas...","Africa, Americas, Asia, Canada, Middle East, O...","Beverages, Computer Hardware, Education and Tr...","Accelerator/Incubator, Early Stage VC, Later S...",Accelerator/Incubator,middle east,"[accelerator/incubator, early stage vc, later ...","[beverages, computer hardware, education and t...",middle east,...,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
155138,Ecoinvestors Capital,EcoInvestors Capital is investing flexible cap...,"Americas, Canada, Europe, United States","Agriculture, Building Products, Construction (...","Bridge, Early Stage VC, Joint Venture, Later S...",<span>10.00 - 100.00</span>,americas,"[bridge, early stage vc, joint venture, later ...","[agriculture, building products, construction ...",americas,...,False,False,False,False,False,False,False,True,False,False
155138,Ecoinvestors Capital,EcoInvestors Capital is investing flexible cap...,"Americas, Canada, Europe, United States","Agriculture, Building Products, Construction (...","Bridge, Early Stage VC, Joint Venture, Later S...",<span>10.00 - 100.00</span>,canada,"[bridge, early stage vc, joint venture, later ...","[agriculture, building products, construction ...",canada,...,False,False,False,False,False,False,False,False,False,False
155138,Ecoinvestors Capital,EcoInvestors Capital is investing flexible cap...,"Americas, Canada, Europe, United States","Agriculture, Building Products, Construction (...","Bridge, Early Stage VC, Joint Venture, Later S...",<span>10.00 - 100.00</span>,europe,"[bridge, early stage vc, joint venture, later ...","[agriculture, building products, construction ...",europe,...,False,False,False,False,False,False,False,False,False,True
155138,Ecoinvestors Capital,EcoInvestors Capital is investing flexible cap...,"Americas, Canada, Europe, United States","Agriculture, Building Products, Construction (...","Bridge, Early Stage VC, Joint Venture, Later S...",<span>10.00 - 100.00</span>,united states,"[bridge, early stage vc, joint venture, later ...","[agriculture, building products, construction ...",united states,...,False,False,False,False,False,True,False,False,False,False


Combine rows with same index to count the number of investment types for each row.  
Note that other columns are aggregated with duplicate texts!  
Thus, only those columns encoded will be extracted at the end, which will further combine to the original dataset dataframe later.

In [110]:
# Note other columns are aggregated with duplicate texts after groupby in this code.
drop_df_removed_shift_1_encoded.groupby(level=0).sum()

# To extract only those encoding columns, to solve the aggregation issue:
geography_encode_columns = drop_df_removed_shift_1_encoded.groupby(level=0).sum().loc[:,"3rd_Norway":"3rd_europe"]
geography_encode_columns

Unnamed: 0,3rd_Norway,3rd_Philippines,3rd_Taiwan,3rd_UK,3rd_UK: a British Overseas Territory,3rd_USA,3rd_america,3rd_americas,3rd_canada,3rd_europe
0,0,1,0,1,0,1,0,1,0,0
1,0,1,0,0,0,1,0,1,0,1
2,0,1,0,0,0,1,0,1,0,1
3,0,1,0,0,0,1,0,1,0,1
4,1,1,0,1,0,1,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...
155008,0,0,0,0,0,0,0,0,0,0
155039,0,0,0,0,0,0,0,0,0,0
155121,0,0,0,0,0,1,0,0,0,0
155138,0,0,0,0,0,1,0,1,0,1


Combine the encoded columns to the original dataset dataframe

In [111]:
dataset_encoded_1 = drop_df_removed_shift.merge(
    geography_encode_columns,
    left_index=True,
    right_index=True,
    how='inner'
)

dataset_encoded_1

# To output an excel file, remove the # sign in the next line
# dataset_encoded_1.to_excel('dataset_encoded_1.xlsx', index=False)

Unnamed: 0,Investors,Description,Geography,Preferred Industry,Preferred Investment Type,Primary Investor Type,geography_tags,preferred_investment_type_tags,preferred_industry_tags,3rd_Norway,3rd_Philippines,3rd_Taiwan,3rd_UK,3rd_UK: a British Overseas Territory,3rd_USA,3rd_america,3rd_americas,3rd_canada,3rd_europe
0,Techstars,"Founded in 26, Techstars is an accelerator bas...","Africa, Americas, Asia, Canada, Middle East, O...","Beverages, Computer Hardware, Education and Tr...","Accelerator/Incubator, Early Stage VC, Later S...",Accelerator/Incubator,"[africa, americas, asia, canada, middle east, ...","[accelerator/incubator, early stage vc, later ...","[beverages, computer hardware, education and t...",0,1,0,1,0,1,0,1,0,0
1,Y Combinator,"Founded in 25, Y Combinator is an accelerator ...","Africa, Americas, Asia, Europe, Oceania, Unite...","Biotechnology, Commercial Transportation, Comm...","Accelerator/Incubator, Early Stage VC, Later S...",Accelerator/Incubator,"[africa, americas, asia, europe, oceania, unit...","[accelerator/incubator, early stage vc, later ...","[biotechnology, commercial transportation, com...",0,1,0,0,0,1,0,1,0,1
2,Plug and Play Tech Center,"Founded in 26, Plug and Play Tech Center is an...","Africa, Americas, Asia, Canada, Europe, Middle...","Aerospace and Defense, Animal Husbandry, Aquac...","Accelerator/Incubator, Early Stage VC, Later S...",Accelerator/Incubator,"[africa, americas, asia, canada, europe, middl...","[accelerator/incubator, early stage vc, later ...","[aerospace and defense, animal husbandry, aqua...",0,1,0,0,0,1,0,1,0,1
3,Gaingels,"Founded in 214, Gaingels is a venture capital ...","Africa, Americas, Asia, Canada, Europe, Middle...","Business Products and Services (B2B), Consumer...","Early Stage VC, Later Stage VC, PE Growth/Expa...",Venture Capital,"[africa, americas, asia, canada, europe, middl...","[early stage vc, later stage vc, pe growth/exp...","[business products and services (b2b), consume...",0,1,0,0,0,1,0,1,0,1
4,Antler,"Founded in 217, Antler is a venture capital in...","Australia, Brazil, Canada, China, Denmark, Est...","Agriculture, Business Products and Services (B...","Accelerator/Incubator, Early Stage VC, Seed Round",Venture Capital,"[australia, brazil, canada, china, denmark, es...","[accelerator/incubator, early stage vc, seed r...","[agriculture, business products and services (...",1,1,0,1,0,1,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
155008,ECK.Ventures,"Founded in 217, ECK.Ventures is a venture capi...",Germany,Software,"Early Stage VC, Later Stage VC, Seed Round",<span>0.05 - 0.13</span>,[germany],"[early stage vc, later stage vc, seed round]",[software],0,0,0,0,0,0,0,0,0,0
155039,Eclipseon Ventures,"Founded in 217, Eclipseon Ventures is a ventur...",Brazil,"Healthcare, Movies, Music and Entertainment, S...","Early Stage VC, Seed Round",<span>1.00 - 5.00</span>,[brazil],"[early stage vc, seed round]","[healthcare, movies, music and entertainment, ...",0,0,0,0,0,0,0,0,0,0
155121,EcoElectron Ventures,"Founded in 1998, EcoElectron Ventures is a ven...",Southern California,"Communication Software, Energy","Early Stage VC, Seed Round",<span>0.15 - 0.50</span>,[southern california],"[early stage vc, seed round]","[communication software, energy]",0,0,0,0,0,1,0,0,0,0
155138,Ecoinvestors Capital,EcoInvestors Capital is investing flexible cap...,"Americas, Canada, Europe, United States","Agriculture, Building Products, Construction (...","Bridge, Early Stage VC, Joint Venture, Later S...",<span>10.00 - 100.00</span>,"[americas, canada, europe, united states]","[bridge, early stage vc, joint venture, later ...","[agriculture, building products, construction ...",0,0,0,0,0,1,0,1,0,1


2.1 Merge investment categories into dataset dataframe

In [112]:
# Explode the tags so each tag is in its own row
exploded_investment = drop_df_removed_shift.explode('preferred_investment_type_tags')
# Merge will reset the index number, this step is to keep the index order after explosion
exploded_investment_index = exploded_investment.index


# Merge on the exploded tags
drop_df_removed_shift_2 = exploded_investment.merge(
	investment_categories,
	left_on='preferred_investment_type_tags',
	right_on='Preferred Investment Type Tags',
	how='left'
)
# This step is to restore the index order after explosion
drop_df_removed_shift_2.index = exploded_investment_index
drop_df_removed_shift_2

Unnamed: 0,Investors,Description,Geography,Preferred Industry,Preferred Investment Type,Primary Investor Type,geography_tags,preferred_investment_type_tags,preferred_industry_tags,Preferred Investment Type Tags,Investment Categories
0,Techstars,"Founded in 26, Techstars is an accelerator bas...","Africa, Americas, Asia, Canada, Middle East, O...","Beverages, Computer Hardware, Education and Tr...","Accelerator/Incubator, Early Stage VC, Later S...",Accelerator/Incubator,"[africa, americas, asia, canada, middle east, ...",accelerator/incubator,"[beverages, computer hardware, education and t...",accelerator/incubator,Accelerator/incubator
0,Techstars,"Founded in 26, Techstars is an accelerator bas...","Africa, Americas, Asia, Canada, Middle East, O...","Beverages, Computer Hardware, Education and Tr...","Accelerator/Incubator, Early Stage VC, Later S...",Accelerator/Incubator,"[africa, americas, asia, canada, middle east, ...",early stage vc,"[beverages, computer hardware, education and t...",early stage vc,Venture capital
0,Techstars,"Founded in 26, Techstars is an accelerator bas...","Africa, Americas, Asia, Canada, Middle East, O...","Beverages, Computer Hardware, Education and Tr...","Accelerator/Incubator, Early Stage VC, Later S...",Accelerator/Incubator,"[africa, americas, asia, canada, middle east, ...",later stage vc,"[beverages, computer hardware, education and t...",later stage vc,Venture capital
0,Techstars,"Founded in 26, Techstars is an accelerator bas...","Africa, Americas, Asia, Canada, Middle East, O...","Beverages, Computer Hardware, Education and Tr...","Accelerator/Incubator, Early Stage VC, Later S...",Accelerator/Incubator,"[africa, americas, asia, canada, middle east, ...",seed round,"[beverages, computer hardware, education and t...",seed round,Seed round
1,Y Combinator,"Founded in 25, Y Combinator is an accelerator ...","Africa, Americas, Asia, Europe, Oceania, Unite...","Biotechnology, Commercial Transportation, Comm...","Accelerator/Incubator, Early Stage VC, Later S...",Accelerator/Incubator,"[africa, americas, asia, europe, oceania, unit...",accelerator/incubator,"[biotechnology, commercial transportation, com...",accelerator/incubator,Accelerator/incubator
...,...,...,...,...,...,...,...,...,...,...,...
155138,Ecoinvestors Capital,EcoInvestors Capital is investing flexible cap...,"Americas, Canada, Europe, United States","Agriculture, Building Products, Construction (...","Bridge, Early Stage VC, Joint Venture, Later S...",<span>10.00 - 100.00</span>,"[americas, canada, europe, united states]",pe growth/expansion,"[agriculture, building products, construction ...",pe growth/expansion,Private equity
155262,Ecostar (Accelerator),"Founded in 216, Ecostar is an accelerator inve...",Europe,"Agriculture, Consumer Products and Services (B...","Accelerator/Incubator, Debt Refinancing, Early...",<span>0.16</span>,[europe],accelerator/incubator,"[agriculture, consumer products and services (...",accelerator/incubator,Accelerator/incubator
155262,Ecostar (Accelerator),"Founded in 216, Ecostar is an accelerator inve...",Europe,"Agriculture, Consumer Products and Services (B...","Accelerator/Incubator, Debt Refinancing, Early...",<span>0.16</span>,[europe],debt refinancing,"[agriculture, consumer products and services (...",debt refinancing,Debt
155262,Ecostar (Accelerator),"Founded in 216, Ecostar is an accelerator inve...",Europe,"Agriculture, Consumer Products and Services (B...","Accelerator/Incubator, Debt Refinancing, Early...",<span>0.16</span>,[europe],early stage vc,"[agriculture, consumer products and services (...",early stage vc,Venture capital


2.2 One-hot encoding for Investment Tags

In [113]:
drop_df_removed_shift_2_encoded = pd.get_dummies(drop_df_removed_shift_2, columns=['Investment Categories'])
drop_df_removed_shift_2_encoded


# In case you want to check the columns of the DataFrame after encoding, remove the # sign in the next lines
# a = pd.DataFrame(drop_df_removed_shift_2_encoded.columns.to_list())
# a.to_csv('columns_list.csv_investment', index=False)

Unnamed: 0,Investors,Description,Geography,Preferred Industry,Preferred Investment Type,Primary Investor Type,geography_tags,preferred_investment_type_tags,preferred_industry_tags,Preferred Investment Type Tags,...,Investment Categories_Not relevant,Investment Categories_Not specific,Investment Categories_Pension and provident,Investment Categories_Private equity,Investment Categories_Privatization,Investment Categories_Public-private partnership,Investment Categories_Recapitalization,Investment Categories_Sale-leaseback,Investment Categories_Seed round,Investment Categories_Venture capital
0,Techstars,"Founded in 26, Techstars is an accelerator bas...","Africa, Americas, Asia, Canada, Middle East, O...","Beverages, Computer Hardware, Education and Tr...","Accelerator/Incubator, Early Stage VC, Later S...",Accelerator/Incubator,"[africa, americas, asia, canada, middle east, ...",accelerator/incubator,"[beverages, computer hardware, education and t...",accelerator/incubator,...,False,False,False,False,False,False,False,False,False,False
0,Techstars,"Founded in 26, Techstars is an accelerator bas...","Africa, Americas, Asia, Canada, Middle East, O...","Beverages, Computer Hardware, Education and Tr...","Accelerator/Incubator, Early Stage VC, Later S...",Accelerator/Incubator,"[africa, americas, asia, canada, middle east, ...",early stage vc,"[beverages, computer hardware, education and t...",early stage vc,...,False,False,False,False,False,False,False,False,False,True
0,Techstars,"Founded in 26, Techstars is an accelerator bas...","Africa, Americas, Asia, Canada, Middle East, O...","Beverages, Computer Hardware, Education and Tr...","Accelerator/Incubator, Early Stage VC, Later S...",Accelerator/Incubator,"[africa, americas, asia, canada, middle east, ...",later stage vc,"[beverages, computer hardware, education and t...",later stage vc,...,False,False,False,False,False,False,False,False,False,True
0,Techstars,"Founded in 26, Techstars is an accelerator bas...","Africa, Americas, Asia, Canada, Middle East, O...","Beverages, Computer Hardware, Education and Tr...","Accelerator/Incubator, Early Stage VC, Later S...",Accelerator/Incubator,"[africa, americas, asia, canada, middle east, ...",seed round,"[beverages, computer hardware, education and t...",seed round,...,False,False,False,False,False,False,False,False,True,False
1,Y Combinator,"Founded in 25, Y Combinator is an accelerator ...","Africa, Americas, Asia, Europe, Oceania, Unite...","Biotechnology, Commercial Transportation, Comm...","Accelerator/Incubator, Early Stage VC, Later S...",Accelerator/Incubator,"[africa, americas, asia, europe, oceania, unit...",accelerator/incubator,"[biotechnology, commercial transportation, com...",accelerator/incubator,...,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
155138,Ecoinvestors Capital,EcoInvestors Capital is investing flexible cap...,"Americas, Canada, Europe, United States","Agriculture, Building Products, Construction (...","Bridge, Early Stage VC, Joint Venture, Later S...",<span>10.00 - 100.00</span>,"[americas, canada, europe, united states]",pe growth/expansion,"[agriculture, building products, construction ...",pe growth/expansion,...,False,False,False,True,False,False,False,False,False,False
155262,Ecostar (Accelerator),"Founded in 216, Ecostar is an accelerator inve...",Europe,"Agriculture, Consumer Products and Services (B...","Accelerator/Incubator, Debt Refinancing, Early...",<span>0.16</span>,[europe],accelerator/incubator,"[agriculture, consumer products and services (...",accelerator/incubator,...,False,False,False,False,False,False,False,False,False,False
155262,Ecostar (Accelerator),"Founded in 216, Ecostar is an accelerator inve...",Europe,"Agriculture, Consumer Products and Services (B...","Accelerator/Incubator, Debt Refinancing, Early...",<span>0.16</span>,[europe],debt refinancing,"[agriculture, consumer products and services (...",debt refinancing,...,False,False,False,False,False,False,False,False,False,False
155262,Ecostar (Accelerator),"Founded in 216, Ecostar is an accelerator inve...",Europe,"Agriculture, Consumer Products and Services (B...","Accelerator/Incubator, Debt Refinancing, Early...",<span>0.16</span>,[europe],early stage vc,"[agriculture, consumer products and services (...",early stage vc,...,False,False,False,False,False,False,False,False,False,True


Combine rows with same index to count the number of investment types for each row.  
Note that other columns are aggregated with duplicate texts!  
Thus, only those columns encoded will be extracted at the end, which will further combine to the original dataset dataframe later.

In [114]:
# Note other columns are aggregated with duplicate texts after groupby in this code.
drop_df_removed_shift_2_encoded.groupby(level=0).sum()

# To extract only those encoding columns, to solve the aggregation issue:
investment_encode_columns = drop_df_removed_shift_2_encoded.groupby(level=0).sum().loc[:,"Investment Categories_Accelerator/incubator":"Investment Categories_Venture capital"]
investment_encode_columns

Unnamed: 0,Investment Categories_Accelerator/incubator,Investment Categories_Angel investment,Investment Categories_Asset management,Investment Categories_Bridge,Investment Categories_Buyout,Investment Categories_Crowdfunding,Investment Categories_Debt,Investment Categories_Distressed / Bankruptcy,Investment Categories_Distressed M&A,Investment Categories_Divestiture,...,Investment Categories_Not relevant,Investment Categories_Not specific,Investment Categories_Pension and provident,Investment Categories_Private equity,Investment Categories_Privatization,Investment Categories_Public-private partnership,Investment Categories_Recapitalization,Investment Categories_Sale-leaseback,Investment Categories_Seed round,Investment Categories_Venture capital
0,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,2
1,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,2
2,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,2
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,1,2
4,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
155008,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,2
155039,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,1
155121,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,1
155138,0,0,0,1,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,2


Combine the encoded columns to the original dataset dataframe

In [115]:
dataset_encoded_2 = dataset_encoded_1.merge(
    investment_encode_columns,
    left_index=True,
    right_index=True,
    how='inner'
)

dataset_encoded_2

# To output an excel file, remove the # sign in the next line
# dataset_encoded_2.to_excel('dataset_encoded_2.xlsx', index=False)

Unnamed: 0,Investors,Description,Geography,Preferred Industry,Preferred Investment Type,Primary Investor Type,geography_tags,preferred_investment_type_tags,preferred_industry_tags,3rd_Norway,...,Investment Categories_Not relevant,Investment Categories_Not specific,Investment Categories_Pension and provident,Investment Categories_Private equity,Investment Categories_Privatization,Investment Categories_Public-private partnership,Investment Categories_Recapitalization,Investment Categories_Sale-leaseback,Investment Categories_Seed round,Investment Categories_Venture capital
0,Techstars,"Founded in 26, Techstars is an accelerator bas...","Africa, Americas, Asia, Canada, Middle East, O...","Beverages, Computer Hardware, Education and Tr...","Accelerator/Incubator, Early Stage VC, Later S...",Accelerator/Incubator,"[africa, americas, asia, canada, middle east, ...","[accelerator/incubator, early stage vc, later ...","[beverages, computer hardware, education and t...",0,...,0,0,0,0,0,0,0,0,1,2
1,Y Combinator,"Founded in 25, Y Combinator is an accelerator ...","Africa, Americas, Asia, Europe, Oceania, Unite...","Biotechnology, Commercial Transportation, Comm...","Accelerator/Incubator, Early Stage VC, Later S...",Accelerator/Incubator,"[africa, americas, asia, europe, oceania, unit...","[accelerator/incubator, early stage vc, later ...","[biotechnology, commercial transportation, com...",0,...,0,0,0,0,0,0,0,0,1,2
2,Plug and Play Tech Center,"Founded in 26, Plug and Play Tech Center is an...","Africa, Americas, Asia, Canada, Europe, Middle...","Aerospace and Defense, Animal Husbandry, Aquac...","Accelerator/Incubator, Early Stage VC, Later S...",Accelerator/Incubator,"[africa, americas, asia, canada, europe, middl...","[accelerator/incubator, early stage vc, later ...","[aerospace and defense, animal husbandry, aqua...",0,...,0,0,0,0,0,0,0,0,1,2
3,Gaingels,"Founded in 214, Gaingels is a venture capital ...","Africa, Americas, Asia, Canada, Europe, Middle...","Business Products and Services (B2B), Consumer...","Early Stage VC, Later Stage VC, PE Growth/Expa...",Venture Capital,"[africa, americas, asia, canada, europe, middl...","[early stage vc, later stage vc, pe growth/exp...","[business products and services (b2b), consume...",0,...,0,0,0,1,0,0,0,0,1,2
4,Antler,"Founded in 217, Antler is a venture capital in...","Australia, Brazil, Canada, China, Denmark, Est...","Agriculture, Business Products and Services (B...","Accelerator/Incubator, Early Stage VC, Seed Round",Venture Capital,"[australia, brazil, canada, china, denmark, es...","[accelerator/incubator, early stage vc, seed r...","[agriculture, business products and services (...",1,...,0,0,0,0,0,0,0,0,1,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
155008,ECK.Ventures,"Founded in 217, ECK.Ventures is a venture capi...",Germany,Software,"Early Stage VC, Later Stage VC, Seed Round",<span>0.05 - 0.13</span>,[germany],"[early stage vc, later stage vc, seed round]",[software],0,...,0,0,0,0,0,0,0,0,1,2
155039,Eclipseon Ventures,"Founded in 217, Eclipseon Ventures is a ventur...",Brazil,"Healthcare, Movies, Music and Entertainment, S...","Early Stage VC, Seed Round",<span>1.00 - 5.00</span>,[brazil],"[early stage vc, seed round]","[healthcare, movies, music and entertainment, ...",0,...,0,0,0,0,0,0,0,0,1,1
155121,EcoElectron Ventures,"Founded in 1998, EcoElectron Ventures is a ven...",Southern California,"Communication Software, Energy","Early Stage VC, Seed Round",<span>0.15 - 0.50</span>,[southern california],"[early stage vc, seed round]","[communication software, energy]",0,...,0,0,0,0,0,0,0,0,1,1
155138,Ecoinvestors Capital,EcoInvestors Capital is investing flexible cap...,"Americas, Canada, Europe, United States","Agriculture, Building Products, Construction (...","Bridge, Early Stage VC, Joint Venture, Later S...",<span>10.00 - 100.00</span>,"[americas, canada, europe, united states]","[bridge, early stage vc, joint venture, later ...","[agriculture, building products, construction ...",0,...,0,0,0,1,0,0,0,0,0,2


3.1 Merge industry categories into dataset dataframe

In [116]:
# Explode the tags so each tag is in its own row
exploded_industry = drop_df_removed_shift.explode('preferred_industry_tags')
# Merge will reset the index number, this step is to keep the index order after explosion
exploded_industry_index = exploded_industry.index


# Merge on the exploded tags
drop_df_removed_shift_3 = exploded_industry.merge(
	industry_categories,
	left_on='preferred_industry_tags',
	right_on='preferred_industry_tags',
	how='left'
)
# This step is to restore the index order after explosion
drop_df_removed_shift_3.index = exploded_industry_index
drop_df_removed_shift_3

Unnamed: 0,Investors,Description,Geography,Preferred Industry,Preferred Investment Type,Primary Investor Type,geography_tags,preferred_investment_type_tags,preferred_industry_tags,industries categories
0,Techstars,"Founded in 26, Techstars is an accelerator bas...","Africa, Americas, Asia, Canada, Middle East, O...","Beverages, Computer Hardware, Education and Tr...","Accelerator/Incubator, Early Stage VC, Later S...",Accelerator/Incubator,"[africa, americas, asia, canada, middle east, ...","[accelerator/incubator, early stage vc, later ...",beverages,beverages
0,Techstars,"Founded in 26, Techstars is an accelerator bas...","Africa, Americas, Asia, Canada, Middle East, O...","Beverages, Computer Hardware, Education and Tr...","Accelerator/Incubator, Early Stage VC, Later S...",Accelerator/Incubator,"[africa, americas, asia, canada, middle east, ...","[accelerator/incubator, early stage vc, later ...",computer hardware,computer hardware
0,Techstars,"Founded in 26, Techstars is an accelerator bas...","Africa, Americas, Asia, Canada, Middle East, O...","Beverages, Computer Hardware, Education and Tr...","Accelerator/Incubator, Early Stage VC, Later S...",Accelerator/Incubator,"[africa, americas, asia, canada, middle east, ...","[accelerator/incubator, early stage vc, later ...",education and training services (b2b),education and training
0,Techstars,"Founded in 26, Techstars is an accelerator bas...","Africa, Americas, Asia, Canada, Middle East, O...","Beverages, Computer Hardware, Education and Tr...","Accelerator/Incubator, Early Stage VC, Later S...",Accelerator/Incubator,"[africa, americas, asia, canada, middle east, ...","[accelerator/incubator, early stage vc, later ...",educational and training services (b2c),
0,Techstars,"Founded in 26, Techstars is an accelerator bas...","Africa, Americas, Asia, Canada, Middle East, O...","Beverages, Computer Hardware, Education and Tr...","Accelerator/Incubator, Early Stage VC, Later S...",Accelerator/Incubator,"[africa, americas, asia, canada, middle east, ...","[accelerator/incubator, early stage vc, later ...",energy,energy
...,...,...,...,...,...,...,...,...,...,...
155262,Ecostar (Accelerator),"Founded in 216, Ecostar is an accelerator inve...",Europe,"Agriculture, Consumer Products and Services (B...","Accelerator/Incubator, Debt Refinancing, Early...",<span>0.16</span>,[europe],"[accelerator/incubator, debt refinancing, earl...",consumer products and services (b2c),
155262,Ecostar (Accelerator),"Founded in 216, Ecostar is an accelerator inve...",Europe,"Agriculture, Consumer Products and Services (B...","Accelerator/Incubator, Debt Refinancing, Early...",<span>0.16</span>,[europe],"[accelerator/incubator, debt refinancing, earl...",energy,energy
155262,Ecostar (Accelerator),"Founded in 216, Ecostar is an accelerator inve...",Europe,"Agriculture, Consumer Products and Services (B...","Accelerator/Incubator, Debt Refinancing, Early...",<span>0.16</span>,[europe],"[accelerator/incubator, debt refinancing, earl...",metals,
155262,Ecostar (Accelerator),"Founded in 216, Ecostar is an accelerator inve...",Europe,"Agriculture, Consumer Products and Services (B...","Accelerator/Incubator, Debt Refinancing, Early...",<span>0.16</span>,[europe],"[accelerator/incubator, debt refinancing, earl...",minerals and mining,


3.2 One-hot encoding for Preffered Industry Tags

In [117]:
drop_df_removed_shift_3_encoded = pd.get_dummies(drop_df_removed_shift_3, columns=['industries categories'])
drop_df_removed_shift_3_encoded

# In case you want to check the columns of the DataFrame after encoding, remove the # sign in the next lines
# a = pd.DataFrame(drop_df_removed_shift_3_encoded.columns.to_list())
# a.to_csv('columns_list_industry.csv', index=False)

Unnamed: 0,Investors,Description,Geography,Preferred Industry,Preferred Investment Type,Primary Investor Type,geography_tags,preferred_investment_type_tags,preferred_industry_tags,industries categories_Cultural and creative industries,...,industries categories_healthcare,industries categories_healthcare device,industries categories_logistics,industries categories_ma,industries categories_media,industries categories_profes,industries categories_real estate services (b2c),industries categories_real estates,industries categories_telecommunications,industries categories_transportation
0,Techstars,"Founded in 26, Techstars is an accelerator bas...","Africa, Americas, Asia, Canada, Middle East, O...","Beverages, Computer Hardware, Education and Tr...","Accelerator/Incubator, Early Stage VC, Later S...",Accelerator/Incubator,"[africa, americas, asia, canada, middle east, ...","[accelerator/incubator, early stage vc, later ...",beverages,False,...,False,False,False,False,False,False,False,False,False,False
0,Techstars,"Founded in 26, Techstars is an accelerator bas...","Africa, Americas, Asia, Canada, Middle East, O...","Beverages, Computer Hardware, Education and Tr...","Accelerator/Incubator, Early Stage VC, Later S...",Accelerator/Incubator,"[africa, americas, asia, canada, middle east, ...","[accelerator/incubator, early stage vc, later ...",computer hardware,False,...,False,False,False,False,False,False,False,False,False,False
0,Techstars,"Founded in 26, Techstars is an accelerator bas...","Africa, Americas, Asia, Canada, Middle East, O...","Beverages, Computer Hardware, Education and Tr...","Accelerator/Incubator, Early Stage VC, Later S...",Accelerator/Incubator,"[africa, americas, asia, canada, middle east, ...","[accelerator/incubator, early stage vc, later ...",education and training services (b2b),False,...,False,False,False,False,False,False,False,False,False,False
0,Techstars,"Founded in 26, Techstars is an accelerator bas...","Africa, Americas, Asia, Canada, Middle East, O...","Beverages, Computer Hardware, Education and Tr...","Accelerator/Incubator, Early Stage VC, Later S...",Accelerator/Incubator,"[africa, americas, asia, canada, middle east, ...","[accelerator/incubator, early stage vc, later ...",educational and training services (b2c),False,...,False,False,False,False,False,False,False,False,False,False
0,Techstars,"Founded in 26, Techstars is an accelerator bas...","Africa, Americas, Asia, Canada, Middle East, O...","Beverages, Computer Hardware, Education and Tr...","Accelerator/Incubator, Early Stage VC, Later S...",Accelerator/Incubator,"[africa, americas, asia, canada, middle east, ...","[accelerator/incubator, early stage vc, later ...",energy,False,...,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
155262,Ecostar (Accelerator),"Founded in 216, Ecostar is an accelerator inve...",Europe,"Agriculture, Consumer Products and Services (B...","Accelerator/Incubator, Debt Refinancing, Early...",<span>0.16</span>,[europe],"[accelerator/incubator, debt refinancing, earl...",consumer products and services (b2c),False,...,False,False,False,False,False,False,False,False,False,False
155262,Ecostar (Accelerator),"Founded in 216, Ecostar is an accelerator inve...",Europe,"Agriculture, Consumer Products and Services (B...","Accelerator/Incubator, Debt Refinancing, Early...",<span>0.16</span>,[europe],"[accelerator/incubator, debt refinancing, earl...",energy,False,...,False,False,False,False,False,False,False,False,False,False
155262,Ecostar (Accelerator),"Founded in 216, Ecostar is an accelerator inve...",Europe,"Agriculture, Consumer Products and Services (B...","Accelerator/Incubator, Debt Refinancing, Early...",<span>0.16</span>,[europe],"[accelerator/incubator, debt refinancing, earl...",metals,False,...,False,False,False,False,False,False,False,False,False,False
155262,Ecostar (Accelerator),"Founded in 216, Ecostar is an accelerator inve...",Europe,"Agriculture, Consumer Products and Services (B...","Accelerator/Incubator, Debt Refinancing, Early...",<span>0.16</span>,[europe],"[accelerator/incubator, debt refinancing, earl...",minerals and mining,False,...,False,False,False,False,False,False,False,False,False,False


Combine rows with same index to count the number of investment types for each row.  
Note that other columns are aggregated with duplicate texts!  
Thus, only those columns encoded will be extracted at the end, which will further combine to the original dataset dataframe later.

In [118]:
# Note other columns are aggregated with duplicate texts after groupby in this code.
drop_df_removed_shift_3_encoded.groupby(level=0).sum()

# To extract only those encoding columns, to solve the aggregation issue:
industry_encode_columns = drop_df_removed_shift_3_encoded.groupby(level=0).sum().loc[:,"industries categories_Cultural and creative industries":"industries categories_transportation"]
industry_encode_columns

Unnamed: 0,industries categories_Cultural and creative industries,"industries categories_Food and beverage, Retail",industries categories_Government & Public sector,industries categories_Manufacturing & industrialisation-related,industries categories_Professional Services,industries categories_STEM,industries categories_Tourism,industries categories_agriculture,industries categories_architecture and real estate,industries categories_banks,...,industries categories_healthcare,industries categories_healthcare device,industries categories_logistics,industries categories_ma,industries categories_media,industries categories_profes,industries categories_real estate services (b2c),industries categories_real estates,industries categories_telecommunications,industries categories_transportation
0,0,0,0,0,0,1,1,0,0,0,...,1,0,0,0,0,0,0,0,0,0
1,1,0,1,0,0,1,0,0,1,0,...,1,0,0,0,0,0,1,0,0,1
2,0,0,0,0,0,1,0,0,1,0,...,1,0,0,0,0,0,1,0,0,0
3,0,0,0,0,0,1,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,1,0,1,0,0,...,1,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
155008,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
155039,0,0,0,0,0,1,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
155121,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
155138,0,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0


Combine the encoded columns to the original dataset dataframe

In [119]:
dataset_encoded_3 = dataset_encoded_2.merge(
    industry_encode_columns,
    left_index=True,
    right_index=True,
    how='inner'
)

dataset_encoded_3

# To output an excel file, remove the # sign in the next line
# dataset_encoded_2.to_excel('dataset_encoded_2.xlsx', index=False)

Unnamed: 0,Investors,Description,Geography,Preferred Industry,Preferred Investment Type,Primary Investor Type,geography_tags,preferred_investment_type_tags,preferred_industry_tags,3rd_Norway,...,industries categories_healthcare,industries categories_healthcare device,industries categories_logistics,industries categories_ma,industries categories_media,industries categories_profes,industries categories_real estate services (b2c),industries categories_real estates,industries categories_telecommunications,industries categories_transportation
0,Techstars,"Founded in 26, Techstars is an accelerator bas...","Africa, Americas, Asia, Canada, Middle East, O...","Beverages, Computer Hardware, Education and Tr...","Accelerator/Incubator, Early Stage VC, Later S...",Accelerator/Incubator,"[africa, americas, asia, canada, middle east, ...","[accelerator/incubator, early stage vc, later ...","[beverages, computer hardware, education and t...",0,...,1,0,0,0,0,0,0,0,0,0
1,Y Combinator,"Founded in 25, Y Combinator is an accelerator ...","Africa, Americas, Asia, Europe, Oceania, Unite...","Biotechnology, Commercial Transportation, Comm...","Accelerator/Incubator, Early Stage VC, Later S...",Accelerator/Incubator,"[africa, americas, asia, europe, oceania, unit...","[accelerator/incubator, early stage vc, later ...","[biotechnology, commercial transportation, com...",0,...,1,0,0,0,0,0,1,0,0,1
2,Plug and Play Tech Center,"Founded in 26, Plug and Play Tech Center is an...","Africa, Americas, Asia, Canada, Europe, Middle...","Aerospace and Defense, Animal Husbandry, Aquac...","Accelerator/Incubator, Early Stage VC, Later S...",Accelerator/Incubator,"[africa, americas, asia, canada, europe, middl...","[accelerator/incubator, early stage vc, later ...","[aerospace and defense, animal husbandry, aqua...",0,...,1,0,0,0,0,0,1,0,0,0
3,Gaingels,"Founded in 214, Gaingels is a venture capital ...","Africa, Americas, Asia, Canada, Europe, Middle...","Business Products and Services (B2B), Consumer...","Early Stage VC, Later Stage VC, PE Growth/Expa...",Venture Capital,"[africa, americas, asia, canada, europe, middl...","[early stage vc, later stage vc, pe growth/exp...","[business products and services (b2b), consume...",0,...,1,0,0,0,0,0,0,0,0,0
4,Antler,"Founded in 217, Antler is a venture capital in...","Australia, Brazil, Canada, China, Denmark, Est...","Agriculture, Business Products and Services (B...","Accelerator/Incubator, Early Stage VC, Seed Round",Venture Capital,"[australia, brazil, canada, china, denmark, es...","[accelerator/incubator, early stage vc, seed r...","[agriculture, business products and services (...",1,...,1,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
155008,ECK.Ventures,"Founded in 217, ECK.Ventures is a venture capi...",Germany,Software,"Early Stage VC, Later Stage VC, Seed Round",<span>0.05 - 0.13</span>,[germany],"[early stage vc, later stage vc, seed round]",[software],0,...,0,0,0,0,0,0,0,0,0,0
155039,Eclipseon Ventures,"Founded in 217, Eclipseon Ventures is a ventur...",Brazil,"Healthcare, Movies, Music and Entertainment, S...","Early Stage VC, Seed Round",<span>1.00 - 5.00</span>,[brazil],"[early stage vc, seed round]","[healthcare, movies, music and entertainment, ...",0,...,1,0,0,0,0,0,0,0,0,0
155121,EcoElectron Ventures,"Founded in 1998, EcoElectron Ventures is a ven...",Southern California,"Communication Software, Energy","Early Stage VC, Seed Round",<span>0.15 - 0.50</span>,[southern california],"[early stage vc, seed round]","[communication software, energy]",0,...,0,0,0,0,0,0,0,0,0,0
155138,Ecoinvestors Capital,EcoInvestors Capital is investing flexible cap...,"Americas, Canada, Europe, United States","Agriculture, Building Products, Construction (...","Bridge, Early Stage VC, Joint Venture, Later S...",<span>10.00 - 100.00</span>,"[americas, canada, europe, united states]","[bridge, early stage vc, joint venture, later ...","[agriculture, building products, construction ...",0,...,0,0,0,0,0,0,0,0,0,0
