# feature engineering

*dealing with multi-label columns, which generates the most columns*

v3
- instead of encoding the `female_led` info, just keep the percentage; if no information, encode **0.5**
- no multilabelbinarizers

v4
- only keep the `Industry Groups` multilabel column

v5
- break down `Headquarters Location` multilabel column and use OneHotEncoder instead (similar to how `Headquarters Region` was processed

v6 
- strategically add `Top 5 Investors` information in after pre-processing, take investors appearing more than **10** times (added 78 more columns)

v7
- `Headquarters Location` expanded to also include cities
- take `Top 5 Investors` appearing more than **5** times

*the xgboost and ols models are still performing poorly*

v8
- take `Top 5 Investors` appearing more than **3** times
- include both `Industries` and `Industry Groups`

v8.2
- remove "cities" info from `Headquarters Location`

v9
- improved location encoding, i.e. keep different levels of location information so that the distribution is also relatively more balanced: {western us: 571, beijing: 393, europe_country: 324, united kingdom: 290, northeastern us: 258, chinese_state: 230, shanghai: 198, france: 183, guangdong: 179, southern us: 121, germany: 105, scandinavia: 98, midwestern us: 50}

v10
- add industry encoding, top level encoding!
    - "top 5 industry groups" (bool): count > 500 --> sexy hot industries
    - "out of top 50 industries how many each company belongs to" --> diversified across industries
- also add top level investor encoding with same logic!
- v10.1: only keep industry groups (turns out not as successful)
- v10.2: only keep top industries

v11 
- dropna %female for classification
- drop cb ranking and trend scores because they could be correlated with funding
- only keep top industries because keeping all would be too many cols for the smaller df we have after dropna


In [1]:
# ignore warnings
import warnings
warnings.filterwarnings("ignore")

# data
import pandas as pd
import numpy as np

# visualization
import matplotlib.pyplot as plt

# feature engineering
from numpy import asarray
from scipy.sparse import csr_matrix
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, MultiLabelBinarizer
from collections import Counter

## 1. Aggregate Data
combine data of regions China, Europe, and the US.

In [10]:
regions = ['China', 'Europe', 'US']

df0 = pd.read_csv(f'../data/crunchbase-aggregated/{regions[0]}-gender.csv')
df1 = pd.read_csv(f'../data/crunchbase-aggregated/{regions[1]}-gender.csv')
df2 = pd.read_csv(f'../data/crunchbase-aggregated/{regions[2]}-gender.csv')

df = pd.concat([df0, df1, df2])
df.reset_index(inplace=True, drop=True)
df.shape

(3000, 113)

## 2. Feature Transformation
data scaling, discretization, dealing missing values etc.

## 2.1 skip rows

1. rows already correctly labeled
2. all the same or too many NULLs
3. equivalent to name
4. equivalent to total funding amount
5. irrelevant data

In [3]:
lower_cols = ['Number of Founders']
df.rename(columns={'Number of Founders': 'number_of_founders'}, inplace=True)

In [11]:
drop_cols = ['Description', 'Full Description', 
             'Website', 'Twitter', 'Facebook', 'LinkedIn',
             'Contact Email', 'Phone Number', 'Founders',
             'Transaction Name', 'Contact Job Departments',
             'Number of Contacts', 'Number of Private Contacts',
             'api_raw', 'gender', 'prob',
             'IPO Status', 'Operating Status', 'Diversity Spotlight (US Only)',
              'Exit Date', 'Closed Date', 'Company Type', 'Hub Tags',
              'Actively Hiring', 'Investor Type', 'Investment Stage',
              'Number of Portfolio Organizations','Number of Investments',
              'Number of Lead Investments', 'Number of Diversity Investments',
              'Number of Exits', 'Number of Exits (IPO)', 'Accelerator Program Type',
              'Accelerator Application Deadline', 'Accelerator Duration (in weeks)',
              'School Type', 'School Program', 'Number of Enrollments',
              'School Method', 'Number of Founders (Alumni)', 'Number of Alumni',
              'Acquired by', 'Announced Date', 'Price', 
              'Acquisition Type', 'Acquisition Terms', 'Acquisition Status',
              'IPO Date', 'Delisted Date', 'Money Raised at IPO',
              'Valuation at IPO', 'Stock Symbol', 'Stock Exchange', 'Number of Events',
              'Last Leadership Hiring Date', 'Last Layoff Mention Date',
              'IT Spend', 'Date of Most Recent Valuation', 'Number of Private Notes', 
              'Most Popular Trademark Class', 'Most Popular Patent Class',
              'Tags', 'Unnamed: 107', 'Funding Status',
#               'Industries', 
              'Funding Status', 'Last Equity Funding Type',
              #correlated to funding
              'CB Rank (Organization)', 'CB Rank (School)', 'CB Rank (Company)',
              'Number of Funding Rounds', 
              'Trend Score (7 Days)', 'Trend Score (30 Days)', 'Trend Score (90 Days)',
              'Last Funding Amount', 'Last Equity Funding Amount', 'Total Equity Funding Amount']

df.drop(columns=drop_cols, inplace=True)
df.shape

(3000, 38)

## 2.2 encoding categorical data
### 2.2.1 convert text to equal categories

**are they equal or ordinal?**
- `Funding Status`: "Early Stage Venture", "Seed", "M&A" (overlaps with `Last Funding Type`)
- `Acquisition Status`: "Was Acquired", "Made Acquisitions", "Made Acquisitions; Was Acquired" (for early stage too many NULLs)

In [5]:
def equal_cat(df, col):
    
    '''create new columns binary encoding each category'''
    
    # deal with NULL values
    new_col = col.lower().replace(' ', '_')
    df[new_col] = df[col].str.replace('—',f'{new_col}_null')
    
    # initiate binary encoder
    ohe = OneHotEncoder()
    
    # join original df with the created df with many new binary columns
    df_ohe = pd.DataFrame(ohe.fit_transform(asarray(df[new_col]).reshape(-1,1)).toarray(), 
                          columns=ohe.categories_, index=df.index)
    df_ohe.columns = df_ohe.columns.get_level_values(0)
    
    # deal with exceptions
    try:
        df = df.join(df_ohe)
    except ValueError:
        # country==state
        if col == 'hq_state':
            df = df.join(df_ohe.drop(columns=state_country))
        # state==city
        if col == 'hq_city':
            df = df.join(df_ohe.drop(columns=state_city))
    
    return df

### Location data

- `Headquarters Location` _(city, state, country) where many city=state (e.g. New York)_ **(break it down, only take state and country, and use OneHotEncoder)**
- `Headquarters Regions` (to avoid overlap with prev, only take last region and use OneHotEncoder)

#### preprocess multi-label location columns for one-hot-encoding

In [6]:
def location_cat(df):
    
    #1 location data extraction
    ##1.1 regions
    df['hq_region'] = df['Headquarters Regions'].str.lower().str.strip('').str.split('; ').str[-1]

    ##1.2 location
    df['hq_country'] = df['Headquarters Location'].str.lower().str.strip('').str.split('; ').str[-1]
    df['hq_state'] = df['Headquarters Location'].str.lower().str.strip('').str.split('; ').str[-2]
    # df['hq_city'] = df['Headquarters Location'].str.lower().str.strip('').str.split('; ').str[-3]

    #2 specific location data level grouping
    hq_location = []
    for i in range(df.shape[0]):
        
        ##2.1 if country==us, keep region level
        # western us (424), northeastern us (196), west coast (147), 
        # southern us (121), new england (62), midwestern us (50)
        if df['hq_country'][i]=='united states':
            us_region = df['hq_region'][i]
            
            #2.1.1 combine new england into northeastern (258)
            if us_region=='new england':
                hq_location.append('northeastern us')
                
            #2.1.2 combine west coast into western (571)
            elif us_region=='west coast':
                hq_location.append('western us')
                
            #2.1.3 the rest
            else:
                hq_location.append(us_region)
    
    
        ##2.2 if country==china, keep state level
        if df['hq_country'][i]=='china':
            
            ###2.2.1 majority states
            # beijing (393), shanghai (198), guangdong (179)
            if df['hq_state'][i] in ['beijing', 'shanghai', 'guangdong']:
                china_state = df['hq_state'][i]
                hq_location.append(china_state)
            
            ###2.2.2 minority states (230)
            else:
                hq_location.append('chinese_state')

                
        ##2.3 for scandanavia, also keep region level
        if df['hq_region'][i]=='scandinavia':
            hq_location.append('scandinavia')

        ##2.4 for europe, keep country level
        ### 2.4.1 majority countries
        # uk (290), france (183), germany (105)
        if df['hq_country'][i] in ['united kingdom', 'france', 'germany']:
            eup_country = df['hq_country'][i]
            hq_location.append(eup_country)
        
        ### 2.4.2 within eu
        elif df['hq_region'][i]=='european union (eu)':
            hq_location.append('europe_country')
        
        ### 2.4.3 outside eu but within europe
        elif df['hq_country'][i] in ['switzerland', 'russian federation', 
                                     'belarus', 'liechtenstein', 'turkey']:
            hq_location.append('europe_country')


    df.drop(columns=['Headquarters Regions', 'Headquarters Location', 
                     'hq_region', 'hq_country', 'hq_state'], inplace=True)
    df['hq_location'] =  hq_location

In [12]:
location_cat(df)
# this updated location col has no null info
# df['hq_location'].value_counts()#.sum()

In [14]:
df[df['%female'].isnull()]['hq_location'].value_counts()

beijing            250
chinese_state      162
guangdong          132
shanghai           124
europe_country      38
united kingdom      26
western us          21
france              14
northeastern us     11
scandinavia         10
southern us          9
midwestern us        4
germany              2
Name: hq_location, dtype: int64

In [9]:
df['hq_location'].value_counts()

western us         550
europe_country     286
united kingdom     264
northeastern us    247
france             169
beijing            143
southern us        112
germany            103
scandinavia         88
shanghai            74
chinese_state       68
guangdong           47
midwestern us       46
Name: hq_location, dtype: int64

### 2.2.2 convert text to ORDINAL categories

- 'Last Funding Type'
- 'Estimated Revenue Range'
- 'Number of Employees'
- 'Last Equity Funding Type'
- 'Most Recent Valuation Range'

*Note: `.astype('category').cat.codes` is not a good method because it assigns the number in random order*

In [None]:
def ordinal_cat(df, col):
    
    '''create one new column with ordinal categories'''
    
    # get text for new column name
    new_col = col.lower().replace(' ', '_')
    
    
    # specify ordinal order
    if (col=='Last Funding Type') or (col=='Last Equity Funding Type'):
        labels = ['Seed', 'Series A']
    
    if (col=='Estimated Revenue Range') or (col=='Most Recent Valuation Range'):
        labels = ['—', 'Less than $1M', '$1M to $10M', '$10M to $50M', 
                  '$50M to $100M', '$100M to $500M', '$500M to $1B', 
                  '$1B to $10B', '$10B+']
    
    if col == 'Number of Employees':
        # some '1-10' were read incorrectly and automatically converted to date formats
        df['Number of Employees'] = df['Number of Employees'].str.replace('10-Jan', '1-10')
        labels = ['—', '1-10', '11-50', '51-100', '101-250', '251-500', 
                  '501-1000', '1001-5000', '5001-10000', '10001+']
    
    
    # convert text to ordinal categories
    cat = list(np.array(labels).reshape(1,len(labels)))
    oe = OrdinalEncoder(categories=cat)
    df[new_col] = oe.fit_transform(asarray(df[col]).reshape(-1, 1))
    df[new_col] = df[new_col].astype('int')

### 2.2.3 convert text list to MULTI-LABEL categories
#1
- `Headquarters Location` (after processing moved to OneHotEncoder)
- `Headquarters Regions` (after processing moved to OneHotEncoder)

#2
- `Industry Groups` 
- `Industries` (not ignored because does not overlaps with `Industry Groups`)

*but also they cannot both be processed because collinearity*

note: [Industry Group v Industries Table (Crunchbase)](https://support.crunchbase.com/hc/en-us/articles/360043146954-What-Industries-are-included-in-Crunchbase-)

#3
- `Top 5 Investors` (after processing moved to OneHotEncoder)

#### industry group

In [None]:
# get all industry_groups
all_industry_groups = []
df_industry_groups = df['Industry Groups'].str.lower().str.strip('').str.split('; ')
for i in range(3000):
    all_industry_groups.extend(df_industry_groups[i])
    
# len(all_industry_groups) #10477
# len(set(all_industry_groups)) #48

top_industry_groups = []
for key, val in Counter(all_industry_groups).items():
    if val >= 480:
        top_industry_groups.append(key)
        
# len(top_industry_groups) #6
top_industry_groups #['health care', 'science and engineering', 'internet services', 
                      # 'software', 'data and analytics', 'information technology']

In [None]:
def top_industry_group_bool(df):
    # extract data
    df['industry_groups'] = df['Industry Groups'].str.lower().str.strip('').str.split('; ')
    
    # get data
    group_lst = []
    for row in range(3000):
        val = 0
        for i in df['industry_groups'][row]:
            if i in top_industry_groups:
                val = 1
        group_lst.append(val)

    # create new col
#     df.drop(columns=['Industry Groups', 'industry_groups'], inplace=True)
    df['top_group_bool'] = group_lst

In [None]:
top_industry_group_bool(df)
# df['top_group_bool'].value_counts() #1: 2187, 0: 813

#### industries

In [None]:
# get all industries
all_industries = []
df_industries = df['Industries'].str.lower().str.strip('').str.split('; ')
for i in range(3000):
    all_industries.extend(df_industries[i])
    
# len(all_industries) #11036
# len(set(all_industries)) #603

top_industries = []
for key, val in Counter(all_industries).items():
    if val >=50:
        top_industries.append(key)
        
len(top_industries) #51

In [None]:
half_industries = []
for key, val in Counter(all_industries).items():
    if val >= 5:
        half_industries.append(key)
half_industries.remove('—')
len(half_industries) #50: count=51--> 40: count=61, 30: count=77; 20: count=126; 10: count=206; 5: count=324

In [None]:
def top_industries_count(df):
    # extract data
    df['industries'] = df['Industries'].str.lower().str.strip('').str.split('; ')
    
    # get data
    count_lst = []
    for row in range(3000):
        tot = len(df['industries'][row])
        val = 0
        for i in df['industries'][row]:
            if i in top_industries:
                val += 1
        count_lst.append(val/tot)

    # create new col
#     df.drop(columns=['Industries', 'industries'], inplace=True)
    df['top_industry_count'] = count_lst

In [None]:
top_industries_count(df)
# df['top_industry_count'].value_counts()

In [None]:
# df['top_industry_count'].value_counts()

#### investors

In [None]:
# get all investors
all_investors = []
df_investors = df['Top 5 Investors'].str.lower().str.strip('').str.split('; ')
for i in range(3000):
    all_investors.extend(df_investors[i])
    
# len(all_investors) #8905
# len(set(all_investors)) #4773

# sorted(Counter(all_investors).items(), key=lambda pair: pair[1], reverse=True)
# investors in at least 5 companies out of 3000
top_investors = []
top_top_investors = []
for key, val in Counter(all_investors).items():
    if val >= 3:
        top_investors.append(key)
    if val >= 5:
        top_top_investors.append(key)

# remove nan from list
top_investors.remove('—')
top_top_investors.remove('—')

# one hot encode all these
len(top_investors) #val-2: 1318; val=3: 681; val=5: 286; val=10: 78

In [None]:
def top_investors_col(df):
    # extract data
    df['investors'] = df['Top 5 Investors'].str.lower().str.strip('').str.split('; ')
    
    # get bool data
    bool_lst = []
    for row in range(3000):
        val = 0
        for i in df['investors'][row]:
            if i in top_top_investors:
                val = 1
        bool_lst.append(val)
        
    # get count data
    count_lst = []
    for row in range(3000):
        tot = len(df['investors'][row])
        val = 0
        for i in df['investors'][row]:
            if i in top_investors:
                val += 1
        count_lst.append(val/tot)

    # create new col
#     df.drop(columns=['Top 5 Investors', 'investors'], inplace=True)
    df['top_investors_bool'] = bool_lst
    df['top_investors_count'] = count_lst

In [None]:
investors_list = []
for key, val in Counter(all_investors).items():
    if val >= 15:
        investors_list.append(key)

len(investors_list)

In [None]:
top_investors_col(df)
# df['top_investors_bool'].value_counts() #0: 1517, 1: 1483
# df['top_investors_count'].value_counts()

In [None]:
# {k: v for k, v in sorted(Counter(all_industry_groups).items(), key=lambda item: item[1], reverse=True)}

In [None]:
def multilabel_cat(df, col):
    '''create multiple one-hot encoded columns for each tag/label in a row'''
    
    # dealing with null valuess (so that null_cols for each newly created col is a different name)
    new_col = col.lower().replace(' ', '_')
    df[new_col] = df[col].str.replace('—', f'{new_col}_null')
    
    # get list of labels from text in each row
    df[f'{new_col}_lst'] = df[new_col].str.lower().str.strip('').str.split('; ')
    
    # initiate multi-label binary encoder
    mlb = MultiLabelBinarizer()
    
    # join original df with the created df with many new binary columns
    df_mlb = pd.DataFrame(mlb.fit_transform(df[f'{new_col}_lst']),
                          columns=mlb.classes_, index=df.index)
    
    # only take top info to add back to table because investors info is sparse
    if col=='Top 5 Investors':
#         df_mlb = df_mlb[top_top_investors]
        df_mlb = df_mlb[top_investors]
        
    if col=='Industries':
        df_mlb = df_mlb[half_industries]
    
    df = df.join(df_mlb)
    
    return df

## 2.4 convert text to separate dates
(1) have full date (format, e.g. "Dec 31; 1999"), 
(2) some have full date but most only have year

- `Last Funding Date`: (1)
- `Founded Date`: (2)

In [None]:
def text_date(df, col):
    
    '''create new columns separating date into day, month, year'''
    
    # (1) have full date info (format, e.g. "Dec 31; 1999")
    if all(df[col].str.len()>10):
    
        # get text for new column name
        new_col1 = col.lower().replace(' ', '_').replace('date', 'day')
        new_col2 = col.lower().replace(' ', '_').replace('date', 'month')
        new_col3 = col.lower().replace(' ', '_').replace('date', 'year')

        # convert day and year
        df[new_col3] = df[col].str[-4:]
        df[new_col1] = df[col].str[3:5]

        # convert month
        # df[new_col2] = df[col].str[:3] #text
        df[new_col2] = pd.to_datetime(df[col].str[:3], format='%b').dt.month
    
    
    # (2) some rows have full date but most only have year info
    else:
        
        # get text for new column name
        new_col = col.lower().replace(' ', '_').replace('date', 'year')

        # convert day and year
        df[new_col] = df[col].str[-4:]

## 2.5 convert text to number

1. integer
2. float (percentage)
3. currency (multiply and union)

In [None]:
def text_num(df, col, type='int'):
    
    '''update original column converting text to appropriate numerical format'''
    
    # get new column name
    new_col = col.lower().replace(' ', '_')
    
    # common cleaning: deal with NULL values
    df[new_col] = df[col].str.replace('—','0')
    
    # (1) integer
    if type=='int':
        
        # convert text to int
        df[new_col] = df[new_col].str.replace(';','').astype('int')
        
    # (2) float (percentage)
    if type=='float':
        
        # additional step to strip sign
        df[new_col] = df[new_col].str.replace('%','')
        
        # convert text to float
        df[new_col].str.replace(';','').astype('float')
        

In [None]:
def text_curr(df, col):
    '''create new column converting all amount to USD'''
    
    # get new column name
    new_col = col.lower().replace(' ', '_')
    
    # clean text
    df[new_col] = df[col].str.replace(';','')
    
    # add new col "conversion rate" of usd:currency = 1:x
    df['cvr'] = 0
    
    # strip currency signs and update conversion rate
    # us dollar
    df[new_col] = df[new_col].str.replace('$','')
    df.loc[df[col].str[0]=='$', 'cvr'] = 1
    
    # euro
    df[new_col] = df[new_col].str.replace('€','')
    df.loc[df[col].str[0]=='€', 'cvr'] = 1.1
    
    # uk pound
    df[new_col] = df[new_col].str.replace('£','')
    df.loc[df[col].str[0]=='£', 'cvr'] = 1.34
    
    # japanese yen
    df[new_col] = df[new_col].str.replace('¥','')
    df.loc[df[col].str[0]=='¥', 'cvr'] = 0.0087
    
    # chinese yuan ('CN¥')
    df[new_col] = df[new_col].str.replace('CN','')
    df.loc[df[col].str[0:2]=='CN', 'cvr'] = 0.16
    
    # canadian dollar ('CA$')
    df[new_col] = df[new_col].str.replace('CA','')
    df.loc[df[col].str[0:2]=='CA', 'cvr'] = 0.79
    
    # swiss franc
    df[new_col] = df[new_col].str.replace('CHF','')
    df.loc[df[col].str[0:3]=='CHF', 'cvr'] = 1.09
    
    # swedish krona
    df[new_col] = df[new_col].str.replace('SEK','')
    df.loc[df[col].str[0:3]=='SEK', 'cvr'] = 0.1
    
    # russian ruble
    df[new_col] = df[new_col].str.replace('RUB','')
    df.loc[df[col].str[0:3]=='RUB', 'cvr'] = 0.01
        
    # norwegian krone
    df[new_col] = df[new_col].str.replace('NOK','')
    df.loc[df[col].str[0:3]=='NOK', 'cvr'] = 0.11
    
    # new zealand dollar ('NZ$')
    df[new_col] = df[new_col].str.replace('NZ','')
    df.loc[df[col].str[0:2]=='NZ', 'cvr'] = 0.69
    
    # poland ztoty
    df[new_col] = df[new_col].str.replace('PLN','')
    df.loc[df[col].str[0:3]=='PLN', 'cvr'] = 0.24
        
    # icelandic krona
    df[new_col] = df[new_col].str.replace('ISK','')
    df.loc[df[col].str[0:3]=='ISK', 'cvr'] = 0.008
    
    # hungarian forint
    df[new_col] = df[new_col].str.replace('HUF','')
    df.loc[df[col].str[0:3]=='HUF', 'cvr'] = 0.003
    
    # null value
    df[new_col] = df[new_col].str.replace('—','0')
    
    
    '''cannot strip currency and convert to int the multipl only for parts of the data 
       so the best implementation is to split it into two steps'''
    
    # multiply number by conversion rate to get amount all in usd
    df[new_col] = df[new_col].astype('int')
    df[f'{new_col}_usd'] = df[new_col]*df['cvr']

## 2.6 convert text to NLP (bag of words?)
- `Description`
- `Full Description`

## run all conversions

In [None]:
# headquarters info all moved to equal_cats
equal_cats = ['hq_location'] #'hq_region', 'hq_country', 'hq_state']#, 'hq_city']
for cat1 in equal_cats:
    df = equal_cat(df, cat1)
    print(df.shape)

In [None]:
ord_cats = ['Last Funding Type', 'Estimated Revenue Range', 'Number of Employees', 
            'Most Recent Valuation Range']
for cat2 in ord_cats:
    ordinal_cat(df, cat2)
df.shape

In [None]:
date_cols = ['Last Funding Date', 'Founded Date']
for date_col in date_cols:
    text_date(df, date_col)
df.shape

In [None]:
int_cols = ['Number of Articles', 'Number of Lead Investors', 
            'Number of Investors', 'Number of Acquisitions', 'Monthly Visits', 
            'Visit Duration', 'Global Traffic Rank', 'Monthly Rank Change (#)', 
            'Active Tech Count', 'Number of Apps', 'Downloads Last 30 Days',
            'Total Products Active', 'Patents Granted', 'Trademarks Registered']
for num1 in int_cols:
        text_num(df, num1, type='int')
df.shape

In [None]:
float_cols = ['Monthly Visits Growth', 'Visit Duration Growth', 'Page Views / Visit', 
              'Page Views / Visit Growth', 'Bounce Rate', 'Bounce Rate Growth', 
              'Monthly Rank Growth', 'Average Visits (6 months)']
for num2 in float_cols:
    text_num(df, num2, type='float')
df.shape

In [None]:
curr_cols = ['Total Funding Amount']
for num3 in curr_cols:
    text_curr(df, num3)
df.shape

In [None]:
multi_cats = ['Top 5 Investors', 'Industries']#, 'Industry Groups']
for cat3 in multi_cats:
    df = multilabel_cat(df, cat3)
    print(df.shape)

In [None]:
# redundant cols generated from feature engineering
multi_cats_lst = []
for col in multi_cats:
    new_col = col.lower().replace(' ', '_')
    multi_cats_lst.append(f'{new_col}_lst')

### remove additional cols
- old cols that is no longer needed after new processing
- midway processing cols used to produce new cols

In [None]:
old_cols = equal_cats + ord_cats + multi_cats + multi_cats_lst + date_cols + int_cols + float_cols + curr_cols + ['cvr', 'total_funding_amount']

In [None]:
df.drop(columns=old_cols, inplace=True)
df.shape

## 3. Data Post-Processing

since no longer a classification task, `%female` can be kept as variable!

In [None]:
# also drop the col that would give away
df.drop(columns=['#female'], inplace=True)

<span style="color:red">
encode no info as 0.5 in company (so it is a neutral situation? better than encoding 0?)

In [None]:
# df['%female'].fillna(0.5, inplace=True)
df.dropna(subset=['%female'], inplace=True)

In [None]:
# encode as bool
# df['female_led'] = (df['%female']>0.5).astype(int)
# df.drop(columns=['%female'], inplace=True)

### dealing with missing data

In [None]:
df['total_funding_amount_usd'].isnull().sum()#value_counts()

In [None]:
df[df['total_funding_amount_usd']==0].shape

### set name as index 
so that the rest of the columns are all numerical data that could fit in the model

In [None]:
df.set_index('Organization Name', inplace=True)
num_cols = df.describe().columns #this takes awhile to load
new_df = df[num_cols]
new_df.shape #2197*0.5=1098 so cols should be less than that (.5 is test data size)

In [None]:
# set(df.columns).difference(set(new_df.columns))

### export data

In [None]:
# new_df.to_csv('../data/feature_engineering/combined_feng_v11.csv')