# 1. Introduction

### 1.1 Problem definition
In this project, our primary goal was to construct a predictive model capable of estimating the total funding amount in USD that startups receive, making 'funding_total_usd' our target variable. 

### 1.2 Overview of models
To tackle this predictive challenge, we employed a variety of machine learning models, each with distinct characteristics and complexity levels:
- K-Nearest Neighbors (K-NN): A simple, intuitive model that makes predictions based on the 'closeness' of data points in a feature space.
- Random Forest: An ensemble approach that leverages multiple decision trees to enhance prediction accuracy and robustness, effectively reducing overfitting.
- LightGBM: A gradient boosting framework that uses tree-based learning algorithms, known for its efficiency and effectiveness, particularly on large datasets.
- Lasso Regression: A linear model with regularization to prevent overfitting, especially useful when the dataset has a high dimensionality due to many encoded categorical features.

### 1.3 Feature engineering
Prior to the modelling, we had already conducted extensive data preprocessing and feature engineering. Nevertheless, some models required further modifications to tailor to their specific requirements:
- Data cleaning: Removed irrelevant columns and handled missing values to prepare a clean dataset.
- Feature engineering: Calculated the duration between different funding rounds to potentially capture the momentum of funding interest.
- Categorical encoding: Transformed categorical variables using binary encoding to prepare them for machine learning models, allowing models to better understand the patterns in categorical data.
- Feature selection: Identified and dropped features that did not contribute to the predictive power of the model, focusing on those most relevant to the target variable.

### 1.4 Evaluation metrics
For the evaluation of our models, we chose R² as the primary metric due to its clarity and independence from scale. R² measures the proportion of variance in the dependent variable that can be predicted from the independent variables, providing a standardized approach that remains unaffected by the magnitude of the funding amounts or the complexity introduced by numerous predictors. This characteristic allowed for straightforward comparisons across different modeling approaches, enhancing our understanding of each model's explanatory power.

While R² served as our metric of choice, we did consider alternative metrics such as Mean Squared Error (MSE) and Mean Absolute Error (MAE). These metrics provide valuable insights into the magnitude of prediction errors and the models' robustness to outliers. However, we ultimately favored R² for its ability to measure performance in a relative context, which was particularly aligned with the objectives of our project. This decision was based on the importance of quantifying how effectively our models could explain variations in funding amounts, rather than just the scale of their errors.

# 2. K-Nearest Neighbors
### 2.1 Model introduction
Knn is a simple, non-parametric algorithm used for both classification and regression tasks. In regression, like in this case, it predicts the output variable by averaging the values of its k nearest neighbours' target varlues. The "nearest" neighbours are determined based on a distance metric. In this case we decided to use the Euclidean distance. We started off with KNN as it is a simple model and provides a baseline comparison to the rest of the models. The simplicity makes the model relatively intuitive and easy to understand. Another advantage is that KNN is able to capture complex patterns in the data without imposing rigid assumptions. 

The following code will first load the necessary libraries and data afterwhich the data is preprocessed. The data was already preprocessed by the team yet some further adjustments were made below namely, the creation of a new variable (funding_duration), dropping of some columns that became redundant and standardizing several variable names. Secondly, the variables categorical variables were encoded using binary encoding. Then the data was split in a training and test set in order to validate the model predictions. To capture most of the training data, the team used cross fold validation. Lastly, the model was defined, trained and tested. The model evaluation was done based on the R². 

### 2.2 Loading libraries

In [8]:
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from category_encoders import BinaryEncoder
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import r2_score
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint
from sklearn.impute import SimpleImputer

### 2.3 Loading the dataset

In [9]:
# Read the CSV file
startup_df = pd.read_csv('cleaned.csv')

print(startup_df)

       permalink  Sequoia Capital  Kleiner Perkins Caufield & Byers  \
0              0            False                             False   
1              1            False                             False   
2              2            False                             False   
3              3            False                             False   
4              4            False                             False   
...          ...              ...                               ...   
28714      28714            False                             False   
28715      28715            False                             False   
28716      28716            False                             False   
28717      28717            False                             False   
28718      28718            False                             False   

       New Enterprise Associates  Accel Partners  Intel Capital  \
0                          False           False          False   
1            

### 2.4 Data engineering
In this step, some feature engineering steps were required to make the data machine processible and improve model accuracy. The first code block creates a new variable that shows the duration between the date the first and the last funds were raised. This represents the duration within a funding round was finished. 

In [10]:
# Create a new variable reflecting the duration between the raised amount and the last funded
startup_df['funded_at'] = pd.to_datetime(startup_df['funded_at'])
startup_df['last_funding_at'] = pd.to_datetime(startup_df['last_funding_at'])

# Calculate the funding duration in days
startup_df['funding_duration'] = (startup_df['last_funding_at'] - startup_df['funded_at']).dt.days

Then the dataset still contains some variables that are not machine processible due to its datatype. The below variables are date indicators yet as the team created new variables in the data cleaning process these variables have become redundant. Therefore, the team decided to remove them. 

In [11]:
# List of columns to drop
columns_to_drop = ['founded_at', 'first_funding_at', 'last_funding_at', 'acquired_at', 'funded_at']

# Drop the specified columns from the DataFrame
startup_df.drop(columns=columns_to_drop, inplace=True)

In the data cleaning process, the variable investor_name was transformed into different dummy variables of the 50 most frequent investors. These dummy variables still contain some characters that are not machine processible. All these characters were replaced with an underscore (_). 

In [12]:
def check_and_replace_column_names(df):
    replaced_columns = []
    for column_name in df.columns:
        if any(c in column_name for c in [' ', ',', ':', '.', '[', ']', '{', '}', '"']):
            # Replace non-compliant characters with underscores
            new_column_name = column_name.replace(' ', '_').replace(',', '_').replace(':', '_').replace('.', '_').replace('[', '_').replace(']', '_').replace('{', '_').replace('}', '_').replace('"', '_')
            replaced_columns.append((column_name, new_column_name))
    return replaced_columns

# Create dummies
replaced_columns = check_and_replace_column_names(startup_df)

# Print the old and new column names
for old_name, new_name in replaced_columns:
    print(f"Replacing '{old_name}' with '{new_name}'")

# Replace non-compliant characters in column names
startup_df.rename(columns=dict(replaced_columns), inplace=True)

# Check the modified column names
print("Modified columns:", startup_df.columns)

Replacing 'Sequoia Capital' with 'Sequoia_Capital'
Replacing 'Kleiner Perkins Caufield & Byers' with 'Kleiner_Perkins_Caufield_&_Byers'
Replacing 'New Enterprise Associates' with 'New_Enterprise_Associates'
Replacing 'Accel Partners' with 'Accel_Partners'
Replacing 'Intel Capital' with 'Intel_Capital'
Replacing 'Draper Fisher Jurvetson (DFJ)' with 'Draper_Fisher_Jurvetson_(DFJ)'
Replacing 'First Round' with 'First_Round'
Replacing 'SV Angel' with 'SV_Angel'
Replacing '500 Startups' with '500_Startups'
Replacing 'Bessemer Venture Partners' with 'Bessemer_Venture_Partners'
Replacing 'Greylock Partners' with 'Greylock_Partners'
Replacing 'Lightspeed Venture Partners' with 'Lightspeed_Venture_Partners'
Replacing 'Andreessen Horowitz' with 'Andreessen_Horowitz'
Replacing 'Khosla Ventures' with 'Khosla_Ventures'
Replacing 'Index Ventures' with 'Index_Ventures'
Replacing 'Canaan Partners' with 'Canaan_Partners'
Replacing 'Redpoint Ventures' with 'Redpoint_Ventures'
Replacing 'Norwest Venture 

### 2.5 Defining, tuning and evaluating model
The below code first splits the data in test and training sets using a 70/30 training/test split. The predictors are defined by all variables besides the target variable (funding_total_usd) and the feature permalink. Then, the code defines the KNN model using a pipeline. This pipeline encompasses several essential steps: binary encoding is applied to categorical variables, missing values are imputed using the median, and numerical features are standardized. The team decided to replace the missing values with the median due to its ability to maintain the data distribution and robustness to outliers.

To optimize the KNN model's performance, a random grid search is conducted. the objective is to fine-tune the hyperparameter number of neighbours, with thR²ed score serving as the guiding evaluation metric. Through 20 iterations of random hyperparameter combinations and 5-fold cross-validation, the search aims to identify the optimal configuration for the KNN model. The team used a random grid search with 20 iterations over a normal grid search as computational power and time was restricted. The random grid search still provides a basis for hyperparameter tuning. 

Subsequently, the best-performing model resulting from the random grid search is evaluated using the test dataset. This evaluation is based on R²ared score, providing insight into the model's predictive accuracy. The number of neighbours of the best model was found to be 3 witR²ared of 0.2618, showcasing relatively low predictive power. Only 26.18% of the total variance in the outcome variable can be explained by the model. 

In [13]:
# X: predictors (all columns except 'funding_total_usd')
# y: target variable ('funding_total_usd')
X = startup_df.drop(columns=['funding_total_usd', 'permalink'])
y = startup_df['funding_total_usd']

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Define categorical columns for binary encoding
categorical_columns = ['category_list', 'market', 'status', 'country_code', 'state_code', 'region',
                      'acquirer_category_list', 'acquirer_market', 'acquirer_country_code',
                      'acquirer_state_code', 'acquirer_region', 'funding_round_type',
                      'investor_country_code', 'investor_state_code', 'investor_market',
                      'investor_region']

In [14]:
# Create pipeline with binary encoding, imputation, scaling, and KNN model
pipeline = Pipeline([
    ('binary_encoder', BinaryEncoder(cols=categorical_columns)),
    ('imputer', SimpleImputer(strategy='median')),  # Impute missing values with median
    ('scaler', StandardScaler()),  # Standardize numerical features
    ('knn', KNeighborsRegressor())
])

# Define hyperparameter distribution for random grid search
param_dist = {
    'knn__n_neighbors': randint(1, 31)
}

# Perform random grid search
random_search = RandomizedSearchCV(pipeline, param_distributions=param_dist, n_iter=20, cv=5, scoring='r2', verbose=1)
random_search.fit(X_train, y_train)

# Get best model
best_knn_model = random_search.best_estimator_

# Evaluate on test data
y_pred = best_knn_model.predict(X_test)
r2 = r2_score(y_test, y_pred)

print(f"Best R-squared score: {r2:.4f}")
print(f"Best number of neighbors (k): {random_search.best_params_['knn__n_neighbors']}")

Fitting 5 folds for each of 20 candidates, totalling 100 fits




Best R-squared score: 0.2618
Best number of neighbors (k): 3


# 3. Random Forest
### 3.1 Model introduction
he Random Forest algorithm was selected for its exceptional capability to manage complex datasets with a mix of numerical and categorical variables. This ensemble method, which constructs multiple decision trees and aggregates their outputs to produce a single result, is renowned for its accuracy and robustness. A major draw of Random Forest is its natural ability to handle overfitting. This is particularly advantageous when dealing with large datasets, as it maintains excellent performance without the typical risk of fitting too closely to the training data.

Another appealing feature of Random Forest is its proficiency in processing both continuous and categorical data, which simplifies the preprocessing phase by minimizing the need for transforming all data into a uniform format. This model also offers interpretability through feature importance scores, allowing us to identify which variables most significantly impact predictions. This aspect is crucial for understanding the driving factors behind startup funding amounts, providing actionable insights that can guide further data collection and feature engineering efforts.

However, despite its numerous advantages, Random Forest comes with certain drawbacks. It can be computationally intensive, especially with larger datasets and a high number of trees in the forest, leading to longer training times. This model also tends to perform less effectively when there are extremely noisy classification/regression tasks and unprocessed datasets with many missing values. Moreover, while it offers a good degree of interpretability compared to other more complex models, the inner workings of individual trees and their interactions within the forest can still be challenging to fully decipher.

### 3.2 Loading libraries
To start off, various libraries were imported to facilitate data manipulation, model training, and evaluation. This setup included packages like Pandas for data operations, and Scikit-learn for model building and validation tools.

In [15]:
import pandas as pd
from sklearn.model_selection import train_test_split, KFold, RandomizedSearchCV
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from scipy.stats import randint as sp_randint
from scipy.stats import uniform as sp_uniform

### 3.3 Loading the dataset

In [16]:
# Read the CSV file
startup_df = pd.read_csv('cleaned.csv')
print(startup_df)

       permalink  Sequoia Capital  Kleiner Perkins Caufield & Byers  \
0              0            False                             False   
1              1            False                             False   
2              2            False                             False   
3              3            False                             False   
4              4            False                             False   
...          ...              ...                               ...   
28714      28714            False                             False   
28715      28715            False                             False   
28716      28716            False                             False   
28717      28717            False                             False   
28718      28718            False                             False   

       New Enterprise Associates  Accel Partners  Intel Capital  \
0                          False           False          False   
1            

### 3.4 Data engineering
The dataset was loaded, and specific date-related columns with less relevance to funding amounts were dropped to streamline the model's input features. Categorical variables were then encoded into binary formats to make them interpretable for the model. Thereafter, the preprocessed dataset underwent further cleaning to remove columns that only contained NaN values, which could otherwise skew the model's performance. Columns were also renamed to ensure consistency and avoid issues with illegal characters that might interfere with coding syntax.

In [17]:
# List of columns to drop
columns_to_drop = ['founded_at', 'first_funding_at', 'last_funding_at', 'acquired_at', 'funded_at']

# Drop the specified columns from the DataFrame
startup_df.drop(columns=columns_to_drop, inplace=True)

# List of categorical variables to be transformed into dummy variables
categorical_vars = [
    'category_list', 'market', 'status', 'country_code', 'state_code', 'region',
    'acquirer_category_list', 'acquirer_market', 'acquirer_country_code',
    'acquirer_state_code', 'acquirer_region', 'funding_round_type',
    'investor_country_code', 'investor_state_code', 'investor_market',
    'investor_region'
]

# Transform categorical variables into dummy variables (bool values)
startup_df_dummies = pd.get_dummies(startup_df, columns=categorical_vars, drop_first=True, dtype=bool)
startup_df_dummies.rename(columns=lambda x: x.replace(' ', '_').replace(',', '_').replace(':', '_').replace('.', '_').replace('[', '_').replace(']', '_').replace('{', '_').replace('}', '_').replace('"', '_'), inplace=True)

# Check for any completely NaN columns and drop them
nan_columns = startup_df_dummies.columns[startup_df_dummies.isna().all()].tolist()
startup_df_dummies.drop(columns=nan_columns, inplace=True)

print("Modified columns:", startup_df_dummies.columns)

Modified columns: Index(['permalink', 'Sequoia_Capital', 'Kleiner_Perkins_Caufield_&_Byers',
       'New_Enterprise_Associates', 'Accel_Partners', 'Intel_Capital',
       'Draper_Fisher_Jurvetson_(DFJ)', 'First_Round', 'SV_Angel',
       '500_Startups',
       ...
       'investor_region_Winnipeg', 'investor_region_Winston-Salem',
       'investor_region_Wollerau', 'investor_region_Worcester',
       'investor_region_Wrexham', 'investor_region_Zagreb',
       'investor_region_Zaragoza', 'investor_region_Zhejiang',
       'investor_region_Zurich', 'investor_region_Çan'],
      dtype='object', length=3207)


### 3.5 Data splitting
The dataset was split into 60% training, 20% validation, and 20% testing sets. This distribution allows ample data for training while providing sufficient data for both validating model tweaks and independently testing the model’s performance, mimicking realistic scenarios where the model predicts new, unseen data.

In [18]:
target_col = "funding_total_usd"

# Split the data into train, validation, and test sets (60% train, 20% validation, 20% test)
X = startup_df_dummies.drop(columns=[target_col, "permalink"])
y = startup_df_dummies[target_col]
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.4, random_state=1998)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=1998)


### 3.6 Model configuration
Within a pipeline, we integrated a median imputer to handle any remaining missing values and configured the Random Forest regressor. We then set up a randomized search over hyperparameters using a 5-fold cross-validation to optimize model parameters and prevent overfitting. Key hyperparameters included:
- n_estimators: The number of trees in the forest, ranging from 100 to 1001, to explore how model complexity affects performance.
- max_depth: The maximum depth of each tree, limited to between 3 and 8, to control overfitting by limiting how deep the trees can grow.
- min_samples_split and min_samples_leaf: These parameters help in defining the minimum number of samples required to split a node and the minimum number of samples a leaf node must have, respectively, further controlling the growth of trees.
- max_features: The number of features to consider when looking for the best split, set to vary between 10% and 90% of the features, influencing the diversity of the trees in the forest.

The choice of 20 iterations in the randomized search was a balance between computational efficiency and thorough exploration of the parameter space.

In [19]:
# Define the pipeline
pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('random_forest', RandomForestRegressor(random_state=1998))
])

# Define hyperparameters for random search
param_dist = {
    'random_forest__n_estimators': sp_randint(100, 1001),
    'random_forest__max_depth': sp_randint(3, 8),
    'random_forest__min_samples_split': sp_randint(2, 21),
    'random_forest__min_samples_leaf': sp_randint(1, 21),
    'random_forest__max_features': sp_uniform(0.1, 0.9)
}

# Initialize 5-fold cross-validation
kf = KFold(n_splits=5, shuffle=True, random_state=1998)

# Perform random search with cross-validation
random_search = RandomizedSearchCV(pipeline, param_distributions=param_dist, n_iter=20, cv=kf, scoring='neg_mean_squared_error', verbose=1, random_state=1998)
random_search.fit(X_train, y_train)

Fitting 5 folds for each of 20 candidates, totalling 100 fits


### 3.7 Model validation & evaluation
The best model configuration was then applied to both the validation and test sets. The best performing Random Forest had an R² of 0.52 on the test set, showing a significant improvement when comparing to the K-NN model's 0.2618. 

In [20]:
# Get the best model
best_model = random_search.best_estimator_

# Predict on the validation set
y_pred_val = best_model.predict(X_val)
y_pred_test = best_model.predict(X_test)

# Calculate R^2 (coefficient of determination) on the validation and test sets
r_squared_val = r2_score(y_val, y_pred_val)
r_squared_test = r2_score(y_test, y_pred_test)

# Print the best hyperparameters and R^2
print("Best hyperparameters:", random_search.best_params_)
print(f"R^2 on validation set: {r_squared_val:.2f}")
print(f"R^2 on test set: {r_squared_test:.2f}")

Best hyperparameters: {'random_forest__max_depth': 6, 'random_forest__max_features': 0.7136211298249849, 'random_forest__min_samples_leaf': 6, 'random_forest__min_samples_split': 4, 'random_forest__n_estimators': 235}
R^2 on validation set: 0.30
R^2 on test set: 0.52


### 3.8 Feature importance
Finally, we extracted and analyzed the feature importances determined by the model. This analysis helps in understanding the driving factors behind the funding predictions and was used in refining the model.

In [21]:
# Get feature importances from the random forest model within the pipeline
feature_importances = best_model.named_steps['random_forest'].feature_importances_

# Create a DataFrame to store feature importances along with their corresponding names
feature_importance_df = pd.DataFrame({'Feature': X_train.columns, 'Importance': feature_importances}).sort_values(by='Importance', ascending=False)

# Print the top 10 most important features
print("Top features by importance:")
print(feature_importance_df.head(10))

Top features by importance:
                    Feature  Importance
55        raised_amount_usd    0.523113
50           funding_rounds    0.202972
77           investor_count    0.067977
64        founded_dayofyear    0.019108
57               gdp_capita    0.016048
73       acquired_dayofyear    0.015819
53  funding_round_permalink    0.013130
63            founded_month    0.012811
70   last_founded_dayofyear    0.010971
56       investor_permalink    0.008983


# 4. LightGBM
### 4.1 Model introduction

LightGBM, or Light Gradient Boosting Machine, is a gradient boosting framework that is designed for efficiency, scalability, and high performance. One of LightGBM's notable features is its gradient boosting algorithm, which sequentially trains a series of weak learners, such as decision trees, and combines them to create a robust predictive model. This approch effectively learns from previous model mistakes and continuously improves prediction accuracy. 

Moreover, LightGBM has a histogram-based algorithm for decision tree contruction that minimizes memory usage and accelerates the training process. This makes it very usefull for large datasets such as ours'. Another crucial factor in our the model is its ability to deal with categorical features. Therefore, there is no need for encoding which streamlines the preprocessing phase. This was especially important for our predictive model due to the great number of categorical variables in the data. The below code shows a similar workflow to the above models.

### 4.2 Loading libraries

In [22]:
import pandas as pd
import numpy as np
import lightgbm as lgb
from sklearn.model_selection import train_test_split, GridSearchCV, KFold
from sklearn.metrics import r2_score

### 4.3 Loading the dataset

In [23]:
# Read the CSV file
startup_df = pd.read_csv('cleaned.csv')

print(startup_df)

       permalink  Sequoia Capital  Kleiner Perkins Caufield & Byers  \
0              0            False                             False   
1              1            False                             False   
2              2            False                             False   
3              3            False                             False   
4              4            False                             False   
...          ...              ...                               ...   
28714      28714            False                             False   
28715      28715            False                             False   
28716      28716            False                             False   
28717      28717            False                             False   
28718      28718            False                             False   

       New Enterprise Associates  Accel Partners  Intel Capital  \
0                          False           False          False   
1            

### 4.4 Data engineering

The below data engineering steps are identical to those taken for K-NN (see 2.4).

In [24]:
# Create a new variable reflecting the duration between the raised amount and the last funded
startup_df['funded_at'] = pd.to_datetime(startup_df['funded_at'])
startup_df['last_funding_at'] = pd.to_datetime(startup_df['last_funding_at'])

# Calculate the funding duration in days
startup_df['funding_duration'] = (startup_df['last_funding_at'] - startup_df['funded_at']).dt.days

In [25]:
# List of columns to drop
columns_to_drop = ['founded_at', 'first_funding_at', 'last_funding_at', 'acquired_at', 'funded_at']

# Drop the specified columns from the DataFrame
startup_df.drop(columns=columns_to_drop, inplace=True)

# List of categorical variables to be transformed into dummy variables
#categorical_vars = ['category_list', 'market', 'status', 'country_code', 'state_code', 'region', 
#                    'acquirer_category_list', 'acquirer_market', 'acquirer_country_code', 
#                    'acquirer_state_code', 'acquirer_region', 'funding_round_type', 
#                    'investor_country_code', 'investor_state_code', 'investor_market', 
#                    'investor_region']

# Transform categorical variables into dummy variables (bool values)
#startup_df_dummies = pd.get_dummies(startup_df, columns=categorical_vars, drop_first=True, dtype=bool)
#print(startup_df_dummies)

In [26]:
def check_and_replace_column_names(df):
    replaced_columns = []
    for column_name in df.columns:
        if any(c in column_name for c in [' ', ',', ':', '.', '[', ']', '{', '}', '"']):
            # Replace non-compliant characters with underscores
            new_column_name = column_name.replace(' ', '_').replace(',', '_').replace(':', '_').replace('.', '_').replace('[', '_').replace(']', '_').replace('{', '_').replace('}', '_').replace('"', '_')
            replaced_columns.append((column_name, new_column_name))
    return replaced_columns

# Create dummies
replaced_columns = check_and_replace_column_names(startup_df)

# Print the old and new column names
for old_name, new_name in replaced_columns:
    print(f"Replacing '{old_name}' with '{new_name}'")

# Replace non-compliant characters in column names
startup_df.rename(columns=dict(replaced_columns), inplace=True)

# Check the modified column names
print("Modified columns:", startup_df.columns)

Replacing 'Sequoia Capital' with 'Sequoia_Capital'
Replacing 'Kleiner Perkins Caufield & Byers' with 'Kleiner_Perkins_Caufield_&_Byers'
Replacing 'New Enterprise Associates' with 'New_Enterprise_Associates'
Replacing 'Accel Partners' with 'Accel_Partners'
Replacing 'Intel Capital' with 'Intel_Capital'
Replacing 'Draper Fisher Jurvetson (DFJ)' with 'Draper_Fisher_Jurvetson_(DFJ)'
Replacing 'First Round' with 'First_Round'
Replacing 'SV Angel' with 'SV_Angel'
Replacing '500 Startups' with '500_Startups'
Replacing 'Bessemer Venture Partners' with 'Bessemer_Venture_Partners'
Replacing 'Greylock Partners' with 'Greylock_Partners'
Replacing 'Lightspeed Venture Partners' with 'Lightspeed_Venture_Partners'
Replacing 'Andreessen Horowitz' with 'Andreessen_Horowitz'
Replacing 'Khosla Ventures' with 'Khosla_Ventures'
Replacing 'Index Ventures' with 'Index_Ventures'
Replacing 'Canaan Partners' with 'Canaan_Partners'
Replacing 'Redpoint Ventures' with 'Redpoint_Ventures'
Replacing 'Norwest Venture 

### 4.5 Defining, tuning and evaluating model

The below code segments define, tune and evaluate the LightGBM regression model. It begins by selecting predictors and splitting the data into training and testing sets, like done previously. There is one crucial difference, as LightGBM is able to deal with categorical variables, these do not have to be encoded. Only one preprocessing step is done on these variables and that is setting the datatype to category. This makes them processible in the model. 

After the data split and preprocessing, the hyperparameter grid was defined. It specifies potential values for key parameters lik the number of leaves, alpha, lambda L1, lambda L2 and minimum data in one leaf. Next, a random grid search is coducted with 5-fold cross-validaiton to explore the hyperparameter space. The best hyperparameters are extracted and used in the fitting of the model and finally the evaluation process. The best performing model has an R² of 0.5334 on the test set. This shows a slight improvement when comparing it to the Random Forest model.

In [27]:
# Define the target variable and predictors
target_col = 'funding_total_usd'
predictors = [col for col in startup_df.columns if col != target_col and col != "permalink"]
 
# Split the data into train and test sets (70% train, 30% test)
X_train, X_test, y_train, y_test = train_test_split(startup_df[predictors], startup_df[target_col], test_size=0.3, random_state=42)
 
# Define the categorical columns
categorical_columns = ['category_list', 'market', 'status', 'country_code', 'state_code', 'region',
                      'acquirer_category_list', 'acquirer_market', 'acquirer_country_code',
                      'acquirer_state_code', 'acquirer_region', 'funding_round_type',
                      'investor_country_code', 'investor_state_code', 'investor_market',
                      'investor_region']
 
# Convert categorical columns to integers (binary encoding)
for col in categorical_columns:
    X_train[col] = X_train[col].astype('category')
    X_test[col] = X_test[col].astype('category')

print(X_train)

       Sequoia_Capital  Kleiner_Perkins_Caufield_&_Byers  \
20999            False                             False   
16050            False                             False   
22019            False                             False   
4767             False                             False   
25269            False                             False   
...                ...                               ...   
21575            False                             False   
5390             False                             False   
860              False                             False   
15795            False                             False   
23654            False                             False   

       New_Enterprise_Associates  Accel_Partners  Intel_Capital  \
20999                      False           False          False   
16050                      False           False          False   
22019                      False           False          False   
4767       

In [28]:
# Initialize the LightGBM model
lgb_model = lgb.LGBMRegressor(objective='regression', boosting_type='gbdt', n_jobs=-1, random_state=100)

# Define the hyperparameter grid for random grid search
param_dist = {
    'num_leaves': [31, 600],
    'reg_alpha': [0.1, 0.5],
    'min_data_in_leaf': [30, 50, 100, 300, 400],
    'lambda_l1': [0, 1, 1.5],
    'lambda_l2': [0, 1]
}

# Perform random grid search with 5-fold cross-validation
kf = KFold(n_splits=5, shuffle=True, random_state=100)
random_search = RandomizedSearchCV(estimator=lgb_model, param_distributions=param_dist, n_iter=20, cv=kf, scoring='r2', n_jobs=-1)
random_search.fit(X_train, y_train)

# Get the best model
best_lgb_model = random_search.best_estimator_

# Evaluate the model on the test set
y_pred = best_lgb_model.predict(X_test)
r2 = r2_score(y_test, y_pred)

print(f"Best parameters: {random_search.best_params_}")
print(f"R-squared on test set: {r2:.4f}")

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.002358 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 5626
[LightGBM] [Info] Number of data points in the train set: 20103, number of used features: 95
[LightGBM] [Info] Start training from score 15879290.097349
Best parameters: {'reg_alpha': 0.1, 'num_leaves': 600, 'min_data_in_leaf': 30, 'lambda_l2': 0, 'lambda_l1': 0}
R-squared on test set: 0.5334


In [29]:
# Get feature importances
feature_importances = best_lgb_model.feature_importances_

# Create a DataFrame to store feature importances along with their corresponding names
feature_importance_df = pd.DataFrame({'Feature': X_train.columns, 'Importance': feature_importances})

# Sort the DataFrame by importance values in descending order
feature_importance_df = feature_importance_df.sort_values(by='Importance', ascending=False)

# Print the top n most important features
n = 50  # Number of top features to display
top_features = feature_importance_df.head(n)
print("Top", n, "most important features:")
print(top_features)

Top 50 most important features:
                             Feature  Importance
67                 raised_amount_usd        8749
64           funding_round_permalink        3625
56                    funding_rounds        3365
84           first_founded_dayofyear        3163
95                  funding_duration        3064
70                investor_permalink        2817
79                      founded_year        2780
87            last_founded_dayofyear        2178
94                    investor_count        2073
76                        gdp_growth        1849
78              female_participation        1723
93                  funded_dayofyear        1649
74                        gdp_capita        1537
75         foreign_direct_investment        1516
82                first_founded_year        1465
81                 founded_dayofyear        1192
77                     interest_rate         955
55                            region         946
51                            market 

# 5. Lasso Regression

### 5.1 Model introduction

Lasso regression is particularly advantageous when dealing with datasets that have a large amount of features, as it inherently performs feature selection, reducing complexity and enhancing model interpretability by penalizing the absolute size of the coefficients. Given that our dataset contains a large quantity of features, lasso regression provides us with a handy tool to reduce the high dimensionality of our featureset. 

On the other hand, as opposed to more complex models, Lasso assumes a linear relationship between features and target variable, which may not adequately capture more complex, non-linear relationships. 

### 5.2 Loading libraries

In [47]:
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV, KFold
from sklearn.linear_model import Lasso, LassoCV
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.preprocessing import StandardScaler
from category_encoders import BinaryEncoder
from scipy.stats import uniform 
import numpy as np

### 5.3 Loading the dataset

In [48]:
# Set display option
pd.set_option('display.max_rows', None) 

In [49]:
# Load data
df = pd.read_csv('cleaned.csv')
print(df.head(10))  # Checking initial data

   permalink  Sequoia Capital  Kleiner Perkins Caufield & Byers  \
0          0            False                             False   
1          1            False                             False   
2          2            False                             False   
3          3            False                             False   
4          4            False                             False   
5          5            False                             False   
6          6            False                             False   
7          7            False                             False   
8          8            False                             False   
9          9            False                             False   

   New Enterprise Associates  Accel Partners  Intel Capital  \
0                      False           False          False   
1                      False           False          False   
2                      False           False          False   
3         

### 5.4 Data engineering


As we prepared our dataset for lasso regression, we faced the challenge of handling missing values, since lasso inherently cannot process incomplete data. After careful consideration, we opted to impute median values for missing numeric data, primarily timestamps. The median was chosen for its robustness against outliers and its ability to represent the central tendency without significantly altering the distribution of the dataset. 

In [50]:
# Impute median values
median_timestamps = df[['funded_year_new', 'funded_month_new', 'funded_dayofyear', 'acquired_year_new', 'acquired_month_new','acquired_dayofyear']].median()
df.update(df[['funded_year_new', 'funded_month_new', 'funded_dayofyear', 'acquired_year_new', 'acquired_month_new','acquired_dayofyear']].fillna(median_timestamps))

Additionally, we engineered a new feature, funding duration. This feature captures the time span between funding events, providing the model with crucial insights. A longer funding duration might suggest a higher amount of funds raised, offering a potentially valuable predictor for our model.

In [51]:
# Date processing
df['funded_at'] = pd.to_datetime(df['funded_at'])
df['last_funding_at'] = pd.to_datetime(df['last_funding_at'])
df['funding_duration'] = (df['last_funding_at'] - df['funded_at']).dt.days

For the missing categorical data, we employed a straightforward approach of back-filling these gaps with the label 'Unknown'. This method ensured that we retained maximum data integrity without losing too many data points.

In [52]:
# Fill categorical missing values
categories = ['category_list', 'market', 'status', 'state_code', 'acquirer_category_list', 'acquirer_market', 
              'acquirer_country_code', 'acquirer_state_code', 'acquirer_region', 'investor_category_list', 
              'investor_country_code', 'investor_state_code', 'investor_market', 'investor_region']
df[categories] = df[categories].fillna('Unknown')

Thereafter, we streamlined the dataset by dropping columns that were either already processed during the data cleaning stage or were deemed unnecessary for further analysis. This not only simplified our model but also focused the training process on the most relevant features.

In [53]:
# Drop unused columns
columns_to_drop = ['founded_at', 'first_funding_at', 'last_funding_at', 'acquired_at', 'funded_at']
df.drop(columns=columns_to_drop, inplace=True)



We also undertook the task of cleaning up string columns by removing unnecessary signs and characters, which was crucial for facilitating the subsequent processing steps.

In [54]:
#Rename columns to ensure clean column names
df.columns = df.columns.str.replace(' ', '_').str.replace(r'[,:.\[\]{}"]', '_', regex=True)

Given the complexity and size of our dataset, particularly with numerous categorical variables, we faced the challenge of feature explosion while trying to maintain a manageable model size. We considered various encoding techniques, including one-hot encoding and count encoding, each with its benefits and drawbacks. One-hot encoding, while comprehensive, significantly increases the number of features, potentially leading to sparse matrices. Count encoding, on the other hand, can introduce bias if certain categories dominate.

Ultimately, we decided on binary encoding for our categorical variables. This method was selected because it effectively balances the need to include all categorical features in the model while limiting the increase in feature dimensions. Binary encoding transforms categorical variables into binary columns, considerably reducing the number of additional features compared to one-hot encoding but retaining more distinct category information than count encoding. This strategic choice helped streamline our feature set without sacrificing the informational value of our categorical data.

In [55]:
# Encoding categorical columns
categorical_columns = df.select_dtypes(include=['object', 'category']).columns
encoder = BinaryEncoder(cols=categorical_columns)
data_encoded = encoder.fit_transform(df[categorical_columns])

In [56]:
# Drop original categorical columns and concatenate encoded ones
df.drop(columns=categorical_columns, inplace=True)
final_data = pd.concat([df, data_encoded], axis=1)

### 5.5 Defining, tuning and evaluating model
We initiated the modeling process for our lasso regression by undertaking several preparatory steps. These included setting up the train/test split and performing necessary transformations to meet the specific requirements of lasso regression.

In [57]:
# Define the target and predictors
target_col = 'funding_total_usd'
predictors = final_data.columns.difference([target_col, 'permalink'])

In [58]:
# Split the dataset
X_train, X_test, y_train, y_test = train_test_split(final_data[predictors], final_data[target_col], test_size=0.3, random_state=42)

Crucial to the process was scaling the data using the StandardScaler. This step standardizes features to have zero mean and unit variance, which is vital for lasso regression because it ensures the regularization penalty is uniformly applied across all coefficients. This uniformity is essential for effective feature selection and optimization.

In [59]:
# Scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

We used Lasso-CV for our model, employing a logarithmic grid of alphas and 5-fold cross-validation. This approach optimally determines the regularization strength by exploring a wide range of values on a multiplicative scale, aiding in the selection of the best alpha. This alpha is critical as it minimizes overfitting while maximizing model performance.

After optimizing the model parameters, we fit the model to the training set and proceeded to make predictions on the test set. The model’s performance was evaluated using the R² score, which at approximately 0.5854, indicated that our model was reasonably predictive. In terms of predictive accuracy, it managed to surpass more complex models.

In [60]:
# Lasso regression with cross-validation
lasso_cv = LassoCV(alphas=np.logspace(-6, 6, 13), cv=10)
lasso_cv.fit(X_train_scaled, y_train)

lasso_opt = Lasso(alpha=lasso_cv.alpha_)
lasso_opt.fit(X_train_scaled, y_train)

y_pred = lasso_opt.predict(X_test_scaled)

  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descen

In [61]:
# Calculate R^2 score
r2 = r2_score(y_test, y_pred)
print(f"R2-Score: {r2:.4f}")

R2-Score: 0.5854


### 5.6 Feature importance

Finally, we analyzed the coefficients derived from the lasso model to identify which features it deemed most significant. Among the most influential were funding_rounds, total_amount_raised_USD, acquirer_state_code, and involvement from Goldman Sachs. Notably, the model selected about 36 features from an initial set of around 200, demonstrating its efficiency in feature reduction and focus on the most impactful variables.

In [62]:
#Extracting the feature importances from the lasso model. 
coefficients = lasso_opt.coef_

features = pd.DataFrame({
    'Feature': final_data[predictors].columns,
    'Importance': coefficients
}).sort_values(by='Importance', ascending=False)

# Printing the features sorted by importance
print(features)

                              Feature    Importance
168                 raised_amount_usd  5.103829e+07
117                    funding_rounds  1.274584e+07
81              acquirer_state_code_0  4.778666e+06
16                      Goldman_Sachs  3.548463e+06
122                    investor_count  3.529623e+06
24   Kleiner_Perkins_Caufield_&_Byers  2.541039e+06
110                  funding_duration  2.432363e+06
1                      Accel_Partners  1.251475e+06
18                  Greylock_Partners  1.101136e+06
5                           Benchmark  6.925759e+05
30          New_Enterprise_Associates  6.065994e+05
114              funding_round_type_1  5.987934e+05
134                 investor_market_4  5.734611e+05
38                    Sequoia_Capital  5.634156e+05
31      North_Bridge_Venture_Partners  5.380657e+05
46              U_S__Venture_Partners  5.021467e+05
11      Draper_Fisher_Jurvetson_(DFJ)  5.011850e+05
74                  acquirer_region_1  4.040515e+05
77          

In [63]:
print("Number of features used:", np.sum(lasso_opt.coef_ != 0))

Number of features used: 36


# 6. Conclusion

### 6.1 Model evaluation
To effectively compare the performance of the various models we employed, we compiled their results into the following table:

| Model             | R² score |
|-------------------|-----------|
| K-NN               | 0.2618    | 
| Random Forest     | 0.5200    | 
| LightGBM         | 0.5334    | 
| Lasso regression  | 0.5854    | 

From the data presented, it is evident that Lasso regression outperforms the other models, achieving the highest R² score of 0.5854. This indicates that Lasso Regression is the most effective model at explaining the variability in the dataset, making it our preferred choice for predicting the total funding amount that startups receive.


### 6.2 Feature importance
Below are the top 10 features that significantly influence the prediction of funding amounts for startups, as per the lasso regression model:

| Feature                                  | Importance     |
|------------------------------------------|----------------|
| raised_amount_usd                        | 5.103829e+07   |
| funding_rounds                           | 1.274584e+07   |
| acquirer_state_code_0                    | 4.778666e+06   |
| Goldman_Sachs                            | 3.548463e+06   |
| investor_count                           | 3.529623e+06   |
| Kleiner_Perkins_Caufield_&_Byers         | 2.541039e+06   |
| funding_duration                         | 2.432363e+06   |
| Accel_Partners                           | 1.251475e+06   |
| Greylock_Partners                        | 1.101136e+06   |
| Benchmark                                | 6.925759e+05   |

Striking is that these features not only underscore the importance of financial metrics but also reflect the impact of prominent investors, echoing patterns observed in other models.

### 6.3 Limitations
In our comparative analysis of predictive models for estimating startup funding, Lasso Regression emerged as the most effective, achieving anR² score of 0.5854. This score reflects a robust model with substantial explanatory power. However, it's crucial to consider the limitations encountered during this analysis:
1. Resource and Time Constraints: Our analysis was constrained by limited computing power and strict time limitations, which confined the exploration to less computationally demanding models and limited the model refinement process to only 20 iterations. This not only restricted the depth of our investigation but also potentially hindered achieving optimal model performance.
2. Data Quality and Model Assumptions: The data's completeness, accuracy, and representativeness were significant concerns, as issues like missing data and measurement errors could lead to biased predictions. Additionally, each model came with inherent assumptions which, if unmet, adversely affect their effectiveness.
3. Feature Selection and Hyperparameter Tuning: The selection and engineering of features, along with the methodology for hyperparameter tuning, faced limitations. We were only able to explore a fraction of the hyperparameter space, which leaves room for improvement.
4. External factors and model suitability: Although we incorporated some external variables in the dataset, we may not have adequately accounted for significant trends that impact startup succes and funding, such as economics downturns or technological breakthroughs. Moreover, the principle that no single model provides the best solution under all circumstances was evident, underscoring the need for a tailored approach based on specific data characteristics and project requirements.

These limitations highlight that while our results are promising, they are conditional. The constraints on resources, time, and data quality, coupled with limited model tuning, suggest that the current outcomes can be improved. Further research with better resources and more comprehensive model exploration could yield more optimized results, enhancing accuracy.