# Preliminaries

## 3.1 Feature engineering

This has already been achieved in the file `"Founder Feature Engineering.ipynb"`, so is omit in the main programme.

## 3.2 Calculate similarity between founder profiles

`calculate_aggregate_similarity()` is mainly used to calculate the overall top n matching results based on aggregate similarity scores

In [1]:
import re
import string
import ast
import numpy as np
import pandas as pd
from openai import OpenAI
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS

OPENAI_API_KEY = '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'
client = OpenAI(api_key = OPENAI_API_KEY)

# Function to clean and tokenize text
def clean_and_tokenize(text):
    # Convert to lower case
    text = text.lower()
    # Remove punctuation
    text = re.sub(f"[{string.punctuation}]", " ", text)
    # Tokenization and removing stopwords
    tokens = text.split()
    tokens = [token for token in tokens if token not in ENGLISH_STOP_WORDS]
    return tokens

# OpenAI API call for embeddings
def openai_embed_text(tokens):
    # Convert tokens back to text
    cleaned_text = " ".join(tokens)
    
    # Making a call to the OpenAI API
    response = client.embeddings.create(
        model="text-embedding-ada-002",
        input=cleaned_text,
        encoding_format="float"
    )

    # Extracting embeddings from the response
    embeddings = response.data[0].embedding
    return embeddings

# The complete embed_text function
def embed_text(text):
    # Clean and tokenize the text
    tokens = clean_and_tokenize(text)
    # Get embeddings using OpenAI (placeholder implementation)
    embeddings = openai_embed_text(tokens)
    return embeddings


# Function to calculate cosine similarity
def calculate_similarity(new_founder_profile, existing_embeddings):
    """
    Input: new_founder_profile : String
           existing_profiles: List<Embeddings>
    """
    # Embed the new founder's profile
    new_founder_embedding = embed_text(new_founder_profile)

    # Embed the existing profiles, converting string to the list representation
    aList = []
    for embedding_str in existing_embeddings:
        if embedding_str != "[]":
            try:
                aList.append(ast.literal_eval(embedding_str))
            except:
                aList.append(embed_text(""))
        else:
            aList.append(embed_text(""))
    existing_embeddings = np.array(aList)

    # Calculate cosine similarity
    similarity_scores = cosine_similarity([new_founder_embedding], existing_embeddings)

    return similarity_scores[0]


# Method 1: The function used to calculate the weighted sum of similarity across 5 dimensions
def calculate_aggregate_similarity(training_file, new_founder_profile, num_matches = 5):
    """
    Input: training_file: file name of embeddings file for training
           new_founder_profile: the row in the Founder Features.csv for the new founder
           num_matches: Int
    """
    columns_to_compare = ["Self-Description", "Education Backgrounds", "Employment Backgrounds", "long_description", "category_list"]
    attribute_weights = [2.66, 0.69, 1.85, 1.71, 3.08] #Weight obtained from GPT analysis
    
    df = pd.read_csv(training_file)
    aggregate_similarity = [0] * (len(df) + 1)

    for i, column in enumerate(columns_to_compare):
        existing_profiles_embeddings = list(df[column].fillna('') )
        new_founder_attribute = str(new_founder_profile[column])

        similarity_scores = calculate_similarity(new_founder_attribute, existing_profiles_embeddings)
        for index in range(len(similarity_scores)):
            aggregate_similarity[index] += attribute_weights[i] * similarity_scores[index]
        print(column)
        print(similarity_scores[:10])
        print(aggregate_similarity[:10])

    # Get the top 3-5 matches
    top_matches_indices = np.argsort(aggregate_similarity)[-num_matches:][::-1]
    top_matches = df.iloc[top_matches_indices]

    return top_matches
    

## 3.3: Rationale Generation

### Create a Founder object

So it is easier to handle the top few founders and feed their information to LLM

In [4]:
import math

class Founder:
    def __init__(self, id, is_success, name, gender = 'None', age = 0.0, linkedin_url = '',
                 self_description = '', education_backgrounds = '', employment_backgrounds = '',
                 org_name = '', long_description = '', 
                 category_list = '', category_groups_list = '', country_code = '', city = ''):
        self.id = id
        self.is_success = is_success
        self.name = name
        self.gender = gender
        self.age = age
        self.linkedin_url = linkedin_url
        self.self_description = self_description
        self.education_backgrounds = education_backgrounds
        self.employment_backgrounds = employment_backgrounds
        self.org_name = org_name
        self.long_description = long_description
        self.category_list = category_list
        self.category_groups_list = category_groups_list
        self.country_code = country_code
        self.city = city
        
        self.startup_start_date = "[unknown]"

    def get_id(self):
        return self.id

    def get_is_success(self):
        if self.is_success:
            return "Successful"
        return "Failed"

    def get_name(self):
        return self.name

    def get_gender(self):
        return self.gender

    def get_age(self):
        if math.isnan(self.age):
            return "Not Available"
        return str(self.age)

    def get_linkedin_url(self):
        return self.linkedin_url

    def get_self_description(self):
        return self.self_description

    def get_education_backgrounds(self):
        return self.education_backgrounds

    def get_employment_backgrounds(self):
        return self.employment_backgrounds

    def get_org_name(self):
        return self.org_name

    def get_long_description(self):
        return self.long_description
    
    def get_startup_start_date(self):
        return self.startup_start_date

    def get_category_list(self):
        return self.category_list

    def get_category_groups_list(self):
        return self.category_groups_list

    def get_country_code(self):
        return self.country_code

    def get_city(self):
        return self.city


In [6]:
# Create a list of founder profiles that are top n similar Founder objects
import pandas as pd

def extract_top_n_founders(filepath, n):
    founder_profiles = []

    df = pd.read_csv(filepath)
    #Randomly select n samples from the file given (ignore if the file contains exactly n rows)
    random_sample = df.sample(n=n)

    for index, row in random_sample.iterrows():

        new_founder = Founder(
            id= row['ID'], 
            is_success = bool(row['isSuccess']), 
            name = row['Name'], 
            gender = row['Gender'], 
            age = row['Age'], 
            linkedin_url = row['linkedin_url'], 
            self_description = row['Self-Description'],
            education_backgrounds = row['Education Backgrounds'], 
            employment_backgrounds = row['Employment Backgrounds'], 
            org_name = row['org_name'], 
            long_description = row['long_description'],
            category_list = row['category_list'], 
            category_groups_list = row['category_groups_list'], 
            country_code = row['country_code'], 
            city = row['city'])

        founder_profiles.append(new_founder)
        
    return founder_profiles



### Asking LLM: GPT-4 for a rationale to perform founder profile analysis

assuming we already have the founder_profiles list of Founder objects that we want to pass to GPT-4

In [None]:
from openai import OpenAI

# To be replaced with your own openAI key (the original one has expired)
OPENAI_API_KEY = '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'

In [7]:
# Prompt Generation
import ast

def education_details(education_list):
    try:
        education_list = ast.literal_eval(education_list)
    except:
        return "\t - No Education Records"
    
    if education_list == []:
        return "\t - No Education Records"
    
    description = ""
    for education in education_list[:3]: # Keep only up to the most recent 3 educations
        degree = education[1] if education[1] != "N/A" else "[unknown]"
        school = education[0] if education[0] != "N/A" else "[unknown]"
        major = education[2] if education[2] != "N/A" else "[unknown]"
        description += f"\t - {degree} in {major} from {school}" + "\n"
        
    return description


def employment_details(employment_list, org_name):
    try:
        employment_list = ast.literal_eval(employment_list)
    except:
        return "\t - No Employment Records"
    
    if employment_list == []:
        return "\t - No Employment Records"
    
    description = ""
    count = 0  #Keep only up to the most recent 3 jobs, excluding the current startup
    for employment in employment_list:
        company = employment[0]
        roles = ", ".join(employment[1])
        duration = str(round(employment[2], 2))
        start_time = employment[3]
        isCurrent = employment[4]
        if not isCurrent and org_name not in company:
            description += f"\t - Worked in {company} for {duration} years, starting from {start_time}. His roles in the company include {roles}.\n"
            count += 1
        if count >= 3: break
    
    return description

def extract_founder_details(profile, index = 0, hide_success = False):
    founders_details = ""
    founders_details += f"Founder {index+1}: \n"
    founders_details += " - Name: " + profile.get_name() + "\n"
    if not hide_success:
        founders_details += " - Startup Status: " + profile.get_is_success() + "\n"
    founders_details += " - Age: " + profile.get_age() + "\n"
    founders_details += " - Self Description: " + profile.get_self_description() + "\n"
    founders_details += " - Education Backgrounds:\n" + education_details(profile.get_education_backgrounds()) + "\n"
    founders_details += " - Employment Backgrounds:\n " + employment_details(profile.get_employment_backgrounds(), profile.get_org_name()) + "\n"
    founders_details += " - Startup Name: " + profile.get_org_name() + "\n"
    founders_details += " - Startup Idea: " + profile.get_long_description() + "\n"
    founders_details += "\n"
    return founders_details

def generate_prompt_for_rationale(founder_profiles):
    n = len(founder_profiles)
    
    introduction = "I am analyzing the profiles of several startup founders to understand " + \
                   "the key factors contributing to their success or failure. " + \
                   f"Below are the profiles of {n} founders:\n\n"
    
    founders_details = ""
    for index, profile in enumerate(founder_profiles):
        founders_details += extract_founder_details(profile, index)
    
    with open("prompts/rationale_request_body.txt", "r") as f:
        request = f.read()
    
    prompt = introduction + founders_details + request
    return prompt
    

In [9]:
# Now passing the prompt to GPT-4

def get_rationale(prompt):
    client = OpenAI(api_key = OPENAI_API_KEY)

    prompt = generate_prompt_for_rationale(founder_profiles)
    system_content = "You are an experienced venture capital analyst who is skilled at " +\
                     "synthesizing a rationale for startup success or failure based on founder data."

    response = client.chat.completions.create(
        model="gpt-4",
        messages=[
            {"role": "system", "content": system_content},
            {"role": "user", "content": prompt}
        ],
        max_tokens=500  # Adjust as needed
    )

    rationale = response.choices[0].message.content
    return rationale

## 3.4 Generating Score and Pros and Cons List

### Firstly, given the rationale, I will need to extract the key factors in it

These factors, for success and for failure, will act as a guideline for the scoring of the founder.

In [10]:
def request_key_factors(rationale):
    client = OpenAI(api_key = OPENAI_API_KEY)

    # Structuring the prompt for GPT-4
    with open("prompts/scoring_factor_prompt.txt", "r") as f:
        introduction = f.readline()
        request = f.read()
        
    prompt = introduction + "\n" + rationale + "\n" + request
    
    system_content = "You are an experienced venture capital analyst who is skilled at extracting " +\
                     "key founders for the success and failure of startup founders."

    response = client.chat.completions.create(
        model="gpt-4",
        messages=[
            {"role": "system", "content": system_content},
            {"role": "user", "content": prompt}
        ],
        max_tokens=500  # Adjust as needed
    )

    text_response = response.choices[0].message.content.strip()
    
    return text_response

def extract_key_factors(text_response):

    # Process the response to extract factors
    success_factors = []
    failure_factors = []
    current_list = None

    for line in text_response.split('\n'):
        if line.startswith('Success'):
            success_factors.append(line.strip())
        elif line.startswith('Failure'):
            failure_factors.append(line.strip())

    return success_factors, failure_factors


### Score Evaluation

Ask for individual scores or relevance across all factors

In [11]:
from tenacity import (
    retry,
    stop_after_attempt,
    wait_random_exponential,
)  

@retry(wait=wait_random_exponential(min=1, max=60), stop=stop_after_attempt(6))
def completion_with_backoff(messages = [], model = 'gpt-4', max_tokens = 100):
    client = OpenAI(api_key = OPENAI_API_KEY)
    return client.chat.completions.create(model = model, 
                                          messages = messages, 
                                          max_tokens = max_tokens)

def score_founder_profile(founder, success_factors, failure_factors):

    #=============Initialization=============
    success_scores = []
    failure_scores = []
    # New founder's details
    founder_detail = extract_founder_details(founder, hide_success = True)
    system_content = "You are an experienced venture capital analyst who is skilled at evaluating" +\
                     "whether a startup founder would be successful or not."
    # Make an initial call to GPT-4
    with open("prompts/scoring_generation_prompt.txt", "r") as f:
        introduction = f.readline()
        request = f.readline()
        output_condition = f.read()
    prompt = introduction + "\n" + founder_detail + request
    print(prompt, "\n")
    initial_messages = [{"role": "system", "content": system_content}, {"role": "user", "content": prompt}]
    response = completion_with_backoff(initial_messages)
    text_response = response.choices[0].message.content.strip()
    initial_messages.append({"role": "assistant", "content": text_response})
    print(text_response)
    print()

    #============= Function to construct the prompt for scoring each factor============
    def create_scoring_prompt(factor, factor_type):
        introduction = f"Below is a factor of {factor_type}. Give a score of alignment between the factor and the founder details previously given." +\
                        "Score between 0 to 10, where 10 means a perfect match and 0 means no match. You should be very stringent, and would not usually give a score higher than 6 for success factor, or a score lower than 4 for failure factor, unless clear evidence suggest otherwise.\n Factor:\n"

        prompt = introduction + factor + "\n" + output_condition
        print(factor)
        print()
        return prompt


    #===========Scoring for success factors===========
    for factor in success_factors:
        prompt = create_scoring_prompt(factor, 'success')
        messages = initial_messages + [{"role": "user", "content": prompt}]
        response = response = completion_with_backoff(messages)
        text_response = response.choices[0].message.content.strip()
        #messages.append({"role": "assistant", "content": text_response})
        print(text_response)
        print()
        score = float(text_response.split()[1])
        success_scores.append(score)

    #===========Scoring for failure factors===========
    for factor in failure_factors:
        prompt = create_scoring_prompt(factor, 'failure')
        messages = initial_messages + [{"role": "user", "content": prompt}]
        response = response = completion_with_backoff(messages)
        text_response = response.choices[0].message.content.strip()
        #messages.append({"role": "assistant", "content": text_response})
        print(text_response)
        print()
        score = float(text_response.split()[1])
        failure_scores.append(score)

    # Calculate overall score
    total_succ_score = sum(success_scores)
    total_fail_score = sum(failure_scores)
    if total_succ_score + total_fail_score > 0:
        overall_score = total_succ_score / (total_succ_score + total_fail_score) * 10
    else:
        overall_score = 0

    return success_scores, failure_scores, overall_score


## Generate pros and cons list

1. Analyze Scores and Factors:
Review Scores: Look at the individual scores assigned to each success and failure factor for the founder's profile.
Identify Strengths and Weaknesses: High scores in success factors and low scores in failure factors indicate strengths (Pros), while low scores in success factors and high scores in failure factors indicate weaknesses (Cons).
2. Construct the Pros and Cons List:
Pros (Strengths):
Include factors where the founder scored high in success factors.
Highlight attributes or aspects of the founder's profile that align well with the identified keys to success.
Cons (Weaknesses):
Include factors where the founder scored high in failure factors.
Point out areas where the founder's profile lacks or diverges from the success criteria.

In [24]:
def generate_pros_and_cons(founder, success_factors, failure_factors, success_scores, 
                           failure_scores, threshold = 5):
    
    # New founder's details
    founder_detail = extract_founder_details(founder, hide_success = True)
    system_content = "You are an experienced venture capital analyst who is skilled at evaluating" +\
                     "whether a startup founder would be successful or not."
    with open("prompts/pros_and_cons_prompt.txt", "r") as f:
        introduction = f.readline()
        request = f.read()

    def summarize_factors_prompt(factor_list, score_list, factor_type):
        # Filter factors based on the threshold score
        relevant_factors = [factor for factor, score in zip(factor_list, score_list) if score > threshold]
        relevant_scores = [score for score in score_list if score > threshold]
        if len(relevant_factors) < 3:
            relevant_factors = factor_list
            relevant_scores = score_list

        prompt="\n".join(f"- {factor} [Score: {score} / 10]" for factor, score in zip(relevant_factors, relevant_scores))
        return prompt


    prompt = introduction + "\nFounder Details: " + founder_detail
    pros = summarize_factors_prompt(success_factors, success_scores, "success")
    cons = summarize_factors_prompt(failure_factors, failure_scores, "failure")
    prompt += "Success Factors:\n" + pros
    prompt += "\n\nFailure Factors:\n" + cons
    prompt += "\n" + request
    #print(prompt,"\n")

    messages = [{"role": "system", "content": system_content}, {"role": "user", "content": prompt}]
    response = completion_with_backoff(messages = messages, max_tokens = 500)
    return response.choices[0].message.content.strip()


# Main Programme Execution

In [16]:
# Step 1: input new founder profile
# Here, input founder profile is taken from among the testing set
# In practice, this can be modified to take input from an API, or ask for manual user input
new_founder_id = "3665_S"
df = pd.read_csv("Founder Features.csv")
search_result = df[df["ID"] == new_founder_id]
new_founder_profile = search_result.iloc[0]
# Input founder is saved to individual file
search_result.to_csv("Input Founder Features.csv", index = False)

new_founder_profile

ID                                                                   3665_S
isSuccess                                                              True
Name                                                        Charles Vincent
Gender                                                                 Male
Age                                                                    31.0
linkedin_url              https://www.linkedin.com/in/charles-vincent-73...
Self-Description          Charles is a Montreal-born Materials Engineer ...
Education Backgrounds     [('McGill University', "Bachelor's (4 year pro...
Employment Backgrounds    [('Temperpack Technologies', ['Engineering, IT...
org_name                                                         TemperPack
long_description          TemperPack is seeking to solve the world's pac...
category_list             Environmental Consulting,GreenTech,Manufacturi...
category_groups_list      Administrative Services,Manufacturing,Professi...
country_code

In [17]:
# Step 2: Find the top 5 matching results from training set
# Here, we are using method 1 on weighted sum. The calculation is very time-intensive as it
# needs to compare the embeddings of new founder details with all embeddings in other training set data
# This can be modified to method 2, using pinecone vector database.
# See the file "Similarity Comparison.ipynb" for more details
training_file = "Founder Features with Embeddings_Training.csv"
top_matches = calculate_aggregate_similarity(training_file, new_founder_profile, num_matches = 5)
top_matches

# Exporting the similarity results to new file
df = pd.read_csv("Founder Features.csv")
founder_ids = []
for index, row in top_matches.iterrows():
    founder_ids.append(row["ID"])
    
matching_rows = []
for index, row in df.iterrows():
    if row["ID"] in founder_ids:
        matching_rows.append(row)
    
new_df = pd.DataFrame(matching_rows)
new_df.to_csv("Top Similar Founders.csv")

Self-Description
[0.79788585 0.7901382  0.76581986 0.76284827 0.748219   0.7945609
 0.76519649 0.77851118 0.76804495 0.77367048]
[2.1223763578601234, 2.101767622843568, 2.0370808249921253, 2.0291764091788473, 1.9902625365272284, 2.113532004046599, 2.035422674584794, 2.070839728990379, 2.0429995795101403, 2.057963479708599]
Education Backgrounds
[0.82769126 0.81970877 0.80863059 0.82418825 0.64830977 0.77868455
 0.8928601  0.81076988 0.85656532 0.81552328]
[2.693483327935362, 2.6673666750257503, 2.5950359314637823, 2.5978663020580104, 2.437596280406116, 2.650824343356518, 2.6514961464472577, 2.630270948130838, 2.634029649162489, 2.6206745450626387]
Employment Backgrounds
[0.86374454 0.87408769 0.8500886  0.82799857 0.80553748 0.86840053
 0.80873238 0.88684163 0.89221033 0.84936408]
[4.291410736109577, 4.2844288981871745, 4.167699833935217, 4.129663652919863, 3.9278406275840014, 4.257365318744887, 4.147651050931705, 4.2709279655259635, 4.284618751474269, 4.19199808673801]
long_descriptio

In [20]:
# Step 3: Rationale generation

# Firstly, retrieve the top 5 matching results and make them into a list of Founder objects
founder_profiles = extract_top_n_founders('Top Similar Founders.csv', 5)
print([ (x.get_name(), x.get_is_success()) for x in founder_profiles])

# Next, generate prompt to be fed into gpt-4 for rationale generation
prompt = generate_prompt_for_rationale(founder_profiles)

# Feed the prompt to gpt-4 and obtain the rationale
rationale = get_rationale(prompt)
print(rationale)

[('Benjamin Moore', 'Failed'), ('James Mcgoff', 'Successful'), ('Yoke Chung', 'Successful'), ('Brian Powers', 'Successful'), ('Troy Swope', 'Successful')]
Success Rationale:
- Successful founders presented robust educational backgrounds with qualifications in their startup’s core domain or relevant fields. For example, Yoke Chung and Brian Powers possess degrees in fields closely related to their startups' main focus, enhancing their credibility and expertise.
- The successful founders have significant previous work experience, including leadership positions, which helped them acquire skills and competencies vital for their startups' navigation. For instance, James Mcgoff worked at Boeing Defense, and Yoke Chung had a tenure at Intel.
- Successful founders showed an ability to positively leverage their prior experiences, both career, and educational, in their current ventures. For instance, Footprint was established by Troy Swope and Yoke Chung, who applied their skills and experience 

In [21]:
# Step 4: Scoring & producing pros and cons list

# Firstly, extract the lists of success and failure factors from the rationale
text_response = request_key_factors(rationale)
success_factors, failure_factors = extract_key_factors(text_response)
print("Success Factors:", success_factors)
print("Failure Factors:", failure_factors)

Success Factors: ["Success 1) Relevant Educational Background: Successful founders have a robust education in their startup's domain or a closely connected field.", 'Success 2) Valuable Work Experience: Founders with substantial previous work experience, especially in leadership roles, often leverage these skills to navigate their startups successfully.', 'Success 3) Application of Prior Experience: The ability to leverage past experiences from career and education to the current venture can significantly contribute to success.', 'Success 4) Significant Problem-Solving: The startup focuses on solving a significant problem or taps into unmet market needs.', 'Success 5) Prevailing Recognition: Founders who have earned recognitions like Forbes 30 Under 30, EY Entrepreneur of the Year Winner, etc., are usually successful due to their exceptional contributions and abilities.', 'Success 6) Successful Entrepreneurship History: Founders who have successfully run or sold their previous startups

In [22]:
# Next, obtain again the new founder profile (input)
founder = extract_top_n_founders('Input Founder Features.csv', 1)[0]
print(founder.get_name(), founder.get_is_success())

# Perform scoring
# Assuming founder is a Founder object and success_factors, failure_factors are lists
success_scores, failure_scores, overall_score = score_founder_profile(founder, success_factors, failure_factors)
print(success_scores)
print(failure_scores)
print(overall_score)

Charles Vincent Successful
You are given the profile details of a new startup founder, as well as a list of factors that signals successful or unsuccessful founders. Your task is to give this founder a relevance score on each of the factors given. The founder's profile details is given as such:

Founder 1: 
 - Name: Charles Vincent
 - Age: 31.0
 - Self Description: Charles is a Montreal-born Materials Engineer and holds a B.Eng. from McGill University. Prior to cofounding TemperPack, Charles worked in the EV industry, and led the design and development of a novel lithium-ion battery in collaboration with the Canadian Department of National Defense. Charles drives the design, execution, and scaling of revolutionary manufacturing processes across our operation to maintain our nimbleness and authority in the marketplace. When he is not thinking about the next big thing, Charles can be found repairing his ’89 Wrangler, fly fishing in the Blue Ridge Mountains, or hiking with his fiancée Mic

In [25]:
# Finally, produce a list of pros and cons for the founder
# Assuming founder is a Founder object, and we have lists for success/failure factors and their scores
response = generate_pros_and_cons(founder, success_factors, failure_factors, success_scores, failure_scores)
print(response)

Pros:
1. Relevant Experience and Background: Charles has an educational background in Materials Engineering, which aligns with his startup's field of specialization. Moreover, his past work experience in the Electric Vehicle (EV) industry demonstrates his ability to design, develop, and execute revolutionary manufacturing processes, potentially beneficial to his current startup.
2. Problem-solving Prowess: Charles's startup focuses on a significant market need: sustainable packaging. His background and his work in developing a novel lithium-ion battery suggest he is proficient at addressing complex problems, a valuable skill for leading a startup.
3. Positive Personal Traits: Charles self-describes as a proactive individual who enjoys taking on complex challenges. With his proven record of project management and innovation, his self-confidence and perseverance are assets that could drive his startup towards success.

Cons:
1. Undefined employment Background: Charles’ previous roles at 