In [1]:
from openai import OpenAI
client = OpenAI()
import pandas as pd
from pathlib import Path

In [2]:
part_path = Path("part-2")
raw_path = Path(f"{part_path}/raw")
processed_path = Path(f"{part_path}/processed")
submission_path = Path(f"{part_path}/submission")

In [3]:
embedding_model = "text-embedding-3-small"
embedding_encoding = "cl100k_base"
max_tokens = 8000  # the maximum for text-embedding-3-small is 8191

def get_embedding(text, model="text-embedding-3-small", **kwargs):
    # replace newlines, which can negatively affect performance.
    text = text.replace("\n", " ")
    response = client.embeddings.create(input=[text], model=model, **kwargs)
    return response.data[0].embedding

In [4]:
df_train = pd.read_csv(f"{processed_path}/train-pre-embeddings.csv")
df_test = pd.read_csv(f"{processed_path}/test-pre-embeddings.csv")

In [5]:
projects = pd.concat([
    df_train['project_a'],
    df_train['project_b'],
    df_test['project_a'],
    df_test['project_b']
]).unique().tolist()

In [6]:
# Create a pivot table with projects and their descriptions
df_desc = pd.concat([
    df_train[['project_a', 'description']].rename(columns={'project_a': 'project'}),
    df_train[['project_b', 'description_b']].rename(columns={'project_b': 'project', 'description_b': 'description'}),
    df_test[['project_a', 'description']].rename(columns={'project_a': 'project'}),
    df_test[['project_b', 'description_b']].rename(columns={'project_b': 'project', 'description_b': 'description'})
]).drop_duplicates()

In [7]:
df_desc.reset_index(drop=True, inplace=True)

In [8]:
df_desc.tail()

Unnamed: 0,project,description
112,grandinetech/grandine,High performance Ethereum consensus client
113,quic-go/quic-go,A QUIC implementation in pure Go
114,ethereumjs/ethereumjs-monorepo,Monorepo for the Ethereum VM TypeScript Implem...
115,alexeyraspopov/picocolors,The tiniest and the fastest library for termin...
116,ethereum/solc-js,Javascript bindings for the Solidity compiler


In [9]:
df_desc['description'] = df_desc['description'].astype(str)
df_desc["embedding"] = df_desc.description.apply(lambda x: get_embedding(x, model=embedding_model))

In [None]:
df_desc.drop(columns=['description'], inplace=True)
df_desc.head()

In [12]:
df_desc.head()

Unnamed: 0,project,embedding
0,mochajs/mocha,"[-0.06486216932535172, -0.0335981547832489, -0..."
1,chzyer/readline,"[-0.00926456693559885, 0.08315405994653702, -0..."
2,gulpjs/gulp,"[-0.021835271269083023, -0.022416533902287483,..."
3,webpack/webpack,"[-0.013375786133110523, -0.005960001144558191,..."
4,redux-saga/redux-saga,"[-0.06511539220809937, 0.019022632390260696, -..."


In [13]:
desc_eth = """
The Ethereum ecosystem is a vast and complex network of technologies, applications, and participants that revolve around the Ethereum blockchain. As of 2025, it has grown into one of the most significant and influential platforms in the cryptocurrency and blockchain space.
Core Components
Ethereum Blockchain
At the heart of the ecosystem is the Ethereum blockchain itself. It's a decentralized, open-source platform that enables the creation and execution of smart contracts and decentralized applications (dApps)1. The Ethereum network operates on a proof-of-stake consensus mechanism, which was implemented in 2022 as part of the Ethereum 2.0 upgrade7.
Ether (ETH)
Ether is the native cryptocurrency of the Ethereum network. It serves multiple purposes:
Medium of exchange
Store of value
Gas for transaction fees
Staking token for network security4
Smart Contracts
Smart contracts are self-executing programs that run on the Ethereum blockchain. They automatically execute when predetermined conditions are met, enabling trustless and automated agreements without intermediaries3.
Ethereum Virtual Machine (EVM)
The EVM is a runtime environment for executing smart contracts on the Ethereum network. It provides a sandboxed and isolated environment for running decentralized applications2.
Ecosystem Participants
Developers
Developers play a crucial role in the Ethereum ecosystem by creating dApps, smart contracts, and other blockchain-based solutions. They leverage Ethereum's programmable nature to build innovative applications across various sectors1.
Users
Millions of users interact with Ethereum-based applications, from DeFi platforms to NFT marketplaces. As of 2024, there were over 96 million accounts with an Ether balance1.
Miners/Validators
With the transition to proof-of-stake, miners have been replaced by validators who stake their ETH to secure the network and validate transactions7.
Key Sectors
Decentralized Finance (DeFi)
DeFi is one of the most transformative sectors within the Ethereum ecosystem. It leverages smart contracts to create financial products and services without traditional intermediaries. Popular DeFi applications include:
Decentralized exchanges (e.g., Uniswap)
Lending platforms (e.g., Aave)
Stablecoin issuance (e.g., MakerDAO)4
Non-Fungible Tokens (NFTs)
Ethereum has been at the forefront of the NFT revolution, hosting platforms like OpenSea and enabling the creation and trading of unique digital assets4.
Decentralized Autonomous Organizations (DAOs)
Ethereum supports the creation of DAOs, which are entities governed by smart contracts, enabling decentralized decision-making and resource allocation3.
Development Tools and Infrastructure
Development Frameworks
Tools like Truffle and Hardhat facilitate the development, testing, and deployment of smart contracts15.
Layer 2 Solutions
To address scalability issues, Layer 2 solutions like Optimism and Arbitrum have been developed to reduce transaction costs and increase throughput4.
Challenges and Future Developments
While Ethereum has established itself as a leader in the blockchain space, it still faces challenges:
Scalability: Despite improvements, the network continues to work on scaling solutions to handle increased demand.
Competition: Other blockchain platforms like Cardano and Solana present ongoing competition4.
The Ethereum ecosystem continues to evolve, with ongoing developments and upgrades aimed at improving scalability, security, and sustainability. The implementation of Ethereum 2.0 and the transition to proof-of-stake have been significant milestones in this journey7.
In conclusion, the Ethereum ecosystem represents a dynamic and innovative space at the forefront of blockchain technology, continually pushing the boundaries of what's possible in decentralized computing and finance.
"""

In [14]:
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
desc_eth_emb = get_embedding(desc_eth, model=embedding_model)

In [15]:
np.array(desc_eth_emb).reshape(1, -1).shape

(1, 1536)

In [16]:
df_desc.iloc[0]['embedding']

[-0.06486216932535172,
 -0.0335981547832489,
 -0.02666306309401989,
 -0.04807431250810623,
 0.014229278080165386,
 -0.032924845814704895,
 0.016709301620721817,
 0.02270175889134407,
 0.005436975974589586,
 0.04767032712697983,
 0.051844846457242966,
 -0.013006098568439484,
 -0.010868340730667114,
 0.007759894244372845,
 0.021826455369591713,
 0.019570868462324142,
 -0.003086002776399255,
 -0.004791720770299435,
 0.023520952090620995,
 0.024239148944616318,
 0.01513824611902237,
 -0.01572178117930889,
 0.04165542498230934,
 0.0021559938322752714,
 0.00045413337647914886,
 -0.027269043028354645,
 -0.03162311390042305,
 0.04210430011153221,
 -0.024845127016305923,
 -0.05440342426300049,
 0.060867197811603546,
 -0.02298230491578579,
 -0.0010667750611901283,
 0.012422563508152962,
 0.006458162330091,
 8.867874566931278e-05,
 0.014655707404017448,
 -0.0010022495407611132,
 0.015351461246609688,
 0.040735237300395966,
 -0.04553817957639694,
 -0.028974760323762894,
 0.049869805574417114,
 -0.

In [17]:
# Calculate cosine similarity between embedding vectors
df_desc['cosine_similarity'] = df_desc.apply(lambda x: cosine_similarity(np.array(desc_eth_emb).reshape(1, -1), np.array(x['embedding']).reshape(1, -1)), axis=1)

In [18]:
df_desc.tail()

Unnamed: 0,project,embedding,cosine_similarity
112,grandinetech/grandine,"[0.021497009322047234, -0.007843143306672573, ...",[[0.42385671311182566]]
113,quic-go/quic-go,"[-0.011620334349572659, 0.04118042811751366, -...",[[0.09138016355620468]]
114,ethereumjs/ethereumjs-monorepo,"[-0.02139364555478096, -0.0041436501778662205,...",[[0.27810810586931245]]
115,alexeyraspopov/picocolors,"[-0.048054084181785583, -0.02389087714254856, ...",[[0.07799856268206073]]
116,ethereum/solc-js,"[-0.014569791033864021, 0.003501335857436061, ...",[[0.21791059236360344]]


In [19]:
df_desc.to_csv(f"{processed_path}/project-cosine.csv", index=False)

In [20]:
df_desc.tail()

Unnamed: 0,project,embedding,cosine_similarity
112,grandinetech/grandine,"[0.021497009322047234, -0.007843143306672573, ...",[[0.42385671311182566]]
113,quic-go/quic-go,"[-0.011620334349572659, 0.04118042811751366, -...",[[0.09138016355620468]]
114,ethereumjs/ethereumjs-monorepo,"[-0.02139364555478096, -0.0041436501778662205,...",[[0.27810810586931245]]
115,alexeyraspopov/picocolors,"[-0.048054084181785583, -0.02389087714254856, ...",[[0.07799856268206073]]
116,ethereum/solc-js,"[-0.014569791033864021, 0.003501335857436061, ...",[[0.21791059236360344]]


In [21]:
df_train = df_train.merge(
    df_desc, 
    left_on='project_a', 
    right_on='project', 
    how='left',
    suffixes=('', '_a')
)
df_train.drop(columns=['project'], inplace=True)

df_train = df_train.merge(
    df_desc, 
    left_on='project_b', 
    right_on='project', 
    how='left',
    suffixes=('', '_b')
)
df_train.drop(columns=['project'], inplace=True)


In [22]:
df_test = df_test.merge(
    df_desc, 
    left_on='project_a', 
    right_on='project', 
    how='left',
    suffixes=('', '_a')
)
df_test.drop(columns=['project'], inplace=True)

df_test = df_test.merge(
    df_desc, 
    left_on='project_b', 
    right_on='project', 
    how='left',
    suffixes=('', '_b')
)
df_test.drop(columns=['project'], inplace=True)

In [23]:
len(df_train) , len(df_test)

(20958, 4261)

In [25]:
from ast import literal_eval
df_train["cosine_similarity"] = df_train.cosine_similarity.apply(lambda x: np.array(x).reshape(-1)[0])
df_train["cosine_similarity_b"] = df_train.cosine_similarity_b.apply(lambda x: np.array(x).reshape(-1)[0])
df_test["cosine_similarity"] = df_test.cosine_similarity.apply(lambda x: np.array(x).reshape(-1)[0])
df_test["cosine_similarity_b"] = df_test.cosine_similarity_b.apply(lambda x: np.array(x).reshape(-1)[0])


In [26]:
df_test.tail()

Unnamed: 0,id,project_a,project_b,total_amount_usd,funder,quarter,description,created_at,updated_at,size,...,starred_ratio,num_dependents_ratio,v_index_ratio,stars_intersection_v_index,stars_b_intersection_v_index_b,stars_ratio_intersection_v_index_ratio,embedding,cosine_similarity,embedding_b,cosine_similarity_b
4256,25215,chainsafe/lodestar,bluealloy/revm,398133,optimism,2024-10,🌟 TypeScript Implementation of Ethereum Consensus,2018-06-22T14:41:47Z,2025-02-07T04:09:55Z,399320,...,0.377483,0.202128,0.0,0.001237,3500.00175,4.141279e-07,"[-0.03276010975241661, -0.03295972943305969, -...",0.348497,"[-0.0379643589258194, 0.03921094909310341, -0....",0.349677
4257,25216,chainsafe/lodestar,ethereum/solc-js,293189,optimism,2024-10,🌟 TypeScript Implementation of Ethereum Consensus,2018-06-22T14:41:47Z,2025-02-07T04:09:55Z,399320,...,0.765101,0.010287,0.0,0.001237,0.001471,4.567947e-07,"[-0.03276010975241661, -0.03295972943305969, -...",0.348497,"[-0.014569791033864021, 0.003501335857436061, ...",0.217911
4258,25217,libp2p/go-libp2p,bluealloy/revm,877024,optimism,2024-10,libp2p implementation in Go,2015-09-30T23:24:32Z,2025-02-06T18:26:58Z,59260,...,0.56682,0.724265,0.714286,31075.006215,3500.00175,0.5573498,"[-0.011176025494933128, 0.004412353038787842, ...",0.130008,"[-0.0379643589258194, 0.03921094909310341, -0....",0.349677
4259,25218,libp2p/go-libp2p,ethereum/solc-js,772080,optimism,2024-10,libp2p implementation in Go,2015-09-30T23:24:32Z,2025-02-06T18:26:58Z,59260,...,0.875445,0.097284,1.0,31075.006215,0.001471,0.8086137,"[-0.011176025494933128, 0.004412353038787842, ...",0.130008,"[-0.014569791033864021, 0.003501335857436061, ...",0.217911
4260,25219,bluealloy/revm,ethereum/solc-js,343290,optimism,2024-10,Rust implementation of the Ethereum Virtual Ma...,2021-09-29T09:21:18Z,2025-02-07T07:37:01Z,35674,...,0.843049,0.039411,1.0,3500.00175,0.001471,0.5433098,"[-0.0379643589258194, 0.03921094909310341, -0....",0.349677,"[-0.014569791033864021, 0.003501335857436061, ...",0.217911


In [27]:
eps = 1e-6
df_train["cosine_ratio"] = df_train["cosine_similarity"] / (df_train["cosine_similarity"] + df_train["cosine_similarity_b"] + eps)
df_test["cosine_ratio"] = df_test["cosine_similarity"] / (df_test["cosine_similarity"] + df_test["cosine_similarity_b"] + eps)

In [28]:
df_train.to_csv(f"{processed_path}/train-embeddings.csv", index=False)
df_test.to_csv(f"{processed_path}/test-embeddings.csv", index=False)