# Process of fetching the Emerging Technologies

## Overview 

1. Manually collecting a List of Emerging Technologies 
2. Enriching the technology names with a one-sentence definition
3. Compute sentence embeddings
4. Cluster embeddings
5. Name clusters
6. Manually review the clusters




### Step 1: Manually collect Technologies

We manually collected emerging technologies using different reports from known industry entities and a list from Wikipedia to collect a general list of various emerging technologies. The list is called manual_selected_technologies.csv and can be found in the data/ directory.

### Step 2: Enrich the technologies

The next step is to enrich the technolgies with their definition. We do that because we run an embedding model later on and with the definitions this model will yield better results. We will not add this script to this notebook because it is only a generic step but one can find the script for that in the scripts/enrich/ folder. We used gpt-4.1-mini because it has it's knowledge cutoff in 2024 

### Step 3: Compute sentence embeddings

We compute the embeddings of the terms using the term and the sentence. We chosose the all-MiniLM-L6-v2 model.

In [8]:
import pandas as pd
import numpy as np
from sentence_transformers import SentenceTransformer

df = pd.read_csv('../data/technologies-data/technologies_with_definitions.csv')
df["embed_text"] = df["Technology Name"].str.strip() + ". " + df["Definition"].str.strip()

model = SentenceTransformer(
    "sentence-transformers/all-MiniLM-L6-v2",
    device="cpu"
)

embeddings = model.encode(
    df["embed_text"].tolist(),
    show_progress_bar=True,
    batch_size=16,
    convert_to_numpy=True,
    normalize_embeddings=True
)

np.save('../data/technologies-data/tech_embeddings.npy', embeddings)
df.to_parquet("../data/technologies-data/tech_terms_enriched.parquet", index=False)

Batches:   0%|          | 0/7 [00:00<?, ?it/s]

### Step 4: Create Clusters

Now we create the clusters to cluster our Technologies. We use HDBSCAN because we do not know how many clusters we will have (no pre-set k). Furthermore HDBSCAN labels the low density points as noise. This is helpful for us because we do excpect that the majority of technologies is a standalone technology and can not be grouped.

In [9]:
from tqdm import tqdm
import hdbscan

# Load embeddings and DataFrame which we created earlier
embeddings = np.load('../data/technologies-data/tech_embeddings.npy')
df = pd.read_parquet("../data/technologies-data/tech_terms_enriched.parquet")

# Run HDBSCAN
clusterer = hdbscan.HDBSCAN(
    metric="euclidean",
    min_cluster_size=2,
    min_samples=1,
    cluster_selection_method="leaf",
    prediction_data=False,
)

labels = clusterer.fit_predict(embeddings)

# convert noise (-1) to unique singleton cluster IDs
current_max = labels.max() + 1
singelton_ids = []

for i, label in enumerate(labels):
    if label == -1:  # noise
        singelton_ids.append(current_max)
        current_max += 1
        singelton_ids.append(label)
        
df["cluster_id"] = labels


singleton_count = (labels == -1).sum()
print(f"Number of clusters found: {df['cluster_id'].nunique()}")
print(f"Singleton points: {singleton_count}")

# Get non-singleton clusters (where cluster_id != -1)
non_singleton_clusters = df[df['cluster_id'] != -1].groupby('cluster_id')['Technology Name'].agg(list)

# Print each cluster
for cluster_id, techs in non_singleton_clusters.items():
    print(f"Cluster {cluster_id}: {', '.join(techs)}")



Number of clusters found: 23
Singleton points: 45
Cluster 0: GitOps, Low-Code And No-Code Platforms
Cluster 1: Metaverse for mental health, The Metaverse
Cluster 2: Immersive Technology for the Built Worls, Immersive-Reality Technologies
Cluster 3: cloud-native, Cloud and Edge Computing
Cluster 4: Carbon-capturing Microbes, Decarbonization Technologies, Sustainable computing, Clean Energy Generation and Storage, Electrification and Renewables, Climate Technologies Beyond Electrification
Cluster 5: Digital Immune System, Privacy-Enhancing Technologies, Data Privacy, Data Security, and Cybersecurity Technologies, Digital Trust and Cybersecurity, Blockchain
Cluster 6: Disease-Diagnosing Breath Sensors, Energy Harvesting from Wireless Signals, Wireless Biomarker Devices
Cluster 7: Semiconductors and Microelectronics, Implantable Microchips
Cluster 8: Flexible batteries, Flexible neural electronics, Solid-State Batteries
Cluster 9: Spatial Computing, Spatial omics
Cluster 10: Homomorphic En

### Step 4.1: Manually create Terms for Groups

As one can see we got 23 Clusters and 45 singeltons (45 categories could not be grouped). Now we manually name the clusters. We add some more clusters in case the categorization is too inclusive
0. DevOps
1. Metaverse
2. Immersive-Reality Technologies
3. Cloud Computing
4. Climate Technologies
5. Cybersecurity
6. Digital Immune System
7. Wireless Biomarker Devices
8. Advanced Semiconductors and Microelectronics
9. Advanced Batteries
10. Spatial Computing 
11. Quantum Technologies
12. Machine Learning
13. Advanced Engineering Materials
14. GenUI
15. AI-facilitated healthcare
16. Advanced Computing
17. Advanced Manufacturing
18. Advanced Connectivity
19. Integrated Communication and Networking Technologies
20. Spaced Based Connectivity
21. Biotechnologies
22. Future of Mobility
23. Future of Space Technologies
24. AI-Driven Diagnostics
25. AI for Scientific Discorvery
26. Autonomous Agents
27. Generative Artificial Intelligence
28. AI Copilots

Now we combine this with our singletons and our list is done

In [10]:
cluster_groups = [
"DevOps",
"Metaverse",
"Immersive-Reality Technologies",
"Cloud Computing",
"Climate Technologies",
"Cybersecurity",
"Digital Immune System",
"Wireless Biomarker Devices",
"Advanced Semiconductors and Microelectronics",
"Advanced Batteries",
"Spatial Computing ",
"Quantum Technologies",
"Machine Learning",
"Advanced Engineering Materials",
"GenUI",
"AI-facilitated healthcare",
"Advanced Computing",
"Advanced Manufacturing",
"Advanced Connectivity",
"Integrated Communication and Networking Technologies",
"Spaced Based Connectivity",
"Biotechnologies",
"Future of Mobility",
"Future of Space Technologies",
"AI-Driven Diagnostics",
"AI for Scientific Discorvery",
"Autonomous Agents",
"Generative Artificial Intelligence",
"AI Copilots",
]
singletons = df[df['cluster_id'] == -1]['Technology Name'].tolist()
pd.DataFrame(cluster_groups + singletons, columns=['Technology Name']).to_csv('../data/technologies-data/finished_technologies.csv', index=False)
print(f"CSV file has {len(cluster_groups + singletons)} technologies")


CSV file has 74 technologies
