# ADSP 32018: Final Project
## Topic Modeling and Industry Identification

Peyton Nash

### Project Description
In March of 2023, Goldman Sachs published a report, indicating that ~25% of the tasks in US and Europe can be automated using AI.  However, not all industries will be affected equally. According to the report, certain jobs, like office tasks, legal, architecture, and social sciences have a potential for 30%+ automation, while positions like construction, installation, and building maintenance are going to be largely unaffected.

In July of 2025, Microsoft published an in-depth studyLinks to an external site. based on 200,000 anonymized conversations with Microsoft Copilot, aiming to understand how generative AI is actually being used in the workplace and which professions are being most affected.

The researchers separated what users intended to do from what the AI actually delivered. They then mapped both to detailed job functions defined by O*NET. Using this framework, along with indicators of task success and coverage, they developed an “AI applicability score” for every occupation.

The findings are clear. Generative AI excels at tasks like information gathering, writing, and communication. It is already transforming knowledge and service-based roles. However, it has limited usefulness in jobs that rely on physical effort.
One of the most surprising insights? There’s little connection between AI’s impact and factors like income or education level. This challenges long-held assumptions about which roles are most at risk of disruption.

You can also find supporting evidence in the Facebook Research paper, which highlights Moravec’s Paradox. This thesis posits that the hardest problems in AI involve sensorimotor skills rather than abstract thought or reasoning. Notably, these findings coincide with predictions made by Goldman Sachs.

For this final project, I have prepared a collection of ~200K news articles on our favorite topics, data science, machine learning, and artificial intelligence. Your task is to identify what industries are going to be most impacted by AI over the next several years, based on the information/insights you can extract from this text corpus.

Your goal is to provide actionable recommendations on what can be done with AI to automate the jobs, improve employee productivity, and generally make AI adoption successful. Please pay attention to the introduction of novel technologies and algorithms, such as AI for image generation and Conversational AI, as they represent the entire paradigm shift in adoption of AI technologies and data science in general.

### Setup

In [1]:
%pip install umap-learn

Collecting umap-learn
  Using cached umap_learn-0.5.9.post2-py3-none-any.whl.metadata (25 kB)
Collecting numpy>=1.23 (from umap-learn)
  Using cached numpy-2.2.6-cp310-cp310-macosx_14_0_arm64.whl.metadata (62 kB)
Collecting scipy>=1.3.1 (from umap-learn)
  Using cached scipy-1.15.3-cp310-cp310-macosx_14_0_arm64.whl.metadata (61 kB)
Collecting scikit-learn>=1.6 (from umap-learn)
  Using cached scikit_learn-1.7.1-cp310-cp310-macosx_12_0_arm64.whl.metadata (11 kB)
Collecting numba>=0.51.2 (from umap-learn)
  Using cached numba-0.61.2-cp310-cp310-macosx_11_0_arm64.whl.metadata (2.8 kB)
Collecting pynndescent>=0.5 (from umap-learn)
  Using cached pynndescent-0.5.13-py3-none-any.whl.metadata (6.8 kB)
Collecting tqdm (from umap-learn)
  Using cached tqdm-4.67.1-py3-none-any.whl.metadata (57 kB)
Collecting llvmlite<0.45,>=0.44.0dev0 (from numba>=0.51.2->umap-learn)
  Using cached llvmlite-0.44.0-cp310-cp310-macosx_11_0_arm64.whl.metadata (4.8 kB)
Collecting joblib>=0.11 (from pynndescent>=0.5-

In [2]:
%pip install dotenv numpy pandas bertopic sentence_transformers scikit-learn hdbscan

Collecting dotenv
  Using cached dotenv-0.9.9-py2.py3-none-any.whl.metadata (279 bytes)
Collecting pandas
  Using cached pandas-2.3.1-cp310-cp310-macosx_11_0_arm64.whl.metadata (91 kB)
Collecting bertopic
  Using cached bertopic-0.17.3-py3-none-any.whl.metadata (24 kB)
Collecting sentence_transformers
  Using cached sentence_transformers-5.1.0-py3-none-any.whl.metadata (16 kB)
Collecting hdbscan
  Using cached hdbscan-0.8.40-cp310-cp310-macosx_14_0_arm64.whl
Collecting python-dotenv (from dotenv)
  Using cached python_dotenv-1.1.1-py3-none-any.whl.metadata (24 kB)
Collecting pytz>=2020.1 (from pandas)
  Using cached pytz-2025.2-py2.py3-none-any.whl.metadata (22 kB)
Collecting tzdata>=2022.7 (from pandas)
  Using cached tzdata-2025.2-py2.py3-none-any.whl.metadata (1.4 kB)
Collecting plotly>=4.7.0 (from bertopic)
  Using cached plotly-6.3.0-py3-none-any.whl.metadata (8.5 kB)
Collecting transformers<5.0.0,>=4.41.0 (from sentence_transformers)
  Using cached transformers-4.55.2-py3-none-an

In [1]:
# Import libraries
import re, math, gc, itertools, warnings, os, random
from dotenv import load_dotenv

import numpy as np
import pandas as pd

# Topic modeling
from bertopic import BERTopic
from sentence_transformers import SentenceTransformer, util
from sklearn.feature_extraction.text import CountVectorizer
import umap
import hdbscan

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
load_dotenv()

False

#### Topic Modeling

In [3]:
# Load data
df = pd.read_parquet('output_data2/df_dedupe2.parquet')

In [4]:
df['text_clean'][10000]



In [15]:
# Define BERTopic model
umap_model = umap.UMAP(
    n_neighbors=40,
    n_components=15,
    min_dist=0.1,
    random_state=42,
    metric="cosine",
    low_memory=True
)

hdbscan_model = hdbscan.HDBSCAN(
    min_cluster_size=100,
    min_samples=10,
    metric="euclidean",
    cluster_selection_method="eom",
    prediction_data=True
)

vectorizer_model = CountVectorizer(
    stop_words="english",
    ngram_range=(1,2),
    min_df=0.01
)

topic_model = BERTopic(
    umap_model=umap_model,
    hdbscan_model=hdbscan_model,
    vectorizer_model=vectorizer_model,
    min_topic_size=100,
    calculate_probabilities=True,
    verbose=True
)

In [53]:
# Define BERTopic model
umap_model = umap.UMAP(
    n_neighbors=40,
    n_components=15,
    min_dist=0.1,
    random_state=42,
    metric="cosine",
    low_memory=True
)

hdbscan_model = hdbscan.HDBSCAN(
    min_cluster_size=75,
    min_samples=10,
    metric="euclidean",
    cluster_selection_method="eom",
    prediction_data=True
)

vectorizer_model = CountVectorizer(
    stop_words="english",
    ngram_range=(1,2),
    min_df=0.01
)

topic_model2 = BERTopic(
    umap_model=umap_model,
    hdbscan_model=hdbscan_model,
    vectorizer_model=vectorizer_model,
    min_topic_size=100,
    calculate_probabilities=True,
    verbose=True
)

In [54]:
# Fit BERTopic
topics, probs = topic_model2.fit_transform(df['text_clean'])

2025-08-18 18:43:02,312 - BERTopic - Embedding - Transforming documents to embeddings.
Batches: 100%|██████████| 3689/3689 [08:35<00:00,  7.15it/s]
2025-08-18 18:51:48,172 - BERTopic - Embedding - Completed ✓
2025-08-18 18:51:48,173 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2025-08-18 18:53:49,183 - BERTopic - Dimensionality - Completed ✓
2025-08-18 18:53:49,187 - BERTopic - Cluster - Start clustering the reduced embeddings
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZER

In [57]:
# Save outputs
topics_dict = topic_model2.get_topics()
#pd.DataFrame(np.column_stack((probs, np.array(topics)))).to_parquet('output_data2/bert_results2.parquet')
#pd.DataFrame([[k, str(v)] for k, v in topics_dict.items()]).to_parquet('output_data2/bert_dict2.parquet')

#### Clean Topics

In [58]:
# Get topics
print(f'Number of topics: {len(topics_dict)}')

Number of topics: 205


In [71]:
# Create topic column
df['topic'] = topics

# Get counts by topic
print(df.groupby('topic')['topic'].count())
print(len(df[df.topic==-1])/len(df))

topic
-1      60833
 0       1785
 1       1658
 2       1511
 3       1405
        ...  
 199       78
 200       78
 201       78
 202       77
 203       75
Name: topic, Length: 205, dtype: int64
0.5153548343372218


### Assign Articles to Industries

Industries are collected from the [BEA](https://www.bea.gov/sites/default/files/2025-06/gdp1q25-3rd.pdf). The industries are presented as both category total and sub-items. Whether the category or sub-item is included is based on a review of the topic keywords produced by BERTopic. The industry keyword dictionary was produced by ChatGPT.

In [22]:
# Create a dictionary of keywords for each industry
industry_keywords = {
    "Real estate and rental and leasing": [
        "real estate", "property", "housing", "apartment", "condominium", "mortgage",
        "broker", "landlord", "tenant", "leasing", "rental", "commercial property",
        "residential property", "zoning", "realtor", "property management", "vacancy",
        "realty", "title deed", "appraisal", "real estate market", "land parcel",
        "property tax", "homeownership", "escrow", "foreclosure"
    ],
    "Government": [
        "federal", "state", "local government", "municipal", "regulation", "public policy",
        "legislation", "agency", "department", "minister", "congress", "parliament",
        "bureaucracy", "public sector", "administration", "civil service", "ordinance",
        "executive order", "governance", "commission", "cabinet", "policy making",
        "regulatory body", "ombudsman", "court", "constitution"
    ],
    "Manufacturing": [
        "factory", "plant", "assembly", "production", "supply chain", "automation",
        "machinery", "fabrication", "industrial", "engineering", "materials",
        "processing", "lean manufacturing", "manufacture", "equipment", "workers",
        "3D printing", "additive manufacturing", "mass production", "prototype",
        "assembly line", "tooling", "CNC", "robotics", "inventory control"
    ],
    "Professional, scientific, and technical services": [
        "consulting", "advisory", "engineering services", "research", "analytics",
        "scientific", "laboratory", "testing", "architecture", "design", "professional",
        "technical", "legal services", "accounting", "IT services", "data science",
        "specialist", "expertise", "surveying", "audit", "compliance",
        "intellectual property", "forensics", "R&D", "innovation", "biotech"
    ],
    "Health care and social assistance": [
        "hospital", "clinic", "doctor", "nurse", "patient", "healthcare",
        "pharmaceutical", "therapy", "treatment", "emergency", "surgery", "vaccine",
        "insurance claim", "social services", "mental health", "elder care", "wellness",
        "rehabilitation", "primary care", "telemedicine", "nursing home", "public health",
        "diagnostic", "clinical trials", "medical device", "nutrition"
    ],
    "Finance and insurance": [
        "banking", "fintech", "investment", "insurance", "mortgage", "stock market",
        "hedge fund", "credit", "loan", "equity", "derivatives", "risk management",
        "retirement", "asset management", "portfolio", "reinsurance", "capital",
        "underwriting", "mutual fund", "wealth management", "securities", "treasury",
        "bond", "cryptocurrency", 'bitcoin', 'crytpo', "financial regulation", "trading"
    ],
    "Retail trade": [
        "retail", "store", "shop", "e-commerce", "customer", "merchandise",
        "sales", "inventory", "supply chain", "fashion", "groceries", "discount",
        "department store", "mall", "point of sale", "consumer goods", "foot traffic",
        "franchise", "checkout", "brand", "retail chain", "promotions", "online store",
        "returns", "shopping cart", "catalog"
    ],
    "Wholesale trade": [
        "wholesale", "distributor", "bulk", "supply", "inventory", "logistics",
        "B2B", "sourcing", "commodity", "warehousing", "retailer", "procurement",
        "supply chain", "distribution center", "trade partner", "supply agreement",
        "import", "export", "wholesale pricing", "stockist", "wholesale supplier",
        "shipment", "trade fair", "supply broker", "order fulfillment"
    ],
    "Information (media, telecom, publishing)": [
        "media", "telecommunications", "publishing", "broadcasting", "network",
        "internet provider", "streaming", "telecom", "digital media", "wireless",
        "social media", "online platform", "satellite", "ISP", "television",
        "journalism", "radio", "press", "content distribution", "newspaper",
        "broadcaster", "mobile carrier", "subscriber", "news outlet", "print media"
    ],
    "Software and data": [
        "software", "cloud computing", "data", "cybersecurity", "IT", "digital",
        "search engine", "AI", "machine learning", "data analytics", "data science",
        "big data", "database", "API", "blockchain", "natural language processing",
        "SaaS", "platform", "coding", "algorithm", "developer", "open source",
        "data warehouse", "artificial intelligence", "predictive analytics"
    ],    
    "Construction": [
        "construction", "infrastructure", "contractor", "builder", "architecture",
        "civil engineering", "project management", "materials", "cement", "steel",
        "renovation", "residential", "commercial construction", "bridge", "road",
        "blueprint", "skyscraper", "foundation", "scaffolding", "crane", "hard hat",
        "general contractor", "permit", "urban development", "construction site"
    ],
    "Transportation and warehousing": [
        "transportation", "vehicle", "traffic", "logistics", "shipping", "freight", "supply chain",
        "distribution", "warehouse", "rail", "aviation", "trucking", "cargo",
        "delivery", "fleet", "scheduling", "port", "supply route", "air cargo", "courier"
    ],
    "Accommodation and food services": [
        "hotel", "restaurant", "hospitality", "lodging", "resort", "catering",
        "travel", "tourism", "fast food", "fine dining", "bed and breakfast",
        "chef", "bar", "café", "booking", "guest", "hospitality industry",
        "banquet", "room service", "hospitality management", "franchise restaurant",
        "menu", "reservation", "culinary", "hotel chain", "housekeeping"
    ],
    "Management of companies and enterprises": [
        "holding company", "corporate", "enterprise", "conglomerate", "management",
        "parent company", "subsidiary", "board of directors", "executive", "leadership",
        "strategy", "organizational", "CEO", "M&A", "corporate governance", "business unit",
        "CFO", "shareholders", "management consulting", "corporate strategy", "joint venture",
        "operating company", "group structure", "chairperson", "divestiture"
    ],
    "Utilities": [
        "electricity", "power", "grid", "water", "sewage", "natural gas",
        "renewable energy", "utility company", "hydropower", "infrastructure",
        "distribution", "pipeline", "waste management", "energy supply",
        "nuclear", "electric utility", "solar power", "wind energy", "substation",
        "ratepayer", "energy efficiency", "load management", "smart grid",
        "generation plant", "transmission line"
    ],
    "Arts, entertainment, and recreation": [
        "arts", "entertainment", "museum", "theater", "cinema", "music",
        "concert", "recreation", "festival", "sports", "amusement", "gaming",
        "culture", "exhibition", "performance", "broadcast", "leisure", "tourism",
        "gallery", "dance", "opera", "theme park", "artist", "show", "spectator", 
        "player", "game"

    ],
    "Educational services": [
        "school", "university", "college", "student", "teacher", "curriculum",
        "learning", "classroom", "education", "training", "course", "degree",
        "academic", "instruction", "tutoring", "literacy", "online learning",
        "seminar", "lecture", "exam", "scholarship", "pedagogy", "MOOC",
        "graduate", "faculty"
    ],
    "Agriculture, forestry, fishing, and hunting": [
        "farming", "crops", "livestock", "agriculture", "harvest", "soil",
        "irrigation", "forestry", "logging", "timber", "fishing", "aquaculture",
        "ranching", "dairy", "horticulture", "sustainable farming", "hunting",
        "poultry", "agribusiness", "fisheries", "tractor", "plantation",
        "cattle", "seed", "barn", "organic farming"
    ]
}

In [60]:
# Define embedding model
model = SentenceTransformer('all-MiniLM-L6-v2')

In [61]:
# Industry embeddings
industry_vectors = {}

for industry, keywords in industry_keywords.items():
    embeddings = model.encode(keywords)
    industry_vectors[industry] = np.float64(np.mean(embeddings, axis=0))

In [62]:
# Topic embeddings
topic_vectors = {}

for topic, keywords in topics_dict.items():
    words = [w for w, _ in keywords if w not in ['ai', 'artificial', 'intelligence']][:5]
    weights = np.array([s for w, s in keywords if w not in ['ai', 'artificial', 'intelligence']][:5])
    embeddings = model.encode(words)
    topic_vectors[topic] = np.average(embeddings, axis = 0, weights=weights)


In [63]:
# Get cosine similarity of industry and topic embeddings
sims = {}
for topic, tv in topic_vectors.items():
    sims[topic] = {ind: util.cos_sim(tv, iv).item() for ind, iv in industry_vectors.items()}
    sims[topic] = dict(sorted(sims[topic].items(), key=lambda x: x[1], reverse=True))

In [64]:
# Map the topics to industries - exclude topics that where no industry has a cosine similarity greater than .6
map = {topic:max(industry, key=industry.get) for topic, industry in sims.items() if max(industry.values()) >= .6}

print(f'Number of topics: {len(topics_dict.keys())}')
print(f'Number of topics assigned: {len(map.values())}')

Number of topics: 205
Number of topics assigned: 111


In [66]:
topics_dict

{-1: [('ai', np.float64(0.0044116741663895185)),
  ('data', np.float64(0.003441200004026581)),
  ('new', np.float64(0.0029392111165203513)),
  ('technology', np.float64(0.0028875006527952205)),
  ('said', np.float64(0.0027658216594994675)),
  ('use', np.float64(0.0026415183362848015)),
  ('company', np.float64(0.0025332455923952456)),
  ('business', np.float64(0.0024673305461324524)),
  ('intelligence', np.float64(0.0024518833216545166)),
  ('like', np.float64(0.0023830857374600727))],
 0: [('chinese', np.float64(0.016009266735163855)),
  ('china', np.float64(0.015388967710604412)),
  ('baidu', np.float64(0.010914545877746448)),
  ('chinas', np.float64(0.009580043877756997)),
  ('ernie', np.float64(0.009498822777047188)),
  ('alibaba', np.float64(0.006376163175803243)),
  ('ernie bot', np.float64(0.0058525995333813426)),
  ('beijing', np.float64(0.005694808430028876)),
  ('model', np.float64(0.005272258250451039)),
  ('models', np.float64(0.005086616298463165))],
 1: [('apple', np.floa

In [65]:
# Inspect topic/industry map
for topic, industry in map.items():
    print(f'Topic: {topic}')
    print(f'Number of articles: {len(df[df.topic==topic])}')
    print(f'Industry: {industry}')
    print([w[0] for w in topics_dict[topic] if w not in ['ai', 'artificial', 'intelligence']][:5])
    print('\n')

Topic: -1
Number of articles: 60833
Industry: Software and data
['ai', 'data', 'new', 'technology', 'said']


Topic: 2
Number of articles: 1511
Industry: Educational services
['students', 'teachers', 'education', 'student', 'chatgpt']


Topic: 5
Number of articles: 1101
Industry: Arts, entertainment, and recreation
['art', 'images', 'image', 'dalle', 'artists']


Topic: 6
Number of articles: 1101
Industry: Finance and insurance
['blockchain', 'trading', 'crypto', 'decentralized', 'cryptocurrency']


Topic: 9
Number of articles: 918
Industry: Finance and insurance
['financial', 'banks', 'banking', 'credit', 'fraud']


Topic: 10
Number of articles: 857
Industry: Arts, entertainment, and recreation
['music', 'artists', 'song', 'ai music', 'songs']


Topic: 11
Number of articles: 850
Industry: Arts, entertainment, and recreation
['eu', 'ai act', 'act', 'european', 'rules']


Topic: 13
Number of articles: 780
Industry: Transportation and warehousing
['traffic', 'vehicles', 'vehicle', 'road'

In [69]:
# Identify changes to the map
topics_change = {
    -1:None,
    0:'Government',
    1:'Retail trade',
    11:'Government',
    37:'Government',
    38:'Government',
    40:'Government',
    41:'Retail trade',
    178:'Finance and insurance',
    197:'Arts, entertainment, and recreation'
    }

# Update map
for t, i in topics_change.items():
    map[t] = i

In [72]:
# Assign industry to articles
df['industry'] = df['topic'].map(map)
df = df[df.to_drop == False]

print(f'Number of articles by industry:\n\n{df.groupby("industry")["industry"].count().sort_values(ascending=False)}')
print(f'\nNumber of articles assigned to an industry: {len(df) - df["industry"].isnull().sum()}')
print(f'Percentage of articles assigned to an industry: {100*(len(df) - df["industry"].isnull().sum())/len(df[df.topic != -1]):.2f}')

Number of articles by industry:

industry
Government                                          6307
Arts, entertainment, and recreation                 5061
Finance and insurance                               4960
Software and data                                   2982
Professional, scientific, and technical services    2719
Retail trade                                        2602
Health care and social assistance                   2181
Educational services                                2004
Transportation and warehousing                      1445
Information (media, telecom, publishing)            1348
Accommodation and food services                      791
Utilities                                            635
Management of companies and enterprises              352
Agriculture, forestry, fishing, and hunting          236
Wholesale trade                                      213
Real estate and rental and leasing                   212
Manufacturing                                 

In [74]:
df.to_parquet('output_data2/df_industry2.parquet')