# ADSP 32018: Final Project
## Topic Modeling and Industry Identification

Peyton Nash

### Project Description
In March of 2023, Goldman Sachs published a report, indicating that ~25% of the tasks in US and Europe can be automated using AI.  However, not all industries will be affected equally. According to the report, certain jobs, like office tasks, legal, architecture, and social sciences have a potential for 30%+ automation, while positions like construction, installation, and building maintenance are going to be largely unaffected.

In July of 2025, Microsoft published an in-depth studyLinks to an external site. based on 200,000 anonymized conversations with Microsoft Copilot, aiming to understand how generative AI is actually being used in the workplace and which professions are being most affected.

The researchers separated what users intended to do from what the AI actually delivered. They then mapped both to detailed job functions defined by O*NET. Using this framework, along with indicators of task success and coverage, they developed an “AI applicability score” for every occupation.

The findings are clear. Generative AI excels at tasks like information gathering, writing, and communication. It is already transforming knowledge and service-based roles. However, it has limited usefulness in jobs that rely on physical effort.
One of the most surprising insights? There’s little connection between AI’s impact and factors like income or education level. This challenges long-held assumptions about which roles are most at risk of disruption.

You can also find supporting evidence in the Facebook Research paper, which highlights Moravec’s Paradox. This thesis posits that the hardest problems in AI involve sensorimotor skills rather than abstract thought or reasoning. Notably, these findings coincide with predictions made by Goldman Sachs.

For this final project, I have prepared a collection of ~200K news articles on our favorite topics, data science, machine learning, and artificial intelligence. Your task is to identify what industries are going to be most impacted by AI over the next several years, based on the information/insights you can extract from this text corpus.

Your goal is to provide actionable recommendations on what can be done with AI to automate the jobs, improve employee productivity, and generally make AI adoption successful. Please pay attention to the introduction of novel technologies and algorithms, such as AI for image generation and Conversational AI, as they represent the entire paradigm shift in adoption of AI technologies and data science in general.

### Setup

In [31]:
# Import libraries
import re, math, gc, itertools, warnings, os, random
from dotenv import load_dotenv

import numpy as np
import pandas as pd

# Topic modeling
from bertopic import BERTopic
from sentence_transformers import SentenceTransformer, util
from sklearn.feature_extraction.text import CountVectorizer
import umap.umap_ as UMAP
import hdbscan

In [4]:
load_dotenv()

True

#### Topic Modeling

In [5]:
# Load data
df = pd.read_parquet('output_data/df_dedupe.parquet')

In [19]:
df['text_clean'][0]

'Infogain AI Business Solutions Now Available in the Microsoft Azure Marketplace Business News This Week Courtyard by Marriott Mahabaleshwar Announces Three New Leadership Appointmen Dibyendu Bhattacharya to appear in Sonu Soods Fateh and Anubhav Sinhas next Exclusive Interview with Mr Subroto Sen, CEO, GenY Medium NGEL and HMEL tie up to collaborate in Renewable Energy and the generation of Green Hydrogen synthesizing Green Chemicals Medica Group of Hospitals appoints Dr Nandakumar Jairam as the new Chairman Entrepreneurship Home Improvement Business Wire Listing Digital Marketing About Business News This Week HomeBusinessInfogain AI Business Solutions Now Available in the Microsoft Azure Marketplace Infogain AI Business Solutions Now Available in the Microsoft Azure Marketplace Los Gatos, California, May 20th, 2023 Infogain, a Silicon Valley headquartered leader in human centered digital platform and software engineering services, today announced the availability of three AI powered 

In [None]:
umap_model = UMAP.UMAP(
    n_neighbors=30,
    n_components=10,
    min_dist=0.1,
    random_state=42,
    metric="cosine",
    low_memory=True
)

hdbscan_model = hdbscan.HDBSCAN(
    min_cluster_size=200,
    min_samples=10,
    metric="euclidean",
    cluster_selection_method="eom",
    prediction_data=True
)
vectorizer_model = CountVectorizer(
    stop_words="english",
    ngram_range=(1,2),
    min_df=0.01
)


topic_model = BERTopic(
    umap_model=umap_model,
    hdbscan_model=hdbscan_model,
    vectorizer_model=vectorizer_model,
    min_topic_size=200,
    calculate_probabilities=True,
    verbose=True
)

In [12]:
# Fit BERTopic
topics, probs = topic_model.fit_transform(df['text_clean'])

2025-08-18 06:34:03,166 - BERTopic - Embedding - Transforming documents to embeddings.


Batches:   0%|          | 0/5067 [00:00<?, ?it/s]

2025-08-18 06:46:25,013 - BERTopic - Embedding - Completed ✓
2025-08-18 06:46:25,014 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2025-08-18 06:49:33,705 - BERTopic - Dimensionality - Completed ✓
2025-08-18 06:49:33,708 - BERTopic - Cluster - Start clustering the reduced embeddings
2025-08-18 06:50:47,809 - BERTopic - Cluster - Completed ✓
2025-08-18 06:50:47,827 - BERTopic - Representation - Fine-tuning topics using representation models.
2025-08-18 06:52:35,609 - BERTopic - Representation - Completed ✓


In [74]:
# Save outputs
topics_dict = topic_model.get_topics()
pd.DataFrame(np.column_stack((probs, np.array(topics)))).to_parquet('output_data/bert_results.parquet')
pd.DataFrame([[k, str(v)] for k, v in topics_dict.items()]).to_parquet('output_data/bert_dict.parquet')

#### Clean Topics

In [14]:
# Get topics
print(f'Number of topics: {len(topics_dict)}')

Number of topics: 121


In [15]:
# Create topic column
df['topic'] = topics

# Get counts by topic
print(df.groupby('topic')['topic'].count())

topic
-1      81862
 0       8280
 1       5851
 2       3130
 3       2413
        ...  
 115      206
 116      205
 117      204
 118      203
 119      200
Name: topic, Length: 121, dtype: int64


### Assign Articles to Industries

Industries are collected from the [BEA](https://www.bea.gov/sites/default/files/2025-06/gdp1q25-3rd.pdf). The industries are presented as both category total and sub-items. Whether the category or sub-item is included is based on a review of the topic keywords produced by BERTopic. The industry keyword dictionary was produced by ChatGPT.

In [17]:
# Create a dictionary of keywords for each industry
industry_keywords = {
    "Real estate and rental and leasing": [
        "real estate", "property", "housing", "apartment", "condominium", "mortgage",
        "broker", "landlord", "tenant", "leasing", "rental", "commercial property",
        "residential property", "zoning", "realtor", "property management", "vacancy",
        "realty", "title deed", "appraisal", "real estate market", "land parcel",
        "property tax", "homeownership", "escrow", "foreclosure"
    ],
    "Government": [
        "federal", "state", "local government", "municipal", "regulation", "public policy",
        "legislation", "agency", "department", "minister", "congress", "parliament",
        "bureaucracy", "public sector", "administration", "civil service", "ordinance",
        "executive order", "governance", "commission", "cabinet", "policy making",
        "regulatory body", "ombudsman", "court", "constitution"
    ],
    "Manufacturing": [
        "factory", "plant", "assembly", "production", "supply chain", "automation",
        "machinery", "fabrication", "industrial", "engineering", "materials",
        "processing", "lean manufacturing", "manufacture", "equipment", "workers",
        "3D printing", "additive manufacturing", "mass production", "prototype",
        "assembly line", "tooling", "CNC", "robotics", "inventory control"
    ],
    "Professional, scientific, and technical services": [
        "consulting", "advisory", "engineering services", "research", "analytics",
        "scientific", "laboratory", "testing", "architecture", "design", "professional",
        "technical", "legal services", "accounting", "IT services", "data science",
        "specialist", "expertise", "surveying", "audit", "compliance",
        "intellectual property", "forensics", "R&D", "innovation", "biotech"
    ],
    "Health care and social assistance": [
        "hospital", "clinic", "doctor", "nurse", "patient", "healthcare",
        "pharmaceutical", "therapy", "treatment", "emergency", "surgery", "vaccine",
        "insurance claim", "social services", "mental health", "elder care", "wellness",
        "rehabilitation", "primary care", "telemedicine", "nursing home", "public health",
        "diagnostic", "clinical trials", "medical device", "nutrition"
    ],
    "Finance and insurance": [
        "banking", "fintech", "investment", "insurance", "mortgage", "stock market",
        "hedge fund", "credit", "loan", "equity", "derivatives", "risk management",
        "retirement", "asset management", "portfolio", "reinsurance", "capital",
        "underwriting", "mutual fund", "wealth management", "securities", "treasury",
        "bond", "cryptocurrency", 'bitcoin', 'crytpo', "financial regulation", "trading"
    ],
    "Retail trade": [
        "retail", "store", "shop", "e-commerce", "customer", "merchandise",
        "sales", "inventory", "supply chain", "fashion", "groceries", "discount",
        "department store", "mall", "point of sale", "consumer goods", "foot traffic",
        "franchise", "checkout", "brand", "retail chain", "promotions", "online store",
        "returns", "shopping cart", "catalog"
    ],
    "Wholesale trade": [
        "wholesale", "distributor", "bulk", "supply", "inventory", "logistics",
        "B2B", "sourcing", "commodity", "warehousing", "retailer", "procurement",
        "supply chain", "distribution center", "trade partner", "supply agreement",
        "import", "export", "wholesale pricing", "stockist", "wholesale supplier",
        "shipment", "trade fair", "supply broker", "order fulfillment"
    ],
    "Information (media, telecom, publishing)": [
        "media", "telecommunications", "publishing", "broadcasting", "network",
        "internet provider", "streaming", "telecom", "digital media", "wireless",
        "social media", "online platform", "satellite", "ISP", "television",
        "journalism", "radio", "press", "content distribution", "newspaper",
        "broadcaster", "mobile carrier", "subscriber", "news outlet", "print media"
    ],
    "Software and data": [
        "software", "cloud computing", "data", "cybersecurity", "IT", "digital",
        "search engine", "AI", "machine learning", "data analytics", "data science",
        "big data", "database", "API", "blockchain", "natural language processing",
        "SaaS", "platform", "coding", "algorithm", "developer", "open source",
        "data warehouse", "artificial intelligence", "predictive analytics"
    ],    
    "Construction": [
        "construction", "infrastructure", "contractor", "builder", "architecture",
        "civil engineering", "project management", "materials", "cement", "steel",
        "renovation", "residential", "commercial construction", "bridge", "road",
        "blueprint", "skyscraper", "foundation", "scaffolding", "crane", "hard hat",
        "general contractor", "permit", "urban development", "construction site"
    ],
    "Transportation and warehousing": [
        "transportation", "vehicle", "traffic", "logistics", "shipping", "freight", "supply chain",
        "distribution", "warehouse", "rail", "aviation", "trucking", "cargo",
        "delivery", "fleet", "scheduling", "port", "supply route", "air cargo", "courier"
    ],
    "Accommodation and food services": [
        "hotel", "restaurant", "hospitality", "lodging", "resort", "catering",
        "travel", "tourism", "fast food", "fine dining", "bed and breakfast",
        "chef", "bar", "café", "booking", "guest", "hospitality industry",
        "banquet", "room service", "hospitality management", "franchise restaurant",
        "menu", "reservation", "culinary", "hotel chain", "housekeeping"
    ],
    "Management of companies and enterprises": [
        "holding company", "corporate", "enterprise", "conglomerate", "management",
        "parent company", "subsidiary", "board of directors", "executive", "leadership",
        "strategy", "organizational", "CEO", "M&A", "corporate governance", "business unit",
        "CFO", "shareholders", "management consulting", "corporate strategy", "joint venture",
        "operating company", "group structure", "chairperson", "divestiture"
    ],
    "Utilities": [
        "electricity", "power", "grid", "water", "sewage", "natural gas",
        "renewable energy", "utility company", "hydropower", "infrastructure",
        "distribution", "pipeline", "waste management", "energy supply",
        "nuclear", "electric utility", "solar power", "wind energy", "substation",
        "ratepayer", "energy efficiency", "load management", "smart grid",
        "generation plant", "transmission line"
    ],
    "Arts, entertainment, and recreation": [
        "arts", "entertainment", "museum", "theater", "cinema", "music",
        "concert", "recreation", "festival", "sports", "amusement", "gaming",
        "culture", "exhibition", "performance", "broadcast", "leisure", "tourism",
        "gallery", "dance", "opera", "theme park", "artist", "show", "spectator", 
        "player", "game"

    ],
    "Educational services": [
        "school", "university", "college", "student", "teacher", "curriculum",
        "learning", "classroom", "education", "training", "course", "degree",
        "academic", "instruction", "tutoring", "literacy", "online learning",
        "seminar", "lecture", "exam", "scholarship", "pedagogy", "MOOC",
        "graduate", "faculty"
    ],
    "Agriculture, forestry, fishing, and hunting": [
        "farming", "crops", "livestock", "agriculture", "harvest", "soil",
        "irrigation", "forestry", "logging", "timber", "fishing", "aquaculture",
        "ranching", "dairy", "horticulture", "sustainable farming", "hunting",
        "poultry", "agribusiness", "fisheries", "tractor", "plantation",
        "cattle", "seed", "barn", "organic farming"
    ]
}

In [18]:
# Define embedding model
model = SentenceTransformer('all-MiniLM-L6-v2')

In [19]:
# Industry embeddings
industry_vectors = {}

for industry, keywords in industry_keywords.items():
    embeddings = model.encode(keywords)
    industry_vectors[industry] = np.float64(np.mean(embeddings, axis=0))

In [28]:
# Topic embeddings
topic_vectors = {}

for topic, keywords in topics_dict.items():
    words = [w for w, _ in keywords if w not in ['ai', 'artificial', 'intelligence']][:5]
    weights = np.array([s for w, s in keywords if w not in ['ai', 'artificial', 'intelligence']][:5])
    embeddings = model.encode(words)
    topic_vectors[topic] = np.average(embeddings, axis = 0, weights=weights)


In [32]:
# Get cosine similarity of industry and topic embeddings
sims = {}
for topic, tv in topic_vectors.items():
    sims[topic] = {ind: util.cos_sim(tv, iv).item() for ind, iv in industry_vectors.items()}
    sims[topic] = dict(sorted(sims[topic].items(), key=lambda x: x[1], reverse=True))

In [33]:
# Map the topics to industries - exclude topics that where no industry has a cosine similarity greater than .6
map = {topic:max(industry, key=industry.get) for topic, industry in sims.items() if max(industry.values()) >= .6}

print(f'Number of topics: {len(topics_dict.keys())}')
print(f'Number of topics assigned: {len(map.values())}')

Number of topics: 121
Number of topics assigned: 69


In [44]:
# Inspect topic/industry map
for topic, industry in map.items():
    print(f'Topic: {topic}')
    print(f'Number of articles: {len(df[df.topic==topic])}')
    print(f'Industry: {industry}')
    print([w[0] for w in topics_dict[topic] if w not in ['ai', 'artificial', 'intelligence']][:5])
    print('\n')

Topic: -1
Number of articles: 81862
Industry: Software and data
['ai', 'data', 'new', 'technology', 'said']


Topic: 1
Number of articles: 5851
Industry: Health care and social assistance
['health', 'healthcare', 'medical', 'patients', 'clinical']


Topic: 2
Number of articles: 3130
Industry: Finance and insurance
['days', 'ai', 'share', 'news', 'stock']


Topic: 3
Number of articles: 2413
Industry: Arts, entertainment, and recreation
['art', 'images', 'image', 'video', 'ai']


Topic: 4
Number of articles: 1863
Industry: Educational services
['students', 'education', 'teachers', 'chatgpt', 'student']


Topic: 6
Number of articles: 1606
Industry: Arts, entertainment, and recreation
['music', 'song', 'artists', 'beatles', 'songs']


Topic: 7
Number of articles: 1450
Industry: Arts, entertainment, and recreation
['republic', 'sports', 'nfl', 'game', 'players']


Topic: 8
Number of articles: 1211
Industry: Finance and insurance
['trading', 'blockchain', 'crypto', 'decentralized', 'bitcoin'

In [70]:
# Identify topics to drop
topics_drop = [
    0, # Descriptions of AI generated images
    49 # Child sexual abuse content
    ]

df['to_drop'] = np.where(df['topic'].isin(topics_drop), True, False)

In [54]:
# Identify changes to the map
topics_change = {
    -1:None,
    22:'Government', 
    28:'Government', 
    56:'Government',
    74:'Retail Trade',
    83:None,
    86:None
    }

# Update map
for t, i in topics_change.items():
    map[t] = i

In [75]:
# Assign industry to articles
df['industry'] = df['topic'].map(map)
df = df[df.to_drop == False]

print(f'Number of articles by industry:\n\n{df.groupby("industry")["industry"].count().sort_values(ascending=False)}')
print(f'\nNumber of articles assigned to an industry: {len(df) - df["industry"].isnull().sum()}')
print(f'Percentage of articles assigned to an industry: {100*(len(df) - df["industry"].isnull().sum())/len(df):.2f}')

Number of articles by industry:

industry
Finance and insurance                               7945
Arts, entertainment, and recreation                 7220
Software and data                                   6688
Health care and social assistance                   5851
Government                                          4913
Information (media, telecom, publishing)            3031
Professional, scientific, and technical services    2875
Educational services                                1863
Retail trade                                        1861
Transportation and warehousing                      1574
Utilities                                            649
Accommodation and food services                      629
Management of companies and enterprises              357
Retail Trade                                         326
Agriculture, forestry, fishing, and hunting          290
Wholesale trade                                      235
Real estate and rental and leasing            

In [77]:
df.to_parquet('output_data/df_industry.parquet')