# ADSP 32018: Final Project
## Exploratory Data Analysis and Data Cleaning

Peyton Nash

### Project Description
In March of 2023, Goldman Sachs published a report, indicating that ~25% of the tasks in US and Europe can be automated using AI.  However, not all industries will be affected equally. According to the report, certain jobs, like office tasks, legal, architecture, and social sciences have a potential for 30%+ automation, while positions like construction, installation, and building maintenance are going to be largely unaffected.

In July of 2025, Microsoft published an in-depth studyLinks to an external site. based on 200,000 anonymized conversations with Microsoft Copilot, aiming to understand how generative AI is actually being used in the workplace and which professions are being most affected.

The researchers separated what users intended to do from what the AI actually delivered. They then mapped both to detailed job functions defined by O*NET. Using this framework, along with indicators of task success and coverage, they developed an “AI applicability score” for every occupation.

The findings are clear. Generative AI excels at tasks like information gathering, writing, and communication. It is already transforming knowledge and service-based roles. However, it has limited usefulness in jobs that rely on physical effort.
One of the most surprising insights? There’s little connection between AI’s impact and factors like income or education level. This challenges long-held assumptions about which roles are most at risk of disruption.

You can also find supporting evidence in the Facebook Research paper, which highlights Moravec’s Paradox. This thesis posits that the hardest problems in AI involve sensorimotor skills rather than abstract thought or reasoning. Notably, these findings coincide with predictions made by Goldman Sachs.

For this final project, I have prepared a collection of ~200K news articles on our favorite topics, data science, machine learning, and artificial intelligence. Your task is to identify what industries are going to be most impacted by AI over the next several years, based on the information/insights you can extract from this text corpus.

Your goal is to provide actionable recommendations on what can be done with AI to automate the jobs, improve employee productivity, and generally make AI adoption successful. Please pay attention to the introduction of novel technologies and algorithms, such as AI for image generation and Conversational AI, as they represent the entire paradigm shift in adoption of AI technologies and data science in general.

### Setup

In [2]:
%pip install pandas numpy pandarallel dotenv openai tldextract nltk bertopic transformers openai pyarrow ipywidgets

Collecting pandas
  Using cached pandas-2.3.1-cp310-cp310-macosx_11_0_arm64.whl.metadata (91 kB)
Collecting numpy
  Using cached numpy-2.2.6-cp310-cp310-macosx_14_0_arm64.whl.metadata (62 kB)
Collecting pandarallel
  Using cached pandarallel-1.6.5-py3-none-any.whl
Collecting dotenv
  Using cached dotenv-0.9.9-py2.py3-none-any.whl.metadata (279 bytes)
Collecting openai
  Using cached openai-1.99.9-py3-none-any.whl.metadata (29 kB)
Collecting tldextract
  Using cached tldextract-5.3.0-py3-none-any.whl.metadata (11 kB)
Collecting nltk
  Using cached nltk-3.9.1-py3-none-any.whl.metadata (2.9 kB)
Collecting bertopic
  Using cached bertopic-0.17.3-py3-none-any.whl.metadata (24 kB)
Collecting transformers
  Using cached transformers-4.55.2-py3-none-any.whl.metadata (41 kB)
Collecting pyarrow
  Using cached pyarrow-21.0.0-cp310-cp310-macosx_12_0_arm64.whl.metadata (3.3 kB)
Collecting ipywidgets
  Using cached ipywidgets-8.1.7-py3-none-any.whl.metadata (2.4 kB)
Collecting pytz>=2020.1 (from pan

In [1]:
# Import libraries
import pandas as pd
import numpy as np
import math
from dotenv import load_dotenv
import os

from pandarallel import pandarallel
import re
import random
from collections import Counter
import openai
import tldextract
from nltk.tokenize import sent_tokenize
from bertopic import BERTopic
from sklearn.feature_extraction.text import CountVectorizer
from sentence_transformers import SentenceTransformer, util
from transformers import pipeline

In [2]:
# Read environment variables
load_dotenv()

True

In [3]:
# Load data
df = pd.read_parquet('https://storage.googleapis.com/msca-bdp-data-open/news_final_project/news_final_project.parquet', engine='pyarrow')

# Check dimensions
print(f'Number of articles: {df.shape[0]}')

Number of articles: 200760


### Initial Data Checking

In [4]:
# Check head
df.head()

Unnamed: 0,url,date,language,title,text
0,http://businessnewsthisweek.com/business/infog...,2023-05-20,en,Infogain AI Business Solutions Now Available i...,\n\nInfogain AI Business Solutions Now Availab...
1,http://www.huewire.com/how-you-should-validate...,2023-07-21,en,How You Should Validate Machine Learning Model...,\n\nHow You Should Validate Machine Learning M...
2,http://www.huewire.com/vise-intelligence-is-a-...,2023-09-29,en,Vise Intelligence is a new AI to assist — not ...,\n\nVise Intelligence is a new AI to assist — ...
3,https://abcnews.go.com/Technology/google-makes...,2024-06-03,en,Google makes adjustments to AI Overviews after...,\n\nGoogle makes adjustments to AI Overviews a...
4,https://betanews.com/2023/06/08/wordpress-ai-a...,2023-06-08,en,WordPress' AI Assistant can write blog posts...,\n\n\n WordPress' AI Assistant can write blog...


In [5]:
# Check language of articles
print(df.groupby('language')['language'].count())

language
en    200760
Name: language, dtype: int64


In [6]:
# Check scrape data of articles
print(f'Earliest scrape data: {df["date"].min()}')
print(f'Latest scrape data: {df["date"].max()}')

Earliest scrape data: 2022-01-01
Latest scrape data: 2025-07-22


In [7]:
# Check the number of sources
df['domain'] = df['url'].apply(lambda x: tldextract.extract(x).domain)
print(f'Number of unique domains: {df["domain"].unique().shape[0]}\n')

# Check most common sources
source_count = df.groupby('domain')['domain'].count().sort_values(ascending=False)
print(f'Most common sources:\n {source_count[:10]}\n')
print(f'Number of sources with more than 100 articles: {source_count[source_count>100].shape[0]}\n')
print(f'Percentage of articles published by sources with more than 100 articles: {100 * source_count[source_count>100].sum()/source_count.sum():.2f}%\n')

Number of unique domains: 5119

Most common sources:
 domain
rawpixel        8831
citylife        4040
menafn          3847
einpresswire    3637
indiatimes      3594
prnewswire      3393
nasdaq          2473
yahoo           1982
levels          1576
livemint        1479
Name: domain, dtype: int64

Number of sources with more than 100 articles: 394

Percentage of articles published by sources with more than 100 articles: 70.60%



### Initial Data Cleaning

After inspecting articles for the 20 most common domains, some are irrelevant to the task:
- _rawpixel_ contains short descriptions of AI generated images
- _levels_ contains salaries for data scientist roles
- _mexc_ contains stock prices for AI companies

Those are removed before processing.



In [8]:
# Remove large irrelevant domains
domain_drop = ['rawpixel', 'levels', 'mexc']
df = df[~df['domain'].isin(domain_drop)]

#### Remove Web-Scrape Remnants

In [9]:
# Define function to keep lines with multiple complete sentences - heuristic based
def get_body_heur(text: str):
    # Split into paragraphs and remove empty text
    text_split = [line.strip() for line in re.split(r'\n', text) if line.strip() != '']

    # Identify lines with three or more complete sentences
    n_sent = [len(sent_tokenize(line, language='english')) for line in text_split]

    # Remove lines with 3 or fewer sentences
    text_body = [x for x, y in zip(text_split, n_sent) if y >= 2]

    # Remove repeated lines
    text_body = [k for k, v in dict(Counter(text_body)).items() if v == 1]

    return text_body

In [10]:
# Apply the heuristic for identifying body text to the data
pandarallel.initialize(progress_bar=True)

df['text_clean_list'] = df['text'].parallel_apply(get_body_heur)
df['text_clean'] = df['text_clean_list'].parallel_apply(lambda x: ' '.join(x))

INFO: Pandarallel will run on 11 workers.
INFO: Pandarallel will use standard multiprocessing data transfer (pipe) to transfer data between the main process and workers.


VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=17206), Label(value='0 / 17206')))…

VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=17206), Label(value='0 / 17206')))…

In [11]:
# Remove observations with no text
df = df[df['text_clean'] != '']

print(f'Length after removing empty text: {len(df)}')

Length after removing empty text: 186636


In [12]:
# Define a function to clean the text
def clean_text(text: str, domain: str):
    # Replace newlines, tabs, carriage returns with a space
    text = re.sub(r'[\n\r\t]+', ' ', text)

    # Remove URLs
    text = re.sub(r'http[s]?://\S+', ' ', text)
    text = re.sub(r'\bwww\.\S+', ' ', text)

    # Remove emails
    text = re.sub(r'\b\S+@\S+\.\S+\b', ' ', text)

    # Remove social media handles and hashtags
    text = re.sub(r'@\w+', ' ', text)
    text = re.sub(r'#\w+', ' ', text)

    # Remove bracketed content like [text]
    text = re.sub(r'\[[^\]]*\]', ' ', text)

    # Remove most punctuation except intra-word hyphens/apostrophes
    text = re.sub(r"[^\w\s'\-$%.,:;/]", ' ', text)

    # Remove domain names
    domain_pattern = r'\b' + domain + r'\b'
    text = re.sub(domain_pattern, ' ', text, flags=re.IGNORECASE)

    # Collapse multiple spaces into one and strip
    text = re.sub(r"\s+", " ", text).strip()

    # Remove 3+ repeated words in a row
    text = re.sub(r"\b(\w+)( \1\b){2,}", r"\1", text)


    # Normalize quotes and dashes
    text = text.replace('“', '\"').replace('”', '\"')
    text = text.replace('‘', ''').replace('’', ''')
    text = text.replace('–', '-').replace('—', '-')

    # Remove non-breaking spaces and other Unicode oddities
    text = re.sub(r'[\u200b\u200e\xa0]', ' ', text)

    return text

In [13]:
# Clean the text
pandarallel.initialize(progress_bar=True)

df['text_clean'] = df.parallel_apply(lambda x: clean_text(x['text_clean'], x['domain']), axis=1)

INFO: Pandarallel will run on 11 workers.
INFO: Pandarallel will use standard multiprocessing data transfer (pipe) to transfer data between the main process and workers.


VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=16967), Label(value='0 / 16967')))…

In [14]:
# Get the length of the cleaned text
pandarallel.initialize(progress_bar=True)
df['text_length'] = df['text_clean'].parallel_apply(lambda x: len(x))

print(f'Length range: {df["text_length"].min()} to {df["text_length"].max()}')

# Remove the top and bottom five percent longest and shortest texts
pctl = np.percentile(df['text_length'], [.10, .95])
df = df[(df['text_length']<=pctl[0]) | (df['text_length']>=pctl[1])]

print(f'Length after removing long and short articles: {len(df)}')
print(f'Length range after removing long and short articles: {np.floor(pctl[0])} to {np.ceil(pctl[1])}')

INFO: Pandarallel will run on 11 workers.
INFO: Pandarallel will use standard multiprocessing data transfer (pipe) to transfer data between the main process and workers.


VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=16967), Label(value='0 / 16967')))…

Length range: 4 to 245003
Length after removing long and short articles: 185069
Length range after removing long and short articles: 38.0 to 158.0


In [15]:
# Write to CSV
df.to_parquet('output_data/clean1.parquet')

In [16]:
df.drop(['language', 'domain', 'text_clean_list', 'text_length'], axis=1, inplace=True)

### BERTopic Modeling

In [17]:
# Create a dictionary of keywords for each industry
industry_keywords = {
    "Real estate and rental and leasing": [
        "real estate", "property", "housing", "apartment", "condominium", "mortgage",
        "broker", "landlord", "tenant", "leasing", "rental", "commercial property",
        "residential property", "zoning", "realtor", "property management", "vacancy",
        "realty", "title deed", "appraisal", "real estate market", "land parcel",
        "property tax", "homeownership", "escrow", "foreclosure"
    ],
    "Government": [
        "federal", "state", "local government", "municipal", "regulation", "public policy",
        "legislation", "agency", "department", "minister", "congress", "parliament",
        "bureaucracy", "public sector", "administration", "civil service", "ordinance",
        "executive order", "governance", "commission", "cabinet", "policy making",
        "regulatory body", "ombudsman", "court", "constitution"
    ],
    "Manufacturing": [
        "factory", "plant", "assembly", "production", "supply chain", "automation",
        "machinery", "fabrication", "industrial", "engineering", "materials",
        "processing", "lean manufacturing", "manufacture", "equipment", "workers",
        "3D printing", "additive manufacturing", "mass production", "prototype",
        "assembly line", "tooling", "CNC", "robotics", "inventory control"
    ],
    "Professional, scientific, and technical services": [
        "consulting", "advisory", "engineering services", "research", "analytics",
        "scientific", "laboratory", "testing", "architecture", "design", "professional",
        "technical", "legal services", "accounting", "IT services", "data science",
        "specialist", "expertise", "surveying", "audit", "compliance",
        "intellectual property", "forensics", "R&D", "innovation", "biotech services",
        "software development", "programming", "cloud services", "cybersecurity",
        "artificial intelligence", "machine learning", "big data", "blockchain",
        "IT consulting", "digital transformation", "systems integration",
        "DevOps", "SaaS", "technology consulting", "automation",
        "data engineering", "information systems", "enterprise software",
        "platform engineering", "technology solutions"
    ],
    "Health care and social assistance": [
        "hospital", "clinic", "doctor", "nurse", "patient", "healthcare",
        "pharmaceutical", "therapy", "treatment", "emergency", "surgery", "vaccine",
        "insurance claim", "social services", "mental health", "elder care", "wellness",
        "rehabilitation", "primary care", "telemedicine", "nursing home", "public health",
        "diagnostic", "clinical trials", "medical device", "nutrition"
    ],
    "Finance and insurance": [
        "banking", "fintech", "investment", "insurance", "mortgage", "stock market",
        "hedge fund", "credit", "loan", "equity", "derivatives", "risk management",
        "retirement", "asset management", "portfolio", "reinsurance", "capital",
        "underwriting", "mutual fund", "wealth management", "securities", "treasury",
        "bond", "cryptocurrency", "financial regulation", "trading"
    ],
    "Retail trade": [
        "retail", "store", "shop", "e-commerce", "customer", "merchandise",
        "sales", "inventory", "supply chain", "fashion", "groceries", "discount",
        "department store", "mall", "point of sale", "consumer goods", "foot traffic",
        "franchise", "checkout", "brand", "retail chain", "promotions", "online store",
        "returns", "shopping cart", "catalog"
    ],
    "Wholesale trade": [
        "wholesale", "distributor", "bulk", "supply", "inventory", "logistics",
        "B2B", "sourcing", "commodity", "warehousing", "retailer", "procurement",
        "supply chain", "distribution center", "trade partner", "supply agreement",
        "import", "export", "wholesale pricing", "stockist", "wholesale supplier",
        "shipment", "trade fair", "supply broker", "order fulfillment"
    ],
    "Information": [
        "media", "telecommunications", "IT", "software", "cloud computing", "data",
        "publishing", "broadcasting", "cybersecurity", "network", "internet",
        "streaming", "telecom", "information technology", "digital", "search engine",
        "AI", "content", "journalism", "data analytics", "digital media", "wireless",
        "social media", "online platform", "satellite", "ISP"
    ],
    "Construction": [
        "construction", "infrastructure", "contractor", "builder", "architecture",
        "civil engineering", "project management", "materials", "cement", "steel",
        "renovation", "residential", "commercial construction", "bridge", "road",
        "blueprint", "skyscraper", "foundation", "scaffolding", "crane", "hard hat",
        "general contractor", "permit", "urban development", "construction site"
    ],
    "Transportation and warehousing": [
        "transportation", "vehicle", "traffic", "logistics", "shipping", "freight", "supply chain",
        "distribution", "warehouse", "rail", "aviation", "trucking", "cargo",
        "delivery", "fleet", "scheduling", "port", "supply route", "air cargo", "courier"
    ],
    "Accommodation and food services": [
        "hotel", "restaurant", "hospitality", "lodging", "resort", "catering",
        "travel", "tourism", "fast food", "fine dining", "bed and breakfast",
        "chef", "bar", "café", "booking", "guest", "hospitality industry",
        "banquet", "room service", "hospitality management", "franchise restaurant",
        "menu", "reservation", "culinary", "hotel chain", "housekeeping"
    ],
    "Management of companies and enterprises": [
        "holding company", "corporate", "enterprise", "conglomerate", "management",
        "parent company", "subsidiary", "board of directors", "executive", "leadership",
        "strategy", "organizational", "CEO", "M&A", "corporate governance", "business unit",
        "CFO", "shareholders", "management consulting", "corporate strategy", "joint venture",
        "operating company", "group structure", "chairperson", "divestiture"
    ],
    "Utilities": [
        "electricity", "power", "grid", "water", "sewage", "natural gas",
        "renewable energy", "utility company", "hydropower", "infrastructure",
        "distribution", "pipeline", "waste management", "energy supply",
        "nuclear", "electric utility", "solar power", "wind energy", "substation",
        "ratepayer", "energy efficiency", "load management", "smart grid",
        "generation plant", "transmission line"
    ],
    "Mining": [
        "mining", "ore", "coal", "gold", "silver", "iron", "extraction",
        "drilling", "quarry", "minerals", "oil", "gas", "petroleum", "refinery",
        "geology", "resource", "exploration", "metals", "natural resources",
        "copper", "nickel", "rare earth", "smelting", "prospecting", "tailings"
    ],
    "Arts, entertainment, and recreation": [
        "arts", "entertainment", "museum", "theater", "cinema", "music",
        "concert", "recreation", "festival", "sports", "amusement", "gaming",
        "culture", "exhibition", "performance", "broadcast", "leisure", "tourism",
        "gallery", "dance", "opera", "theme park", "artist", "show", "spectator", 
        "player", "game"

    ],
    "Educational services": [
        "school", "university", "college", "student", "teacher", "curriculum",
        "learning", "classroom", "education", "training", "course", "degree",
        "academic", "instruction", "tutoring", "literacy", "online learning",
        "seminar", "lecture", "exam", "scholarship", "pedagogy", "MOOC",
        "graduate", "faculty"
    ],
    "Agriculture, forestry, fishing, and hunting": [
        "farming", "crops", "livestock", "agriculture", "harvest", "soil",
        "irrigation", "forestry", "logging", "timber", "fishing", "aquaculture",
        "ranching", "dairy", "horticulture", "sustainable farming", "hunting",
        "poultry", "agribusiness", "fisheries", "tractor", "plantation",
        "cattle", "seed", "barn", "organic farming"
    ]
}

In [39]:
%pip install hdbscan umap

Collecting umap
  Downloading umap-0.1.1.tar.gz (3.2 kB)
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
Building wheels for collected packages: umap
  Building wheel for umap (pyproject.toml) ... [?25ldone
[?25h  Created wheel for umap: filename=umap-0.1.1-py3-none-any.whl size=3578 sha256=72c32780b85d3b541583609caf49d0b1ab3d29a4f26d1249dad82d509d37f9ca
  Stored in directory: /Users/peytonnash/Library/Caches/pip/wheels/15/f1/28/53dcf7a309118ed35d810a5f9cb995217800f3f269ab5771cb
Successfully built umap
Installing collected packages: umap
Successfully installed umap-0.1.1
Note: you may need to restart the kernel to use updated packages.


In [57]:
from hdbscan import HDBSCAN
hdbscan_model = HDBSCAN(
    min_cluster_size=7,
    min_samples= 7,
    metric='euclidean',
    cluster_selection_method='eom',
    prediction_data=True
)

In [54]:
from umap import UMAP
umap_model = UMAP(
    n_neighbors=15,
    n_components=7,
    min_dist=0.0,
    metric='cosine'
)

In [None]:
# Topic modeling with BERTopic
vectorizer_model = CountVectorizer(min_df=.001, stop_words='english')
seed_topic_list = [value for value in industry_keywords.values()]

topic_model = BERTopic(language="english", 
                       #min_topic_size=200, 
                       calculate_probabilities=True, 
                       verbose=True, 
                       vectorizer_model=vectorizer_model,
                       #umap_model=umap_model
                       hdbscan_model=hdbscan_model
                       #seed_topic_list = seed_topic_list
                       )
topics, probs = topic_model.fit_transform(df["text_clean"][:50000])

2025-08-16 22:09:33,411 - BERTopic - Embedding - Transforming documents to embeddings.


Batches:   0%|          | 0/1563 [00:00<?, ?it/s]

2025-08-16 23:25:22,220 - BERTopic - Embedding - Completed ✓
2025-08-16 23:25:22,221 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2025-08-16 23:25:57,419 - BERTopic - Dimensionality - Completed ✓
2025-08-16 23:25:57,421 - BERTopic - Cluster - Start clustering the reduced embeddings


In [56]:
Counter(topics)[-1]

11364

- Standard 50,000: 9,747
    - min_topic_size = 0
    - Count Vectorizer:
        - min_df = .001
- Seed 50,000: 10,813
    - min_topic_size = 0
    - seed_topic_list = seed_topic_list
    - Count Vectorizer:
        - min_df = .001
- HDBSCAN 50,000: 10,518
    - min_topic_size = 0
    - HDBSCAN
        - min_cluster_size=10
        - metric='euclidean'
        - cluster_selection_method='eom'
        - prediction_data=True
    - Count Vectorizer:
        - min_df = .001
- HDBSCAN 50,000: 
    - min_topic_size = 0
    - HDBSCAN
        - min_cluster_size=7
        - min_sample_size = 7
        - metric='euclidean'
        - cluster_selection_method='eom'
        - prediction_data=True
    - Count Vectorizer:
        - min_df = .001

In [19]:
# Save outputs
pd.DataFrame(np.column_stack((probs, np.array(topics)))).to_parquet('output_data/bert_results.parquet')

In [20]:
# Get topics
topics_dict = topic_model.get_topics()
print(f'Number of topics: {len(topics_dict)}')

pd.DataFrame([[k, str(v)] for k, v in topics_dict.items()]).to_parquet('output_data/bert_dict.parquet')

Number of topics: 127


In [None]:
bert_results = pd.read_parquet('output_data/bert_results.parquet')
topics = [int(item) for item in list(bert_results.iloc[:, -1])]

In [21]:
print(f'Number of articles: {len(df)}')
print(f'Number of topic assignments: {len(topics)}')

Number of articles: 185069
Number of topic assignments: 185069


In [23]:
# Create topic column
df['topic'] = topics

# Get counts by topic
print(df.groupby('topic')['topic'].count())

topic
-1      88868
 0       7616
 1       6708
 2       6370
 3       4302
        ...  
 121      210
 122      206
 123      205
 124      203
 125      203
Name: topic, Length: 127, dtype: int64


In [24]:
# Check the topics
topics_dict

{-1: [('ai', np.float64(0.0064943594947080106)),
  ('data', np.float64(0.005388077811823435)),
  ('media', np.float64(0.0049381041310344095)),
  ('new', np.float64(0.004668527945325022)),
  ('said', np.float64(0.004531652396939147)),
  ('content', np.float64(0.004470881351400185)),
  ('technology', np.float64(0.004279237324707398)),
  ('news', np.float64(0.0038857214321201547)),
  ('gray', np.float64(0.003827647381824978)),
  ('group', np.float64(0.003733102737705585))],
 0: [('agents', np.float64(0.01204407754217131)),
  ('data', np.float64(0.011389159289084608)),
  ('customer', np.float64(0.011351394969612828)),
  ('ai', np.float64(0.010392532321919857)),
  ('business', np.float64(0.0075823207540304995)),
  ('agent', np.float64(0.006051224520930516)),
  ('customers', np.float64(0.005868634894547928)),
  ('businesses', np.float64(0.005800063215141899)),
  ('generative', np.float64(0.005765631339945376)),
  ('automation', np.float64(0.005411907525420208))],
 1: [('healthcare', np.float

### Filter Irrelevant Topics

In [26]:
# Create list of topics that do not pertain to the task
topic_drop = [-1]

# Remove rows with irrelevant topics
df = df[~df.topic.isin(topic_drop)]

print(f'Length after removing irrelevant topics: {len(df)}')

Length after removing irrelevant topics: 96201


### Assign Articles to Industries

Industries are collected from the [BEA](https://www.bea.gov/sites/default/files/2025-06/gdp1q25-3rd.pdf). The industries are presented as both category total and sub-items. Whether the category or sub-item is included is based on a review of the topic keywords produced by BERTopic. The industry keyword dictionary was produced by ChatGPT.

In [27]:
# Define embedding model
model = SentenceTransformer('all-MiniLM-L6-v2')

In [28]:
# Industry embeddings
industry_vectors = {}

for industry, keywords in industry_keywords.items():
    embeddings = model.encode(keywords)
    industry_vectors[industry] = np.float64(np.mean(embeddings, axis=0))

In [29]:
# Topic embeddings
topic_vectors = {}

for topic, keywords in topics_dict.items():
    words = [w for w, s in keywords[:5]]
#    words = [w for w, _ in keywords[:10]]
    weights = np.array([s for _, s in keywords[:5]])
    embeddings = model.encode(words)
    topic_vectors[topic] = np.average(embeddings, axis = 0, weights=weights)


In [30]:
# Get cosine similarity of industry and topic embeddings
sims = {}
for topic, tv in topic_vectors.items():
    sims[topic] = {ind: util.cos_sim(tv, iv).item() for ind, iv in industry_vectors.items()}
    sims[topic] = dict(sorted(sims[topic].items(), key=lambda x: x[1], reverse=True))

In [31]:
# Map the topics to industries - exclude topics that where no industry has a cosine similarity greater than .6
map = {topic:max(industry, key=industry.get) for topic, industry in sims.items() if max(industry.values()) >= .6}

print(f'Number of topics: {len(topics_dict.keys())}')
print(f'Number of topics assigned: {len(map.values())}')

Number of topics: 127
Number of topics assigned: 72


In [32]:
# Assign industry to articles
df['industry'] = df['topic'].map(map)

print(f'Number of articles by industry:\n\n{df.groupby("industry")["industry"].count().sort_values(ascending=False)}')
print(f'\nNumber of articles assigned to an industry: {len(df) - df["industry"].isnull().sum()}')
print(f'Percentage of articles assigned to an industry: {100*(len(df) - df["industry"].isnull().sum())/len(df):.2f}')

Number of articles by industry:

industry
Finance and insurance                               15619
Information                                         13610
Arts, entertainment, and recreation                 12776
Professional, scientific, and technical services     8576
Health care and social assistance                    6993
Government                                           3288
Transportation and warehousing                       1823
Educational services                                 1578
Utilities                                            1020
Mining                                                656
Agriculture, forestry, fishing, and hunting           643
Manufacturing                                         487
Accommodation and food services                       428
Real estate and rental and leasing                    248
Name: industry, dtype: int64

Number of articles assigned to an industry: 67745
Percentage of articles assigned to an industry: 70.42


#### Create Training Data

In [None]:
# Prepare text to be labeled
random.seed(42)
idx_to_label = random.sample(range(0, len(df)), 1000)
to_label = [{'idx': idx, 'text': text, 'industry':industry} for idx, text, industry in zip(idx_to_label, df[['text_clean2', 'industry']][idx_to_label])]

print(f'Number of documents to train: {len(to_label)}')
print(f'\nNumber of training documents by industry:\n\n{Counter([item['industry'] for item in to_label])}')

In [None]:
# Label data using GPT 3.5
openai.api_key = os.environ['OPENAI_API_KEY']
gpt_model = 'gpt-5-nano'

# Create function to label articles
def label_text(industry, text):
    prompt = (
        f'''You are a sentiment annotator. Your task is to label a news article based on the overall sentiment it expresses toward the implementation of artificial intelligence. 
Focus only on the sentiment expressed about AI implementation.
Do not label based on unrelated topics in the article.

POSITIVE: The article suggests that AI adoption is likely to be successful
NEGATIVE: The article suggests that AI adoption is unlikely to be successful
NEUTRAL: The article has neither a strong positive or negative sentiment.

Text to label:
[{text}]

Return only one of the following labels: POSITIVE, NEGATIVE, or NEUTRAL.'''
        )
   
    response = openai.chat.completions.create(
        model=gpt_model,
        messages=[{"role": "user", "content": prompt}],
        temperature=0.3
    )
    
    return response.choices[0].message.content

In [None]:
# Create counters
n_pos = 0
n_neg = 0
n_net = 0
n_format = 0
n_label = 0

# Create empty containers
labels = {}
misformat = []

for obs in to_label:
    label = label_text(obs['text']).toupper()

    # Add to label dictionary if the output is formatted correctly
    if label.isin(['POSITIVE', 'NEGATIVE', 'NEUTRAL']):

        labels[obs['idx']] = label
        n_label += 1

        # Add to class counters
        if label == 'POSITIVE':
            n_pos += 1
        if label == 'NEGATIVE':
            n_neg += 1
        if label == 'NEUTRAL':
            n_net += 1

        # End if there are more than 150 observations in each class
        if n_net > 150 & n_neg > 150 & n_net > 150:
            break
    
    # Add id to list of mislabelled
    else:
        n_format = 0
        misformat.append(obs['idx'])

    # End if more than 10% of observations are not returning the correct format
    if n_label > 20 & n_format/n_label > .1:
        break