## LDA topic modeling 

Question: 
 > How did sustainability language shift from Brundtland keywords toward sector-specific keywords after 2015? 

We perform LDA to obtain: <br>
 > Topic distributions per document <br>
 > Keyword representation of topics <br>
 > Topic prevalence across subsets <br>

Our steps: 

 > 1. Discover latent topics; for this we run LDA on all documents
 > 2. Interpret the topics and label them 
 > 3. Compare prevalences of topics; for this we compute the average topic proportion by period (pre vs post) and sector-group 

In [None]:
# pip install polars
# pip install pyarrow
# pip install scikit-learn

In [7]:
import polars as pl
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation

In [None]:
# load df_en
df_en = pl.read_csv("df_en.csv")

In [37]:
# extract the text from the clean text column as list for vectorization
texts = df_en["text_clean"].to_list()

In [None]:
# see documentation for details: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html



vectorizer = CountVectorizer(
    max_df=0.9,
    min_df=10,
    stop_words="english"
)

dtm = vectorizer.fit_transform(texts)

In [None]:
# Here we run the LDA, which is a ready-to-use thing in sklearn > we need to decide for n_components, the number of topics we specify

lda = LatentDirichletAllocation(
    n_components=12,
    random_state=42 # reproducibility 
)

doc_topics = lda.fit_transform(dtm)

In [39]:
# run it another time with n_components = 6 

lda = LatentDirichletAllocation(
    n_components=6,
    random_state=42 # reproducibility 
)

doc_topics = lda.fit_transform(dtm)

we merge the topics (their proportions per document) with the reports in df_en:

In [40]:
topic_cols = [f"topic_{i}" for i in range(doc_topics.shape[1])]

topic_df = pl.DataFrame(doc_topics, schema=topic_cols)

df_en = df_en.with_columns(topic_df)

In [41]:
df_en

filename,company_name,year,Name,Year,file,Organization_type,Size,Sector,Sec_SASB,Country,Region,OECD,english_non_english,file_full_name,file_empty,Group,text,text_clean,clean_len,topic_0,topic_1,topic_2,topic_3,topic_4,topic_5
str,str,i64,str,i64,str,str,str,str,str,str,str,str,str,str,bool,str,str,str,i64,f64,f64,f64,f64,f64,f64
"""Anasoft_2015.txt""","""Anasoft""",2015,"""Anasoft""",2015,"""Anasoft_2015""","""Private company""","""SME""","""Technology Hardware""","""Technology and Communications""","""Slovak Republic""","""Europe""",,"""english""","""Anasoft_2015.txt""",false,"""Consumer""","""sustainable development  …","""sustainable development cop re…",42754,0.32581,0.079365,0.555149,0.000055,0.034082,0.005538
"""AECOM_2016.txt""","""AECOM""",2016,"""AECOM""",2016,"""AECOM_2016""","""Private company""","""Large""","""Other""","""Other""","""United States of America""","""Northern America""","""No""","""english""","""AECOM_2016.txt""",false,"""Other""","""sustainability report 2015  …","""sustainability report deliveri…",42960,0.015873,0.000052,0.934657,0.000052,0.000052,0.049313
"""KimballOffice_2016.txt""","""KimballOffice""",2016,"""Kimball Office""",2016,"""KimballOffice_2016""","""Private company""","""Large""","""Consumer Durables""","""Consumer Goods""","""United States of America""","""Northern America""",,"""english""","""KimballOffice_2016.txt""",false,"""Other""",""" 2017  …","""corporate sustainability repor…",15577,0.000151,0.000151,0.961157,0.00015,0.03824,0.000151
"""CascadeEngineering_2016.txt""","""CascadeEngineering""",2016,"""Cascade Engineering""",2016,"""CascadeEngineering_2016""","""Private company""","""MNE""","""Other""","""Other""","""United States of America""","""Northern America""",,"""english""","""CascadeEngineering_2016.txt""",false,"""Other""",""" cascade engineering is compri…","""cascade engineering is compris…",36560,0.000061,0.000061,0.75818,0.000061,0.000061,0.241577
"""VodafoneQatar_2013.txt""","""VodafoneQatar""",2013,"""Vodafone Qatar""",2013,"""VodafoneQatar_2013""","""Private company""","""Large""","""Telecommunications""","""Technology and Communications""","""Qatar""","""Asia""","""No""","""english""","""VodafoneQatar_2013.txt""",false,"""Consumer""",""" vodafone qatar csr report…","""vodafone qatar csr report abou…",14688,0.308383,0.00015,0.641667,0.00191,0.047739,0.00015
…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…
"""ADFIAP_2009.txt""","""ADFIAP""",2009,"""ADFIAP""",2009,"""ADFIAP_2009""",,"""SME""","""Non-Profit / Services""","""Non-Profit / Services""","""Philippines""","""Asia""",,"""english""","""ADFIAP_2009.txt""",false,"""Other""",""" …","""integrated annual and sustaina…",54134,0.000041,0.203641,0.439858,0.316892,0.038542,0.001026
"""SCJohnson_2014.txt""","""SCJohnson""",2014,"""SC Johnson""",2014,"""SCJohnson_2014""","""Private company""","""MNE""","""Household and Personal Product…","""Consumer Goods""","""United States of America""","""Northern America""",,"""english""","""SCJohnson_2014.txt""",false,"""Consumer""","""the choices we make sc johnson…","""the choices we make sc johnson…",76402,0.000028,0.030069,0.903767,0.000028,0.000028,0.066081
"""OrientOverseasInternational_20…","""OrientOverseasInternational""",2016,"""Orient Overseas International""",2016,"""OrientOverseasInternational_20…","""Private company""","""MNE""","""Logistics""","""Transportation""","""Hong Kong""","""Asia""","""No""","""english""","""OrientOverseasInternational_20…",false,"""Financial""","""going green: we take it person…","""going green we take it persona…",167059,0.206936,0.172536,0.459045,0.000013,0.078742,0.082728
"""AhlstromCorporation_2011.txt""","""AhlstromCorporation""",2011,"""Ahlstrom Corporation""",2011,"""AhlstromCorporation_2011""","""Private company""","""Large""","""Forest and Paper Products""","""Renewable Resources & Alternat…","""Finland""","""Europe""","""No""","""english""","""AhlstromCorporation_2011.txt""",false,"""Other""","""sustainability report 2010 …","""sustainability report ahlstrom…",57564,0.033067,0.00968,0.78986,0.00004,0.00004,0.167314


we print out the 50 most frequent words for each topic: 

In [42]:

def get_top_words_transposed(model, feature_names, n_top_words=50):
    topic_dict = {}

    for topic_idx, topic in enumerate(model.components_):
        top_features = topic.argsort()[:-n_top_words - 1:-1]
        words = [feature_names[i] for i in top_features]

        # each topic becomes a column
        topic_dict[f"topic_{topic_idx}"] = words

    # Polars automatically aligns lists as rows
    df = pl.DataFrame(topic_dict)

    # optional: add rank index
    df = df.with_row_index("rank", offset=1)

    # move rank to first column
    df = df.select(["rank"] + [c for c in df.columns if c != "rank"])

    return df


feature_names = vectorizer.get_feature_names_out()
df_topics = get_top_words_transposed(lda, feature_names)

df_topics

rank,topic_0,topic_1,topic_2,topic_3,topic_4,topic_5
u32,str,str,str,str,str,str
1,"""group""","""sustainability""","""sustainability""","""financial""","""group""","""power"""
2,"""gas""","""safety""","""emissions""","""group""","""csr""","""group"""
3,"""safety""","""product""","""safety""","""board""","""japan""","""fiscal"""
4,"""sustainability""","""india""","""program""","""assets""","""fy""","""electric"""
5,"""project""","""organization""","""waste""","""million""","""fiscal""","""sales"""
…,…,…,…,…,…,…
46,"""principles""","""production""","""supply""","""related""","""basic""","""cost"""
47,"""compliance""","""power""","""climate""","""asset""","""internal""","""plants"""
48,"""policy""","""regulations""","""paper""","""march""","""promoting""","""customers"""
49,"""mobile""","""emissions""","""strategy""","""meeting""","""hitachi""","""iino"""


In [None]:
df_topics.write_csv("topics_6.csv")

above we see that some words we could consider removing form the text, like names of companies and countries

to get a better idea about the topics and ensure proper labeling, we can find those reports with the highest probability for each topic:

In [44]:
texts = df_en["text_clean"].to_list()
files = df_en["filename"].to_list()

import numpy as np

for k in range(doc_topics.shape[1]):
    top_docs = np.argsort(doc_topics[:, k])[-5:]   # top 5 docs

    print(f"\n===== Topic {k} =====")

    for d in reversed(top_docs):   # highest first
        print(f"\nFILE: {files[int(d)]}")
        print(texts[int(d)][:400])


===== Topic 0 =====

FILE: TeollisuudenVoimaOyj(TVO)_2014.txt
corporate social responsibility report table of contents responsible leadership review by the ceo operating environment strategic objectives good corporate governance risk management management system company level policies code of conduct safety safety culture development special events research and development uranium from bedrock to bedrock procurement of uranium nuclear power plant olkiluoto a

FILE: TeollisuudenVoimaOyj(TVO)_2016.txt
well being with nuclear electricity corporate social responsibility report table of contents responsible leadership review by the ceo operating environment strategic goals responsibility program risk management management system business ethics communal influencer tvo an overview creation of jobs and benefits to national economy well being and employment economic impacts significant climate and en

FILE: TeollisuudenVoimaOyj(TVO)_2015.txt
corporate social responsibility report table of con