# Topic Modelling with LDA
[Kaggle](https://www.kaggle.com/rcushen/topic-modelling-with-lsa-and-lda)

## Imports

In [428]:
import numpy as np
import pandas as pd
import spacy
# from IPython.display import display
# from tqdm import tqdm
from collections import Counter
# import ast

import matplotlib.pyplot as plt
# import matplotlib.mlab as mlab
# import seaborn as sb

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
# from textblob import TextBlob
# import scipy.stats as stats

from sklearn.decomposition import TruncatedSVD
# from sklearn.decomposition import LatentDirichletAllocation
# from sklearn.manifold import TSNE

# from bokeh.plotting import figure, output_file, show
# from bokeh.models import Label
# from bokeh.io import output_notebook
# output_notebook()

# %matplotlib inline

## Read in the data

In [429]:
reports = pd.read_csv('reports.csv', index_col=0)
reports.lang.iloc[9] = 'deu'        # heidelbergcement has the wrong document language
reports
reports.is_copy = False             # turn of pandas copy warning

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_block(indexer, value, name)


In [430]:
# exclude german texts
reports = reports[reports['lang'] == 'eng']
reports.reset_index(drop=True, inplace=True)
reports

Unnamed: 0,filepath,lang,text,number_of_pages
0,./reports/scraped/full/adidas-group.pdf,eng,pv p | py. t sus ds report off when working to...,84
1,./reports/scraped/full/eon.pdf,eng,slavery & human trafficking statement e.on's s...,3
2,./reports/scraped/full/siemens-energy.pdf,eng,siemens claleigen4 sustainability report 2020 ...,80
3,./reports/scraped/full/munichre.pdf,eng,corporate responsibility report 2020 munich re...,106
4,./reports/scraped/full/volkswagenag.pdf,eng,volkswagen aktiengesellschaft sustainability r...,97
5,./reports/scraped/full/new.siemens.pdf,eng,ure tareas tahielaerreherele 2020 a . 4 ie . c...,144
6,./reports/scraped/full/freseniusmedicalcare.pdf,eng,wes fresenius medical care non-financial group...,24
7,./reports/scraped/full/group.pdf,eng,a sustainable future. sustainability report 20...,139
8,./reports/scraped/full/fresenius.pdf,eng,fresenius 2020 annual report media hub group i...,338
9,./reports/scraped/full/allianz.pdf,eng,‘ic call rin for /a susta ne ble future | alli...,102


## Preprocess the text
- remove stopwords
- lemmatize

In [431]:
text = "performance manage- ment. more than 1500 managers have already started using this platform. our different programs for lead- ership development are based on regional requirements but with a focus on principles that apply globally. regarding the development of our compensation systems the management board decided on the implementation of a new global leadership bonus plan in 2020. according to this plan all senior executives are given a comparable mix of global busi- ness-specific and individual objectives. the aim is to improve the consistency alignment and fairness of our senior leader- ship targets and ensure recognition. the plan will be imple- mented in 2021. employee "
text = re.sub('([A-z]+)- ', '\\1', text)
print(text)

performance management. more than 1500 managers have already started using this platform. our different programs for leadership development are based on regional requirements but with a focus on principles that apply globally. regarding the development of our compensation systems the management board decided on the implementation of a new global leadership bonus plan in 2020. according to this plan all senior executives are given a comparable mix of global business-specific and individual objectives. the aim is to improve the consistency alignment and fairness of our senior leadership targets and ensure recognition. the plan will be implemented in 2021. employee 


### Text specific corrections

In [432]:
custom_stopwords = ['kgaa', 'qed','ssb','eri','magn','mica','ltir','cvcc','aoa','gcgc','nwow','sie','lcas', 'epds', 'unep', 'wef', 'gim', 'efpia', 'mbap','emea', 'ind', 'report','re_cr-report', 'social6']   # those would show up in the topic clusters, most of them ocr errors
def remove_custom_stopwords(text, custom_stopwords):
    return ' '.join(filter(lambda x: x.lower() not in custom_stopwords,  text.split()))

reports.text = reports.text.str.replace('([A-z]+)- ', '\\1', regex=True)    # there are tons of separated words, i.e environ- ment
#reports.text = reports.text.str.replace('[^0-9A-z ]+', '', regex=True)      # replace special characters
# reports.text = reports.text.str.replace(' \\d{1,3} ', ' ', regex=True)    #replace 1 to 3 digits number
reports.text = reports.text.str.replace('\\b\\w{1,2} ', '', regex=True)     #replace words of length 1 or 2
# https://stackoverflow.com/questions/34305505/python-regex-remove-digits-except-years#34305766
reports.text = reports.text.str.replace('\\b(?!(\\D\\S*|20[0-9]{2})\\b)\S+\\b', ' ', regex=True)   #replace all digits not in range 1000-2999
reports.text = reports.text.str.replace('siemen ', 'siemens ')
reports.text = reports.text.str.replace('adida ', 'adidas ')
reports.text = reports.text.str.replace('allianzs ', 'allianz ')
reports.text = reports.text.map(lambda text: remove_custom_stopwords(text, custom_stopwords))

reports.loc[1]['text']

ther assessments based e.on'human rights risk matrix refined 2019 enable even more structured approach assessing human rights risks e.on'supply chain. health safety and environment events will continue conducted throughout 2020 for e.employees and contractor representatives. the aim these events reinforce awareness the importance these topics e.both generally and for individual projects well design specific action plans for joint improvement initiatives related the products and services particular contractor subcontractor provides. the events also serve forum for sharing best practice and communicating e.on'standards and policies. e.on’continued commitment e.will continue review its policies and processes relation the prevention slavery and human trafficking its business and supply chain strengthening these where necessary ensure continued alignment with the act. e.will also continue train all employees and ensure compliance with its code conduct and will identify additional training n

### Lemmatize, remove stopwords

In [433]:
nlp_en = spacy.load('en_core_web_sm')
nlp_en.max_length = 1300000
nlp_de = spacy.load('de_core_news_sm')
nlp_de.max_length = 1300000

# New stop words list 
customize_stop_words = [
    'attach'
]

def lemmatize(nlp, text) -> str:
    # Mark them as stop words
    for w in customize_stop_words:
        nlp.vocab[w].is_stop = True
    return " ".join(token.lemma_ for token in nlp(text) 
                                if not token.is_stop and not token.is_punct)

for index, row in reports.iterrows():
    if row.lang == 'eng':
        reports.loc[index, 'lemma'] = lemmatize(nlp_en, row.text)
    if row.lang == 'deu':
        reports.loc[index, 'lemma'] = lemmatize(nlp_de, row.text)

reports.lemma = reports.lemma.str.replace(' adida ', ' adidas ')
reports.lemma = reports.lemma.str.replace(' siemen ', ' siemens ')
reports.lemma = reports.lemma.str.replace('allianzs ', 'allianz ')

reports.lemma[0][:500]

'| py sus work toget reason call creator mployee partner consumer supplier important ney strive space eas creative force need improve mpany’sustainable effort gaede ceee gee lena mvaay kehw yma vine worker supply chain assistance supplier improve environmental performance develop chemical management minimise waste people positive imp world guide core belief tha sport power change team thousand keep mat effort call con people ceo statement empow people start listen empower worker supply chain sust'

## Create Term Frequency-Inverse Document Frequency (TFIDF) matrix

### Bigrams

In [440]:
number_of_topics = 20
# https://www.kaggle.com/munavar/latent-semantic-analysis-topic-modelling/#Topic-Modeling-insincere-questions
tfidf_vectorizer = TfidfVectorizer(sublinear_tf=True,ngram_range={1,2}) # ,ngram_range={1,2}
tfidf_matrix= tfidf_vectorizer.fit_transform(reports.lemma)
print(tfidf_matrix.shape)

(12, 206310)


In [441]:
truncated_svd=TruncatedSVD(n_components=number_of_topics, n_iter=10,random_state=42)
X=truncated_svd.fit_transform(tfidf_matrix)     # Fit model to [text] and perform dimensionality reduction on [text]

In [442]:
def get_topics(components, feature_names, n=number_of_topics):
    clusters = []
    for index, frequency in enumerate(components):
        clusters.append((f"Topic {index}", [(feature_names[i]) for i in frequency.argsort()[:-n - 1:-1]]))
    return clusters

In [443]:
topic_clusters = get_topics(truncated_svd.components_,tfidf_vectorizer.get_feature_names())

Notes:
fla: Fair Labor Association
oifr: Occupational Illness Frequency Rate
ekpi: Environmental KPI
epra: European Public Real Estate Association
pimco: global investment management firm
meag: Munich Ergo Assetmanagement Gmbh
bcg: Business Conduct Guidelines
dkv: Deutsche Krankenversicherung
csb: Chemischer Sauerstoffbedarf


In [446]:
for topic in topic_clusters[:round(number_of_topics/2)-1]:
    print(f"{topic[0]}: {topic[1]}\n")

print('Notes:\nfla: Fair Labor Association\noifr: Occupational Illness Frequency Rate\nekpi: Environmental KPI\nepra: European Public Real Estate Association\npimco: global investment management firm\nbom: Board of Management\nmeag: Munich Ergo Assetmanagement Gmbh\nbcg: Business Conduct Guidelines\ndkv: Deutsche Krankenversicherung\ncsb: Chemischer Sauerstoffbedarf')

wear', 'parley ocean', 'fla', 'target evaluation', 'apparel', '2014 2016', 'adidas distribution', 'adidas office', 'ocean plastic', 'supplier factory', 'licensee factory', 'timeline progress', 'factory']

Topic 5: ['esg integration', 'underwriting', 'asset owner', 'datum performance', 'investment esg', 'allianz climate', 'sustainability operation', 'allianz sustainability', 'allianz', 'bom', 'reinsurance', 'introduction sustainability', '2019 introduction', 'investment portfolio', 'natural catastrophe', 'investment management', 'operation allianz', 'insurance solution', 'allianz group', 'disclosure datum']

Topic 6: ['volkswagen', 'volkswagen group', 'merck corporate', 'merck', 'strategy management', 'commercial vehicle', 'versum material', 'versum', 'plant process', 'responsibility 2019', 'darmstadt', 'passenger car', '2019 fact', 'tonne year', 'light commercial', 'life science', 'schistosomiasis', 'car light', 'intermolecular', 'material intermolecular']

Topic 7: ['rwe', 'rwe sustai

In [447]:
for topic in topic_clusters[round(number_of_topics/2):]:
    print(f"{topic[0]}: {topic[1]}\n")

Topic 10: ['group limit', 'management start', 'financial group', 'start page', 't3', 'care 2020', 'section start', 'fresenius medical', 'fresenius', 'dialysis', 'corporate risk', '2020 non', 'privacy program', 'page information', 'limit assurance', 'human labor', 'ethic business', 'patient experience', 'code ethic', 'labor right']

Topic 11: ['preserve nature', 'society indicator', 'nature contribution', 'resource people', 'practice resource', 'business preserve', 'siemens glance', 'annex siemens', 'information 2020', 'people society', 'glance governance', 'sustainability responsible', 'contribution sustainability', 'glance sustainability', 'manage board', 'siemens governance', 'environment preserve', 'environment sustainability', 'socialcontribution', 'annex glance']

