<h1 style="color: #210e9c;">Final Project: <br> Pension Funds and reforms<br>OECD countries 2005-2020</h1>

---
<h2 style="color: #47a6ff;">Unsupervised Learning</h2>

**News clustering** is an application of Natural Language Processing (NLP) that groups text documents based on their underlying themes or topics. This technique is particularly useful in organizing and analyzing large text corpora, such as news articles, by identifying patterns and categorizing them into clusters without predefined labels.

In [1]:
# 📚 Basic Libraries
import pandas as pd
import numpy as np
import warnings

# 📝 Text Processing
import nltk 
from nltk.stem import WordNetLemmatizer # to lemmatize the words
from nltk.tokenize import word_tokenize
from nltk.corpus import wordnet # to get the wordnet pos tags
from nltk.corpus import stopwords # to remove the stopwords
from sklearn.feature_extraction.text import CountVectorizer # to create a bag of words

# Machine Learning
from sklearn.cluster import KMeans
from kneed import KneeLocator
import plotly.graph_objects as go
from sklearn.metrics import silhouette_score

In [2]:
# 🔧 Make your functions:
# Save this file as my_functions.py
# Import your functions in your notebook
# from my_functions import *
def snake_columns(data):
    """
    Standarize and returns snake_case columns
    """
    data.columns = [column.lower().replace(' ', '_') for column in data.columns]

def map_pos_tag(word):
    """
    Map POS tag to first character lemmatize() accepts.
    """
    tag = nltk.pos_tag([word])[0][1][0].upper() # get the first character of the POS tag
    tag_dict = { # dictionary to map POS tags
        "J": wordnet.ADJ,
        "N": wordnet.NOUN,
        "V": wordnet.VERB,
        "R": wordnet.ADV
    }
    return tag_dict.get(tag, wordnet.NOUN) # return the value of the key or the default value

# ⚙️ Settings
pd.set_option('display.max_columns', None) # display all columns
warnings.filterwarnings('ignore') # ignore warnings

<h2 style="color: #47a6ff;">Data Extraction</h2>

In [3]:
data = pd.read_csv("../Data/data.csv", sep=";")
df = data.copy()
snake_columns(df)
df.head(3)

Unnamed: 0,country,year,year-country,information_type,oecd_private_pensions_outlook,oecd_pensions_@glance,text,expanding_measures,contracting_measures
0,Switzerland,1946.0,1946-Switzerland,Benefits,,_,Federal Law of 20 December on old-age and surv...,1.0,
1,Luxembourg,1967.0,1967-Luxembourg,Taxes,,_,Article 111bis of the amended Law on Revenue T...,,1.0
2,Germany,1974.0,1974-Germany,Coverage,,_,The Gesetz zur Verbesserung der betrieblichen ...,1.0,


In [4]:
 # "oecd_private_pensions_outlook", "oecd_pensions_@glance" columns describe the source of the data, not the data itself
df = df.drop(columns=["oecd_private_pensions_outlook", "oecd_pensions_@glance"])

# "expanding_measures" and "contracting_measures" took place 1 if there is a measure and 0 if there is no measure
df['expanding_measures'] = df['expanding_measures'].replace(np.nan, 0)
df['contracting_measures'] = df['contracting_measures'].replace(np.nan, 0)

# "year" does is empty when it collects descriptions of the pension system, not reforms
df.dropna(subset=['year', 'information_type'], inplace=True)

### Overview
The dataset contains pension reform descriptions.


In [5]:
df.head(3)

Unnamed: 0,country,year,year-country,information_type,text,expanding_measures,contracting_measures
0,Switzerland,1946.0,1946-Switzerland,Benefits,Federal Law of 20 December on old-age and surv...,1,0.0
1,Luxembourg,1967.0,1967-Luxembourg,Taxes,Article 111bis of the amended Law on Revenue T...,0,1.0
2,Germany,1974.0,1974-Germany,Coverage,The Gesetz zur Verbesserung der betrieblichen ...,1,0.0


In [6]:
df.iloc[0]

country                                                       Switzerland
year                                                               1946.0
year-country                                             1946-Switzerland
information_type                                                 Benefits
text                    Federal Law of 20 December on old-age and surv...
expanding_measures                                                      1
contracting_measures                                                  0.0
Name: 0, dtype: object

In [7]:
df.iloc[0,4]

'Federal Law of 20 December on old-age and survivors insurance.'

In [8]:
df = df[["country", "year", "information_type", "text"]]


In [9]:
df.shape

(446, 4)

<h2 style="color: #47a6ff;">Tokenization and Punctuation Removal</h2>

This section preprocesses the text data in the `all_news` DataFrame by tokenizing and cleaning the text.

- **Tokenization**: Splits each text into smaller units (tokens), typically words, for easier processing.
- **Lowercasing**: Converts all tokens to lowercase to ensure consistency and prevent duplicate representations (e.g., "Apple" and "apple").
- **Punctuation Removal**: Removes non-alphabetic characters to focus only on meaningful words.


In [10]:
# lambda text wi
df['tokenized'] = df['text'].apply(lambda x: [word.lower() for word in word_tokenize(x) if word.isalpha()])
df.head()

Unnamed: 0,country,year,information_type,text,tokenized
0,Switzerland,1946.0,Benefits,Federal Law of 20 December on old-age and surv...,"[federal, law, of, december, on, and, survivor..."
1,Luxembourg,1967.0,Taxes,Article 111bis of the amended Law on Revenue T...,"[article, of, the, amended, law, on, revenue, ..."
2,Germany,1974.0,Coverage,The Gesetz zur Verbesserung der betrieblichen ...,"[the, gesetz, zur, verbesserung, der, betriebl..."
3,Iceland,1974.0,Coverage,The mandatory pension fund system was introduc...,"[the, mandatory, pension, fund, system, was, i..."
4,United States,1974.0,Coverage,The Employee Retirement Income Security Act (E...,"[the, employee, retirement, income, security, ..."


<h2 style="color: #47a6ff;">Lemmatization with Part-of-Speech (POS) Helpers</h2>

This section applies **lemmatization** to the tokenized text in the `all_news` DataFrame using part-of-speech (POS) tags for improved accuracy.

- **Lemmatization**: Reduces words to their base or dictionary form (lemma) while considering the grammatical role of each word.
- **POS Tagging**: Enhances lemmatization by providing contextual information about whether a word is a noun, verb, adjective, etc.

In [11]:
lemm = WordNetLemmatizer()

df['lemmatized'] = df['tokenized'].apply(lambda x: [lemm.lemmatize(word, map_pos_tag(word)) for word in x])
df.head()

Unnamed: 0,country,year,information_type,text,tokenized,lemmatized
0,Switzerland,1946.0,Benefits,Federal Law of 20 December on old-age and surv...,"[federal, law, of, december, on, and, survivor...","[federal, law, of, december, on, and, survivor..."
1,Luxembourg,1967.0,Taxes,Article 111bis of the amended Law on Revenue T...,"[article, of, the, amended, law, on, revenue, ...","[article, of, the, amend, law, on, revenue, ta..."
2,Germany,1974.0,Coverage,The Gesetz zur Verbesserung der betrieblichen ...,"[the, gesetz, zur, verbesserung, der, betriebl...","[the, gesetz, zur, verbesserung, der, betriebl..."
3,Iceland,1974.0,Coverage,The mandatory pension fund system was introduc...,"[the, mandatory, pension, fund, system, was, i...","[the, mandatory, pension, fund, system, be, in..."
4,United States,1974.0,Coverage,The Employee Retirement Income Security Act (E...,"[the, employee, retirement, income, security, ...","[the, employee, retirement, income, security, ..."


<h2 style="color: #47a6ff;">Removing Stopwords</h2>

To further cleans the lemmatized text by removing stopwords from the `all_news` DataFrame.

- **Stopwords** are common words like "the," "is," and "and," which often do not carry significant meaning in the text.
- Removing stopwords helps:
  - Reduce noise in the data.
  - Focus on meaningful and relevant words for analysis.
  - Improve the performance of downstream NLP tasks such as clustering and classification.

In [12]:
# remove stopwords
df['no_stopwords'] = df['lemmatized'].apply(lambda x: list(set(x).difference(stopwords.words('english'))))
df.head()

Unnamed: 0,country,year,information_type,text,tokenized,lemmatized,no_stopwords
0,Switzerland,1946.0,Benefits,Federal Law of 20 December on old-age and surv...,"[federal, law, of, december, on, and, survivor...","[federal, law, of, december, on, and, survivor...","[insurance, december, law, federal, survivor]"
1,Luxembourg,1967.0,Taxes,Article 111bis of the amended Law on Revenue T...,"[article, of, the, amended, law, on, revenue, ...","[article, of, the, amend, law, on, revenue, ta...","[article, pension, law, saving, contract, taxa..."
2,Germany,1974.0,Coverage,The Gesetz zur Verbesserung der betrieblichen ...,"[the, gesetz, zur, verbesserung, der, betriebl...","[the, gesetz, zur, verbesserung, der, betriebl...","[gesetz, pension, law, enhance, retirement, se..."
3,Iceland,1974.0,Coverage,The mandatory pension fund system was introduc...,"[the, mandatory, pension, fund, system, was, i...","[the, mandatory, pension, fund, system, be, in...","[pension, design, retirement, replacement, emp..."
4,United States,1974.0,Coverage,The Employee Retirement Income Security Act (E...,"[the, employee, retirement, income, security, ...","[the, employee, retirement, income, security, ...","[pension, law, security, retirement, complemen..."


<h2 style="color: #47a6ff;">Combining Tokens into Text Blobs</h2>

This steps combines the processed tokens in the `no_stopwords` column into a single string for each row. These "clean blobs" are used as input for feature extraction and clustering.


In [13]:
df['clean_blob'] = df['no_stopwords'].apply(lambda x: " ".join(x))
# df.head(1)

<h2 style="color: #47a6ff;">Bag-of-Words (BoW) Vectorization</h2>

This step uses the Bag-of-Words model to convert the cleaned text blobs into numerical feature vectors, limited to the most common 1000 words.

In [14]:
# let's take only the most common 1000 words
bow_vect = CountVectorizer(max_features=1000)

In [15]:
X = bow_vect.fit_transform(df['clean_blob']).toarray()


In [16]:
df['clean_blob']


0                insurance december law federal survivor
1      article pension law saving contract taxation c...
2      gesetz pension law enhance retirement set occu...
3      pension design retirement replacement employ m...
4      pension law security retirement complementary ...
                             ...                        
442    automatically provider transfer contribution a...
443    option may contribution people program kiwisav...
444    line one jan early flexibly remains retirement...
445    affect pln cover certain occupational conditio...
446                      employer benefit increase contn
Name: clean_blob, Length: 446, dtype: object

In [17]:
as_df = pd.DataFrame(X, columns=bow_vect.get_feature_names_out())
as_df.head()

Unnamed: 0,able,abolish,abolition,access,accord,account,accrual,accrue,accumulate,acquisition,acronym,act,action,active,activity,actual,actually,actuarial,ad,add,addition,additional,additionally,adhesion,adjust,adjustment,administer,administration,administrative,adopt,advice,affect,afores,afp,age,agency,agree,agreement,aim,allocate,allocation,allow,allowance,allows,already,also,alternative,although,amend,amendment,amends,among,amount,announce,annual,annually,annuity,another,apartment,applicable,applies,apply,approve,approximately,apr,april,arduous,area,around,arrangement,article,aspect,asset,association,assurance,atp,attractive,au,aud,august,australia,australian,authorise,authority,automatic,automatically,autonomous,auxiliary,available,average,back,balance,bank,bankruptcy,bargaining,base,basic,basis,become,begin,beneficiary,benefit,best,beyond,bill,bn,board,bond,bonus,book,boost,born,branch,bring,british,budget,buffer,business,cad,calculate,calculation,call,canada,cancel,cap,capital,capitalisation,care,carers,carry,case,cash,category,ce,ceiling,certain,certificate,change,charge,child,choice,choose,chose,citizen,civil,claim,claimed,clause,close,cohort,collect,collective,collectively,columbia,combination,combine,come,commenced,commission,committee,company,compare,complement,complementary,complete,component,compound,comprises,compulsory,compute,concern,concerned,concession,concessional,condition,conditional,consar,conservative,consider,consolidation,constitutional,construction,consumer,continue,continued,contn,contns,contract,contractual,contribute,contribution,contributory,control,conversion,convert,cooperative,cost,coud,could,council,count,couple,court,cover,coverage,cpi,cpp,create,creation,credit,current,currently,cut,cxvii,date,day,db,dc,de,decease,december,decide,decision,decrease,decree,decrement,deduct,deduction,default,defer,define,defines,delegate,demand,denmark,depend,deposit,described,design,detail,determination,determine,di,different,direct,directive,directly,disability,dissolve,distribution,divert,dkk,due,earlier,early,earn,earner,earneres,earnings,economic,economy,education,effect,effective,either,eligibility,eligible,eliminate,employ,employee,employer,employment,enable,enact,encourage,end,enhance,enrol,enrolment,ensure,enter,enterprise,entitlement,entity,entrant,epfs,equal,equity,esg,esodati,establish,establishment,estimate,etc,eu,eur,event,every,exceed,exceeds,except,exceptional,excess,exclude,exclusively,exempt,exemption,exist,exit,exmployers,expand,expect,expectancy,experience,extend,extension,extra,extraordinary,face,fact,factor,fairer,fall,family,far,farmer,february,federal,fee,fek,final,finance,financial,financially,financing,firm,first,fiscal,five,fix,flat,flexibility,flexible,follow,force,foreign,form,former,formula,forth,fortnight,forward,four,fourth,framework,free,freedom,freeze,frozen,fulfil,full,fully,function,fund,funding,future,gain,gap,gbp,gdp,general,generally,generous,germany,get,gi,give,go,good,govern,governance,government,governs,govt,gradual,gradually,grant,great,gross,group,growth,guarantee,guaranteed,half,hazardous,health,held,help,high,hire,hoc,hold,holder,hour,household,however,http,hybrid,impact,impartial,implement,implementation,implies,important,improve,incentive,include,income,increase,independent,index,indexation,indexed,individual,industry,inflation,influence,information,initial,inps,insolvency,insolvent,instead,institution,institutional,instrument,insurance,insured,interest,introduce,introduces,introduction,invalidity,invest,investment,ira,issue,issuer,italy,january,job,join,jpy,jul,july,june,jurisdiction,keep,kela,key,kiwisaver,kiwisavers,know,la,labour,laid,large,largely,last,late,later,law,lay,least,leave,legal,legislation,legislative,less,level,levy,life,lifetime,limit,limited,line,link,little,live,loan,local,locate,loi,long,longer,low,lower,lump,main,mainly,maintain,majority,make,manage,management,manager,mandatory,many,march,marginal,market,match,maternity,max,maximum,may,mean,measure,mechanism,medium,meet,member,membership,men,mild,million,min,minimum,minister,ministerial,ministry,money,month,monthly,move,much,must,mutual,national,nature,ndc,nest,net,new,newly,next,nok,nominal,normal,normally,norway,notional,november,number,nzd,oas,objective,obligation,oblige,occupation,occupational,october,ofes,offer,office,offset,old,one,ontario,onwards,open,operate,operating,operation,opt,option,optional,order,ordinary,organisation,others,outlive,outside,overall,oversee,package,paid,parameter,parent,parliament,part,partial,participant,participate,participation,particular,partly,partner,pass,past,pathway,pay,payment,payroll,penalty,pension,pensionable,pensioner,pensionsfonds,pensionskasse,pensionskassen,people,per,percentage,percos,performance,period,permit,person,personal,personality,phase,physically,pillar,pip,place,plan,pln,plus,point,policy,pool,portfolio,portion,possibility,possible,post,potential,power,ppm,practice,precede,premium,previous,previously,price,principal,principle,prior,priority,private,privately,procedure,product,produits,professional,professionnelle,profile,profit,program,programme,programmed,progressive,prohibit,prohibits,project,projection,promise,promote,property,proposal,propose,protect,protection,provide,provider,provincial,provision,prpp,prpps,prsa,prsas,prudent,prudential,psv,public,publicly,publish,purchase,purpose,push,put,qualify,quarter,quarterly,quebec,québec,raise,range,rapid,rate,rather,ratio,reach,real,rebate,recapitalise,receive,recipient,recommends,record,recovery,reduce,reduction,reform,regard,regardless,regime,register,registration,regular,regulates,regulation,regulatory,related,relative,relax,relaxed,relevant,relief,remain,remains,remove,repeat,replace,replacement,report,reporting,represent,representative,require,requirement,reserve,residential,residual,respective,respectively,response,responsibility,restriction,restrictive,result,retail,retain,retire,retiree,retirement,retraite,retraites,retroactively,return,revenue,review,revise,revision,riester,right,rise,risen,risk,riskier,risky,role,rollover,royal,rpi,rule,ruling,run,rürup,saf,safeguard,salary,save,saving,scale,scheme,seasonal,second,secondary,sector,sectoral,security,see,sek,senior,seniority,sent,separate,september,servant,service,set,seven,several,severance,share,shift,show,siefores,similar,since,single,size,slightly,small,smalle,smooth,social,society,sole,solely,solidarity,solvency,sorveglianza,sound,source,sp,special,specialise,specific,specifies,specify,spmcs,sponsor,spouse,stabilise,staff,standard,start,state,status,statutory,stay,stewardship,still,stimulus,stock,stop,stp,strategic,strategy,stream,strengthen,stricter,structural,structure,student,subject,submit,subsequently,subsidise,subsidy,substitute,successive,sum,super,superannuation,superannuition,superstream,supervision,supervisory,supplement,supplementary,support,surcharge,survivor,survivorship,suspend,suspension,sustainability,sustainable,swedish,switch,swith,system,table,take,taper,target,tariff,task,tax,taxable,taxation,taxed,technical,temporarily,temporary,ten,term,terminate,test,tfr,thereafter,third,though,three,threshold,thus,tighten,tighter,time,together,total,towards,tqpps,trade,traditional,training,transfer,transform,transition,trattamento,travailleurs,treasury,treat,treatment,trl,trust,trustee,twice,two,tyel,type,typically,uk,unattached,unconstitutional,underfunded,undertaking,underthese,unemployed,unemployment,unfairness,unified,uniform,union,unsound,unused,usd,use,valorisation,value,variable,vast,vehicle,version,vest,voluntary,wage,way,website,week,welfare,well,werknemers,western,whether,whichever,whole,whose,widen,widespread,wil,withdrawal,withdrawn,within,without,wok,wokers,woman,work,worker,workforce,worth,would,xcvi,year,yearly,yield,young,youth,yr,yrsr,zealand,zeland,zelfstandigen,zero,zu
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [18]:
as_df.shape

(446, 1000)

In [19]:
feature_vectors = as_df.describe().T
feature_vectors

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
able,446.0,0.017937,0.132872,0.0,0.0,0.0,0.0,1.0
abolish,446.0,0.031390,0.174566,0.0,0.0,0.0,0.0,1.0
abolition,446.0,0.011211,0.105404,0.0,0.0,0.0,0.0,1.0
access,446.0,0.026906,0.161990,0.0,0.0,0.0,0.0,1.0
accord,446.0,0.020179,0.140771,0.0,0.0,0.0,0.0,1.0
...,...,...,...,...,...,...,...,...
zealand,446.0,0.004484,0.066890,0.0,0.0,0.0,0.0,1.0
zeland,446.0,0.002242,0.047351,0.0,0.0,0.0,0.0,1.0
zelfstandigen,446.0,0.002242,0.047351,0.0,0.0,0.0,0.0,1.0
zero,446.0,0.002242,0.047351,0.0,0.0,0.0,0.0,1.0


<h2 style="color: #47a6ff;">K-Means Clustering</h2>

This step applies the K-Means clustering algorithm to the BoW feature vectors to group text blobs into clusters.


In [20]:
inertias = []
range_of_clusters = range(1, 11)

for k in range_of_clusters:
    model = KMeans(n_clusters=k, random_state=42, n_init=10)
    model.fit(X)
    inertias.append(model.inertia_)

In [22]:
fig = go.Figure()
fig.add_trace(go.Scatter(x=list(range_of_clusters), y=inertias, mode='lines+markers', name='Inertia'))
fig.update_layout(title='Elbow Method For Optimal k',
                  xaxis_title='Number of clusters, k',
                  yaxis_title='Inertia',
                  xaxis=dict(tickmode='array', tickvals=list(range_of_clusters)))
fig.show()

In [24]:
# Finding the optimal number of clusters using the KneeLocator
kn = KneeLocator(range_of_clusters, inertias, curve='convex', direction='decreasing')
optimal_clusters = kn.knee

print(f"Knee method optimal clusters: {optimal_clusters}")

Knee method optimal clusters: 8


In [25]:
kmeans = KMeans(n_clusters=8, random_state=42)
kmeans.fit(X)
pred = kmeans.predict(X)

In [27]:
silhouette_avg = silhouette_score(X, kmeans.labels_)
print(f'Silhouette Score: {silhouette_avg:.3f}')

Silhouette Score: 0.079


In [28]:
predict_df = pd.concat([df[["information_type", 'text']], pd.DataFrame(pred, columns=['cluster'])], axis=1)
predict_df.head()

Unnamed: 0,information_type,text,cluster
0,Benefits,Federal Law of 20 December on old-age and surv...,4.0
1,Taxes,Article 111bis of the amended Law on Revenue T...,4.0
2,Coverage,The Gesetz zur Verbesserung der betrieblichen ...,4.0
3,Coverage,The mandatory pension fund system was introduc...,0.0
4,Coverage,The Employee Retirement Income Security Act (E...,4.0


In [29]:
predict_df["information_type"].value_counts()

information_type
Coverage                    111
Benefits                    100
Diversification/security     88
Contributions                77
Taxes                        51
Fee                          15
Fees                          4
Name: count, dtype: int64

In [30]:
contingency_table = pd.crosstab(predict_df ['information_type'], predict_df ['cluster'])
print(contingency_table)

cluster                   0.0  1.0  2.0  3.0  4.0  5.0  6.0  7.0
information_type                                                
Benefits                    5    0    1   32   61    0    1    0
Contributions               4    0    0   39   32    0    1    0
Coverage                   31    0    0    7   69    0    0    4
Diversification/security   19    0    0    3   64    2    0    0
Fee                         1    0    0    0   14    0    0    0
Fees                        2    0    0    1    1    0    0    0
Taxes                       6    1    0   11   33    0    0    0


In [31]:
predict_df[predict_df["cluster"] == 1]

Unnamed: 0,information_type,text,cluster
430,Taxes,January 2018. New Generation Incentive: betwee...,1.0


In [32]:
coverage = predict_df[predict_df["information_type"] == "Coverage"]

In [33]:
coverage.cluster.value_counts()

cluster
4.0    69
0.0    31
3.0     7
7.0     4
Name: count, dtype: int64

In [36]:
benefits = predict_df[predict_df["information_type"] == "Benefits"]

In [37]:
benefits.cluster.value_counts()

cluster
4.0    61
3.0    32
0.0     5
2.0     1
6.0     1
Name: count, dtype: int64