# Results Lexical Semantic Change Detection

By: Iris Luden 
Last edited: 28-03-2023

This notebook is to examine the results of the LSCD systems applied to C1 and C2. 

We used SkipGram with Negative Sampling (SGNS) + alignment using Orthogonal procrustes (OP) + Cosine Distance (CD) as a similarity metric. Source: https://github.com/Garrafao/LSCDetection/blob/master/preprocessing/preprocess.py. Additionally, we computed the local neighborhood distance (LND). 


## Description 
In this notebook, we
1. Read change scores for trending target words and stable target words 
2. Select final target words, changing and stable
3. Select final emerging target words based on manual annotations. 

All results are saved in the folder 'Targetwords/'. The final selection of 20 emerging, changing and stable target words are saved as: 
- 'Targetwords/Changing_targets_20.tsv'
- 'Targetwords/Stable_targets_20.tsv'
- 'Targetwords/Emerging_targets_20.tsv'

In [1]:
import pandas as pd 

# 1. Trending words 

1. read LSCD scores 
2. sort and select changing target words 
3. filter by abreviations and proper nouns 

In [2]:
# 1.1 Read results candidate trending 
folder = 'LSCD_Scores/'

filename = folder + 'results_cd_trending_targets_30_1.tsv'
df_cd =  pd.read_csv(filename, sep='\t', header=None, names=['Word1', 'CD Score'])

filename = folder + 'results_lnd_trending_targets_30_5.tsv'
df_lnd =  pd.read_csv(filename, sep='\t', header=None, names=['Word2', 'LND Score', 'Neighbors'])

# combine the two data frames 
df_trending = pd.concat([df_cd, df_lnd], axis=1)
df_trending.drop(columns=['Word2'], inplace=True)

df_trending['Word'] = df_trending['Word1'].apply(lambda x: x.split(',')[0])
df_trending.drop(columns=['Word1'], inplace=True)

# # now display the most likely semantically changed words
df_trending.sort_values(by='CD Score', inplace=True, ascending=False)
display(df_trending)

# 1.2 select changing words

df_changing = df_trending[(df_trending['CD Score'] > 0.6) & (df_trending['LND Score'] < 0.5)]
print(" There are", len(df_changing), "changing words")

# ### NOTE: uncomment to save  ### 
# df_changing.to_csv('Targetwords/LSCD_results_changing_trending_targets.tsv', sep='\t', index=False)

Unnamed: 0,CD Score,LND Score,Neighbors,Word
7,0.975653,0.002869,7,corona
4,0.959357,0.009155,8,lockdown
6,0.950220,0.004171,7,vanishes
22,0.917989,0.001619,7,manifesting
31,0.908140,0.000214,4,closeness
...,...,...,...,...
280,0.181261,0.000443,4,action
253,0.172959,0.000304,6,debited
108,0.171292,0.000576,4,performed
93,0.163350,0.000016,4,moderators


 There are 73 changing words


#### 1. 3 filter Changing target words 

The following words are abbreviations or propernouns (and nothing else) according to WordNet: 
- abbreviations: cro, lh, sol, moa, crt, sars, rh, wei, atp
- Proper names: erica, hancock, burrow, ethiopian, azerbaijan, herbert, bruno, greenwood, cummings

In [None]:
# 1.3 filter target words
df_changing = pd.read_csv('Targetwords/LSCD_results_changing_trending_targets.tsv', sep='\t')

drop = ['cro', 'lh', 'sol',  'moa', 'caters', 'crt', 'sars',  'rh', 'wei', 'atp',
        'erica', 'hancock', 'burrow', 'ethiopian',  'azerbaijan',  'herbert', 'bruno', 'greenwood', 'cummings']
df_changing = df_changing[df_changing['Word'].map(lambda x: x not in drop)]
df_changing = df_changing[['Word', 'CD Score', 'LND Score', 'Neighbors']]
display(df_changing[:20])

# ### Note: uncomment to save ### 
# df_changing[:20].to_csv('Targetwords/LSCD_results_changing_targets_20.tsv', sep='\t', index=False)
# df_changing[:20]['Word'].to_csv('Targetwords/Changing_targets_20.tsv', sep='\t', index=False)

# 2. Stable target words selection 

1. read LSCD results 
2. select those with change scores below 0.25
3. randomly sample 20 
4. Save files 

In [None]:
# 2.1 read results candidate stable words 
df_cd =  pd.read_csv(folder + 'results_cd_stable_targets_30.tsv', sep='\t', header=None, 
                     names=['Word1', 'CD Score'])
df_lnd =  pd.read_csv(folder + 'results_lnd_stable_targets_30_5.tsv', sep='\t', header=None, 
                      names=['Word2', 'LND Score', 'Neighbors'])

# combine the two data frames 
df_stable_candidates = pd.concat([df_cd, df_lnd], axis=1).drop(columns=['Word2'])
df_stable_candidates['Word'] = df_stable_candidates['Word1'].apply(lambda x: x.split(',')[0])
df_stable_candidates.drop(columns=['Word1'], inplace=True)

# 2.2 Select stable words have change scores below 0.25
df_stable = df_stable_candidates[(df_stable_candidates['CD Score'] < 0.25) & (df_stable_candidates['LND Score'] < 0.25)].sort_values(by='CD Score')
df_stable

# Save 
# df_stable.to_csv('Targetwords/LSCD_results_stable_targets.tsv', sep='\t', index=False)

# 2.3 randomly select 20 of these (commented out due to randomness)
# df_stable = df_stable.sample(20)

# 2.4 Save 
# ### NOTE: uncomment to save ### 
# df_stable.to_csv('Targetwords/LSCD_results_stable_targets_20.tsv', sep='\t', index=False)
# df_stable['Word'].to_csv('Targetwords/Stable_targets_20.tsv, sep='\t', index=False)


# 3. Emerging Target Words selection 

Emerging words should occur at least 50 times in C2 (to have sufficient example sentences), and should not also be changing target words. They are manually selected. 

In [None]:
# changing target words
changing_words = pd.read_csv('Targetwords/Changing_targets_20.tsv', sep='\t')
stable_words = pd.read_csv('Targetwords/Stable_targets_20.tsv', sep='\t')

In [None]:
# read all candidates 
df_emerging = pd.read_csv(f'Targetwords/all_neologisms-ANNOTATORS.tsv', sep='\t', encoding='"ISO-8859-1"')
df_emerging = df_emerging[(df_emerging['Judgement'] == '1') & (df_emerging['C2 freq']  > 50)]
df_emerging[df_emerging['Term'].map(lambda x: x not in changing_words)]
display(df_emerging)

# (manual) selection of 20 target words
emerging = '''copium covidiots plandemic vaxed gatekeeping grifting gaslight non-binary femboy quarantining covid transphobe simp wokeness sapphic spreader goated k-pop vax anti-vax'''
# emerging += '''metaverse mots fuckable poggers airdrop doxxed tiktok corona parasol groomers minting tokenomics '''
# emerging += '''tannies hololive brainrot yeeted bootlicker performative socials shitcoins nft '''
# emerging += '''rioters engene valorant stonks vtuber pre-pandemic simping insurrectionist immunocompromised'''    
emerging = emerging.split()

# save in same format as changing and stable
emerging_df = pd.DataFrame({'Word': emerging})
emerging_df.to_csv('Targetwords/Emerging_targets_20.tsv', sep='\t', index=False)