# Corpus statistics 

By: Iris Luden
Last edited: March 2023

### Description 

Corpus1 and Corpus2 consist of Twitter and Reddit data.

Corpus 1: 
- Start date: 07-2015
- End date: 04-2019 (included)

Corpus 2: 
- Start date:    05-2019
- End date: 2-2023 (included)

In this notebook, we collect:

1. General statistics

2. Neologisms / Emerging new words 

3. Trending candidate target words 

4. Stable candidate target words 

5. (extra) some example sentences for each emerging word


# 1. General statistics 

- 1.1 Total number of documents: 967400
- 1.2 Number of sentences 
- 1.2 Sentence lengths 
- 1.3 Word frequencies & total number of words 

In [None]:
from gensim.models.word2vec import PathLineSentences

import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from collections import Counter
from nltk.corpus import wordnet
import random
minCount = 30

In [None]:
# helper functions

# determines whether the word should be included in the vocabulary 
def filter_rule(word):
    ''' Returns False if a word is an url or an emoji '''
    if 'https:/' in word or 'http://' in word or 'www.' in word: 
        return False
    elif '\\u' in ascii(word): 
        return False
    elif '\\U' in ascii(word): 
        return False
    else:
        return True
    
def corpus_frequencies(corpuslines):
    
    freqs_dict = Counter()
    
    for sentence in corpuslines:
        for word in sentence: 
            if filter_rule(word):
                freqs_dict[word] += 1
    return freqs_dict

#### Corpus statistics 

In [None]:
# read data (splits sentences automatically, and splits into words by space)
texts_C1 = PathLineSentences('Corpora/Corpus1')
texts_C2 = PathLineSentences('Corpora/Corpus2')

# sentence lengths
sentence_lengths_C1 = [len(t) for t in texts_C1]
sentence_lengths_C2 = [len(t) for t in texts_C2]

# number of sentences
number_of_sentences_C1 = len(sentence_lengths_C1)
number_of_sentences_C2 = len(sentence_lengths_C2)

# Total number of terms  
number_of_terms_C1 = sum(sentence_lengths_C1)
number_of_terms_C2 = sum(sentence_lengths_C2)

# Word frequency of each corpus 
C1_freqs = corpus_frequencies(texts_C1)
C2_freqs = corpus_frequencies(texts_C2)
total_freqs = C1_freqs + C2_freqs

# Total number of terms in the corpora 
total_words_C1 = sum(C1_freqs.values())
total_words_C2 = sum(C2_freqs.values())
print(len(total_freqs))

# Reduce by mminimum count = 30
# Onnly include words that occur at least 30 times in the corpus
print("MinCount is ", minCount)
C1_freqs_reduced  = Counter({key:value for key, value in C1_freqs.items() if value > minCount})
C2_freqs_reduced  = Counter({key:value for key, value in C2_freqs.items() if value > minCount})
total_freqs_reduced = C1_freqs_reduced + C2_freqs_reduced
print(len(total_freqs_reduced))

# Words that occur in both corpora
intersection = C1_freqs_reduced.keys() & C2_freqs_reduced.keys()
print(len(intersection))

In [None]:
# corpus data in dataframe 
corpus_df = pd.DataFrame({'Corpus 1': {'Number of sentences': number_of_sentences_C1,
                          'Average sentence length': np.mean(sentence_lengths_C1),
                          'Number of terms': number_of_terms_C1, 
                          'Number of words': total_words_C1,            
                          'Unique words': len(C1_freqs), 
                          'Unique words (reduced)': len(C1_freqs_reduced)},
             'Corpus 2': {'Number of sentences': number_of_sentences_C2,
                          'Average sentence length': np.mean(sentence_lengths_C2),
                          'Number of terms': number_of_terms_C2, 
                          'Number of words': total_words_C2,   
                          'Unique words': len(C2_freqs),
                          'Unique words (reduced)': len(C2_freqs_reduced)
                         }, 
              'Combined': {'Number of sentences': number_of_sentences_C2 + number_of_sentences_C1,
              'Average sentence length': np.mean(sentence_lengths_C2 + sentence_lengths_C1),
              'Number of terms': number_of_terms_C2 + number_of_terms_C1, 
              'Number of words': sum(total_freqs.values()),   
              'Unique words': len(total_freqs),
              'Unique words (reduced)': len(total_freqs_reduced)
             }})

# add sums 
display(corpus_df)

#### Identify in which sentences wach target word occurs 

In [None]:
# Get the line number and the target words 
def collect_corpus_index(sentences):
    '''Maps every word to the set of line numbers on which they occur'''
    words2lines = {}
    i = 0
    for sentence in sentences: 
        for word in sentence: 
            if filter_rule(word):
                if word not in words2lines: 
                    words2lines[word] = set()
                words2lines[word].add(i)
        i += 1 
    return words2lines

w2l_C1 = collect_corpus_index(texts_C1)
w2l_C2 = collect_corpus_index(texts_C2)

In [None]:
# Get sentence frequency for each word 
sentence_freqs_C1 = Counter({word: len(w2l_C1[word]) for word in w2l_C1})
sentence_freqs_C2 = Counter({word: len(w2l_C2[word]) for word in w2l_C2})

sentence_freqs_C1_reduced = Counter({word: len(w2l_C1[word]) for word in C1_freqs_reduced})
sentence_freqs_C2_reduced = Counter({word: len(w2l_C2[word]) for word in C2_freqs_reduced})

# 2. Emerging words selection


Lazaridou et. al. (2021) define **emerging new words** "as those that occur frequently on the test set (at least 50 times), but either: (i) were previously unseen on the training set, or (ii) occurred much less frequently on the training set than on the test set, as indicated by an at least 5 times lower unigram probability."

Hence emerging (new) words are words that are either unseen in  $C_1$ , and seen in $C_2$, or words whos frequency has significantly increased. 

We select words as neologisms if their frequency in $C_1$ is below 15, OR if their frequency in $C_2$ is at least five times higher than in $C_1$. Additionally, an emerging new word should occur at least 50 times in $C_2$ - such that it provides us with sufficient example context sentences. 

In this script we also look at weak/strong neologisms. 
- Weak neologisms: occur 5 times more in $C_2$ than in $C_1$, but may occur (a few times) in $C_1$ as well
- Strong neologisms:  occur 5 times more in $C_2$ than in $C_1$, and do NOT occur in $C_1$


#### Additional info

We disregard words that contain digits, URLs, or emoji's. Additionally, we check whether they are in the WordNet database. This info is saved to the files:

    - 'Targetwords/neologisms.tsv'
    - 'Targetwords/weak_neologisms.tsv'
    - 'Targetwords/all_neologisms_{minCount}.tsv'

The files are saved to ../LSCDetection-master/testsets/test/all_neologisms_reduced.tsv'

In [None]:
# helper functions 
def has_numbers(input_string):
    ''' return True if the word does not contain any digits '''
    return any(char.isdigit() for char in input_string)


def in_wordnet(word):
    ''' return true if the word is known in WordNet database'''
    if wordnet.synsets(word) != []:
        return True
    return False 
    

# Other neologisms: words whose use have increased drastically. 
def find_neologisms(freqs1, freqs2, thres=50, ratio=5):
    ''' Find new emerging words from corpus frequencies
    neologisms: do not occur in C1, and occur in C2 at least minCount times
    weak neologisms: occur at least 5 times more in C2 compared to C1'''
    strict_neologisms = []
    weak_neologisms = []

    # go over the terms in the recent vocabulary
    for word in freqs2:
        
        # skip all the non-relevant words
        if filter_rule(word) and not has_numbers(word):
            
            if freqs2[word] >= thres:

                # strict neologism
                if word not in freqs1:
                    strict_neologisms.append((word, freqs1[word], freqs2[word]))

                # weak neologisms
                else: 
                    if freqs1[word]  * ratio <= freqs2[word]:
                        weak_neologisms.append((word, freqs1[word], freqs2[word]))

    return strict_neologisms, weak_neologisms

In [None]:
# find neologisms/emerging (new) words 
strict_neologisms, weak_neologisms = find_neologisms(sentence_freqs_C1, sentence_freqs_C2)
print(len(strict_neologisms), len(weak_neologisms))

# find neologisms based on SENTENCE/(document) FREQUENCIES in stead of WORD FREQUENCIES
df_neologisms = pd.DataFrame(strict_neologisms, columns = ['Term', 'C1 freq', 'C2 freq'])
df_weak_neologisms = pd.DataFrame(weak_neologisms, columns = ['Term', 'C1 freq', 'C2 freq'])

# see whether they are in WordNet or not
df_neologisms['Wordnet'] = df_neologisms['Term'].apply(lambda x: in_wordnet(x))
df_weak_neologisms['Wordnet'] = df_weak_neologisms['Term'].apply(lambda x: in_wordnet(x))
print("Number of words that occur in WordNet ", sum(df_neologisms['Wordnet']), sum(df_weak_neologisms['Wordnet']))

display(df_neologisms.head(10))
display(df_weak_neologisms.head(10))

print(len(df_neologisms), len(df_weak_neologisms))

In [None]:
# Write dataframes to csv files. 
### NOTE: uncomment to save

# os.mkdir('Targetwords/')


# df_neologisms.to_csv(f'Targetwords/neologisms_{minCount}.tsv', sep='\t', index=False)
# df_weak_neologisms.to_csv(f'Targetwords/weak_neologisms_{minCount}.tsv', sep='\t', index=False)

# merge together & Save
# df_all_neologisms = pd.concat([df_neologisms, df_weak_neologisms], ignore_index=True)
# df_all_neologisms.to_csv(f'Targetwords/all_neologisms_{minCount}.tsv', sep='\t', index=False)

### 3.1 Calculate target words based on trending scores

**Trending scores** based on the total frequency in each corpus (Chen et al. 2022)

- Can only be defined for words in both corpora
- Can only be accurate when the corpora have roughly the same number of words/documents

$$ score(n) =  \frac{F_{C2}(w) - F_{C1}(w)}{F_{C1}(w) + k}$$ 

The variable $k$ is meant to filter out words with high overall frequency. 


#### Exclusions 

We only consider words that:
- occur in both $C_1$ and $C_2$ at least 30 times 
- have a wordnet entry 
- are not digits, emoji's, URLs 

In [None]:
def trending_score(word, freqs1, freqs2, k=15):
    '''Calculate the trending score of a wrod
        based on its frequency at two time periods'''
    return (freqs2[word] - freqs1[word]) / (freqs1[word] + k)

In [None]:
# Calculate trending score for each word in the common vocabulary of C1 and C2
intersection = C1_freqs_reduced.keys() & C2_freqs_reduced.keys()

words_trending_scores = []
for word in intersection:
    
    if in_wordnet(word) and has_numbers(word) == False:
        
        # only words with sufficient frequency in both corpora
        if sentence_freqs_C1[word] >= minCount and sentence_freqs_C2[word] >= minCount: 
            words_trending_scores.append((word,
                                          trending_score(word, sentence_freqs_C1, sentence_freqs_C2, k=minCount), 
                                         sentence_freqs_C1[word], sentence_freqs_C2[word]))

df_trending_scores = pd.DataFrame(words_trending_scores, 
                                  columns=['word', 'trending score', 'Sentence freq C1', 'Sentence freq C2'])
df_trending_scores.sort_values(by='trending score', ascending=False, inplace=True)
display(df_trending_scores)

# 1. fselect candidate trending words based on threshold
threshold = 1
df_candidate_trending = df_trending_scores[df_trending_scores['trending score'] >= threshold]
display(df_candidate_trending)

In [None]:
# # Save Candidate changing target words 

# ### NOTE: uncomment to run ###

# outfile_candidate_trending = f'Targetwords/trending_candidates_statistics_{minCount}_{threshold}.tsv'
# df_candidate_trending.to_csv(outfile_candidate_trending, sep='\t', index=False)

# # save in file format for LSCD 
# df_save = pd.DataFrame([df_candidate_trending['word'], df_candidate_trending['word']]).T
# display(df_save.head(10))

# df_save.to_csv(f'Targetwords/LSCD_trending_candidates_{minCount}_{threshold}.tsv', sep='\t', header=False, index=False)

# 4. Stable Candidate word selection 

Randomly select 1000 words from the corpus, given that they: 
- are in the wordnet database
- do not contain any digits, emoji's, URLs

In [None]:
# Collect the intersecting words that occur in both corpora
intersection = C1_freqs_reduced.keys() & C2_freqs_reduced.keys()
print(len(intersection))

# words that have been varified in wordnet and don't contain digits
verified_intersection = [word for word in intersection if (not has_numbers(word) and in_wordnet(word)\
                                                           and sentence_freqs_C1[word] >= minCount \
                                                           and sentence_freqs_C2[word] >= minCount)]

candidate_stable = random.sample(verified_intersection, 1000)
print(len(candidate_stable)) # should be 1000 

# save in format for LSCD 
df_save = pd.DataFrame([candidate_stable, candidate_stable]).T
display(df_save.head(10))


In [None]:
# SAVE (commented out due to randomness)
# df_save.to_csv(f'Targetwords/stable_candidates_{minCount}.tsv', sep='\t', header=False, index=False)

##### Summary: 

In this notebook, we collected:
- Corpus statstics
- Neologisms between C1 and C2 
- Candidate target words based on trending scores 
- Candidate stable words based on random selection. 

### Extra: 
Collect all cadidate target words, without the "trending rule". 
Saved in 'Targetwords/all_candidates_{minCount}.tsv'

In [None]:
# Collect the intersecting words that occur in both corpora
intersection = C1_freqs_reduced.keys() & C2_freqs_reduced.keys()
print(len(intersection))

# words that have been varified in wordnet and don't contain digits
verified_intersection = [word for word in intersection if (not has_numbers(word) and in_wordnet(word)\
                                                           and sentence_freqs_C1[word] >= minCount \
                                                           and sentence_freqs_C2[word] >= minCount)]
df_all_words = pd.DataFrame([verified_intersection, verified_intersection]).T

# df_all_words.to_csv(f'Targetwords/all_candidates_{minCount}.tsv', sep='\t', header=False, index=False)

# 5. Extra: Collect example sentences for each neologism

The script below collects some example sentences for the neologisms such that we can manually select 20 emerging new target words. 

For this we use the w2l dictionaries created at part 1 of this notebook. 


In [None]:
# read corpus original lines 
import os

def read_lines(foldername):
    
    for _, _, files in os.walk(foldername):
        data_lines = []
        for file in files:
            with open(foldername+ file, 'r', encoding='utf-8') as f: 
                data_lines += f.readlines()                
    return data_lines

lines_C2 = read_lines('Corpora/Corpus2/')

In [None]:
def retrieve_examples(word, w2l, corpuslines, minCount=30, n=5):
    '''Retreive sentences containing the target word. 
    The sentences are detokenized. 
    Only returns example sentences that contain less than 200 words '''
    
    example_lines = w2l[word]
    
    example_sentences = [corpuslines[l] for l in example_lines]
    example_sentences = set(example_sentences) # set to filter out duplicates

    # in case the term has too little examples in the corpus
    if len(example_sentences) < minCount:
        return None
    
    return list(example_sentences)[:n]


In [None]:
# For the annotation, write 5 example sentences 
df_all_neologisms = pd.read_csv(f'Targetwords/all_neologisms_{minCount}.tsv', sep='\t')
df_all_neologisms['Examples'] = df_all_neologisms['Term'].map(lambda x: retrieve_examples(x, w2l_C2, lines_C2))
df_all_neologisms.dropna(inplace=True)
display(df_all_neologisms)

In [None]:
# Save to file for annotators

# ### Note: uncomment to save ### 

# df_all_neologisms.to_csv(f'Targetwords/all_neologisms-ANNOTATORS.tsv', sep='\t', encoding='utf-8', index=False)
