# Principles of vocabulary control for one year window (take contemperaneous return in 2023 as an example)
1. Remove stop words in English
2. Remove top 100 frequent words in each year's corpus
3. Remove words appearing once in each year's corpus; it's found that these words account for approximately 25% of the total vocabularies
4. **Most news sources are removed by removing top 100 frequent words** 

In [1]:
import pandas as pd
import re
import collections
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS

df = pd.read_csv("/scratch/wx2309/Processed_data/one_year_window/contem_2023.csv")
headlines = df.headline.tolist()

In [2]:
# Remove individual commenter first
def remove_individual(headline):
    by_pattern = re.search(r" (-+)? By", headline, flags = re.IGNORECASE)
    if by_pattern:
        return headline[:by_pattern.start()]
    else:
        return headline
df["vocab_con_headline"] = df.headline.apply(remove_individual)

In [3]:
# Tokenize the headline into tokens which are alphanumeric words including period . 
# longer than 2 because . is meaningful in financial news such as prices and some news
# sources like barrons.com and researchmarket.com
def custom_tokenizer(headline):
    tokens = re.findall(r"\b[a-zA-z0-9\.][a-zA-z0-9\.]+\b",headline.lower())  
    return tokens
# Remove tokens according to the principles 
def custom_processor(headline, remove_words):
    tokens = custom_tokenizer(headline)
    new_tokens = [token for token in tokens if token not in remove_words]
    return " ".join(new_tokens)

In [4]:
vocab = collections.Counter()
for headline in headlines:
    vocab.update(custom_tokenizer(headline))
remove_words = list(ENGLISH_STOP_WORDS)
words_once = [word for word,count in vocab.items() if count ==1]
top_100_count = sorted(vocab.values(),reverse = True)[99]
words_top_100 = [word for word,count in vocab.items() if count >=top_100_count]
remove_words.extend(words_once)
remove_words.extend(words_top_100)
remove_words = set(remove_words)

In [5]:
df.vocab_con_headline = df.vocab_con_headline.apply(lambda headline: custom_processor(headline, remove_words))

In [6]:
df.head()

Unnamed: 0,date,rp_entity_id,comnam,ret,headline,vocab_con_headline
0,2023-01-03,3DC887,ARISTA NETWORKS INC,-0.003626,Network Monitoring Global Market Report 2022: ...,network monitoring sector reach 3.8 2030 cagr
1,2023-01-03,3DC887,ARISTA NETWORKS INC,-0.003626,Network Monitoring Global Market Report 2022: ...,network monitoring corporations continue onlin...
2,2023-01-03,3DC887,ARISTA NETWORKS INC,-0.003626,iRocket Appoints Kelyn Brannon to Board of Dir...,irocket appoints kelyn brannon board directors
3,2023-01-03,8EA478,MODERNA INC,-0.003507,Official List Official List Notice -2-,official list official list notice
4,2023-01-03,8EA478,MODERNA INC,-0.003507,DelveInsight Evaluates a Robust Cystic Fibrosi...,delveinsight evaluates robust cystic fibrosis ...


In [8]:
vocab_con_headlines = df.vocab_con_headline.tolist()
for headline, vocab_con_headline in zip(headlines[:99],vocab_con_headlines[:99]):
    print(f"{headline} / {vocab_con_headline}")

Network Monitoring Global Market Report 2022: Sector to Reach $3.8 Billion by 2030 with a 7% CAGR / network monitoring sector reach 3.8 2030 cagr
Network Monitoring Global Market Report 2022: Corporations Continue to Move Online, Driving Demand - ResearchAndMarkets.com / network monitoring corporations continue online driving demand
iRocket Appoints Kelyn Brannon to Board of Directors / irocket appoints kelyn brannon board directors
Official List Official List Notice -2- / official list official list notice
DelveInsight Evaluates a Robust Cystic Fibrosis Pipeline as 75+ Influential Pharma Players to Set Foot in the Domain / delveinsight evaluates robust cystic fibrosis pipeline 75 influential pharma players set foot domain
London Stock Exchange Notice Admission to Trading - 03/01/2023 / london exchange notice admission trading 03 01
DelveInsight Evaluates a Robust Cystic Fibrosis -2- / delveinsight evaluates robust cystic fibrosis
CFA Drugs & Devices:Insider Review For Week Ended 30-De