## Python dictionary workflow

Here we use the sentiment dictionary for the workflow, using Loughran-McDonald Master Dictionary w/ Sentiment Word Lists


https://sraf.nd.edu/loughranmcdonald-master-dictionary/

In [1]:
import pandas as pd
import numpy as np

In [2]:
import re
import nltk
import string
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer

from nltk.tokenize import sent_tokenize, word_tokenize

In [3]:
from pandarallel import pandarallel
pandarallel.initialize()

INFO: Pandarallel will run on 10 workers.
INFO: Pandarallel will use standard multiprocessing data transfer (pipe) to transfer data between the main process and workers.


In [4]:
import matplotlib.pyplot as plt
import seaborn as sns

In [5]:
# get the stop words here

# download nltk package
nltk.download('stopwords')
nltk.download('punkt')


# add other stop words to fine tune the relevant model, use this for the other workflows here!!
stopword=set(stopwords.words('english') + []) # add the other stop words here

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/peterhu/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /Users/peterhu/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [6]:
def data_cleaner(text, return_tokens = False):
    '''
    Cleans the data from special characters, urls, punctuation marks, extra spaces.
    Removes stopwords (Like if, it, the etc) and transforms the word in its native
    form using Porter Stemmer.
    '''
    text = str(text).lower() # lowercase the string
    text = re.sub('\[.*?\]', ' ', text) # replace punctuation with whitespaces.
    text = re.sub('https?://\S+|www\.\S+', ' ', text) # replacing urls with whitespaces.
    text = re.sub('<.*?>+', ' ', text) # removes special characters
    text = re.sub('[%s]' % re.escape(string.punctuation), ' ', text) # removes punctuation
    text = re.sub('\r', ' ', text) # removes new line characters
    text = re.sub('\n', ' ', text) # removes new line characters
    text = re.sub('\w*\d\w*', ' ', text)
    #text = re.sub('–', ' ', text) # remove any additional characters we cannot remove
    text = re.sub('[–£…»]', ' ', text) # remove any additional characters we cannot remove
    text = text.split()

    # removing stopwords,
    text = [word for word in text if not word in stopword ]

    # stemming.
    ps = PorterStemmer()
    text = [ps.stem(word) for word in text]

    if return_tokens:

        # return relevant tokens here where needed
        return text

    #List to string.
    text = ' '.join(text)

    return text

In [7]:
sentiment_words_df = pd.read_csv("sentiment_dictionary/Loughran-McDonald_MasterDictionary_1993-2023.csv")
sentiment_words_df.head()

Unnamed: 0,Word,Seq_num,Word Count,Word Proportion,Average Proportion,Std Dev,Doc Count,Negative,Positive,Uncertainty,Litigious,Strong_Modal,Weak_Modal,Constraining,Complexity,Syllables,Source
0,AARDVARK,1,664,2.69e-08,1.86e-08,4.05e-06,131,0,0,0,0,0,0,0,0,2,12of12inf
1,AARDVARKS,2,3,1.21e-10,8.23e-12,9.02e-09,1,0,0,0,0,0,0,0,0,2,12of12inf
2,ABACI,3,9,3.64e-10,1.11e-10,5.16e-08,7,0,0,0,0,0,0,0,0,3,12of12inf
3,ABACK,4,29,1.17e-09,6.33e-10,1.56e-07,28,0,0,0,0,0,0,0,0,2,12of12inf
4,ABACUS,5,9349,3.79e-07,3.83e-07,3.46e-05,1239,0,0,0,0,0,0,0,0,3,12of12inf


In [8]:
sentiment_words_df.columns

Index(['Word', 'Seq_num', 'Word Count', 'Word Proportion',
       'Average Proportion', 'Std Dev', 'Doc Count', 'Negative', 'Positive',
       'Uncertainty', 'Litigious', 'Strong_Modal', 'Weak_Modal',
       'Constraining', 'Complexity', 'Syllables', 'Source'],
      dtype='object')

In [9]:
# extract positive and negative words for workflow

In [10]:
# sentiment_words_df[sentiment_words_df['Positive'] > 0]['Word'].apply(data_cleaner, return_tokens=False)


In [11]:
# Extract sentiment word lists, convert it into data dictionary here
positive_words = sentiment_words_df[sentiment_words_df['Positive'] > 0]['Word'].apply(data_cleaner, return_tokens=False)

positive_words = set(positive_words)

negative_words = sentiment_words_df[sentiment_words_df['Negative'] > 0]['Word'].apply(data_cleaner, return_tokens=False)

negative_words = set(negative_words)

In [12]:
positive_words

{'abl',
 'abund',
 'acclaim',
 'accomplish',
 'achiev',
 'adequ',
 'advanc',
 'advantag',
 'allianc',
 'assur',
 'attain',
 'attract',
 'beauti',
 'benefici',
 'benefit',
 'best',
 'better',
 'bolster',
 'boom',
 'boost',
 'breakthrough',
 'brilliant',
 'charit',
 'collabor',
 'compliment',
 'complimentari',
 'conclus',
 'conduc',
 'confid',
 'construct',
 'courteou',
 'creativ',
 'delight',
 'depend',
 'desir',
 'despit',
 'destin',
 'dilig',
 'distinct',
 'dream',
 'easi',
 'easier',
 'easili',
 'effici',
 'empow',
 'enabl',
 'encourag',
 'enhanc',
 'enjoy',
 'enthusiasm',
 'enthusiast',
 'excel',
 'except',
 'excit',
 'exclus',
 'exemplari',
 'fantast',
 'favor',
 'favorit',
 'friendli',
 'gain',
 'good',
 'great',
 'greatest',
 'greatli',
 'happi',
 'happiest',
 'happili',
 'highest',
 'honor',
 'ideal',
 'impress',
 'improv',
 'incred',
 'influenti',
 'inform',
 'ingenu',
 'innov',
 'insight',
 'inspir',
 'integr',
 'invent',
 'inventor',
 'lead',
 'leadership',
 'loyal',
 'lucr',

In [13]:
def preprocess_document(doc):
    doc = doc.lower()
    doc = re.sub(r'[^a-z\s]', '', doc)
    tokens = doc.split()
    return tokens

To improve, you can preprocess the tokens and code in the same way, and then use those words to preprocess the below.

Use the same LDA preprocessing on both to get the sentiments as seen below.

In [14]:
def calculate_sentiment(doc_tokens, positive_words, negative_words):
    # tokens = preprocess_document(doc)
    positive_count = sum(1 for word in doc_tokens if word in positive_words)
    negative_count = sum(1 for word in doc_tokens if word in negative_words)
    total_words = len(doc_tokens)
    
    sentiment_score = (positive_count - negative_count) / total_words if total_words else 0
    return sentiment_score

In [15]:
documents = [
    "The company had a great quarter with significant growth.",
    "There were many challenges and losses in the last quarter."
]

# Calculate and print sentiment scores
for doc in documents:

    preprocessed_doc = data_cleaner(doc, return_tokens=True)

    score = calculate_sentiment(preprocessed_doc, positive_words, negative_words)
    print(f'Document: {doc}\nSentiment Score: {score}\n')

Document: The company had a great quarter with significant growth.
Sentiment Score: 0.2

Document: There were many challenges and losses in the last quarter.
Sentiment Score: -0.4



The preprocessing, and the code above, seems to make sense.

So we can run this sentiment of words, but we can also try to use the other code workflows as well.

In [16]:
# use this to count the number of words

In [17]:
annual_report_df = pd.read_json("raw_data/sec_us_phrama_all_company_filing_meta_with_text_w_7_7A_2011_2023.jsonl", lines = True)

In [18]:
annual_report_df.columns

Index(['ticker', 'formType', 'accessionNo', 'cik', 'companyNameLong',
       'companyName', 'linkToFilingDetails', 'description', 'linkToTxt',
       'filedAt', 'documentFormatFiles', 'periodOfReport', 'entities', 'id',
       'seriesAndClassesContractsInformation', 'linkToHtml', 'linkToXbrl',
       'dataFiles', 'effectivenessDate', 'Text_7', 'Text_7A'],
      dtype='object')

In [19]:
section7_text_cleaned = annual_report_df["Text_7"].parallel_apply(data_cleaner, return_tokens=True)

In [20]:
annual_report_df["Text_7"]

# it is right in the below, as you've got the same result so make sure you do this!!

0        Item 7. MANAGEMENT&#8217;S DISCUSSION AND ANA...
1        Item 7. MANAGEMENT&#8217;S DISCUSSION AND ANA...
2        Item 7. MANAGEMENT&#8217;S DISCUSSION AND ANA...
3        Item 7. MANAGEMENT&#8217;S DISCUSSION AND ANA...
4                                                        
                              ...                        
8376     Item 7. Management&#8217;s Discussion and Ana...
8377     Item 7. Management&#8217;s Discussion and Ana...
8378     Item 7. Management&#8217;s Discussion and Ana...
8379     Item 7. Management&#8217;s Discussion and Ana...
8380     Item 7. Management&#8217;s Discussion and Ana...
Name: Text_7, Length: 8381, dtype: object

In [21]:
annual_report_df["section7_cleaned"] = section7_text_cleaned

In [22]:
annual_report_df["section7_cleaned"]

0       [item, manag, discuss, analysi, financi, condi...
1       [item, manag, discuss, analysi, financi, condi...
2       [item, manag, discuss, analysi, financi, condi...
3       [item, manag, discuss, analysi, financi, condi...
4                                                      []
                              ...                        
8376    [item, manag, discuss, analysi, financi, condi...
8377    [item, manag, discuss, analysi, financi, condi...
8378    [item, manag, discuss, analysi, financi, condi...
8379    [item, manag, discuss, analysi, financi, condi...
8380    [item, manag, discuss, analysi, financi, condi...
Name: section7_cleaned, Length: 8381, dtype: object

In [23]:
annual_report_df["section7_cleaned_sentiment"] = annual_report_df["section7_cleaned"].parallel_apply(calculate_sentiment, positive_words = positive_words, negative_words = negative_words)

In [24]:
# Count positive values
positive_count = (annual_report_df["section7_cleaned_sentiment"] > 0).sum()

# Count negative values
negative_count = (annual_report_df["section7_cleaned_sentiment"]  < 0).sum()

# Count zero values
zero_count = (annual_report_df["section7_cleaned_sentiment"]  == 0).sum()

In [25]:
positive_count, negative_count, zero_count

(np.int64(956), np.int64(5496), np.int64(1929))

In [26]:
annual_report_df["section7_cleaned_sentiment"].describe()

count    8381.000000
mean       -0.006900
std         0.009122
min        -0.056604
25%        -0.012665
50%        -0.005969
75%         0.000000
max         0.083333
Name: section7_cleaned_sentiment, dtype: float64

### Deal with uncertainty as required

These are the words that used previously in the dictionary list.

In [27]:
# define the uncertainty words by the word list, same procedure as before.

uncertainty_words =  sentiment_words_df[sentiment_words_df['Uncertainty'] > 0]['Word'].apply(data_cleaner, return_tokens=False)

uncertainty_words = set(uncertainty_words)

In [28]:
def count_uncertainty_words(words, uncertainty_words):
    count = sum(1 for word in words if word in uncertainty_words)
    return count

In [29]:
def calculate_uncertainty_score(text, uncertainty_words):
    words = data_cleaner(text, return_tokens=True)
    uncertainty_count = count_uncertainty_words(words, uncertainty_words)
    # Normalize by the total number of words to get the score
    score = uncertainty_count / len(words) if words else 0
    return score

In [30]:
def calculate_uncertainty_score_tokens(word_tokens, uncertainty_words):
    # words = data_cleaner(text, return_tokens=True)
    uncertainty_count = count_uncertainty_words(word_tokens, uncertainty_words)
    # Normalize by the total number of words to get the score
    score = uncertainty_count / len(word_tokens) if word_tokens else 0
    return score

In [31]:
text = "I am very uncertain about the my future financial prospects"
uncertainty_score = calculate_uncertainty_score(text, uncertainty_words)
print(f'Uncertainty Score: {uncertainty_score}')

Uncertainty Score: 0.25


In [32]:
annual_report_df["section7_cleaned"]

0       [item, manag, discuss, analysi, financi, condi...
1       [item, manag, discuss, analysi, financi, condi...
2       [item, manag, discuss, analysi, financi, condi...
3       [item, manag, discuss, analysi, financi, condi...
4                                                      []
                              ...                        
8376    [item, manag, discuss, analysi, financi, condi...
8377    [item, manag, discuss, analysi, financi, condi...
8378    [item, manag, discuss, analysi, financi, condi...
8379    [item, manag, discuss, analysi, financi, condi...
8380    [item, manag, discuss, analysi, financi, condi...
Name: section7_cleaned, Length: 8381, dtype: object

In [33]:
# kind of works - just use basic score words and this could work well.

annual_report_df["section7_cleaned_uncertainty_sentiment_score"] = annual_report_df["section7_cleaned"].parallel_apply(calculate_uncertainty_score_tokens, uncertainty_words = uncertainty_words)

In [34]:
annual_report_df["section7_cleaned_uncertainty_sentiment_score"].describe()

count    8381.000000
mean        0.021187
std         0.013074
min         0.000000
25%         0.016946
50%         0.024241
75%         0.029310
max         0.085890
Name: section7_cleaned_uncertainty_sentiment_score, dtype: float64

Repeat procedure for section 7A also and create the panel data needed for these sentiments by dates

Directly linking of the approach here is fine already.

In [35]:
annual_report_df

Unnamed: 0,ticker,formType,accessionNo,cik,companyNameLong,companyName,linkToFilingDetails,description,linkToTxt,filedAt,...,seriesAndClassesContractsInformation,linkToHtml,linkToXbrl,dataFiles,effectivenessDate,Text_7,Text_7A,section7_cleaned,section7_cleaned_sentiment,section7_cleaned_uncertainty_sentiment_score
0,AMGN,10-K,0000318154-24-000011,318154,AMGEN INC (Filer),AMGEN INC,https://www.sec.gov/Archives/edgar/data/318154...,Form 10-K - Annual report [Section 13 and 15(d...,https://www.sec.gov/Archives/edgar/data/318154...,2024-02-14T16:23:32-05:00,...,[],https://www.sec.gov/Archives/edgar/data/318154...,,"[{'sequence': '17', 'size': '118693', 'documen...",,Item 7. MANAGEMENT&#8217;S DISCUSSION AND ANA...,Item 7A. QUANTITATIVE AND QUALITATIVE DISCLOS...,"[item, manag, discuss, analysi, financi, condi...",-0.012413,0.030885
1,AMGN,10-K,0000318154-23-000017,318154,AMGEN INC (Filer),AMGEN INC,https://www.sec.gov/Archives/edgar/data/318154...,Form 10-K - Annual report [Section 13 and 15(d...,https://www.sec.gov/Archives/edgar/data/318154...,2023-02-09T16:26:31-05:00,...,[],https://www.sec.gov/Archives/edgar/data/318154...,,"[{'sequence': '11', 'size': '114316', 'documen...",,Item 7. MANAGEMENT&#8217;S DISCUSSION AND ANA...,Item 7A. QUANTITATIVE AND QUALITATIVE DISCLOS...,"[item, manag, discuss, analysi, financi, condi...",-0.012983,0.033115
2,AMGN,10-K,0000318154-22-000010,318154,AMGEN INC (Filer),AMGEN INC,https://www.sec.gov/Archives/edgar/data/318154...,Form 10-K - Annual report [Section 13 and 15(d...,https://www.sec.gov/Archives/edgar/data/318154...,2022-02-16T16:39:53-05:00,...,[],https://www.sec.gov/Archives/edgar/data/318154...,,"[{'sequence': '13', 'size': '109735', 'documen...",,Item 7. MANAGEMENT&#8217;S DISCUSSION AND ANA...,Item 7A. QUANTITATIVE AND QUALITATIVE DISCLOS...,"[item, manag, discuss, analysi, financi, condi...",-0.009131,0.034089
3,AMGN,10-K,0000318154-21-000010,318154,AMGEN INC (Filer),AMGEN INC,https://www.sec.gov/Archives/edgar/data/318154...,Form 10-K - Annual report [Section 13 and 15(d...,https://www.sec.gov/Archives/edgar/data/318154...,2021-02-08T19:24:56-05:00,...,[],https://www.sec.gov/Archives/edgar/data/318154...,,"[{'sequence': '11', 'size': '111994', 'documen...",2020-12-31,Item 7. MANAGEMENT&#8217;S DISCUSSION AND ANA...,Item 7A. QUANTITATIVE AND QUALITATIVE DISCLOS...,"[item, manag, discuss, analysi, financi, condi...",-0.007950,0.029462
4,AMGN,10-K/A,0000318154-20-000019,318154,AMGEN INC (Filer),AMGEN INC,https://www.sec.gov/Archives/edgar/data/318154...,Form 10-K/A - Annual report [Section 13 and 15...,https://www.sec.gov/Archives/edgar/data/318154...,2020-02-13T13:51:00-05:00,...,[],https://www.sec.gov/Archives/edgar/data/318154...,,[],,,,[],0.000000,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8376,INDP,10-K,0001493152-24-009769,1857044,"Indaptus Therapeutics, Inc. (Filer)","Indaptus Therapeutics, Inc.",https://www.sec.gov/Archives/edgar/data/185704...,Form 10-K - Annual report [Section 13 and 15(d...,https://www.sec.gov/Archives/edgar/data/185704...,2024-03-13T08:02:27-04:00,...,[],https://www.sec.gov/Archives/edgar/data/185704...,,"[{'sequence': '16', 'size': '31864', 'document...",,Item 7. Management&#8217;s Discussion and Ana...,Item 7A. Quantitative and Qualitative Disclos...,"[item, manag, discuss, analysi, financi, condi...",-0.011407,0.051806
8377,INDP,10-K,0001493152-23-008010,1857044,"Indaptus Therapeutics, Inc. (Filer)","Indaptus Therapeutics, Inc.",https://www.sec.gov/Archives/edgar/data/185704...,Form 10-K - Annual report [Section 13 and 15(d...,https://www.sec.gov/Archives/edgar/data/185704...,2023-03-17T08:05:56-04:00,...,[],https://www.sec.gov/Archives/edgar/data/185704...,,"[{'sequence': '18', 'size': '38754', 'document...",,Item 7. Management&#8217;s Discussion and Ana...,Item 7A. Quantitative and Qualitative Disclos...,"[item, manag, discuss, analysi, financi, condi...",-0.007538,0.041457
8378,INDP,10-K,0001493152-22-007319,1857044,"Indaptus Therapeutics, Inc. (Filer)","Indaptus Therapeutics, Inc.",https://www.sec.gov/Archives/edgar/data/185704...,Form 10-K - Annual report [Section 13 and 15(d...,https://www.sec.gov/Archives/edgar/data/185704...,2022-03-21T07:05:43-04:00,...,[],https://www.sec.gov/Archives/edgar/data/185704...,,"[{'sequence': '16', 'size': '46094', 'document...",,Item 7. Management&#8217;s Discussion and Ana...,Item 7A. Quantitative and Qualitative Disclos...,"[item, manag, discuss, analysi, financi, condi...",-0.012223,0.037121
8379,QNRX,10-K,0001410578-24-000198,1671502,"Quoin Pharmaceuticals, Ltd. (Filer)","Quoin Pharmaceuticals, Ltd.",https://www.sec.gov/Archives/edgar/data/167150...,Form 10-K - Annual report [Section 13 and 15(d...,https://www.sec.gov/Archives/edgar/data/167150...,2024-03-14T16:42:23-04:00,...,[],https://www.sec.gov/Archives/edgar/data/167150...,,"[{'sequence': '9', 'size': '62248', 'documentU...",,Item 7. Management&#8217;s Discussion and Ana...,Item 7A. Quantitative and Qualitative Disclos...,"[item, manag, discuss, analysi, financi, condi...",-0.016817,0.026322


In [36]:
annual_report_df.columns

Index(['ticker', 'formType', 'accessionNo', 'cik', 'companyNameLong',
       'companyName', 'linkToFilingDetails', 'description', 'linkToTxt',
       'filedAt', 'documentFormatFiles', 'periodOfReport', 'entities', 'id',
       'seriesAndClassesContractsInformation', 'linkToHtml', 'linkToXbrl',
       'dataFiles', 'effectivenessDate', 'Text_7', 'Text_7A',
       'section7_cleaned', 'section7_cleaned_sentiment',
       'section7_cleaned_uncertainty_sentiment_score'],
      dtype='object')

In [37]:
annual_report_df.periodOfReport

0       2023-12-31
1       2022-12-31
2       2021-12-31
3       2020-12-31
4       2019-12-31
           ...    
8376    2023-12-31
8377    2022-12-31
8378    2021-12-31
8379    2023-12-31
8380    2022-12-31
Name: periodOfReport, Length: 8381, dtype: object

In [40]:
annual_report_df['periodOfReport'] = pd.to_datetime(annual_report_df['periodOfReport'])

# Filter out rows where the year is 2024
filtered_df = annual_report_df[annual_report_df['periodOfReport'].dt.year != 2023]

# Display the filtered data
# print(filtered_df)
filtered_df.shape

(7610, 24)

In [41]:
annual_report_df['periodOfReport'].min()

Timestamp('2011-01-02 00:00:00')

Overall reporting of the annual report results in 7610 annual reports from 2011 to 2022

In [91]:
# run the same method above, given the same functions

annual_report_df["Text_7A_cleaned"] = annual_report_df["Text_7A"].parallel_apply(data_cleaner, return_tokens=True)

# sentiment and uncertainty score here

annual_report_df["section7A_cleaned_sentiment"] = annual_report_df["Text_7A_cleaned"].parallel_apply(calculate_sentiment, positive_words = positive_words, negative_words = negative_words)

annual_report_df["section7A_cleaned_uncertainty_sentiment_score"] = annual_report_df["Text_7A_cleaned"].parallel_apply(calculate_uncertainty_score_tokens, 
                                                                                                                       uncertainty_words = uncertainty_words)



In [92]:
annual_report_df

Unnamed: 0,ticker,formType,accessionNo,cik,companyNameLong,companyName,linkToFilingDetails,description,linkToTxt,filedAt,...,dataFiles,effectivenessDate,Text_7,Text_7A,section7_cleaned,section7_cleaned_sentiment,section7_cleaned_uncertainty_sentiment_score,Text_7A_cleaned,section7A_cleaned_sentiment,section7A_cleaned_uncertainty_sentiment_score
0,AMGN,10-K,0000318154-24-000011,318154,AMGEN INC (Filer),AMGEN INC,https://www.sec.gov/Archives/edgar/data/318154...,Form 10-K - Annual report [Section 13 and 15(d...,https://www.sec.gov/Archives/edgar/data/318154...,2024-02-14T16:23:32-05:00,...,"[{'sequence': '17', 'size': '118693', 'documen...",,Item 7. MANAGEMENT&#8217;S DISCUSSION AND ANA...,Item 7A. QUANTITATIVE AND QUALITATIVE DISCLOS...,"[item, manag, discuss, analysi, financi, condi...",-0.012413,0.030885,"[item, quantit, qualit, disclosur, market, ris...",-0.040434,0.030572
1,AMGN,10-K,0000318154-23-000017,318154,AMGEN INC (Filer),AMGEN INC,https://www.sec.gov/Archives/edgar/data/318154...,Form 10-K - Annual report [Section 13 and 15(d...,https://www.sec.gov/Archives/edgar/data/318154...,2023-02-09T16:26:31-05:00,...,"[{'sequence': '11', 'size': '114316', 'documen...",,Item 7. MANAGEMENT&#8217;S DISCUSSION AND ANA...,Item 7A. QUANTITATIVE AND QUALITATIVE DISCLOS...,"[item, manag, discuss, analysi, financi, condi...",-0.012983,0.033115,"[item, quantit, qualit, disclosur, market, ris...",-0.042553,0.032377
2,AMGN,10-K,0000318154-22-000010,318154,AMGEN INC (Filer),AMGEN INC,https://www.sec.gov/Archives/edgar/data/318154...,Form 10-K - Annual report [Section 13 and 15(d...,https://www.sec.gov/Archives/edgar/data/318154...,2022-02-16T16:39:53-05:00,...,"[{'sequence': '13', 'size': '109735', 'documen...",,Item 7. MANAGEMENT&#8217;S DISCUSSION AND ANA...,Item 7A. QUANTITATIVE AND QUALITATIVE DISCLOS...,"[item, manag, discuss, analysi, financi, condi...",-0.009131,0.034089,"[item, quantit, qualit, disclosur, market, ris...",-0.040594,0.029703
3,AMGN,10-K,0000318154-21-000010,318154,AMGEN INC (Filer),AMGEN INC,https://www.sec.gov/Archives/edgar/data/318154...,Form 10-K - Annual report [Section 13 and 15(d...,https://www.sec.gov/Archives/edgar/data/318154...,2021-02-08T19:24:56-05:00,...,"[{'sequence': '11', 'size': '111994', 'documen...",2020-12-31,Item 7. MANAGEMENT&#8217;S DISCUSSION AND ANA...,Item 7A. QUANTITATIVE AND QUALITATIVE DISCLOS...,"[item, manag, discuss, analysi, financi, condi...",-0.007950,0.029462,"[item, quantit, qualit, disclosur, market, ris...",-0.038961,0.026973
4,AMGN,10-K/A,0000318154-20-000019,318154,AMGEN INC (Filer),AMGEN INC,https://www.sec.gov/Archives/edgar/data/318154...,Form 10-K/A - Annual report [Section 13 and 15...,https://www.sec.gov/Archives/edgar/data/318154...,2020-02-13T13:51:00-05:00,...,[],,,,[],0.000000,0.000000,[],0.000000,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8376,INDP,10-K,0001493152-24-009769,1857044,"Indaptus Therapeutics, Inc. (Filer)","Indaptus Therapeutics, Inc.",https://www.sec.gov/Archives/edgar/data/185704...,Form 10-K - Annual report [Section 13 and 15(d...,https://www.sec.gov/Archives/edgar/data/185704...,2024-03-13T08:02:27-04:00,...,"[{'sequence': '16', 'size': '31864', 'document...",,Item 7. Management&#8217;s Discussion and Ana...,Item 7A. Quantitative and Qualitative Disclos...,"[item, manag, discuss, analysi, financi, condi...",-0.011407,0.051806,"[item, quantit, qualit, disclosur, market, ris...",0.043478,0.043478
8377,INDP,10-K,0001493152-23-008010,1857044,"Indaptus Therapeutics, Inc. (Filer)","Indaptus Therapeutics, Inc.",https://www.sec.gov/Archives/edgar/data/185704...,Form 10-K - Annual report [Section 13 and 15(d...,https://www.sec.gov/Archives/edgar/data/185704...,2023-03-17T08:05:56-04:00,...,"[{'sequence': '18', 'size': '38754', 'document...",,Item 7. Management&#8217;s Discussion and Ana...,Item 7A. Quantitative and Qualitative Disclos...,"[item, manag, discuss, analysi, financi, condi...",-0.007538,0.041457,"[item, quantit, qualit, disclosur, market, ris...",0.037037,0.037037
8378,INDP,10-K,0001493152-22-007319,1857044,"Indaptus Therapeutics, Inc. (Filer)","Indaptus Therapeutics, Inc.",https://www.sec.gov/Archives/edgar/data/185704...,Form 10-K - Annual report [Section 13 and 15(d...,https://www.sec.gov/Archives/edgar/data/185704...,2022-03-21T07:05:43-04:00,...,"[{'sequence': '16', 'size': '46094', 'document...",,Item 7. Management&#8217;s Discussion and Ana...,Item 7A. Quantitative and Qualitative Disclos...,"[item, manag, discuss, analysi, financi, condi...",-0.012223,0.037121,"[item, quantit, qualit, disclosur, market, ris...",0.000000,0.142857
8379,QNRX,10-K,0001410578-24-000198,1671502,"Quoin Pharmaceuticals, Ltd. (Filer)","Quoin Pharmaceuticals, Ltd.",https://www.sec.gov/Archives/edgar/data/167150...,Form 10-K - Annual report [Section 13 and 15(d...,https://www.sec.gov/Archives/edgar/data/167150...,2024-03-14T16:42:23-04:00,...,"[{'sequence': '9', 'size': '62248', 'documentU...",,Item 7. Management&#8217;s Discussion and Ana...,Item 7A. Quantitative and Qualitative Disclos...,"[item, manag, discuss, analysi, financi, condi...",-0.016817,0.026322,"[item, quantit, qualit, disclosur, market, ris...",0.052632,0.052632


Use `sec_cik_gvkey_metadata_us_pharma_company_network_2011_2023.csv` to link everything with the right entities.

In [None]:
gvkey_annual_report_matching_df = pd.read_csv("raw_data/sec_cik_gvkey_metadata_us_pharma_company_network_2011_2023.csv",
                                              index_col= 0)

In [96]:
gvkey_annual_report_matching_df.head()

Unnamed: 0,cik,gvkey,source,link_desc,sec_company_name,link_company_name,sec_start_date,sec_end_date,link_start_date,link_end_date,...,n10q_a,ndef,n8k,nlet,n13d,n13g,n13f,ntot,ntot_nt,ntot_a
0,318154,1602,CRSP/Compustat Merged,Valid CIK-GVKEY Link,AMGEN INC,AMGEN INC,1994-03-28,2024-06-21,2007-04-14,2024-03-08,...,4.0,92.0,308.0,22.0,31.0,89.0,0.0,2942.0,0.0,182.0
1,318154,1602,Capital IQ,Valid CIK-GVKEY Link,AMGEN INC,AMGEN INC.,1994-03-28,2024-06-21,,,...,4.0,92.0,308.0,22.0,31.0,89.0,0.0,2942.0,0.0,182.0
2,318154,1602,Compustat Company,Valid CIK-GVKEY Link,AMGEN INC,AMGEN INC,1994-03-28,2024-06-21,1983-03-31,2023-12-31,...,4.0,92.0,308.0,22.0,31.0,89.0,0.0,2942.0,0.0,182.0
3,318154,1602,Compustat Security,Valid CIK-GVKEY Link,AMGEN INC,AMGEN INC,1994-03-28,2024-06-21,1999-02-16,2020-02-14,...,4.0,92.0,308.0,22.0,31.0,89.0,0.0,2942.0,0.0,182.0
4,722104,2222,CRSP/Compustat Merged,Valid CIK-GVKEY Link,SAVIENT PHARMACEUTICALS INC,SAVIENT PHARMACEUTICALS INC,1995-04-13,2019-03-20,2007-04-14,2024-03-08,...,10.0,24.0,176.0,13.0,6.0,102.0,0.0,980.0,4.0,158.0


In [98]:
gvkey_annual_report_matching_df.columns

Index(['cik', 'gvkey', 'source', 'link_desc', 'sec_company_name',
       'link_company_name', 'sec_start_date', 'sec_end_date',
       'link_start_date', 'link_end_date', 'n10k', 'n10k_nt', 'n10k_a', 'n10q',
       'n10q_nt', 'n10q_a', 'ndef', 'n8k', 'nlet', 'n13d', 'n13g', 'n13f',
       'ntot', 'ntot_nt', 'ntot_a'],
      dtype='object')

In [101]:
gvkey_matching_df = gvkey_annual_report_matching_df[["cik", "gvkey"]].drop_duplicates().reset_index(drop=True)
gvkey_matching_df

Unnamed: 0,cik,gvkey
0,318154,1602
1,722104,2222
2,1347178,2222
3,14272,2403
4,749647,2990
...,...,...
1199,1642116,347007
1200,1664352,349972
1201,1857044,349972
1202,1671502,351038


In [105]:
# repeat this on other sections as needed

annual_report_gvkey_df = pd.merge(gvkey_matching_df, annual_report_df , on = "cik")
annual_report_gvkey_df.head()

Unnamed: 0,cik,gvkey,ticker,formType,accessionNo,companyNameLong,companyName,linkToFilingDetails,description,linkToTxt,...,dataFiles,effectivenessDate,Text_7,Text_7A,section7_cleaned,section7_cleaned_sentiment,section7_cleaned_uncertainty_sentiment_score,Text_7A_cleaned,section7A_cleaned_sentiment,section7A_cleaned_uncertainty_sentiment_score
0,318154,1602,AMGN,10-K,0000318154-24-000011,AMGEN INC (Filer),AMGEN INC,https://www.sec.gov/Archives/edgar/data/318154...,Form 10-K - Annual report [Section 13 and 15(d...,https://www.sec.gov/Archives/edgar/data/318154...,...,"[{'sequence': '17', 'size': '118693', 'documen...",,Item 7. MANAGEMENT&#8217;S DISCUSSION AND ANA...,Item 7A. QUANTITATIVE AND QUALITATIVE DISCLOS...,"[item, manag, discuss, analysi, financi, condi...",-0.012413,0.030885,"[item, quantit, qualit, disclosur, market, ris...",-0.040434,0.030572
1,318154,1602,AMGN,10-K,0000318154-23-000017,AMGEN INC (Filer),AMGEN INC,https://www.sec.gov/Archives/edgar/data/318154...,Form 10-K - Annual report [Section 13 and 15(d...,https://www.sec.gov/Archives/edgar/data/318154...,...,"[{'sequence': '11', 'size': '114316', 'documen...",,Item 7. MANAGEMENT&#8217;S DISCUSSION AND ANA...,Item 7A. QUANTITATIVE AND QUALITATIVE DISCLOS...,"[item, manag, discuss, analysi, financi, condi...",-0.012983,0.033115,"[item, quantit, qualit, disclosur, market, ris...",-0.042553,0.032377
2,318154,1602,AMGN,10-K,0000318154-22-000010,AMGEN INC (Filer),AMGEN INC,https://www.sec.gov/Archives/edgar/data/318154...,Form 10-K - Annual report [Section 13 and 15(d...,https://www.sec.gov/Archives/edgar/data/318154...,...,"[{'sequence': '13', 'size': '109735', 'documen...",,Item 7. MANAGEMENT&#8217;S DISCUSSION AND ANA...,Item 7A. QUANTITATIVE AND QUALITATIVE DISCLOS...,"[item, manag, discuss, analysi, financi, condi...",-0.009131,0.034089,"[item, quantit, qualit, disclosur, market, ris...",-0.040594,0.029703
3,318154,1602,AMGN,10-K,0000318154-21-000010,AMGEN INC (Filer),AMGEN INC,https://www.sec.gov/Archives/edgar/data/318154...,Form 10-K - Annual report [Section 13 and 15(d...,https://www.sec.gov/Archives/edgar/data/318154...,...,"[{'sequence': '11', 'size': '111994', 'documen...",2020-12-31,Item 7. MANAGEMENT&#8217;S DISCUSSION AND ANA...,Item 7A. QUANTITATIVE AND QUALITATIVE DISCLOS...,"[item, manag, discuss, analysi, financi, condi...",-0.00795,0.029462,"[item, quantit, qualit, disclosur, market, ris...",-0.038961,0.026973
4,318154,1602,AMGN,10-K/A,0000318154-20-000019,AMGEN INC (Filer),AMGEN INC,https://www.sec.gov/Archives/edgar/data/318154...,Form 10-K/A - Annual report [Section 13 and 15...,https://www.sec.gov/Archives/edgar/data/318154...,...,[],,,,[],0.0,0.0,[],0.0,0.0


In [107]:
annual_report_gvkey_df.to_csv("features/annual_report_7_7A_sentiment_uncertainty_scores_with_gvkey.csv")

# need to join this with the cik tables to get the relevant metrics

### Problem

Here the value is too small - try to find an alternative more appropriate score that better normalises the score, put everything here into a basic regression, and see what can be done here to address the problem.

Remember to deal with duplicates in the above here.

Try different measurements of scores for the procedure here that's it!!


Try other drafts for the paper for submission later!!

#### To do approach

1. Get topic modelling code 
1. Get graph community detection code with bayesian methods
1. Get the panel data to the firm level here.