In [1]:
%%capture
! pip install datasets transformers seqeval

In [1]:
import pandas as pd
from concurrent.futures import ThreadPoolExecutor, ProcessPoolExecutor
from tqdm.notebook import tqdm

In [2]:
df=pd.read_pickle('/kaggle/input/motley-fool-scraped-earnings-call-transcripts/motley-fool-data.pkl')
df = df.drop_duplicates(subset = "transcript")

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 17592 entries, 0 to 18754
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   date        17592 non-null  object
 1   exchange    17592 non-null  object
 2   q           17592 non-null  object
 3   ticker      17592 non-null  object
 4   transcript  17592 non-null  object
dtypes: object(5)
memory usage: 824.6+ KB


In [4]:
df.shape

(17592, 5)

In [6]:
df.head()

Unnamed: 0,date,exchange,q,ticker,transcript
0,"Aug 27, 2020, 9:00 p.m. ET",NASDAQ: BILI,2020-Q2,BILI,"Prepared Remarks:\nOperator\nGood day, and wel..."
1,"Jul 30, 2020, 4:30 p.m. ET",NYSE: GFF,2020-Q3,GFF,Prepared Remarks:\nOperator\nThank you for sta...
2,"Oct 23, 2019, 5:00 p.m. ET",NASDAQ: LRCX,2020-Q1,LRCX,Prepared Remarks:\nOperator\nGood day and welc...
3,"Nov 6, 2019, 12:00 p.m. ET",NASDAQ: BBSI,2019-Q3,BBSI,"Prepared Remarks:\nOperator\nGood day, everyon..."
4,"Aug 7, 2019, 8:30 a.m. ET",NASDAQ: CSTE,2019-Q2,CSTE,Prepared Remarks:\nOperator\nGreetings and wel...


In [7]:
import re
def replace_abbreviations(text):
    replacements = {
        'Q1': 'first quarter',
        'Q2': 'second quarter',
        'Q3': 'third quarter',
        'Q4': 'fourth quarter',
        'q1': 'first quarter',
        'q2': 'second quarter',
        'q3': 'third quarter',
        'q4': 'fourth quarter',
        'FY': 'fiscal year',
        'YoY': 'year over year',
        'MoM': 'month over month',
        'EBITDA': 'earnings before interest, taxes, depreciation, and amortization',
        'ROI': 'return on investment',
        'EPS': 'earnings per share',
        'P/E': 'price-to-earnings',
        'DCF': 'discounted cash flow',
        'CAGR': 'compound annual growth rate',
        'GDP': 'gross domestic product',
        'CFO': 'chief financial officer',
        'GAAP': 'generally accepted accounting principles',
        'SEC': 'U.S. Securities and Exchange Commission',
        'IPO': 'initial public offering',
        'M&A': 'mergers and acquisitions',
        'EBIT': 'earnings before interest and taxes',
        'IRR': 'internal rate of return',
        'ROA': 'return on assets',
        'ROE': 'return on equity',
        'NAV': 'net asset value',
        'PE ratio': 'price-to-earnings ratio',
        'EPS growth': 'earnings per share growth',
        'Fiscal Year': 'financial year',
        'CAPEX': 'capital expenditure',
        'APR': 'annual percentage rate',
        'P&L': 'profit and loss',
        'NPM': 'net profit margin',
        'EBT': 'earnings before taxes',
        'EBITDAR': 'earnings before interest, taxes, depreciation, amortization, and rent',
        'PAT': 'profit after tax',
        'COGS': 'cost of goods sold',
        'EBTIDA': 'earnings before taxes, interest, depreciation, and amortization',
        'E&Y': 'Ernst & Young',
        'B2B': 'business to business',
        'B2C': 'business to consumer',
        'LIFO': 'last in, first out',
        'FIFO': 'first in, first out',
        'FCF': 'free cash flow',
        'LTM': 'last twelve months',
        'OPEX': 'operating expenses',
        'TSR': 'total shareholder return',
        'PP&E': 'property, plant, and equipment',
        'PBT': 'profit before tax',
        'EBITDAR margin': 'earnings before interest, taxes, depreciation, amortization, and rent margin',
        'ROIC': 'return on invested capital',
        'EPS': 'earnings per share',
        'P/E': 'price-to-earnings',
        'EBITDA': 'earnings before interest, taxes, depreciation, and amortization',
        'YOY': 'year-over-year',
        'MOM': 'month-over-month',
        'CAGR': 'compound annual growth rate',
        'GDP': 'gross domestic product',
        'ROI': 'return on investment',
        'ROE': 'return on equity',
        'EBIT': 'earnings before interest and taxes',
        'DCF': 'discounted cash flow',
        'GAAP': 'Generally Accepted Accounting Principles',
        'LTM': 'last twelve months',
        'EBIT margin': 'earnings before interest and taxes margin',
        'EBT': 'earnings before taxes',
        'EBTA': 'earnings before taxes and amortization',
        'FTE': 'full-time equivalent',
        'EBIDTA': 'earnings before interest, depreciation, taxes, and amortization',
        'EBTIDA': 'earnings before taxes, interest, depreciation, and amortization',
        'EBITDAR': 'earnings before interest, taxes, depreciation, amortization, and rent',
        'COGS': 'cost of goods sold',
        'APR': 'annual percentage rate',
        'PESTEL': 'Political, Economic, Social, Technological, Environmental, and Legal',
        'KPI': 'key performance indicator',
        'SWOT': 'Strengths, Weaknesses, Opportunities, Threats',
        'CAPEX': 'capital expenditures',
        'EBITDARM': 'earnings before interest, taxes, depreciation, amortization, rent, and management fees',
        'EBITDAX': 'earnings before interest, taxes, depreciation, amortization, and exploration expenses',
        'EBITDAS': 'earnings before interest, taxes, depreciation, amortization, and restructuring costs',
        'EBITDAX-C': 'earnings before interest, taxes, depreciation, amortization, exploration expenses, and commodity derivatives',
        'EBITDAX-R': 'earnings before interest, taxes, depreciation, amortization, exploration expenses, and asset retirement obligations',
        'EBITDAX-E': 'earnings before interest, taxes, depreciation, amortization, exploration expenses, and environmental liabilities'
        # Add more abbreviations and replacements as needed
    }
    for abbreviation, full_form in replacements.items():
        text = text.replace(abbreviation, full_form)

    return text

In [9]:
df["transcript"]=df["transcript"].apply(lambda x:replace_abbreviations(x))

In [139]:
newData = df.sample(frac=1).reset_index(drop=True)[:4000]

In [14]:
import nltk
nltk.download('punkt')  # Download the necessary tokenizer data

from nltk.tokenize import sent_tokenize

def split_text_into_sentences(text):
    sentences = sent_tokenize(text)
    return sentences



[nltk_data] Downloading package punkt to /usr/share/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [35]:
import re
def extractMoneyAmount(text):
    # Define a regular expression pattern to match the revenue amount
    pattern = r"\d+(?:\.\d+)? (?:million|billion)"
    # Search for the pattern in the text
    matches = re.findall(pattern, text, re.IGNORECASE)
    if len(matches)>=1:
        return matches[0]
    else:
        return None

In [134]:
import re

def extract_revenue_sentences(text, keywords):
    sentences = split_text_into_sentences(text)  # Split text into sentences
    revenue_sentences = []

    for sentence in sentences:
        for keyword in keywords:
            if re.search(r'\b' + re.escape(keyword) + r'\b', sentence, re.IGNORECASE):
                revenue_sentences.append(sentence)
                break  # Once a keyword is found in a sentence, move to the next sentence

    return revenue_sentences
def getLabels(x):
    # Example earnings call transcript
    earnings_transcript = x

    revenue_keywords = [
        "revenue", "sales", "income", "earnings", "gross profit", "net profit","revenues"
        "top-line growth", "bottom-line performance", "revenue generation", "revenue increase",
        "profitable quarter", "increased turnover", "revenue stream", "revenue recognition",
        "monetization", "customer spend", "market share", "sales figures","quarter revenue","quarter revenues"
    ]
    # List of loss-related keywords
    loss_keywords = [
        "loss", "net loss", "unexpected market fluctuations", "losses incurred", "negative impact",
        "unfavorable outcomes", "downturn", "deficit", "reduced earnings", "loss provision",
        "financial setback", "income decline", "bottom-line reduction"
    ]
    expense_keywords = [
        "expenses", "costs", "expenditures", "operating costs", "production costs",
        "administrative expenses", "overhead costs", "capital expenditures", "cost management",
        "cost control", "spending", "outlays", "budget allocation", "cost efficiency"
    ]
    profit_keywords = [
        "profit", "earnings", "net income", "bottom-line", "gross profit", "operating profit",
        "profitability", "positive financial performance", "strong financial results",
        "healthy profit margins", "profit growth", "profits surged", "improved profitability",
        "increased earnings", "profitable quarter", "robust profits", "strong profits",
        "profitable segments", "profit margin", "profits beat expectations", "profit contribution",
        "profitable ventures", "profits soared", "profit generation", "favorable profits",
        "profitable initiatives", "positive earnings report", "profitable outcome",
        "profitable business", "earnings growth", "profitable operations", "solid profits",
        "profitable business model", "strong profit margins", "increased profitability",
        "revenue and profit growth", "profit-driven strategies", "profit-maximizing approach",
        "consistent profit growth", "profitable product lines", "profitable market segments"
    ]


    # Extract revenue-related sentences from the transcript
    revenue_sentences = extract_revenue_sentences(earnings_transcript, revenue_keywords)
    loss_sentences = extract_revenue_sentences(earnings_transcript, loss_keywords)
    expense_sentences = extract_revenue_sentences(earnings_transcript, expense_keywords)
    profit_sentences = extract_revenue_sentences(earnings_transcript, profit_keywords)
    revenue_money=[]
    loss_money=[]
    expense_money=[]
    profit_money=[]

    for i, sentence in enumerate(revenue_sentences, start=1):
        if extractMoneyAmount(sentence)!=None:
    #         print(sentence)
            revenue_money.append(extractMoneyAmount(sentence))
    #         print("--------------------------")

    for i, sentence in enumerate(loss_sentences, start=1):
        if extractMoneyAmount(sentence)!=None:
    #         print(sentence)
            loss_money.append(extractMoneyAmount(sentence))
    #         print("--------------------------")

    for i, sentence in enumerate(expense_sentences, start=1):
        if extractMoneyAmount(sentence)!=None:
    #         print(sentence)
            expense_money.append(extractMoneyAmount(sentence))
    #         print("--------------------------")

    for i, sentence in enumerate(profit_sentences, start=1):
        if extractMoneyAmount(sentence)!=None:
    #         print(sentence)
            profit_money.append(extractMoneyAmount(sentence))
    #         print("--------------------------")
    text=earnings_transcript.replace("$","")
    labeled_text = ['O'] * len(text.split())
    for rev in revenue_money:
        pattern=rev
        matches = re.finditer(pattern, text, re.IGNORECASE)
        for match in matches:
            label="revenue"
            start, end = match.span()
            start_idx = len(text[:start].split())
            end_idx = len(text[:end].split())
    #         print(text.split()[start_idx])
            labeled_text[start_idx] = f'B-{label}'
            for i in range(start_idx + 1, end_idx):
                labeled_text[i] = f'I-{label}'

    for loss in loss_money:
        pattern=loss
        matches = re.finditer(pattern, text, re.IGNORECASE)
        for match in matches:
            label="loss"
            start, end = match.span()
            start_idx = len(text[:start].split())
            end_idx = len(text[:end].split())
    #         print(text.split()[start_idx])
            labeled_text[start_idx] = f'B-{label}'
            for i in range(start_idx + 1, end_idx):
                labeled_text[i] = f'I-{label}'       
    for prof in profit_money:
        pattern=prof
        matches = re.finditer(pattern, text, re.IGNORECASE)
        for match in matches:
            label="profit"
            start, end = match.span()
            start_idx = len(text[:start].split())
            end_idx = len(text[:end].split())
    #         print(text.split()[start_idx])
            labeled_text[start_idx] = f'B-{label}'
            for i in range(start_idx + 1, end_idx):
                labeled_text[i] = f'I-{label}'
    for exp in expense_money:
        pattern=exp
        matches = re.finditer(pattern, text, re.IGNORECASE)
        for match in matches:
            label="expense"
            start, end = match.span()
            start_idx = len(text[:start].split())
            end_idx = len(text[:end].split())
    #         print(text.split()[start_idx])
            labeled_text[start_idx] = f'B-{label}'
            for i in range(start_idx + 1, end_idx):
                labeled_text[i] = f'I-{label}'
    return text.split(),labeled_text

In [145]:
from datasets import Dataset
import pandas as pd
from datasets import Dataset, DatasetDict
from sklearn.model_selection import train_test_split


In [140]:
tokens=[]
nertags=[]
for idx in tqdm(range(len(newData["transcript"]))):
    token,tag=getLabels(newData["transcript"].iloc[idx])
    tokens.append(token)
    nertags.append(tag)

  0%|          | 0/4000 [00:00<?, ?it/s]

In [156]:
labels2id={"O":0,"B-revenue":1,"B-loss":2,"B-expense":3,"B-profit":4,"I-revenue":5,"I-loss":6,"I-expense":7,"I-profit":8}
id2label={value:key for key,value in labels2id.items()}

In [161]:
nertags

[]

In [159]:
nertags=[]
for x in tqdm(nertags):
    l=[]
    for elm in x:
        l.append(labels2id[elm])
    nertags.append(l)
    

0it [00:00, ?it/s]

In [141]:
import math
finalToken=[]
finalTags=[]
splitSize=256
for token,tag in zip(tokens,nertags):
    finalToken=finalToken+[token[r*splitSize:(r+1)*splitSize] for r in range(math.ceil(len(token)/splitSize))]
    finalTags=finalTags+[tag[r*splitSize:(r+1)*splitSize] for r in range(math.ceil(len(tag)/splitSize))]

In [142]:
len(finalToken[0])

256

In [143]:
usedData = pd.DataFrame({"tokens": finalToken,"tags":finalTags})
# usedData = pd.DataFrame({"text": newData["transcript"]})


In [163]:
def apply(x):
    return [labels2id[z] for z in x ]

In [175]:
usedData["tags"]=usedData["tags"].map(apply)

In [176]:
train,test = train_test_split(usedData,test_size=0.3)
validation,test = train_test_split(test,test_size=0.2)

In [177]:
testCopy=test

In [178]:
tds = Dataset.from_pandas(train)
test = Dataset.from_pandas(test)
val= Dataset.from_pandas(validation)
ds = DatasetDict()

ds['train'] = tds
ds['validation'] = val
ds["test"]=test

In [179]:
ds
ds = ds.remove_columns("__index_level_0__")
ds

DatasetDict({
    train: Dataset({
        features: ['tokens', 'tags', 'tagss'],
        num_rows: 89975
    })
    validation: Dataset({
        features: ['tokens', 'tags', 'tagss'],
        num_rows: 30849
    })
    test: Dataset({
        features: ['tokens', 'tags', 'tagss'],
        num_rows: 7713
    })
})

In [180]:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("dslim/bert-base-NER")

In [181]:
label_all_tokens = True
def tokenize_and_align_labels(examples):
    tokenized_inputs = tokenizer(examples["tokens"],is_split_into_words=True,padding='max_length',max_length=512)

    labels = []
    for i, label in enumerate(examples["tags"]):
        word_ids = tokenized_inputs.word_ids(batch_index=i)
        previous_word_idx = None
        label_ids = []
        for word_idx in word_ids:
            # Special tokens have a word id that is None. We set the label to -100 so they are automatically
            # ignored in the loss function.
            if word_idx is None:
                label_ids.append(-100)
            # We set the label for the first token of each word.
            elif word_idx != previous_word_idx:
                label_ids.append(label[word_idx])
            # For the other tokens in a word, we set the label to either the current label or -100, depending on
            # the label_all_tokens flag.
            else:
                label_ids.append(label[word_idx] if label_all_tokens else -100)
            previous_word_idx = word_idx

        labels.append(label_ids)

    tokenized_inputs["labels"] = labels
    return tokenized_inputs

In [183]:
tokenized_datasets = ds.map(tokenize_and_align_labels, batched=True)

  0%|          | 0/90 [00:00<?, ?ba/s]

  0%|          | 0/31 [00:00<?, ?ba/s]

  0%|          | 0/8 [00:00<?, ?ba/s]

In [184]:
tokenized_datasets

DatasetDict({
    train: Dataset({
        features: ['tokens', 'tags', 'tagss', 'input_ids', 'token_type_ids', 'attention_mask', 'labels'],
        num_rows: 89975
    })
    validation: Dataset({
        features: ['tokens', 'tags', 'tagss', 'input_ids', 'token_type_ids', 'attention_mask', 'labels'],
        num_rows: 30849
    })
    test: Dataset({
        features: ['tokens', 'tags', 'tagss', 'input_ids', 'token_type_ids', 'attention_mask', 'labels'],
        num_rows: 7713
    })
})

In [186]:
from transformers import AutoModelForTokenClassification, TrainingArguments, Trainer

model = AutoModelForTokenClassification.from_pretrained("dslim/bert-base-NER", num_labels=9,label2id=labels2id,id2label=id2label,ignore_mismatched_sizes=True)

Downloading model.safetensors:   0%|          | 0.00/433M [00:00<?, ?B/s]

In [194]:
model_name = "finance-ner-v0.0.9"
args = TrainingArguments(
    f"{model_name}-finetuned-ner",
    evaluation_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=1,
    weight_decay=0.01,
    push_to_hub=True,fp16=True,
    logging_steps=1,gradient_accumulation_steps = 2
)

In [188]:
from transformers import DataCollatorForTokenClassification

data_collator = DataCollatorForTokenClassification(tokenizer)

In [189]:
!pip install seqeval

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [190]:
from datasets import load_dataset, load_metric
metric = load_metric("seqeval")

Downloading builder script:   0%|          | 0.00/2.47k [00:00<?, ?B/s]

In [191]:
import numpy as np
# label_list=["O","B-money","B-revenues","B-expenses"]
def compute_metrics(p):
    predictions, labels = p
    predictions = np.argmax(predictions, axis=2)
    print(p)
    # Remove ignored index (special tokens)
    true_predictions = [
        [id2label[p] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    true_labels = [
        [id2label[l] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]

    results = metric.compute(predictions=true_predictions, references=true_labels)
    return {
        "precision": results["overall_precision"],
        "recall": results["overall_recall"],
        "f1": results["overall_f1"],
        "accuracy": results["overall_accuracy"],
    }

In [192]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [195]:
trainer = Trainer(
    model,
    args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

Cloning https://huggingface.co/AhmedTaha012/finance-ner-v0.0.9-finetuned-ner into local empty directory.


In [196]:
trainer.train()

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
[34m[1mwandb[0m: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

  ········································


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,Precision,Recall,F1,Accuracy
0,0.0018,0.004424,0.925146,0.699712,0.796791,0.998938


<transformers.trainer_utils.EvalPrediction object at 0x79acaa3cd960>


TrainOutput(global_step=5623, training_loss=0.0059232120515330606, metrics={'train_runtime': 6596.0293, 'train_samples_per_second': 13.641, 'train_steps_per_second': 0.852, 'total_flos': 2.3509834372694016e+16, 'train_loss': 0.0059232120515330606, 'epoch': 1.0})

In [200]:
trainer.evaluate(tokenized_datasets["test"])

<transformers.trainer_utils.EvalPrediction object at 0x79acaa3cd6c0>


{'eval_loss': 0.004377402830868959,
 'eval_precision': 0.9315188762071993,
 'eval_recall': 0.7026490066225165,
 'eval_f1': 0.8010570026425066,
 'eval_accuracy': 0.9989520972877464,
 'eval_runtime': 243.6623,
 'eval_samples_per_second': 31.654,
 'eval_steps_per_second': 3.96,
 'epoch': 1.0}

In [201]:
trainer.push_to_hub()

To https://huggingface.co/AhmedTaha012/finance-ner-v0.0.9-finetuned-ner
   681a9e8..c24cc53  main -> main



'https://huggingface.co/AhmedTaha012/finance-ner-v0.0.9-finetuned-ner/commit/c24cc5305c3e45fed9cdb2d694bbaa89311f718f'

In [202]:
testCopy.to_pickle("file1.pkl")

In [169]:
tokenized_datasets["test"]

Dataset({
    features: ['tokens', 'tags', 'input_ids', 'token_type_ids', 'attention_mask', 'labels'],
    num_rows: 1938
})

In [176]:
len(prediction.label_ids)

1938

In [7]:
from transformers import pipeline
from transformers import AutoTokenizer,AutoModelForTokenClassification
tokenizer = AutoTokenizer.from_pretrained("AhmedTaha012/finance-ner-v0.0.9-finetuned-ner")
model = AutoModelForTokenClassification.from_pretrained("AhmedTaha012/finance-ner-v0.0.9-finetuned-ner")

nlpPipe = pipeline("ner", model=model, tokenizer=tokenizer, grouped_entities=True)

caused by: ['/opt/conda/lib/python3.10/site-packages/tensorflow_io/python/ops/libtensorflow_io_plugins.so: undefined symbol: _ZN3tsl6StatusC1EN10tensorflow5error4CodeESt17basic_string_viewIcSt11char_traitsIcEENS_14SourceLocationE']
caused by: ['/opt/conda/lib/python3.10/site-packages/tensorflow_io/python/ops/libtensorflow_io.so: undefined symbol: _ZTVN10tensorflow13GcsFileSystemE']


Downloading (…)okenizer_config.json:   0%|          | 0.00/385 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/669k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.07k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/431M [00:00<?, ?B/s]



In [19]:
df["transcript"].iloc[10]

"Prepared Remarks:\nOperator\nGood morning, and welcome to the Waddell & Reed Financial Second Quarter 2020 Earnings Conference Call. [Operator Instructions] [Operator Instructions]\nI would now like to turn the conference over to Mike Daley, Vice President, Investor Relations. Please go ahead.\nMichael J. Daley -- Vice President-investor Relations and Controller\nThank you. On behalf of our management team, I would like to welcome you to our quarterly earnings conference call. Joining me on our call today are Phil Sanders, our CEO; Brent Bloss, our President; Ben Clouse, CFO; Dan Hanson, our CIO; Shawn Mihal, President of our Wealth Management business, Waddell & Reed, Inc.; and Amy Scupham, President of Ivy Distributors, Inc.\nBefore we begin, I would like to remind you that some of our comments and responses may include forward-looking statements and non-GAAP financial measures. While we believe these forward-looking statements to be reasonable based on information that is currently

In [14]:
import math
tokens=df["transcript"].iloc[10].split()
splitSize=256
chunks=[tokens[r*splitSize:(r+1)*splitSize] for r in range(math.ceil(len(tokens)/splitSize))]

In [17]:
for x in chunks:
    text=" ".join(x)
    entity_list=nlpPipe(text)
    if len(entity_list)>=1:
        print(text)
        print(entity_list)
        print("----------------")

details within both our asset management and wealth management businesses now. Investments net flows improved this quarter, aided by meaningfully lower redemptions against our $2 billion in gross sales. In fact, redemptions improved 24% compared to the first quarter and 19% compared to the same quarter in 2019. Sales continued to be strong in our Mid-Cap suite with both, strategies and net positive flow for the quarter. While short-term performance has improved, we continue to see outflows in our international core strategy. Our distribution teams are working well remotely and continue to make traction across channels since we realigned the structure of our sales teams. We continue to focus on providing our clients with high-quality service that meets their unique needs as well as highlight and provide access to our intellectual capital while keeping the safety and wellness of our employees and clients as our top priority. Turning to investment performance. Second quarter of this year 

In [15]:
import spacy
import random
# Load the spaCy model
nlp = spacy.load("en_core_web_sm")


# Create a Doc object from text (assuming you have a text where entities are extracted)
for idx in range(0, len(chunks)):
    text = " ".join(chunks[idx])
    entity_list=nlpPipe(text)
    if len(entity_list)>=1:
        doc = nlp(text)
        # Create spans for each entity and set their labels
        entities = []
        for entity in entity_list:
            span = doc.char_span(entity['start'], entity['end'], label=entity['entity_group'])
            entities.append(span)
        if len(entities)>=1:
            # Set the entities to the Doc.ents property
            try:
                doc.ents = entities

                # Display entities and their labels
                for ent in doc.ents:
                    print(f"Entity: {ent.text}, Label: {ent.label_}")

                # Display the original text with entity annotations
                from spacy import displacy

                displacy.render(doc, style="ent", jupyter=True)
                print("==============================================")
            except:
                pass

Entity: 2 billion, Label: revenue




In [56]:
tokess=df["transcript"].iloc[11000].split()
splitSize=265
splits=[tokess[r*splitSize:(r+1)*splitSize] for r in range(math.ceil(len(tokess)/splitSize))]

In [60]:
for x in splits:
    text = " ".join(x)
    entity_list=nlpPipe(text)
    if len(entity_list)>=1:
        doc = nlp(text)
        # Create spans for each entity and set their labels
        entities = []
        for entity in entity_list:
            span = doc.char_span(entity['start'], entity['end'], label=entity['entity_group'])
            entities.append(span)
        if len(entities)>=1:
            # Set the entities to the Doc.ents property
            try:
                doc.ents = entities
                # Display entities and their labels
                for ent in doc.ents:
                    print(f"Entity: {ent.text}, Label: {ent.label_}")
                # Display the original text with entity annotations
                from spacy import displacy
                displacy.render(doc, style="ent", jupyter=True)
                print("==============================================")
            except:
                pass
        else:
            print(text)
            print("==============================================")
    else:
        print(text)
        print("==============================================")

Entity: net, Label: PROFIT
Entity: income, Label: PROFIT
Entity: revenues, Label: REVENUE


the West, despite some recent record-breaking rainfall parts of California. Certainly in Oregon and Washington, they've gotten a good amount of water down on the farms, but they also have gotten many feet of snow in the mountains, and when that snow melts, it feeds all the farms in the valley. However, all our properties continued to be in a position where there is currently ample water to complete both the current crop and next year's crop. Where we have farms located in water districts, those districts have stored water or other supplemental sources that cover our farms for the short-term. Almost all of the farms out West have well sites and most of them rely on groundwater as their main source of the irrigation. For these properties, we are seeing a typical seasonal dropping of the water table levels, and we haven't had any, of course, that have gone dry. And all of our farms currently have pumping capacity to cover their crop needs. One thing you should know is that wet and dry wea

Entity: growth, Label: PROFIT
Entity: financial, Label: REVENUE


Entity: financial, Label: REVENUE


complex business. So if you like, what we're doing, please buy some stock and keep eating fresh fruits and vegetables and nuts. Now we will stop and have some questions from those who follow us. Operator, would you please come on and tell these people how they can ask us some questions? Questions and Answers: Operator Thank you. At this time, we will conduct a question-and-answer session. [Operator Instructions]. Our first question comes from Rob Stevenson with Janney. Please proceed. Robert Stevenson -- Janney Montgomery Scott LLC -- Analyst Good morning. David, where is pricing for farmland today versus a couple of years ago pre-pandemic? When you look at similar properties, are we up 5%, 10% flattish? How do you sort of characterize it across your various sort of property types and markets? David J. Gladstone -- Chairman, Chief Executive Officer and President Yeah. If you're looking at the Midwest, which is most often the one that's published, it's gone up pretty substantially this 

selling some of these farms. Right now, we're not interested in selling anything. What we want to do is build an incredible Company with lots of farms and try to catch up with some of the other big farmers in the United States. As you well know, there is a man that is in the -- really not in the business anymore, but he is buying up a lot of land around the country. He has got about 230,000 acres and he is the largest farmer and we need to catch him. It's going to be a while because there is issues in tax-free dollars to buy farms. But I think we are in good shape, Rob, and I think we're just going to continue doing the same thing every day for the next ten years until we get a really big farming operation going. Robert Stevenson -- Janney Montgomery Scott LLC -- Analyst Okay. And then last one from me. The acquisition vehicle, I mean, are the opportunities which you're looking at there going to be too big for taxable REIT subsidiary? Is that the reason why you're going that route rath

Entity: growth, Label: PROFIT
Entity: net, Label: PROFIT
Entity: Financial, Label: REVENUE


Entity: gross, Label: REVENUE
Entity: expenses, Label: EXPENSE


Entity: Financial, Label: REVENUE
Entity: $, Label: MONEY
Entity: 86, Label: MONEY


Entity: financial, Label: REVENUE
Entity: Financial, Label: REVENUE


Entity: cost, Label: EXPENSE


flow. So we like that. But to get to these much larger farms, there aren't that many farmers that can take down that much. So we have to be very careful not to get in a vine whereby we have a large farm, we don't have a tenant. So we like the onesie, twosies. There not a lot of players there. And that's our forte as being able to negotiate those and offer the seller a good price for the farm, but also tax-free if they want to do the right transaction. So we'll keep doing what we're doing and the diversification is really important for me. I don't want to get into a situation where we've got a couple of big farms that are going to hurt us. Eric Borden -- Berenberg Capital -- Analyst No, I appreciate that. And then maybe on the acquisition front, kind of historically fourth quarter seems to be the key time to acquire farms, but given constraints as it relates to COVID, do you think you'll see more farmers come to market in first quarter or will there be some rollover there into the New Y

