# Natural Language Processing - Problems

**Author:** Ties de Kok ([Personal Website](https://www.tiesdekok.com))  <br>
**Last updated:** September 2021  
**Python version:** Python 3.6+     
**Recommended environment: `researchPython`**

In [1]:
import os
recommendedEnvironment = 'researchPython'
if os.environ['CONDA_DEFAULT_ENV'] != recommendedEnvironment:
    print('Warning: it does not appear you are using the {0} environment, did you run "conda activate {0}" before starting Jupyter?'.format(recommendedEnvironment))

<div style='border-style: solid; padding: 10px; border-color: black; border-width:5px;  text-align: left; margin-top:20px; margin-bottom: 20px;'>
<span style='color:black; font-size: 30px; font-weight:bold;'>Introduction</span>
</div>

<div style='border-style: solid; padding: 5px; border-color: darkred; border-width:5px;  text-align: center; margin-left: 100px; margin-right:100px;'>
<span style='color:black; font-size: 20px; font-weight:bold;'> Make sure to open up the respective tutorial notebook(s)! <br> That is what you are expected to use as primary reference material. </span>
</div>

### Relevant tutorial notebooks:

1) [`0_python_basics.ipynb`](https://nbviewer.jupyter.org/github/TiesdeKok/LearnPythonforResearch/blob/master/0_python_basics.ipynb)  


2) [`2_handling_data.ipynb`](https://nbviewer.jupyter.org/github/TiesdeKok/LearnPythonforResearch/blob/master/2_handling_data.ipynb)  


3) [`NLP_Notebook.ipynb`](https://nbviewer.jupyter.org/github/TiesdeKok/Python_NLP_Tutorial/blob/master/NLP_Notebook.ipynb)  

## Import required packages

In [3]:
import os, re
from pathlib import Path
import pandas as pd
import numpy as np

In [133]:
from tqdm.notebook import tqdm

In [136]:
import spacy
nlp = spacy.load("en_core_web_lg", disable=[
    'Tagger', 'DependencyParser', 
    'EntityRecognizer', 'TextCategorizer']
) ## Note we disable most of this functionality here to speed up our code (as we don't need it for these tasks)

You might have to replace the above with the code below if you installed the language model in an alternative way
```python
import en_core_web_lg
nlp = en_core_web_lg.load()
```

<div style='border-style: solid; padding: 10px; border-color: black; border-width:5px;  text-align: center; margin-top:20px; margin-bottom: 20px;'>
<span style='color:black; font-size: 30px; font-weight:bold;'>Part 1 </span>
</div>  

<div style='border-style: solid; padding: 5px; border-color: darkred; border-width:5px;  text-align: center; margin-left: 100px; margin-right:100px;'>
<span style='color:black; font-size: 15px; font-weight:bold;'> Note: feel free to add as many cells as you'd like to answer these problems, you don't have to fit it all in one cell. </span>
</div>

## 1) Perform basic operations on a sample earnings transcript text file

### 1a) Load the following text file: `data > example_transcript.txt` into Python

### 1b) Print the first 400 characters of the text file you just loaded

### 1c) Count the number of times the words `Alex` and `Angie` are mentioned

### 1c) Use the provided Regular Expression to capture all numbers prior to a "%"  
Use this regular expression: `\W([\.\d]{,})%`  
**You can play around with this regular expression here: <a href='https://bit.ly/3heIqoG'>Test on Pythex.org</a>**

### Extra: try to explain to a neighbor / group member what the regular expression is doing
You can use the cheatsheet on Pythex.org for reference.  

### 1d) Load the text into a Spacy object and split it into a list of  sentences

Make sure to evaluate how well it worked by inspecting various elements of the sentence list.

Note: the beginning of the document contains meta data that are not normal sentences, so you might see some weird "sentences" at the beginning. 

#### What is the 140th sentence?

 ### Why is there a difference between what you see when you run `test_sentence` versus `print(test_sentence)`?

In [25]:
test_sentence = 'Name: Hope Bancorp Inc\nCompany Ticker: HOPE US Equity\nDate: 2020-01-23\nQ4 2019 Earnings Call\nCompany Participants\n'

In [26]:
test_sentence

'Name: Hope Bancorp Inc\nCompany Ticker: HOPE US Equity\nDate: 2020-01-23\nQ4 2019 Earnings Call\nCompany Participants\n'

In [27]:
print(test_sentence)

Name: Hope Bancorp Inc
Company Ticker: HOPE US Equity
Date: 2020-01-23
Q4 2019 Earnings Call
Company Participants



### 1e) Parse out the following 3 blocks of text:

* The meta data at the top   
* The presentation portion  
* The Q&A portion

**Note:** you could do it based on the exact location (e.g, `text_file[:1234]`), however, that would only work for this file. Try to come up with a solution that would work for all files that follow the same structure. 

### 1f) How many characters, sentences, words (tokens) do the presentation portion and the Q&A portion have?  

Hint: use `Spacy` for the sentence and word counts.

### 1g) Create a list of all the questions during the Q&A and include the person that asked the question   

You should end up with 20 questions.

### 1h) Modify the Q&A list by adding in the answer + answering person

**Note:** this is not an easy question and will probably take you a while. If you are time constraint I would recommend to leave it until the end. :)

This is what the first entry should (rougly) look like:
```python
qa_list[0] = 
{
  'q_person': 'Christopher McGratty ',
  'question': 'Great, thanks, good afternoon. Kevin maybe you could start -- or Alex on the margin, obviously the environment has got a little bit tougher for the banks. But you have this -- the ability to bring down deposit costs, which you talked about in your prepared remarks. I appreciate in the guidance for the first quarter, but if the rate outlook remains steady, how do we think about ultimate stability in the flow and the margin, where and kind of when?',
  'answers': [{
    'name': 'Alex Ko ',
    'answer': 'Sure, sure. As I indicated, we would expect to have continued compression next quarter given the rate cuts that we have experienced especially October rate cut, it will continue next quarter. But as we indicated, our proactive deposit initiative as well as very disciplined pricing on the deposit, even though we have a very competitive -- competition on the loan rate is very still severe. We would expect to stabilize in the second quarter of 2020 in terms of net interest margin and then second half of the year, we would expect to start to increase.'
  }]
}
```

Make sure that it can handle cases where there are multiple answers:
```python
{
    "q_person": "Unidentified Participant",
    "question": "Okay, thank you very much, I appreciate it.",
    "answers": [
        {
            "name": "Kevin S. Kim ", 
            "answer": "Thank you."
        },
        {
            "name": "Operator",
            "answer": "And the next question comes from Tim Coffey with Janney.",
        },
    ],
}
```

<div style='border-style: solid; padding: 10px; border-color: black; border-width:5px;  text-align: left; margin-top:20px; margin-bottom: 20px;'>
<span style='color:black; font-size: 30px; font-weight:bold;'>Part 2:</span>
</div>

## 2) Create sentiment score based on Loughran and McDonald (2011)   

Create a sentiment score for MD&As based on the Loughran and McDonald (2011) word lists.    

#### References  

*Loughran, T., & McDonald, B. (2011). When is a liability not a liability? Textual analysis, dictionaries, and 10‐Ks. The Journal of Finance, 66(1), 35-65.*

#### Data to use

I have included a random selection of 20 pre-processed MDA filings in the `data > MDA_files` folder. The filename is the unique identifier.   

You will also find a file called `MDA_META_DF.xlsx` in the "data" folder, this contains the following meta-data for eaching MD&A: 
* filing date  
* cik   
* company name  
* link to filing

### 2a) Load data into a dictionary with as key the filename and as value the content of the text file

The files should all be in the following folder:  
```
Path.cwd() / 'data' / 'MDA_files'
```

### 2b) Load the Loughran and McDonald master dictionary    
**Note:** The Loughran and McDonald dictionary is included in the "data" folder: `LoughranMcDonald_MasterDictionary_2014.xlsx `

### 2c) Create two lists: one containing all the negative words and the other one containing all the positive words   

Note, you can treat any number that is not 0 as a 1:

```python
0       82776
2009     2315    ## <-- 1
2014       26    ## <-- 1
2011       13    ## <-- 1
2012        1    ## <-- 1
```

They include the year instead of a one for versioning purposes. 

**Tip:** I recommend to change all words to lowercase in this step so that you don't need to worry about that later

### 2d) For each MD&A calculate the *total* number of times negative and positive words are mentioned

**Hint:** save the counts to a list where each entry is a list that contains the following three items: [*filename*, *total pos count*, *total neg count*], like this:
> [   
 ['21344_0000021344-16-000050.txt', 474, 642],   
 ['21510_0000021510-16-000074.txt', 168, 208],  
 ['21665_0001628280-16-011343.txt', 240, 354],  
> ....    
 ['47217_0000047217-16-000093.txt', 138, 360],    
 ['47518_0001214659-16-014806.txt', 202, 278],   
 ['49071_0000049071-16-000117.txt', 440, 570]   
 ]   

You can verify that it worked by checking whether the the 3rd element (i.e. `list[2]`) equals:  
> ['21665_0001628280-16-011343.txt', 240, 354]


**Note 1:** make sure you also convert the text to lowercase.   
**Note 2:** if your computer is pretty slow running this code you are fine to run it on the first 5 only.  
**Note 3:** you might end up with a different number due to substring matches, we only want to count full matches:

```python
### For example, consider the positive word 'win'

test_sen = "They hockey team made a big win during the winter."

test_sen.count('win')
## gives --> 2 

## We only want to count "win" not "winter", how do we solve that?
```



### 2e) Convert the list created in 3c into a Pandas DataFrame  
**Hint:** Use the `columns=[...]` parameter to name the columns

### 2f) Create a new column with a "sentiment score" for each MD&A

Use the following (imaginary) sentiment score:   

$$\frac{(Num\ Positive\ Words - Num\ Negative\ Words)}{Sum\ of Pos\ and\ Neg\ Words}$$


## 2g) Use the `MDA_META_DF` file to add the company name, filing date, and CIK to the sentiment dataframe