# Natural Language Processing Mini Task

**Author:** Ties de Kok ([Personal Website](http://www.tiesdekok.com))  
**Last updated:** 15 May 2018  
**Python version:** Python 3.6  
**License:** MIT License  
**Credit:** part of these tasks were co-created by Stephan Hollander ([Personal Website](https://www.tilburguniversity.edu/webwijs/show/s.hollander/))

## *Introduction*

In this notebook I will provide you with "tasks" that you can try to solve.  

Most of what you need is discussed in the tutorial notebooks, the rest you will have to Google (which is an important exercise in itself).

## *Relevant notebooks*

1) [`0_python_basics.ipynb`](https://nbviewer.jupyter.org/github/TiesdeKok/LearnPythonforResearch/blob/master/0_python_basics.ipynb)  


2) [`2_handling_data.ipynb`](https://nbviewer.jupyter.org/github/TiesdeKok/LearnPythonforResearch/blob/master/2_handling_data.ipynb)  


3) [`NLP_Notebook.ipynb`](https://nbviewer.jupyter.org/github/TiesdeKok/Python_NLP_Tutorial/blob/master/NLP_Notebook.ipynb)  

## NLP Mini Task <br> --------------------

The goal of this mini-task is to get hands-on experience with handling, cleaning, and analyzing textual data using Python.

---------------------------------------------------

There are two primary tasks and one challenging bonus task:

**Task 1)** Following Garcia and Norli (2012), extract state name counts from MD&As to assess geographic dispersion  

**Task 2)** Create a sentiment score for MD&As based on the Loughran and McDonald (2011) word lists  

---------------------------------------------------

**Bonus task 3)** Combine task 1 and 2, evaluate the sentiment score relating to state name references

#### References  

Garcia, D., & Norli, Ø. (2012). Geographic dispersion and stock returns. Journal of Financial Economics, 106(3), 547-565.  
Loughran, T., & McDonald, B. (2011). When is a liability not a liability? Textual analysis, dictionaries, and 10‐Ks. The Journal of Finance, 66(1), 35-65.

### Data

Gathering and extracting the MD&A section of a 10-K is quite tricky.  

I have therefore included a random selection of 20 pre-processed MDA filings.  

In the "data" folder you will find a folder called "MDA_files". Each file in this folder is an MD&A filing, the filename is the unique identifier.

You will also find a file called `MDA_META_DF.xlsx` in the "data" folder, this contains the following meta-data for eaching MD&A: (filing date, cik, company name, and link to filing)

### Import required packages

### Load data

The files should all be in the following folder:  
```
join('data', 'MDA_files')
```

### Clean and Pre-process text data

You might need to split it into sentences, maybe split it into words, maybe remove invalid characters.

Whatever you see fit.

----
Split into sentences

### Task 1: Extract state name counts

Follow Garcia and Norli (2012) and count the number of times that each U.S. state name is mentioned in the MD&A.  

Then:

1. Create a DataFrame for each MD&A that shows the number of times each U.S. state name is mentioned.  
2. Create a DataFrame to report the min, max, mean, median, and stdev for the number of times that each state is mentioned in MD&As. 

**Note:** state names are provided in the `state_names.xlsx` file in the "data" folder.

### Task 2: Create sentiment score

Follow Loughran and McDonald (2011) and count the number of times a tone word from their dictionary is mentioned.  

Then:  

1. For each MD&A calculate the total number of negative and total number of positive words mentioned.   
2. Tabulate these total counts in a DataFrame together with the total number of words in the MD&A.  
3. Create a new column which calculates a sentiment score using the following equation:  

$$\frac{(Num\ Positive\ Words - Num\ Negative\ Words)}{Total\ Number\ of\ Words}$$



**Note 1:** You can split a sentence into words using any of the tokenizers mentioned in the NLP notebook.  
**Note 2:** The Loughran and McDonald dictionary is included in the "data" folder: `LoughranMcDonald_MasterDictionary_2014.xlsx `

### Bonus Task 3: Sentiment score relating to state name references

Count the number of tone words used within a +/- 250 character range for a U.S. state name.

Then:

1. Create a DataFrame for each MD&A where you report the total number of positive and total number of negative words by U.S. state name.  
2. Create aggregate descriptives for the States with the most positive words and the States with most negative words.