# Natural Language Processing Mini Tasks

**Author:** Ties de Kok ([Personal Website](https://www.tiesdekok.com))  
**Last updated:** 15 March 2019  
**Python version:** Python 3.6 or 3.7   
**License:** MIT License  
**Credit:** part of these tasks were co-created by Stephan Hollander ([Personal Website](https://www.tilburguniversity.edu/webwijs/show/s.hollander/))

## *Introduction*

In this notebook I will provide you with "tasks" that you can try to solve.  

Most of what you need is discussed in the tutorial notebooks, the rest you will have to Google (which is an important exercise in itself).

## *Relevant notebooks*

1) [`0_python_basics.ipynb`](https://nbviewer.jupyter.org/github/TiesdeKok/LearnPythonforResearch/blob/master/0_python_basics.ipynb)  


2) [`2_handling_data.ipynb`](https://nbviewer.jupyter.org/github/TiesdeKok/LearnPythonforResearch/blob/master/2_handling_data.ipynb)  


3) [`NLP_Notebook.ipynb`](https://nbviewer.jupyter.org/github/TiesdeKok/Python_NLP_Tutorial/blob/master/NLP_Notebook.ipynb)  

## Import required packages

**Note:** make sure you have the `limperg-python` environment activated!

In [1]:
import os, re
import pandas as pd
import numpy as np

In [2]:
import en_core_web_sm
nlp = en_core_web_sm.load()

## First steps into NLP<br> ----------------------------

The goal of this mini-task is to get hands-on experience with handling, cleaning, and analyzing textual data using Python.

## 1) Perform basic operations on a sample text file

### 1a) Load the following text file: `data > example_text.txt` into Python

### 1b) Print the first 400 characters of the text file you just loaded

### 1c) Count the number of times the words `Holmes` and `Watson` are mentioned

### 1c) Use the provided Regular Expression to capture all names following "Mr."  
Use this regular expression: `Mr. (.*?)\W`  
**You can play around with this regular expression here: <a href='https://goo.gl/55eMjU'>Test on Pythex.org</a>**

**Hint:** use the `re.findall()` function

### Bonus task: try to explain to your neighbour what the regular expression is doing (you can use the cheatsheet on Pythex.org for reference)

### 1d) Load the text into a Spacy object and split it into a list of  sentences

Make sure to evaluate how well it worked by inspecting various elements of the sentence list.

### Bonus task: why is there a difference between what you see when you run `sentences[1]` versus `print(sentences[1])`

## Main NLP tasks<br> ----------------------

**Task 1)** Follow Garcia and Norli (2012) and extract state name counts from MD&As

**Task 2)** Create a sentiment score for MD&As based on the Loughran and McDonald (2011) word lists  

#### References  

Garcia, D., & Norli, Ø. (2012). Geographic dispersion and stock returns. Journal of Financial Economics, 106(3), 547-565.  
Loughran, T., & McDonald, B. (2011). When is a liability not a liability? Textual analysis, dictionaries, and 10‐Ks. The Journal of Finance, 66(1), 35-65.

### Data

Gathering and extracting the MD&A section of a 10-K is quite tricky.  

I have therefore included a random selection of 20 pre-processed MDA filings.  

In the "data" folder you will find a folder called "MDA_files". Each file in this folder is an MD&A filing, the filename is the unique identifier.

You will also find a file called `MDA_META_DF.xlsx` in the "data" folder, this contains the following meta-data for eaching MD&A: (filing date, cik, company name, and link to filing)

## 2) Extract state name counts from MD&As

### 2a) Load data into a dictionary with as key the filename and as value the content of the text file

The files should all be in the following folder:  
```
os.path.join('data', 'MDA_files')
```

### 2b) Load state name data into a DataFrame  
**Note:** state names are provided in the `state_names.xlsx` file in the "data" folder.

### 2c) Count the number of times that each U.S. state name is mentioned in each MD&A   
**Hint:** save the counts to a list where each entry is a list that contains the following three items: [*filename*, *state_name*, *count*], like this:
> [   
 ['21344_0000021344-16-000050.txt', 'Alabama', 0],  
 ['21344_0000021344-16-000050.txt', 'Alaska', 0],  
 ['21344_0000021344-16-000050.txt', 'Arizona', 0],   
> ....  
>['49071_0000049071-16-000117.txt', 'West Virginia', 0],  
 ['49071_0000049071-16-000117.txt', 'Wisconsin', 0],  
 ['49071_0000049071-16-000117.txt', 'Wyoming', 0]   
 ]   
 
You can verify that it worked by checking whether the the 19th element (i.e. `list[18]`) equals:  
> ['21344_0000021344-16-000050.txt', 'Maine', 2]

### 2d) Convert the list you created in `2c` into a Pandas DataFrame and save it as an Excel sheet

**Hint:** Use the `columns=[...]` parameter to name the columns

## 3) Create sentiment score based on Loughran and McDonald (2011)   

### 3a) Load the Loughran and McDonald master dictionary    
**Note:** The Loughran and McDonald dictionary is included in the "data" folder: `LoughranMcDonald_MasterDictionary_2014.xlsx `

### 3b) Create two lists: one containing all the negative words and the other one containing all the positive words   
**Tip:** I recommend to change all words to lowercase in this step so that you don't need to worry about that later

### 3c) For each MD&A calculate the *total* number of times negative and positive words are mentioned

**Note:** make sure you also convert the text to lowercase!

**Hint:** save the counts to a list where each entry is a list that contains the following three items: [*filename*, *total pos count*, *total neg count*], like this:
> [   
    ['21344_0000021344-16-000050.txt', 1166, 2318],   
    ['21510_0000021510-16-000074.txt', 606, 1078],  
    ['21665_0001628280-16-011343.txt', 516, 1058],      
> ....  
    ['47217_0000047217-16-000093.txt', 544, 928],  
    ['47518_0001214659-16-014806.txt', 482, 974],  
    ['49071_0000049071-16-000117.txt', 954, 1636]    
 ]   


### 3d) Convert the list created in 3c into a Pandas DataFrame  
**Hint:** Use the `columns=[...]` parameter to name the columns

### 3e) Create a new column with a "sentiment score" for each MD&A

Use the following imaginary sentiment score:  
$$\frac{(Num\ Positive\ Words - Num\ Negative\ Words)}{Sum\ of Pos\ and\ Neg\ Words}$$


## 3f) Use the `MDA_META_DF` file to add the company name, filing date, and CIK to the sentiment dataframe