# Machine Learning Part 2 - Problems

**Author:** Ties de Kok ([Personal Website](https://www.tiesdekok.com))  <br>
**Last updated:** September 2020  
**Python version:** Python 3.6+     
**Recommended environment: `researchPython`**

In [1]:
import os
recommendedEnvironment = 'researchPython'
if os.environ['CONDA_DEFAULT_ENV'] != recommendedEnvironment:
    print('Warning: it does not appear you are using the {0} environment, did you run "conda activate {0}" before starting Jupyter?'.format(recommendedEnvironment))

<div style='border-style: solid; padding: 10px; border-color: black; border-width:5px;  text-align: left; margin-top:20px; margin-bottom: 20px;'>
<span style='color:black; font-size: 30px; font-weight:bold;'>Introduction</span>
</div>

<div style='border-style: solid; padding: 5px; border-color: darkred; border-width:5px;  text-align: center; margin-left: 100px; margin-right:100px;'>
<span style='color:black; font-size: 20px; font-weight:bold;'> Make sure to open up the respective tutorial notebook(s)! <br> That is what you are expected to use as primary reference material. </span>
</div>

### Relevant tutorial notebooks:

1) [`0_python_basics.ipynb`](https://nbviewer.jupyter.org/github/TiesdeKok/LearnPythonforResearch/blob/master/0_python_basics.ipynb)  


2) [`2_handling_data.ipynb`](https://nbviewer.jupyter.org/github/TiesdeKok/LearnPythonforResearch/blob/master/2_handling_data.ipynb)  


3) [`NLP_Notebook.ipynb`](https://nbviewer.jupyter.org/github/TiesdeKok/Python_NLP_Tutorial/blob/master/NLP_Notebook.ipynb)  

## Import required packages

In [2]:
import os, sys
import pandas as pd
import numpy as np

In [3]:
import spacy
nlp = spacy.load("en_core_web_lg")

In [4]:
from tqdm.notebook import tqdm

In [5]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

In [6]:
from sklearn.decomposition import LatentDirichletAllocation

In [7]:
import gensim

The `pyLDAvis` package will throw a lot of depreciation warnings, these are safe to ignore for now so the code below will surpress them:

In [8]:
def warn(*args, **kwargs):
    pass
import warnings
warnings.warn = warn

<div style='border-style: solid; padding: 10px; border-color: black; border-width:5px;  text-align: left; margin-top:20px; margin-bottom: 20px;'>
<span style='color:black; font-size: 30px; font-weight:bold;'>Part 1: LDA with text data</span>
</div>

<div style='border-style: solid; padding: 5px; border-color: darkred; border-width:5px;  text-align: center; margin-left: 100px; margin-right:100px;'>
<span style='color:black; font-size: 15px; font-weight:bold;'> Note: feel free to add as many cells as you'd like to answer these problems, you don't have to fit it all in one cell. </span>
</div>

## 2a) Load MD&A files

I have included a random selection of 20 pre-processed MDA filings in the `data > MDA_files` folder. The filename is the unique identifier.   

You will also find a file called `MDA_META_DF.xlsx` in the "data" folder, this contains the following meta-data for eaching MD&A: 
* filing date  
* cik   
* company name  
* link to filing

### 2a - i) Load data into a dictionary with as key the filename and as value the content of the text file

The files should all be in the following folder:  
```
os.path.join('data', 'MDA_files')
```

It should look like this:

![image.png](attachment:79db930d-54ef-425e-8b5a-e3db3aa70551.png)

### 2a - ii) Split the data up into sentences using spacy  

You can speed up Spacy a little bit by disabling things you don't need: 

```python
nlp(text, disable=['Tokenizer', 'Tagger', 'DependencyParser', 'EntityRecognizer', 'TextCategorizer'])
```

For more information: https://spacy.io/usage/processing-pipelines

**Note:** Make sure the sentence is stored as a `str` and not a spacy object.

You want to end up with a dataframe that looks something like this:   

![image.png](attachment:2ec78663-20f1-47f0-99b8-2d7497d80699.png)

----
## 2b) Latent Dirichilet Allocation model
https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.LatentDirichletAllocation.html

### 2b - i) Convert the textual data to numerical using `CountVectorizer`  

Use the following parameters:

1. strip unice accents  
2. lowercase only  
3. Remove stopwords
4. Max_df to 0.8

**Note:** make sure you run this at the sentence level

### 2b - ii) Train an LDA model with 10 topics  
**Tip:** you can use the parameter `n_jobs=-1` to use all the available threads in your machine, instead of just one. This might speed up the training.

### 2c - iii) Show the top 10 words for each topic

### 2c - iiii) Use `pyLDAvis` to visualize the LDA model  

https://github.com/bmabey/pyLDAvis

`pyLDAvis` is not yet fully optimized for this version of Python and Jupyter Lab so I'd recommend to only import it when you have to:

```python
import pyLDAvis
import pyLDAvis.sklearn
pyLDAvis.enable_notebook()
```

You should see something like this:

![image.png](attachment:c48d25e3-b0bd-43c0-9f49-6aaca4c1e736.png)

In [17]:
import pyLDAvis
import pyLDAvis.sklearn
pyLDAvis.enable_notebook()

### 2c - v) Find a good LDA model to represent MD&A sentences   

Use what you learned above to find a good LDA model. You can tweak:

1. The parameters in `countvectorizer`  
2. The number of topics  

Make sure to include a pyLDAvis illustration.

### 2c - vi) Add the topic probabilities back into the sentence_df


### 2c - vi) Add a column to `sentence_df` which indicates the topic with the largest probability

**Hint:** you can return the column name of the max value by using `.idxmax` instead of `.max`

<div style='border-style: solid; padding: 10px; border-color: black; border-width:5px;  text-align: left; margin-top:20px; margin-bottom: 20px;'>
<span style='color:black; font-size: 30px; font-weight:bold;'>(NOT REQUIRED FOR CREDIT) Word embeddings using Gensim</span>
</div>

## 2) Word embeddings using Gensim  

Gensim provides a large collection of other word embedding models, it also allows to load in any model you find on the internet.  

**Note:** generally speaking you want to have as many documents as possible when training a word embedding model, however, to keep things simple we will train a model based on the 20 provided MD&As.

### 2a) Train a 100 dimension word2vec model based on the 20 MD&As 

#### First step: create the right input   
The `Word2Vec` function takes as input a list of sentences, where each sentence is split up as a list of tokens, for example:

```python
[['This', 'is', 'sentence', '1', '.'] , ['This', 'is', 'sentence', '2', '.']]
```

Your input (based on the 20 MD&As) should be 16883 items long. 

**Tips:**

1. Make sure that you end up strings and not Spacy tokens in your lists.  
2. I recommend to convert each word to lowercase

#### 2b) train the Word2Vec model   
Use the `Word2Vec` function (imported from `gensim.models`).

`gensim.models.Word2Vec(<input from step 1>, size = 100)`

### 2c) How many words do you have in your corpus?  
**Hint:** use `.corpus_total_words`

### 2d) Find the most similar words for the following words:

- cash
- assets
- october
- debt  

Use the `model.wv.most_similar()` function.