# Follow this to extract the relevant data

**1. Put compressed quotebank files in /Data and name them like this: quotes-YEAR.json.bz2**

The whole process will result in 37GB of files, but they can be deleted during the process.

---

**2. Install the necessary python libraries.**

**Spacy**

To install spacy with pip:

```
pip install spacy
```

We will use different models for spacy which is downloaded in the terminal like this:

```
python -m spacy download en_core_web_sm
python -m spacy download en_core_web_md
python -m spacy download en_core_web_lg
```

**yfinance**

```
pip install yFinance
```
**googlesearch**

```
pip install googlesearch-python
```
or
```
python3 -m pip install googlesearch-python
```

**pytrends**

```
pip install pytrends
```

**Other (should be included in conda):**
- datetime
- matplotlib
- pandas
- bz2
- os
- numpy
- time
- glob
- re

---

**3. Chunk the quotebank files into smaller .csv.bz2 files for each year.**

This can be parallellarized for different years or run in a loop for all the years.

*Single year chunking:*

In [None]:
from functions import chunkify
year = 2020
chunk_size = 200000
chunkify('Data/quotes-' + str(year) + '.json.bz2', chunk_size,'quotes-' + str(year))

*Loop:*

In [None]:
from functions import chunkify
chunk_size = 200000
for year in range(2015,2021)
    chunkify('Data/quotes-' + str(year) + '.json.bz2', chunk_size,'quotes-' + str(year))

This will output a bunch of chunk-files for the given year in the same /Data folder.

Format: `quotes-YEAR-X.csv.bz2`

Example: `quotes-2015-27.csv.bz2`

---

**4. Extract Elon Musk's quotes from all the .csv.bz2 files**

This can also be parallellarized for different years.

*Single year extracting:*

In [None]:
from functions import get_quotes
from functions import make_csv

year = 2020
speaker = 'Elon Musk'

df = get_quotes(speaker, year)
make_csv(df, speaker, year, compression='bz2')

*Loop:*

In [None]:
from functions import get_quotes
from functions import make_csv

speaker = 'Elon Musk'
for year in range(2015, 2021):
    df = get_quotes(speaker, year, timing=True)
    make_csv(df, speaker, year, compression='bz2')

This will output a single zipped csv file per year with only quotes from the speaker.

`SPEAKER-quotes-YEAR.csv.bz2`

`Elon Musk-quotes-2018.csv.bz2`

---

**5. Combining the zipped csv files per year into one file with all of Elon's quotes from 2015-2020**

In [None]:
from functions import combining_yearly_quotes

combining_yearly_quotes(speaker)

This will output a single zipped csv with all quotes from the speaker from the years you have in `/Data` folder.

`all-SPEAKER-quotes.csv.bz2`

`all-Elon Musk-quotes.csv.bz2`

---

**6. Cut off low probability quotes**

In [None]:
from functions import high_probability_quotes
import pandas as pd

cutoff = 0.7

df = pd.read_csv('Data/all-Elon Musk-quotes.csv.bz2')

hdf = high_probability_quotes(df, cutoff)

hdf.to_csv('Data/high-prob-Elon Musk-quotes.csv.bz2', compression='bz2', index=False)

This will output a single zipped csv with all high probability quotes from the speaker in `/Data`.

`high-prob-Elon Musk-quotes.csv.bz2`

---

**7. Add organizational data**

In [None]:
from functions import create_org_df

spacy = 'en_core_web_lg'

df = pd.read_csv('Data/high-prob-Elon Musk-quotes.csv.bz2')

org_df = create_org_df(spacy_model, df)

org_df.to_csv('Data/org-lg-Elon Musk.csv.bz2', compression='bz2', index=False)

This will output a single zipped csv with all quotes from the speaker that mentions a company.

`org-lg-Elon Musk.csv.bz2`

---