# Advancing your natural language processing with Elasticsearch 
DEVFEST DC 2019  <br>   Summer Rankin, PhD <br> Lead Data Scientist <br> Booz | Allen | Hamilton <br> www.summerrankin.com www.github.com/1fmusic

After cleaning, and topic modeling our data, we will now 

> 1. clean up and normalize this data (not the same as cleaning for NLP analysis)
> 2. push to our database using a low level client for elasticsearch:  **elasticsearch-py** https://elasticsearch-py.readthedocs.io/en/master/ 

+ Elasticsearch is an open-source search engine built on top of Apache Lucene.  https://www.elastic.co/
+ Built in Java
+ A noSQL database
+ Data stored in JSON format. 

Data from https://www.kaggle.com/rounakbanik

In [18]:
import pickle
import pandas as pd

path = '/Volumes/ext200/Dropbox/metis/p4_fletcher/pick/'

# Load the raw (original) text 
+ should include the results (strongest topic for each document) mapped back onto this original dataframe

In [19]:
with open(path + 'ted_w_topic.pkl', 'rb') as picklefile:
    ted_w_topic = pickle.load(picklefile)
    
ted_w_topic.head()

Unnamed: 0,topic,transcript,url,comments,description,duration,event,film_date,languages,main_speaker,name,num_speaker,published_date,ratings,related_talks,speaker_occupation,tags,title,views
0,memories,Good morning. How are you?(Laughter)It's been ...,https://www.ted.com/talks/ken_robinson_says_sc...,4553,Sir Ken Robinson makes an entertaining and pro...,1164,TED2006,1140825600,60,Ken Robinson,Ken Robinson: Do schools kill creativity?,1,1151367060,"[{'id': 7, 'name': 'Funny', 'count': 19645}, {...","[{'id': 865, 'hero': 'https://pe.tedcdn.com/im...",Author/educator,"['children', 'creativity', 'culture', 'dance',...",Do schools kill creativity?,47227110
1,climate,"Thank you so much, Chris. And it's truly a gre...",https://www.ted.com/talks/al_gore_on_averting_...,265,With the same humor and humanity he exuded in ...,977,TED2006,1140825600,43,Al Gore,Al Gore: Averting the climate crisis,1,1151367060,"[{'id': 7, 'name': 'Funny', 'count': 544}, {'i...","[{'id': 243, 'hero': 'https://pe.tedcdn.com/im...",Climate advocate,"['alternative energy', 'cars', 'climate change...",Averting the climate crisis,3200520
2,technology,"(Music: ""The Sound of Silence,"" Simon & Garfun...",https://www.ted.com/talks/david_pogue_says_sim...,124,New York Times columnist David Pogue takes aim...,1286,TED2006,1140739200,26,David Pogue,David Pogue: Simplicity sells,1,1151367060,"[{'id': 7, 'name': 'Funny', 'count': 964}, {'i...","[{'id': 1725, 'hero': 'https://pe.tedcdn.com/i...",Technology columnist,"['computers', 'entertainment', 'interface desi...",Simplicity sells,1636292
3,architecture,If you're here today — and I'm very happy that...,https://www.ted.com/talks/majora_carter_s_tale...,200,"In an emotionally charged talk, MacArthur-winn...",1116,TED2006,1140912000,35,Majora Carter,Majora Carter: Greening the ghetto,1,1151367060,"[{'id': 3, 'name': 'Courageous', 'count': 760}...","[{'id': 1041, 'hero': 'https://pe.tedcdn.com/i...",Activist for environmental justice,"['MacArthur grant', 'activism', 'business', 'c...",Greening the ghetto,1697550
4,economics,"About 10 years ago, I took on the task to teac...",https://www.ted.com/talks/hans_rosling_shows_t...,593,You've never seen data presented like this. Wi...,1190,TED2006,1140566400,48,Hans Rosling,Hans Rosling: The best stats you've ever seen,1,1151440680,"[{'id': 9, 'name': 'Ingenious', 'count': 3202}...","[{'id': 2056, 'hero': 'https://pe.tedcdn.com/i...",Global health expert; data visionary,"['Africa', 'Asia', 'Google', 'demo', 'economic...",The best stats you've ever seen,12005869


### Dates need to be converted to datetime objects

In [None]:
ted_w_topic['film_date'] = pd.to_datetime(ted_w_topic['film_date'],unit='s')
ted_w_topic['published_date'] = pd.to_datetime(ted_w_topic['published_date'],unit='s')

### Lists need to have strings separated by a comma. 
No extra brackets, quotations

In [6]:
ted_w_topic.tags.head()

0    ['children', 'creativity', 'culture', 'dance',...
1    ['alternative energy', 'cars', 'climate change...
2    ['computers', 'entertainment', 'interface desi...
3    ['MacArthur grant', 'activism', 'business', 'c...
4    ['Africa', 'Asia', 'Google', 'demo', 'economic...
Name: tags, dtype: object

In [None]:
ted_w_topic['tags'] = ted_w_topic['tags'].replace([r"\[","\]","'"],"",regex=True).str.lower()

### This list has multiple types of separators, replace them all with commas and lowercase

In [9]:
ted_w_topic.speaker_occupation.head(8)

0                                Author/educator
1                               Climate advocate
2                           Technology columnist
3             Activist for environmental justice
4           Global health expert; data visionary
5    Life coach; expert in leadership psychology
6                    Actor, comedian, playwright
7                                      Architect
Name: speaker_occupation, dtype: object

In [None]:
ted_w_topic['speaker_occupation'] = ted_w_topic['speaker_occupation'].replace(['\/','\;'],', ',regex=True).str.lower()

#### Nans are bad, find them and  destroy

In [13]:
ted_w_topic.isna().sum()

topic                 0
transcript            0
url                   0
comments              0
description           0
duration              0
event                 0
film_date             0
languages             0
main_speaker          0
name                  0
num_speaker           0
published_date        0
ratings               0
related_talks         0
speaker_occupation    6
tags                  0
title                 0
views                 0
dtype: int64

In [None]:
ted_w_topic.speaker_occupation.fillna('None',inplace=True)

# Clean up for ingestion into database 
all the cleaning tasks in one place

+ dates converted to datetime objects
+ fill NaNs with something (srting or number depending on Column dtype)
+ remove brackets at the ends
+ make separators of words in lists the same (i.e. all commas), so Elastic will treat as an array 
+ lowercase these lists

In [20]:
ted_w_topic['film_date'] = pd.to_datetime(ted_w_topic['film_date'],unit='s')
ted_w_topic['published_date'] = pd.to_datetime(ted_w_topic['published_date'],unit='s')

ted_w_topic.speaker_occupation.fillna('None',inplace=True)


ted_w_topic['speaker_occupation'] = ted_w_topic['speaker_occupation'].replace(['\/','\;'],', ',regex=True).str.lower()

ted_w_topic['tags'] = ted_w_topic['tags'].replace([r"\[","\]","'"],"",regex=True).str.lower()

# drop cols I don't want
ted_ingest = ted_w_topic.drop(columns=['ratings','related_talks'])
ted_ingest.head()

Unnamed: 0,topic,transcript,url,comments,description,duration,event,film_date,languages,main_speaker,name,num_speaker,published_date,speaker_occupation,tags,title,views,speaker_occupation_raw
0,memories,Good morning. How are you?(Laughter)It's been ...,https://www.ted.com/talks/ken_robinson_says_sc...,4553,Sir Ken Robinson makes an entertaining and pro...,1164,TED2006,2006-02-25,60,Ken Robinson,Ken Robinson: Do schools kill creativity?,1,2006-06-27 00:11:00,"author, educator","children, creativity, culture, dance, educatio...",Do schools kill creativity?,47227110,Author/educator
1,climate,"Thank you so much, Chris. And it's truly a gre...",https://www.ted.com/talks/al_gore_on_averting_...,265,With the same humor and humanity he exuded in ...,977,TED2006,2006-02-25,43,Al Gore,Al Gore: Averting the climate crisis,1,2006-06-27 00:11:00,climate advocate,"alternative energy, cars, climate change, cult...",Averting the climate crisis,3200520,Climate advocate
2,technology,"(Music: ""The Sound of Silence,"" Simon & Garfun...",https://www.ted.com/talks/david_pogue_says_sim...,124,New York Times columnist David Pogue takes aim...,1286,TED2006,2006-02-24,26,David Pogue,David Pogue: Simplicity sells,1,2006-06-27 00:11:00,technology columnist,"computers, entertainment, interface design, me...",Simplicity sells,1636292,Technology columnist
3,architecture,If you're here today — and I'm very happy that...,https://www.ted.com/talks/majora_carter_s_tale...,200,"In an emotionally charged talk, MacArthur-winn...",1116,TED2006,2006-02-26,35,Majora Carter,Majora Carter: Greening the ghetto,1,2006-06-27 00:11:00,activist for environmental justice,"macarthur grant, activism, business, cities, e...",Greening the ghetto,1697550,Activist for environmental justice
4,economics,"About 10 years ago, I took on the task to teac...",https://www.ted.com/talks/hans_rosling_shows_t...,593,You've never seen data presented like this. Wi...,1190,TED2006,2006-02-22,48,Hans Rosling,Hans Rosling: The best stats you've ever seen,1,2006-06-27 20:38:00,"global health expert, data visionary","africa, asia, google, demo, economics, global ...",The best stats you've ever seen,12005869,Global health expert; data visionary


# Save dataframe as a python Dictionary

In [21]:
ted_ingest_d = ted_ingest.to_dict(orient='records')

# Here are a few ways to ingest our data: 

# Option 1: Elasticsearch-py as one object
### save as a python dictionary and push to our elasticsearch instance
+ can run Elastic locally, from a docker container, or in the cloud. We connect like we would any other server
+ elastic is already running, so we just connect to it with the elastic-py library

### Connect to Elasticsearch locally

In [22]:
from elasticsearch import helpers, Elasticsearch

es = Elasticsearch("localhost:9200")
es_client = Elasticsearch(http_compress=True)
Elasticsearch.info(es)

{'name': 'elasticsearch',
 'cluster_name': 'docker-cluster',
 'cluster_uuid': 'pe6uVqX0SruLz4VRKmqaGA',
 'version': {'number': '7.1.1',
  'build_flavor': 'default',
  'build_type': 'docker',
  'build_hash': '7a013de',
  'build_date': '2019-05-23T14:04:00.380842Z',
  'build_snapshot': False,
  'lucene_version': '8.0.0',
  'minimum_wire_compatibility_version': '6.8.0',
  'minimum_index_compatibility_version': '6.0.0-beta1'},
 'tagline': 'You Know, for Search'}

### Example of how to connect to Elasticsearch non-local
+ don't expose your passwords though

In [None]:
try:
    es=Elasticsearch(
    ['https://yourcloudaddress.aws'],
    http_auth=('username','password'),
    port=8080,
    use_ssl=True,
    ca_certs=certifi.where(),
    )
    print("Connected", es.info())
    
except Exception as ex:
    print( "Error:", ex)

In [24]:
helpers.bulk(es_client, ted_ingest_d, index='ted_talks_1')

(2467, [])

# Option 2: Elasticsearch-py one document at a time
### save as dict/dataframe and use a generator to ingest with our elasticsearch client
+ this is useful if you have a large dataset that you don't want to hold in memory (or can't)

In [49]:
def doc_generator(df):
    df_iter = df.iterrows()
    for index, document in df_iter:
        yield {
                "_index": 'ted_talks_1',
                "_source": document,
            }
    raise StopIteration
    
helpers.bulk(es_client, doc_generator(ted_ingest),index = 'ted_talks_1')

### toy example

In [62]:
def gendata():
    mywords = ['foo', 'bar', 'baz']
    for word in mywords:
        yield {
            "_index": "mywords",
            "_type": "document",
            "doc": {"word": word},
        }

helpers.bulk(es, gendata())

(3, [])

# Option 3: Kibana GUI
## save as a delimited file (CSV, TSV, XML, JSON...) and ingest using the Kibana GUI (under "machine learning => data visualizer" 
+ There is a limit to the size of the file you can ingest
+ tends to be a little fussier about things being in the right format

In [165]:
ted_ingest.to_csv('~/ted_w_topic.csv',index=False, encoding='utf8')

# Option 4: Requests library
getting the pandas df into the correct json format that elastic expects is kind of convoluted. but this function works just fine.  

In [37]:
import json
import requests

ted_ingest['_id'] = ted_ingest.index

ted_as_json = ted_ingest.to_json(orient='records', lines=True)

final_json_string = ''
for json_document in ted_as_json.split('\n'):
    jdict = json.loads(json_document)
    metadata = json.dumps({'index': {'_id': jdict['_id']}})
    jdict.pop('_id')
    final_json_string += metadata + '\n' + json.dumps(jdict) + '\n'
    
headers = {'Content-type': 'application/json', 'Accept': 'text/plain'}

r = requests.post('http://localhost:9200/ted_req/_bulk', data=final_json_string, headers=headers, timeout=120)


# Option 5: CURL
## Similar to the above method, we will  save the JSON string we just created and ingest using the command line. 

In [44]:

with open("ted_w_topic.json","w") as tedf:
    json.dump(final_json_string, tedf)


#curl -X POST "http://localhost:9200/index/ted_curl" -H "content-type: application/json" --data-binary "@ted_w_topic.json"



# Option 6: Logstash 
https://www.elastic.co/products/logstash

Good tutorial from Catherine Ordun about ingesting a CSV using Logstash https://tm3.ghost.io/2017/10/12/kibana-dashboard-for-amazon-food-reviews/

Logstash is an open source server-side data processing pipeline. It is the 'L' in the ELK stack that you may have heard of. It has nothing to do with python and is made for mutating/converting data upon ingestion into an elasticsearch index. It is a separate piece of software that you will install and then have to spend a little time learning. I find that if i'm just ingesting one large document, one time, that this is overkill. This is made more for streaming data the comes in (i.e. server logs). 


In [None]:
ted_ingest.to_csv('~/ted_w_topic.csv',index=False, encoding='utf8')