### Visualizing topic prediction of NASA data metadata using LDA built on Wikipedia corpus

- Data source:https://data.nasa.gov/data.json
- Gensim tutorial: https://radimrehurek.com/gensim/wiki.html
- Exploratory visualization: https://public.tableau.com/views/NASAOpenDataMetadatatopic-predictionusingLDAbuiltfromWikipediacorpus/Sheet4

In [None]:
import json as json
import numpy as np
import pandas as pd

In [2]:
import gensim

## Load data files
These files are created by running parse_metadata.py which requires NASA's data to be saved as nasa_metadata2.json

In [3]:
keyworddf = pd.read_json('nasa_keyword.json')

In [4]:
titledf = pd.read_json('nasa_title.json')

In [5]:
descdf = pd.read_json('nasa_description.json')

## Helper functions for getting topic recommendations from documents
Code modified from http://radimrehurek.com/topic_modeling_tutorial/2%20-%20Topic%20Modeling.html

In [6]:
import gensim
from gensim.utils import smart_open, simple_preprocess
from gensim.corpora.wikicorpus import _extract_pages, filter_wiki
from gensim.parsing.preprocessing import STOPWORDS

def tokenize(text):
    return [token for token in simple_preprocess(text) if token not in STOPWORDS]

def iter_wiki(dump_file):
    """Yield each article from the Wikipedia dump, as a `(title, tokens)` 2-tuple."""
    ignore_namespaces = 'Wikipedia Category File Portal Template MediaWiki User Help Book Draft'.split()
    for title, text, pageid in _extract_pages(smart_open(dump_file)):
        text = filter_wiki(text)
        tokens = tokenize(text)
        if len(tokens) < 50 or any(title.startswith(ns + ':') for ns in ignore_namespaces):
            continue  # ignore short articles and various meta-articles
        yield title, tokens

def rec_topic(doc, id2word, model):
    bow = id2word.doc2bow(tokenize(doc))
    doc_lda = model[bow]
    try:
        print(model.print_topic(max(doc_lda, key=lambda item: item[1])[0]))
    except:
        print ("Error processing", doc)
        
def get_topic_dict(doc, id2word, model):
    bow = id2word.doc2bow(tokenize(doc))
    doc_lda = model[bow]
    topic_dict = {}
    try:
        id_tup = model.get_topic_terms(max(doc_lda, key=lambda item: item[1])[0])  
        for item in id_tup:
            topic_dict[id2word[item[0]]] = item[1]
    except:
        print ("Error processing", doc)
    return topic_dict

## Load dictionary and LDA model generated from Wikipedia tarballs
See https://radimrehurek.com/gensim/wiki.html for how to generate an LDA model from the Wikipedia corpus.
I did not set up the distributed version, so it took my linux VM ~12 hours to convert the corpus into sparse TF-IDF vectors (step 1). Then generating the LDA Model (Step 2) took about 12 hours as well. I've uploaded the resulting files, so you don't have to regenerate these files if you don't want to.

In [7]:
import bz2
id2word = gensim.corpora.Dictionary.load_from_text('wiki_en_wordids.txt.bz2')
model = gensim.models.LdaModel.load('myldamodel.lda')

## Example of how the topic recommender works

In [36]:
# Example description text
example_desc = descdf.description[0]
print (example_desc)

USGS 15 minute stream flow data for Kings Creek on the Konza Prairie


In [37]:
# Example topics recommended for description
rec_topic(example_desc, id2word, model)

0.010*river + 0.008*lake + 0.006*creek + 0.005*park + 0.005*island + 0.005*mountain + 0.004*water + 0.004*reserve + 0.004*forest + 0.003*dam


### Let's compare it with the title

In [38]:
print(titledf.title[0])

15 Minute Stream Flow Data: USGS (FIFE)


### Let's compare it with the keywords

In [39]:
print(keyworddf.keyword[0])

[u'EARTH SCIENCE', u'HYDROSPHERE', u'SURFACE WATER']


## Preparing data structures to visualize topics predicted using the LDA model

For this first exploratory visualization, I am ignoring probabilities associated with each topic from the LDA model output. This is important to note, as we will see from the visualization that some topics are completely incorrect. The question is did LDA predict those topics with lower probabilities than the other ones?

This next cell took 5-8 hours to run. There are faster ways to do it, but I had other things to do, so I let this vm chug away

In [None]:
num_rows = descdf.description.count()

# Dictionary of topic to metadata id
docid_topic = {}

for i in range(0, num_rows): 
    # Print a . every 100 item, so that I know the notebook hasn't crashed
    if i%100 == 0:
        print ('.')
    item_id = descdf.id[i]
    
    # The id from all dataframes should match since they should refer to the same entry from the source data.json
    if (item_id != titledf.id[i] or item_id != keyworddf.id[i]):
        print ('ID mismatch', item_id, titledf.id[i], keyworddf.id[i])
        break
    
    desc_text = descdf.description[i]
    title_text = titledf.title[i]
    keywords = keyworddf.keyword[i]
    
    # Get the recommended topics in the structure of a dictionary, using the LDA model
    desc_topic_dict = get_topic_dict(desc_text, id2word, model)
    
    ''' Each item in the data.json has a unique id. 
        docid_topic uses this unique id as a dictionary key.
        to topic keys with a list.
        Initialize the list if the item id is not in the dictionary'''
    if item_id not in docid_topic.keys():
        docid_topic[item_id] = []
    
    for key in desc_topic_dict:
        if key not in docid_topic[item_id]:
            docid_topic[item_id].append(key)

### Write id:topics into a csv file

In [37]:
import csv
f = open('docids_topic.csv', 'wb')
fieldnames = ['doc_id', 'topics']
w = csv.DictWriter(f, fieldnames=fieldnames)
w.writeheader()

for doc_id in docid_topic:    
    # Encode list in utf-8. There are probably better ways to do this but I'm a python n00b
    topiclist = docid_topic[doc_id]  
    encodedlist = []
    for x in topiclist:
        encodedlist.append(x.encode('utf-8'))        
    # Write the doc_id, encoded in utf-8 and the list of topics to csv
    w.writerow({'doc_id': doc_id.encode('utf-8'), 'topics': encodedlist})

### Write other fields which may be interesting to visualize to csv
This didn't work too well for the Description data frame, possibly due to the fact that some of the descriptions have '\t' in them

In [27]:
titledf.to_csv('nasa_title.csv', sep='\t', encoding='utf-8')


In [28]:
descdf.to_csv('nasa_desc.csv', sep='\t', encoding='utf-8')


In [29]:
keyworddf.to_csv('nasa_keyword.csv', sep='\t', encoding='utf-8')


### This is when I realized the descdf also has its columns in the incorrect order.

In [38]:
descdf.head()

Unnamed: 0,description,id
0,USGS 15 minute stream flow data for Kings Cree...,55942a57c63a7fe59b495a77
1,USGS 15 minute stream flow data for Kings Cree...,55942a57c63a7fe59b495a78
2,ABSTRACT: USGS 15 minute stream flow data for ...,55942a58c63a7fe59b495a79
3,The 2000 Pilot Environmental Sustainability In...,55942a58c63a7fe59b495a7a
4,The 2000 Pilot Environmental Sustainability In...,55942a58c63a7fe59b495a7b


### Here we swap the pandas DataFrame columns

In [42]:
descdf2 = descdf[['id', 'description']]

In [43]:
descdf2.head()

Unnamed: 0,id,description
0,55942a57c63a7fe59b495a77,USGS 15 minute stream flow data for Kings Cree...
1,55942a57c63a7fe59b495a78,USGS 15 minute stream flow data for Kings Cree...
2,55942a58c63a7fe59b495a79,ABSTRACT: USGS 15 minute stream flow data for ...
3,55942a58c63a7fe59b495a7a,The 2000 Pilot Environmental Sustainability In...
4,55942a58c63a7fe59b495a7b,The 2000 Pilot Environmental Sustainability In...


### Write the description DataFrame to csv the old-fashioned way since DataFrame.to_csv didn't work out too well

In [45]:
f = open('nasa_desc3.csv', 'wb')
fieldnames = ['doc_id', 'description']
w = csv.DictWriter(f, fieldnames=fieldnames)

w.writeheader()

for index, row in descdf2.iterrows():
    w.writerow({'doc_id': row['id'].encode('utf-8'), 'description': row['description'].encode('utf-8')})

### After having fun visualizing the data in Tableau, I decided to add even more properties from the source data.json

In [47]:
keyworddf.to_csv('nasa_keywords.csv', sep='\t', encoding='utf-8')

In [48]:
pd.read_json('nasa_spatial.json').to_csv('nasa_spatial.csv', sep='\t', encoding='utf-8')

In [49]:
pd.read_json('nasa_theme.json').to_csv('nasa_theme.csv', sep='\t', encoding='utf-8')

In [50]:
pd.read_json('nasa_temporal.json').to_csv('nasa_temporal.csv', sep='\t', encoding='utf-8')

In [51]:
pd.read_json('nasa_landingPage.json').to_csv('nasa_landingPage.csv', sep='\t', encoding='utf-8')

### If you know how to embed Tableau visualizations in Jupyter notebook, send me a note!

Here's a visualization of how the keywords (human-tagged) match with the topics generated by LDA build from the wikipedia corpus https://public.tableau.com/views/NASAOpenDataMetadatatopic-predictionusingLDAbuiltfromWikipediacorpus/Sheet4
Some of the topic prediction are not very successful. For example, hover over [dashlink, Ames] 

Here's a visualization of how the titles (human-supplied) match with the LDA build from the wikipedia corpus
The size of each circle is scaled by the number of records with the same title
https://public.tableau.com/views/NASAOpenDataMetadatatopic-predictionusingLDAbuiltfromWikipediacorpus/Sheet4

### Tableau didn't like the format of the spatial field, so let's reformat it
The simplest approach is to assume that there is an array of doubles, split them into two and consider the first half of the array the latitude, and the second half is the longitude. 
This seems to work for the first 5 entries in the DataFrame, even though note that some of them are comma-delimited, and some are just space-delimited.
However, we will see that this assumption does not hold true for many entries.
For exploratory visualization I'm just ignoring those entries. 
For real data analysis, those entries should be correctly parsed as well. 


In [13]:
spatialdf = pd.read_json('nasa_spatial.json')

In [14]:

spatialdf.head()

Unnamed: 0,id,spatial
0,55942a57c63a7fe59b495a77,39.1 -96.6
1,55942a57c63a7fe59b495a78,39.1 -96.6
2,55942a58c63a7fe59b495a79,"39.1, -96.6"
3,55942a58c63a7fe59b495a7a,-180.0 -55.0 180.0 90.0
4,55942a58c63a7fe59b495a7b,-180.0 -55.0 180.0 90.0


Again, for exploratory visualization, I'm ignoring entries that are not comma or space-delimited double values. In this cell, we print out these entries so that future code can correctly handle them. Some of these entries are even in xml-format!

In [15]:
import re

def splitNumbers(spatialString):
    strList = re.split(' |,', spatialString)  
    numList = []
    for strval in strList:        
        strval = strval.strip()
        if len(strval) > 0:
            try:
                numList.append(float(strval))
            except:
                # Print location entries that we are not successfully parsing for now
                print(strval)
    return numList
    
# The code here is based on information from http://kb.tableau.com/articles/knowledgebase/convert-latitude-longitude
def convertSpatial(spatialEntry):
    numList = splitNumbers(spatialEntry)
    endPoint = len(numList)
    midPoint = len(numList)/2
    
    latitude = []
    longitude = []
    for i in range (0, midPoint):
        latitude.append(numList[i])
    
    for j in range (midPoint, endPoint):
        longitude.append(numList[i])
    
    if len(latitude) > 2:
        clatitude = latitude[0] + (latitude[1]/60) + (latitude[2]/3600)
    elif len(latitude) > 1:
        clatitude = latitude[0] + (latitude[1]/60)
    elif len(latitude) > 0:
        clatitude = latitude[0]  
    else:
        clatitude = None
    
    if len(longitude) > 2:
        clongitude = -1*(longitude[0] + (longitude[1]/60) + (longitude[2]/3600))
    elif len(longitude) > 1:
        clongitude = -1*(longitude[0] + (longitude[1]/60))
    elif len(longitude) > 0:
        clongitude = -1*(longitude[0])
    else:
        clongitude = None
        
    return clatitude, clongitude

latitudeColumn = []
longitudeColumn = []
for entry in spatialdf.spatial:
    latitude, longitude = convertSpatial(entry)
    latitudeColumn.append(latitude)
    longitudeColumn.append(longitude)
    

<?xml
version="1.0"
encoding="UTF-8"?><gml:Polygon
xmlns:gml="http://www.opengis.net/gml/3.2"
srsName="EPSG:9825"><gml:outerBoundaryIs><gml:LinearRing><gml:posList>-89.0
-180.0</gml:posList></gml:LinearRing></gml:outerBoundaryIs><gml:innerBoundaryIs></gml:innerBoundaryIs></gml:Polygon>
<?xml
version="1.0"
encoding="UTF-8"?><gml:Polygon
xmlns:gml="http://www.opengis.net/gml/3.2"
srsName="EPSG:9825"><gml:outerBoundaryIs><gml:LinearRing><gml:posList>-89.0
-180.0</gml:posList></gml:LinearRing></gml:outerBoundaryIs><gml:innerBoundaryIs></gml:innerBoundaryIs></gml:Polygon>
<?xml
version="1.0"
encoding="UTF-8"?><gml:Polygon
xmlns:gml="http://www.opengis.net/gml/3.2"
srsName="EPSG:9825"><gml:outerBoundaryIs><gml:LinearRing><gml:posList>-89.0
-180.0</gml:posList></gml:LinearRing></gml:outerBoundaryIs><gml:innerBoundaryIs></gml:innerBoundaryIs></gml:Polygon>
<?xml
version="1.0"
encoding="UTF-8"?><gml:Polygon
xmlns:gml="http://www.opengis.net/gml/3.2"
srsName="EPSG:9825"><gml:outerBoundaryIs><gml

In [16]:
spatialdf['latitude'] = latitudeColumn

In [17]:
spatialdf['longitude'] = longitudeColumn

In [18]:
spatialdf.head()

Unnamed: 0,id,spatial,latitude,longitude
0,55942a57c63a7fe59b495a77,39.1 -96.6,39.1,-39.1
1,55942a57c63a7fe59b495a78,39.1 -96.6,39.1,-39.1
2,55942a58c63a7fe59b495a79,"39.1, -96.6",39.1,-39.1
3,55942a58c63a7fe59b495a7a,-180.0 -55.0 180.0 90.0,-180.916667,55.916667
4,55942a58c63a7fe59b495a7b,-180.0 -55.0 180.0 90.0,-180.916667,55.916667


In [23]:
import csv
f = open('nasa_spatial3.csv', 'wb')
fieldnames = ['doc_id', 'latitude', 'longitude']
w = csv.DictWriter(f, fieldnames=fieldnames)
w.writeheader()

for index, row in spatialdf.iterrows():
    w.writerow({'doc_id': row['id'].encode('utf-8'), 'latitude': row['latitude'], 'longitude': row['longitude']})

## Visualization of data location in Tableau
After the data reformatting above, we can finally visualize where each data (source?) is located in a map in Tableau https://public.tableau.com/views/NASAOpenDataMetadatatopic-predictionusingLDAbuiltfromWikipediacorpus/Sheet6

However, the data format seems suspect to me due to the two distinct lines - one at the bottom of the map, and another diagonally across the globe. We will have to contact some NASA experts to verify if the visualization is correct.