#Introduction

In this guide, we will be creating a structured document database based on the [Institude for the Study of War](http://www.understandingwar.org/) (ISW) production library. ISW creates informational products for diplomatic and intelligence professionals to gain a deeper understanding of conflicts occurring around the world. 

This guide will be en exercise in web extraction and natural language processing (NLP), and named entity recognition (NER). For the NLP, we will primarily be using the open-source Python libraries NLTK and Spacy. This guide is not intended primarily to be demonstration of a use-case for web extraction and NLP, **not** a comprehensive beginner tutorial to the usage of either technique. If you are new to NLP or web extraction, I would urge you to follow a different guide or look through the [Spacy](https://spacy.io/api/doc), [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/), and [NLTK](https://www.nltk.org/) documentation pages. However, I will attempt throughout this guide to explain concepts at the simplest level for those who have never been exposed to these concepts before. If, at the end of the guide, you are still confused and would like some further explanation, you will find mind my LinkedIn information below the article. Feel free to reach out to me with any questions or suggestions! Additionally, you can find all of the original code for this guide in the [Colaboratory Notebook](https://colab.research.google.com/drive/1pTrOXW3k5VQo1lEaahCo79AHpyp5ZdfQ?usp=sharing). 

##Import Relevant Libraries

In [None]:
#Import libraries

import requests
import nltk
import math
import re
import spacy
import regex as re
import pandas as pd
import numpy as np
import statistics as stats
import matplotlib.pyplot as plt
import matplotlib.cm as cm
import json

#You will need to download some packages from NLTK. 

from bs4 import BeautifulSoup 
from nltk import *
nltk.download('stopwords')
nltk.download('punkt')
from nltk.corpus import stopwords

#In most environments, you will need to install NER-D. 

!pip install ner-d
from nerd import ner

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.cluster import MiniBatchKMeans
from sklearn.feature_extraction.text import TfidfVectorizer

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


##Initializing Variables

First, we will initialize the data fields we want in our final structured data. For each article, I want to extract the **title**, **date of publication**, **names of people**, **names of places**, and various other information. I'll also be enhancing the information that already exists in the article - for example, we'll use the place names we find in the article to get relevant coordinates, which could be useful for visualizing data later on. We'll also initialize some fields that will be filled in by **machine learning** models later, like topic modeling fields and extracted keywords.

In [None]:
#Initialize data fields for final dataset

dates=[]
titles=[]
locations=[]
people=[]
key_countries=[]
content_text=[]
links=[]
coord_list=[]
mentioned_countries=[]
keywords=[]
topic_categories=[]

#initializing clustering variables for later topic modeling

cluster_keywords=[]
cluster_number=[]

#Initialize the NLP object using the SPACY library
nlp = spacy.load("en_core_web_sm")

In [None]:
#Defining a list of all the countries in the world, for later referencing in extracted text

country_list=['Afghanistan','Albania','Algeria','Andorra','Angola','Antigua','Barbuda',
              'Argentina','Armenia','Australia','Austria','Azerbaijan','Bahamas','Bahrain',
              'Bangladesh','Barbados','Belarus','Belgium','Belize','Benin','Bhutan',
              'Bolivia','Bosnia','Herzegovina','Botswana','Brazil','Brunei','Bulgaria',
              'Burkina Faso','Burundi',"Cote d'lvoire",'Cabo Verde','Cambodia','Cameroon',
              'Canada','Central African Republic','Chad','Chile','China','Colombia','Comoros',
              'Congo','Costa Rica','Croatia','Cuba','Cyprus','Czechia','Czech Republic',
              'Democratic Republic of the Congo','Denmark','Djibouti','Dominica',
              'Domincan Republic','Ecuador','Egypt','El Salvador','Equatorial Guinea',
              'Eritrea','Estonia','Eswatini','Ethiopia','Fiji','Finland','France','Gabon',
              'Gambia','Georgia','Germany','Ghana','Greece','Grenada','Guatemala','Guinea',
              'Guinea-Bissau','Guyana','Haiti','Holy See','Honduras','Hungary','Iceland',
              'India','Indonesia','Iran','Iraq','Ireland','Israel','Italy','Jamaica','Japan',
              'Jordan','Kazakhstan','Kenya','Kiribati','Kuwait','Kyrgyzstan','Laos','Latvia',
              'Lebanon','Lesotho','Liberia','Libya','Liechtenstein','Lithuania','Luxembourg',
              'Madagascar','Malawi','Malaysia','Maldives','Mali','Malta','Marshall Islands',
              'Mauritania','Mauritius','Mexico','Micronesia','Moldova','Monaco','Mongolia',
              'Montenegro','Morocco','Mozambique','Myanmar','Burma','Namibia','Nauru','Nepal',
              'Netherlands','New Zealand','Nicaragua','Niger','Nigeria','North Korea',
              'North Macedonia','Norway','Oman','Pakistan','Palau','Palestine','Palestine State',
              'Panama','Papua New Guinea','Paraguay','Peru','Philippines','Poland','Portugal',
              'Qatar','Romania','Russia','Rwanda','Saint Kitts and Nevis','Saint Lucia',
              'Saint Vincent and the Grenadines','Samoa','San Morino','Sao Tome and Principe',
              'Saudi Arabia','Senegal','Serbia','Seychelles','Sierra Leone','Singapore',
              'Slovakia','Slovenia','Solomon Islands','Somalia','South Africa','South Korea',
              'South Sudan','Spain','Suriname','Sweden','Switzerland','Syria','Tajikistan',
              'Tanzania','Thailand','Timor-Leste','Togo','Tonga','Trinidad','Tobago','Tunisia',
              'Turkey','Turkmenistan','Tuvalu','Uganda','Ukraine','United Kingdom',
              'United States of America','Uruguay','Uzbekistan','Vanuatu','Venezuela',
              'Vietnam','Yemen','Zambia','Zimbabwe']


##Extracting Hrefs

We will be extracting our articles from the Institute for the Study of War's production library. First, we will scrape the **'browse'** page to get individual href links for each product. Then we store those links in a list for our extraction functions to visit later. 

In [None]:
#Getting product links from ISW browse page

urls=['http://www.understandingwar.org/publications?page={}'.format(i) for i in range(179)]
hrefs=[]

def get_hrefs(page,class_name):
  page=requests.get(page)
  soup=BeautifulSoup(page.text,'html.parser')
  container=soup.find_all('div',{'class':class_name})
  container_a=container[0].find_all('a')
  links=[container_a[i].get('href') for i in range(len(container_a))]
  for link in links:
    if link[0]=='/':
      hrefs.append('http://www.understandingwar.org'+link)

for url in urls:
  get_hrefs(url,'view-content')


#Web Extraction 

The first few functions we'll write are fairly straightforward text extraction. This tutorial is not intended to be a tutorial on the usage of beautifulsoup - for an introduction to web scraping in Python, check out the beautifulsoup documentation [here](https://www.crummy.com/software/BeautifulSoup/bs4/doc/). 

For our first function, we will be extracting the publication date. It scans through the html document extracted from the product's webpage, and finds a field with the class of **'submitted'**. This contains our production date. 

Next, we want the product title. Again, this field is conveniently labeled with a class of **'title'**. 

Finally, we will extract the full text of the article. When I extract text, I often find that I follow an 'extract-first, filter-later' style of web extraction. That means that, in my initial text extraction, I perform minimal filtering and processing of the text. I prefer to conduct that processing later on in my analysis, as it becomes necessary. However, if you are more advanced, you may want to conduct more pre-processing of the extracted text than the below function demonstrates. Again, I recommend you follow the beautifulsoup [documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) for reference. 

For my *get_contents* function, I stuck to the bare bones - I listed a few html parents in a blacklist, for text that I don't want to be extracted. Then I extract all the text from the page and append it into a temporary string, which in turn is appended into the *list* **content_text**.



In [None]:
#Function to extract publication data
def get_date(soup):
  try:
    data=soup.find('span',{'class':'submitted'})
    content=data.find('span')
    date=content.get('content')
    dates.append(date)
  except Exception:
    dates.append('')
    pass

#Function to extract product title
def get_title(soup):
  try:
    title=soup.find('h1',{'class':'title'}).contents
    titles.append(title[0])
  except Exception:
    titles.append('')
    pass

#Function to extract product's text contents
def get_contents(soup):
  try:
    parents_blacklist=['[document]','html','head',
                       'style','script','body',
                       'div','a','section','tr',
                       'td','label','ul','header',
                       'aside',]
    content=''
    text=soup.find_all(text=True)
  
    for t in text:
      if t.parent.name not in parents_blacklist and len(t) > 10:
        content=content+t+' '

    content_text.append(content)
  except Exception:
    content_text.append('')
    pass

#Natural Language Porcessing

Next, we will figure out what countries are referenced in the product. There are many APIs that could be used in checking text content for countries, but here we will use a simple fool-proof method: a humble *list* of all the countries in the world. This list is derived from [Wikipedia](https://en.wikipedia.org/wiki/Lists_of_countries_and_territories), so there may be more comprehensive options. This works fine for demonstration purposes, but I urge you to explore more advanced and elegant options!

After the function identified **all_mentioned_countries** in the article, it uses basic statistical analysis to identify which countries are featured most prominently - these countries are most likely to be the point of focus for the article's narrative. To do this, the function counts the number of times a country is mentioned throughout the article and then finds countries mentioned more times than the average. These countries are then appended to a **key_countries** *list*. 

In [None]:
#Reference a list of all countries against the body of the text. 
#If a word in the text matches a country in the list, it is then added to a list of countries. 

def get_countries(content_list):
  iteration=1
  for i in range(len(content_list)):
    print('Getting countries',iteration,'/',len(content_list))
    temp_list=[]
    for word in word_tokenize(content_list[i]):
      for country in country_list:
        if word.lower().strip() == country.lower().strip():
          temp_list.append(country)

    counted_countries=dict(Counter(temp_list))
    temp_dict=dict.fromkeys(temp_list,0)
    temp_list=list(temp_dict)
    if len(temp_list)==0:
      temp_list.append('Worldwide')
    mentioned_countries.append(temp_list)

    #Counting the number of mentions for each country, then checking each count against the mean. 
    #If a country is mentioned more than the mean number of mentions, it is noted as a keyword. 

    keywords=[]
    for key in counted_countries.keys():
      if counted_countries[key] > np.mean(list(counted_countries.values())):
        keywords.append(key)
    if len(keywords) != 0:
      key_countries.append(keywords)
    else:
      key_countries.append(temp_list)
    iteration+=1

##Named Entity Recognition: Places

Next, we want to enrich our data somehow. Ultimately, the goal of structuring data is typically to perform some kind of analysis or visualization - in the case of this international conflict information, it could be valuable to plot the information graphically. To do this, we need coordinates corresponding to the articles. 

First, we will use **natural language processing (NLP)** and **named entity recognition (NER)** to extract place-names from the text. NLP is a form of machine learning, in which computer algorithms use grammar and syntax rules to learn relationships between words in text. Using that learning, NER is able to understand the role that certain words play within a sentence or paragraph. This tutorial is not intended to be a comprehensive introduction to NLP - for such a resource, try [this article on Medium](https://medium.com/@ODSC/an-introduction-to-natural-language-processing-nlp-8e476d9f5f59). 

To then find coordinates for the place names, we will use the **Open Cage API** to query for coordinates; you can make a free account and receive an API key [here](https://opencagedata.com/api). There are many other popular geo-coding APIs to choose from, but through trial and error I found Open Cage to have the best performance given the obscure place names of the Middle East.

First, we iterate through each place name retrieved from the article and query it in Open Cage. 

*Please note that your query allotments are very limited in the free version of Open Cage. If your query errors out, try again another day or using another API.* 

Once this is done, we will cross-reference the Open Cage results with the **mentioned_countries** list created earlier. This will ensure that the query results we retrieve are located in the correct place. 



In [None]:
#Uses NLP to extract place names, then queries against open-cage API to get coordinates for plotting

#Insert your own OpenCage API key here:
geo_api_key='Insert Your API Key Here'

def get_coords(content_list):
  iteration=1
  for i in range(len(content_list)):
    print('Getting coordinates',iteration,'/',len(content_list))
    temp_list=[]
    text=content_list[i]

    #Applying a NER algorithm, from the python library 'ner-d', to find place names. 

    doc=nlp(text)
    location=[X.text for X in doc.ents if X.label_ == 'GPE']
    location_dict=dict.fromkeys(location,0)
    location=list(location_dict)


    #Querying the locations in Open Cage. 
    
    for l in location:
      try: 
        request_url='https://api.opencagedata.com/geocode/v1/json?q={}&key={}'.format(l,geo_api_key)
        page=requests.get(request_url)
        data=page.json()
        for n in range(len(data)):

          #This line of code checks that the country in the query results matches one of the countries
          #in the mentioned_countries for the overall article. If not, then the query result is likely a false positive. 

          if data['results'][n]['components']['country'] in mentioned_countries[i]:
            lat=data['results'][n]['geometry']['lat']
            lng=data['results'][n]['geometry']['lng']
            coordinates={'Location': l,
                          'Lat': lat,
                          'Lon': lng}
            temp_list.append(coordinates)
            break
          else:
            continue    
      except Exception: 
        continue
    coord_list.append(temp_list)
    iteration+=1


##Named Entity Recognition: People

Next, we will extract the names of people mentioned in the article. To do this, we will again use the NER algorithms from the **NER-D python library**. 

In the final structured data, I only want full names. Wouldn't it be confusing to find a data entry with a *'mentioned person'* of "Jack" or "John"? It would be pretty much useless. To do this, we will once again employ some rudimentary statistics. The function will track full names when they are mentioned, usually in the beginning of the text. When a partial name is mentioned later, it will reference the list of full names to identify who the partial name is referencing. For example, if a news article read as follows: *"Joe Biden is running for President. Joe is best known as the Vice President for former President Barrack Obama."* We know that **Joe** is referencing **Joe Biden**, because his full name was given earlier in the text. This function will operate in that same way. 

In the case of duplicates, the function will use the same statistics used earlier for the country function. It will measure a count of how many times a name was mentioned, and use that as the most likely identifier. Example: *'Joe Biden and his son, Hunter Biden, are popular US politicians. Joe Biden is the former VP. Biden is now making a run for president against incumbent Donald Trump'* We know that **'Biden'** is referencing **'Joe Biden'** from context. The passage is clearly about Joe Biden, not Hunter Biden, based on the statistical focus of the text. 

Once the function has figured out all the full names mentioned, it will add them to a list. It will then query each name in **Wikipedia**, to verify that it is the name of an influential person worthy of being included in the structured data. 

In [None]:
def get_people(content_list):
  iteration=1

  #Using NER to find people names in the text. 
  
  for i in range(len(content_list)):
    print('Getting people',iteration,'/',len(content_list))
    temp_list=[]
    text=content_list[i]
    doc=nlp(text)
    persons=[X.text for X in doc.ents if X.label_ == 'PERSON']
    persons_dict=dict.fromkeys(persons,0)
    persons=list(persons_dict)

    full_names=[]
    for person in persons: 
      if len(word_tokenize(person)) >= 2:
        string_name=re.sub(r"[^a-zA-Z0-9]+", ' ', person).strip()
        full_names.append(string_name)
  
    final_names=[]
    for person in persons:
      for name in full_names:
        tokens=word_tokenize(name)
        for n in range(len(tokens)):
          if person==tokens[n]:
            final_names.append(name)

    for name in full_names:
      final_names.append(name)

    name_dict=dict.fromkeys(final_names,0)
    final_names=list(name_dict)
    valid_names=[]

    for name in final_names:
      page=requests.get('https://en.wikipedia.org/wiki/'+name)
      if page.status_code==200:
        valid_names.append(name)

    people.append(valid_names)
    iteration+=1

##Keyword Extraction: Term Frequency-Inverse Document Frequency

Our next task is to extract keywords from the text. The most common method of doing this is by using a method called **Term Frequency-Inverse Document Frequency (TF-IDF)**. Basically, TF-IDF models measure how often a term or word was used within a single document, then compares that to its average usage throughout the entire corpus of documents. If a term is used frequently in a single document, and infrequently across the entire corpus of documents, then it is likely that term represents a keyword unique to that specific document. This guide is not meant to be a comprehensive overview of TF-IDF models. For more information, check out [this article on Medium](https://medium.com/datadriveninvestor/tf-idf-in-natural-language-processing-8db8ef4a7736). 

First, our function will create what is commonly known as a *'bag-of-words'*. This will track every word used in every document. Then, it will count every usage of every word in each document - the **term frequency**. Then, it takes the common logarithm of every sentence in every document containing the term - the **inverse document frequency**. Those values are then written to coordinates in a matrix, which is then sorted to help us find the words most likely to represent unique keywords for our document. 

In [None]:
#The first function pre-processes text by lowering the case of characters, 
#and removing special characters. 

def pre_process(text):
    text=text.lower()
    text=re.sub("</?.*?>"," <> ",text)
    text=re.sub("(\\d|\\W)+"," ",text)
    return text

#This function maps matrices to coordinates. The TF-IDF function maps 
#Frequency scores to matrices, which then need to be sorted to help us find our keywords. 

def sort_coo(coo_matrix):
    tuples = zip(coo_matrix.col, coo_matrix.data)
    return sorted(tuples, key=lambda x: (x[1], x[0]), reverse=True)

#As with above, this is a helper function that will assist in the sorting and
#selection of keywords once the frequencies have been mapped to matrices. 
#This function specifically helps us to choose the most relevant keywords, 
#based on TF-IDF statistics

def extract_topn_from_vector(feature_names, sorted_items, topn=10):
    sorted_items = sorted_items[:topn]
    score_vals = []
    feature_vals = []

    for idx, score in sorted_items:
        fname = feature_names[idx]
        score_vals.append(round(score, 3))
        feature_vals.append(feature_names[idx])

    results= {}
    for idx in range(len(feature_vals)):
        results[feature_vals[idx]]=score_vals[idx]
    return results

#The final function, which incorporates the above helper functions, 
#Applies a TF-IDF algorithm to the body of our text to find keywords based
#on frequency of usage. 

def get_keywords(content_list):
  iteration=1
  processed_text=[pre_process(text) for text in content_list]
  stop_words=set(stopwords.words('english'))
  cv=CountVectorizer(max_df=0.85,stop_words=stop_words)
  word_count_vector=cv.fit_transform(processed_text)

  tfidf_transformer=TfidfTransformer(smooth_idf=True,use_idf=True)
  tfidf_transformer.fit(word_count_vector)

  feature_names=cv.get_feature_names()

  for i in range(len(processed_text)):
    print('Getting Keywords',iteration,'/',len(content_list))
    doc=processed_text[i]
    tf_idf_vector=tfidf_transformer.transform(cv.transform([doc]))
    sorted_items=sort_coo(tf_idf_vector.tocoo())
    keys=extract_topn_from_vector(feature_names,sorted_items,10)
    keywords.append(list(keys.keys()))
    iteration+=1


## Topic Modeling 

One of the most common tasks in NLP is known as **topic modeling**. This is a form of clustering that attempts to automatically sort documents into categories based on their text content. In this specific instance, I would like to know *at-a-glance* whether ISW is writing more products about Operation Inherent Resolve in Iraq, or about the Russian intervention in Belarus. Topic modeling is the solution to this problem. By sorting articles into categories based on text content, I can easily gain an *at-a-glance* understaning of the document's main ideas. 

For this example, I will be using a k-means clustering algorithm to conduct topic modeling. First, I will use a TF-IDF algorithm again to vectorize each document. **Vectorization** is a machine-learning term that refers to the transformation of non-numeric data into numeric spatial data that the computer can use to conduct machine learning tasks. 

Once documents are vectorized, helper functions check to see what the optimal number of clusters are. *(The **k** in **k-means**)*. In this case, the optimal number was **50**. Once I found the optimal number, in this example I commented out that line of code and manually adjusted the parameters to equal 50. That is because the dataset I am analyzing does not change often, so I can expect the number of optimal clusters to stay the same over time. For data that changes more frequently, you should return the optimal number of clusters as a variable - this will help your clustering algorithm to automatically set its optimal parameters. I demonstrate an example of this in [my time-series analysis article](https://towardsdatascience.com/investigate-anomalies-in-temporal-data-with-machine-learning-b3da86793142) on Medium. 

Once each cluster is complete, I save the number of each cluster *(1-50)* to a list of **cluster_numbers** and the keywords making up each cluster to a list of **cluster_keywords**. These cluster keywords will be used later to add a title to each topic cluster. 

In [None]:
#This function checks the clustering algorithm against various 'k' parameters 
#to find the optimal value of 'k'. 

def find_optimal_clusters(data, max_k):
    iters = range(2, max_k+1, 2)
    
    sse = []
    for k in iters:
        sse.append(MiniBatchKMeans(n_clusters=k, 
                                   init_size=1024, 
                                   batch_size=2048,
                                   random_state=20).fit(data).inertia_)
        
        print('Fit {} clusters'.format(k))
        
    f, ax = plt.subplots(1, 1)
    ax.plot(iters, sse, marker='o')
    ax.set_xlabel('Cluster Centers')
    ax.set_xticks(iters)
    ax.set_xticklabels(iters)
    ax.set_ylabel('SSE')
    ax.set_title('SSE by Cluster Center Plot')

#Getting keywords from content lists, 
#to help with categorizing topic clusters later

def get_top_keywords(data, clusters, labels, n_terms):
  df = pd.DataFrame(data.todense()).groupby(clusters).mean()
  for i,r in df.iterrows():
    cluster_keywords.append(','.join([labels[t] for t in np.argsort(r)[-n_terms:]]))
    
#Applying clustering to content lists for topic modeling

def get_topics(content_list):
  processed_text=[pre_process(text) for text in content_list]
  stop_words=set(stopwords.words('english'))
  cv=CountVectorizer(max_df=0.85,stop_words=stop_words)
  word_count_vector=cv.fit_transform(processed_text)

  tfidf_transformer=TfidfTransformer(smooth_idf=True,use_idf=True)
  tfidf_transformer.fit(word_count_vector)

  feature_names=cv.get_feature_names()
  vector=tfidf_transformer.transform(cv.transform(processed_text))

  #find_optimal_clusters(vector,50)

  clusters = MiniBatchKMeans(n_clusters=50, init_size=1024, batch_size=2048, random_state=20).fit_predict(vector)
  for cluster in clusters:
    cluster_number.append(int(cluster))
  
  get_top_keywords(vector, clusters, cv.get_feature_names(), 20)

#Putting it Together

Finally, we will extract our data. Using the list of **hrefs** we got earlier, it is time to apply all of our extraction functions to the web content.

In [None]:
#Iterate through the hrefs extracted from 'browse', extract pertinent content 

iteration=1

#The first few functions rely on the original extracted web content as a parameter.
#These are all basic web-scraping techniques. 

for href in hrefs:
  print('Web scraping: iteration',iteration,'/',len(hrefs))
  page=requests.get(href)
  soup=BeautifulSoup(page.text,'html.parser')
  links.append(href)
  get_date(soup)
  get_title(soup)
  get_contents(soup)
  iteration+=1

#These next functions rely on the body of our text as a parameter.
#These are the NLP-based functions. 

#Note: Because of the querying of an external API, 
#a timeout is required to stop from overloading the server. 
#This portion of the code takes a very long time to run.

get_countries(content_text)
get_coords(content_text)
get_people(content_text)
get_keywords(content_text)
get_topics(content_text)

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
Getting coordinates 237 / 1745
Getting coordinates 238 / 1745
Getting coordinates 239 / 1745
Getting coordinates 240 / 1745
Getting coordinates 241 / 1745
Getting coordinates 242 / 1745
Getting coordinates 243 / 1745
Getting coordinates 244 / 1745
Getting coordinates 245 / 1745
Getting coordinates 246 / 1745
Getting coordinates 247 / 1745
Getting coordinates 248 / 1745
Getting coordinates 249 / 1745
Getting coordinates 250 / 1745
Getting coordinates 251 / 1745
Getting coordinates 252 / 1745
Getting coordinates 253 / 1745
Getting coordinates 254 / 1745
Getting coordinates 255 / 1745
Getting coordinates 256 / 1745
Getting coordinates 257 / 1745
Getting coordinates 258 / 1745
Getting coordinates 259 / 1745
Getting coordinates 260 / 1745
Getting coordinates 261 / 1745
Getting coordinates 262 / 1745
Getting coordinates 263 / 1745
Getting coordinates 264 / 1745
Getting coordinates 265 / 1745
Getting coordinates 266 / 1745
Getti

##Topic Modeling Enrichment

Our next problem is this: Our clusters gave us a list of words that are associated with each cluster, but the clusters are titled simply with numbers. This gives us the opportunity to plot a word cloud or other interesting visualization that can help us understand each cluster, but it not as useful for *at-a-glance* understanding in a structured dataset. Additionally, I believe that some documents may fall within multiple topic categories. Multiple clustering is not supported by k-means, so I will have to identify these documents manually. First, I'll print the first few rows of keywords to get an idea of the data I'm dealing with. 

In [None]:
cluster_keywords[:5]

['ret,islamic,political,province,military,groups,security,us,state,states,syrian,diyala,american,isis,nov,isw,qaeda,al,syria,iraq',
 'ar,raqqa,eastern,northern,al,campaign,city,province,update,forces,iraq,may,syrian,ez,zour,regime,deir,sdf,syria,isis',
 'haq,ahl,ib,gains,asa,january,activity,deir,control,ez,zour,isis,iraq,november,update,report,situation,syria,dec,december',
 'eu,political,president,bugayova,intsum,sanctions,nataliya,nato,west,elections,review,africa,information,military,ukrainian,russian,putin,ukraine,kremlin,russia',
 'damascus,isis,jabhat,military,homs,russian,province,pro,russia,groups,forces,rebel,al,city,assad,opposition,syria,aleppo,syrian,regime']

##Topic Modeling Enrichment (cont.)

After significant experimentation with a variety of techniques, I decided on a very simple approach. I scanned each list of keywords pertaining to each cluster, and noted significant keywords in each that related to a specific topic. At this stage, **domain knowledge** was key. I know, for example, that **Aleppo** in an ISW document is almost certainly mentioned in reference to the **Syrian Civil War**. For your data, if you lack the appropriate domain knowledge, you may need to do further research, consult someone else on your team, or define a more advanced programmatic method for titling your clusters. 

For this example, however, the simple approach works well. After making note of several significant keywords present in the cluster lists, I made a few lists of my own that contained keywords associated to the final topic categories I wanted in the structured data. The function simply compares each cluster's list of keywords with the lists I created, then assigned a topic name based on matches in the lists. It then appends those final topics to a list of **topic_categories**.

In [None]:
#Searches lists of keywords corresponding to topic, 
#cross-references with cluster word banks to assign a topic category to each article. 

oir=['OIR Iraq','yezidis','mosul','peshmerga','isis','iraq','sinjar','baghdad','maliki',
     'daquq','anbar','isf','abadi','malaki','ramadi','iraqi','fallujah','dabiq']

terrorism=['Terrorism','jihadi','islamic','salafi','qaeda',
           'caliphate','isis','terrorist','terrorism']

syrian_conflict=['Syrian Conflict','sana','syria','assad',
                 'idlib','afrin','aleppo']

russia=['Russia','russia','belarus','slavic','kremlin','russian',
        'minsk','ukraine','putin']

iran=['Iran','iran','iranian','proxy','militias','militia','marjah']

turkey=['Turkey','erdogan','turkish','turkey']

ors=['ORS','kabul','ghani','pakistan','afghan','afghanistan',
     'taliban','ansf','karzai','helmand']

africa=['Africa','libya','libyan','egypt','egyptian','africa','african']

cat_list=[oir,terrorism,syrian_conflict,russia,iran,turkey,ors,africa]

topic_dict={}

for i in range(len(cluster_keywords)):
  temp_list=[]
  for n in nltk.word_tokenize(cluster_keywords[i]):
    for item in cat_list:
      if n in item:
        temp_list.append(item[0])
  
  temp_dict=dict.fromkeys(temp_list,0)
  temp_list=list(temp_dict)

  topic_dict[i] = temp_list

for num in cluster_number:
  topic_categories.append(topic_dict[num])

#Database Creation

The very last step in this guide is to bring together all our extracted data. For this data, I prefer the **JSON** format. This is because I wanted to structure certain types of data differently - for example, the **locations** field will include a **list of dictionaries** of place names, latitudes, and longitudes. In my opinion, JSON format is the most effective way to store such formatted data to a local disk. I also backed up a copy of this database in a document database, **MongoDB**, but that is not the focus of this guide. If you are interested in saving your structured data to a document database, try [this article on Medium](https://medium.com/free-code-camp/learn-mongodb-a4ce205e7739). 

In [None]:
#Initalize an empty list for storage

db=[]

for i in range(len(hrefs)):
  countries={
      'focus area': key_countries[i],
      'all mentioned countries': mentioned_countries[i]
  }

  #Add all of the defined lists from our functions
  #To the new storage list

  doc={ 
      '_id': len(hrefs) - i,
       'title': titles[i],
       'date': dates[i],
       'places': coord_list[i],
       'people': people[i],
       'keywords': keywords[i],
       'countries': countries,
       'full text': content_text[i],
       'url': links[i],
       'topic cluster': cluster_number[i],
       'categories': topic_categories[i]
  }

  db.append(doc)

#Save the list as a .JSON data storage file inside google drive
#(for demonstration purposes)

with open ('/content/drive/My Drive/Colab Notebooks/isw_products.json', 'w') as fout:
  json.dump(db, fout)
  

#Summary 

Now we're done! We covered a lot in this guide. We extracted links from a web page, then used those links to extract even more content from the web site. We used that content to then extract and enhance that informaion using **external APIs**, ML **clustering algorithms**, and **NLP**. NLP today is one of the foremost buzz-words in the business intelligence community, and now upon completion of this guide you can confidently execute intermediate-level operations in NLP for document analysis. You can conduct **TF-IDF vectorization**, **keyword extraction**, and **topic modeling**. These are the cornerstones of NLP. Please [reach out](https://www.linkedin.com/in/cmbrew/) if you have more questions or need information, and good luck in your future NLP endeavors!

In [None]:
with open (r'/content/drive/My Drive/Colab Notebooks/isw_products.json','r') as read_file:
  data=json.load(read_file)

data[0:5]

[{'_id': 1745,
  'categories': ['Russia'],
  'countries': {'all mentioned countries': ['Russia',
    'Belarus',
    'Azerbaijan',
    'Kyrgyzstan',
    'Serbia'],
   'focus area': ['Russia', 'Belarus']},
  'date': '2020-10-20T15:07:13-04:00',
  'full text': 'Russia’s Unprecedentedly Expansive Military Exercises in Fall 2020 Seek to Recreate Soviet-Style Multinational Army | Institute for the Study of War Search form Russia’s Unprecedentedly Expansive Military Exercises in Fall 2020 Seek to Recreate Soviet-Style Multinational Army Oct 20, 2020 - George Barros, Mason Clark October 20, 2020 By Mason Clark and George Barros Key Takeaway: The Kremlin has conducted military exercises in fall 2020 on an unprecedented scale, much deeper than usual integration of Russian and foreign military units, and a pattern of modifying pre-announced activities significantly but presenting them as normal and unchanged. These exercises mark significant developments in the Kremlin’s campaigns to integrate th