# Predicting Conflict Types - NLP 

## Leveraging Data Science for ACLED(Armed Conflict Location & Event Data)

### Table of Contents

* [1. Introduction](#1.)
    * [1.1 What is ACLED?](#1.1)
    * [1.2 What is in ACLED?](#1.2)
    * [1.3 Hypothesis (What are we trying to predict?)](#1.3)
    * [1.4 Why are we Doing This?](#1.4)
* [2. Project Pipeline (DS Pipeline, Architecture Diagram)](#1.4)
* [3. Data Wrangling and Text Preprocessing](#section_1_2_1)
* [4. Exploratory Data Analysis (EDA)](#section_1_2_2)
* [5. Computation and Analysis](#section_1_2_3)
* [6. Modeling and Application](#section_1_2_3)
* [7. Reporting and Visualizations](#section_1_2_3)
        
* [Chapter 2](#chapter2)
    * [Section 2.1](#section_2_1)
    * [Section 2.2](#section_2_2)

# 1. Introduction <a class="anchor" id="1.1"></a>

## 1.1 What is ACLED? <a class="anchor" id="1.1"></a>

<img src="https://github.com/georgetown-analytics/ACLED/blob/main/ACLED%20Pictures/ACLED%20Dashboard.PNG?raw=true" alt="acled landing">

The Armed Conflict Location & Event Data Project (ACLED) is a disaggregated data collection, analysis, and crisis mapping project. ACLED collects the dates, actors, locations, fatalities, and types of all reported political violence and protest events across Africa, the Middle East, Latin America & the Caribbean, East Asia, South Asia, Southeast Asia, Central Asia & the Caucasus, Europe, and the United States of America. 

The ACLED team conducts analysis to describe, explore, and test conflict scenarios, and makes both data and analysis open for free use by the public.

ACLED is a registered non-profit organization with 501(c)(3) status in the United States. ACLED receives financial support from the Bureau of Conflict and Stabilization Operations at the United States Department of State, the Dutch Ministry of Foreign Affairs, the German Federal Foreign Office, the Tableau Foundation, the International Organization for Migration, and The University of Texas at Austin.

## 1.2 What's in ACLED? <a class="anchor" id="1.2"></a>

<img src="https://github.com/georgetown-analytics/ACLED/blob/main/ACLED%20Pictures/whats%20in%20acled.PNG?raw=true" alt="what's in acled">

## 1.3 Hypothesis - What Are we Trying to Predict? <a class="anchor" id="1.3"></a>

* Event Type - Working within the framework of the data science pipeline our team utilized NLP against the notes (feature) section of the ACLED data to predict the event type of a particular demonstration in a region (multi classification and supervised).
* Sub-Event type - Topic Modeling -Using LDA to predict new sub-event types to increase new information for interventions
 <img src="https://github.com/georgetown-analytics/ACLED/blob/main/ACLED%20Pictures/event_type.PNG?raw=true" alt="event_type" width=420 height=380>

# 1.4 Why Are We Doing This? <a class="anchor" id="1.4"></a>

* Understanding Trends and creating new classifications increases the ability for agencies / governments to respond better to crisis by developing global strategies, influence current policies, and invest resources into new problem areas.
    * Improving accuracy in event types from ACLED researchers
    * Could we create sub-events that focuses on victims, i.e. gender based violence, violence against certain minorities/ protected classes? 
<img src="https://github.com/georgetown-analytics/ACLED/blob/main/ACLED%20Pictures/why%20are%20we%20doing%20this.PNG?raw=true" alt="why are we doing this" width=420 height=380>

# 2. Project Pipeline - Architecture Diagram <a class="anchor" id="2"></a>
<img src="https://github.com/georgetown-analytics/ACLED/blob/main/ACLED%20Pictures/ACLED%20design%20proposal.png?raw=true" alt="design proposal">

# ACLED Dataset Cleaning and Initial Exploration

In [3]:
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
plt.rcParams["figure.dpi"] = 36

In [4]:
#load in csv from github repo
url = 'https://raw.githubusercontent.com/georgetown-analytics/ACLED/main/CSV_Main/2020-06-01-2021-06-01-Eastern_Africa-Middle_Africa-Northern_Africa-Southern_Africa-Western_Africa.csv'
df = pd.read_csv(url, index_col=0)

In [5]:
df

Unnamed: 0_level_0,iso,event_id_cnty,event_id_no_cnty,event_date,year,time_precision,event_type,sub_event_type,actor1,assoc_actor_1,...,location,latitude,longitude,geo_precision,source,source_scale,notes,fatalities,timestamp,iso3
data_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
8166147,180,DRC21566,21566,01 June 2021,2021,1,Battles,Armed clash,ADF: Allied Democratic Forces,,...,Kokola,0.7821,29.6001,1,Al Nabaa,New media,"On 1 June 2021, the ADF attacked a military ba...",0,1624310472,COD
8166148,729,SUD15181,15181,01 June 2021,2021,1,Violence against civilians,Attack,Unidentified Armed Group (Sudan),,...,Khartoum,15.5725,32.5364,1,Al Rakoba,National,"On 1 June 2021, three masked gunmen opened fir...",5,1624310472,SDN
8166410,426,LES165,165,01 June 2021,2021,1,Riots,Violent demonstration,Rioters (Lesotho),Labour Group (Lesotho),...,Maseru,-29.3167,27.4833,1,Post (Lesotho),National,"On 1 June 2021, workers pelted stones and loot...",0,1624310473,LSO
8166411,426,LES164,164,01 June 2021,2021,1,Riots,Violent demonstration,Rioters (Lesotho),Labour Group (Lesotho),...,Maputsoe,-28.8866,27.8991,1,Post (Lesotho),National,"On 1 June 2021, workers set tires on fire and ...",0,1624310473,LSO
8059405,800,UGA6836,6836,01 June 2021,2021,1,Violence against civilians,Attack,Unidentified Armed Group (Uganda),,...,Bukoto,0.3531,32.6000,1,Daily Monitor (Uganda); Chimp Reports,National,"On 1 June 2021, an unidentified armed group at...",2,1623100969,UGA
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7121647,566,NIG17229,17229,01 June 2020,2020,1,Riots,Mob violence,Rioters (Nigeria),PDP: Peoples Democratic Party,...,Oshogbo,7.7667,4.5667,1,Osun Defender,Subnational,"On 1 June 2020, PDP supporters attacked APC su...",0,1591646811,NGA
7966971,788,TUN6541,6541,01 June 2020,2020,1,Protests,Peaceful protest,Protesters (Tunisia),Health Workers (Tunisia); UGTT: Tunisian Gener...,...,Sfax,34.7406,10.7603,1,Agence Tunis Afrique Presse,National,"On 1 June 2020, aligned health workers protest...",0,1620691911,TUN
7121659,706,SOM31271,31271,01 June 2020,2020,1,Battles,Armed clash,Al Shabaab,,...,Dhobley,0.4063,41.0124,1,Radio Kulmiye,National,"On 1 June 2020, Al shabaab militants launched ...",0,1591646811,SOM
7518716,710,SAF12695,12695,01 June 2020,2020,1,Protests,Peaceful protest,Protesters (South Africa),,...,Cape Town - Bishop Lavis,-33.9473,18.5751,1,GroundUp; News24 (South Africa); Citizen (Sout...,National,"On 1 June 2020, about 30 parents demonstrated ...",0,1611019285,ZAF


In [6]:
# see how many rows and columns are in this dataset
shape_info = df.shape 
print('This dataset contains {} rows  and {} columns' 
      .format(shape_info[0], 
              shape_info[1]))

This dataset contains 33378 rows  and 30 columns


In [7]:
print(df.iloc[0:10]['notes'])

data_id
8166147    On 1 June 2021, the ADF attacked a military ba...
8166148    On 1 June 2021, three masked gunmen opened fir...
8166410    On 1 June 2021, workers pelted stones and loot...
8166411    On 1 June 2021, workers set tires on fire and ...
8059405    On 1 June 2021, an unidentified armed group at...
8059413    On 1 June 2021, doctors gathered at the Univer...
8059414    On 1 June 2021, several Zanu-PF youth marched ...
8059670    On 1 June 2021, a group set tire on fire, bloc...
8059671    On 1 June 2021, a group blocked the North and ...
8059418    On 1 June 2021, members of the National Union ...
Name: notes, dtype: object


In [8]:
# look at first 5 rows of data
df.head()

Unnamed: 0_level_0,iso,event_id_cnty,event_id_no_cnty,event_date,year,time_precision,event_type,sub_event_type,actor1,assoc_actor_1,...,location,latitude,longitude,geo_precision,source,source_scale,notes,fatalities,timestamp,iso3
data_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
8166147,180,DRC21566,21566,01 June 2021,2021,1,Battles,Armed clash,ADF: Allied Democratic Forces,,...,Kokola,0.7821,29.6001,1,Al Nabaa,New media,"On 1 June 2021, the ADF attacked a military ba...",0,1624310472,COD
8166148,729,SUD15181,15181,01 June 2021,2021,1,Violence against civilians,Attack,Unidentified Armed Group (Sudan),,...,Khartoum,15.5725,32.5364,1,Al Rakoba,National,"On 1 June 2021, three masked gunmen opened fir...",5,1624310472,SDN
8166410,426,LES165,165,01 June 2021,2021,1,Riots,Violent demonstration,Rioters (Lesotho),Labour Group (Lesotho),...,Maseru,-29.3167,27.4833,1,Post (Lesotho),National,"On 1 June 2021, workers pelted stones and loot...",0,1624310473,LSO
8166411,426,LES164,164,01 June 2021,2021,1,Riots,Violent demonstration,Rioters (Lesotho),Labour Group (Lesotho),...,Maputsoe,-28.8866,27.8991,1,Post (Lesotho),National,"On 1 June 2021, workers set tires on fire and ...",0,1624310473,LSO
8059405,800,UGA6836,6836,01 June 2021,2021,1,Violence against civilians,Attack,Unidentified Armed Group (Uganda),,...,Bukoto,0.3531,32.6,1,Daily Monitor (Uganda); Chimp Reports,National,"On 1 June 2021, an unidentified armed group at...",2,1623100969,UGA


In [9]:
# look at last 5 rows of data
df.tail()

Unnamed: 0_level_0,iso,event_id_cnty,event_id_no_cnty,event_date,year,time_precision,event_type,sub_event_type,actor1,assoc_actor_1,...,location,latitude,longitude,geo_precision,source,source_scale,notes,fatalities,timestamp,iso3
data_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
7121647,566,NIG17229,17229,01 June 2020,2020,1,Riots,Mob violence,Rioters (Nigeria),PDP: Peoples Democratic Party,...,Oshogbo,7.7667,4.5667,1,Osun Defender,Subnational,"On 1 June 2020, PDP supporters attacked APC su...",0,1591646811,NGA
7966971,788,TUN6541,6541,01 June 2020,2020,1,Protests,Peaceful protest,Protesters (Tunisia),Health Workers (Tunisia); UGTT: Tunisian Gener...,...,Sfax,34.7406,10.7603,1,Agence Tunis Afrique Presse,National,"On 1 June 2020, aligned health workers protest...",0,1620691911,TUN
7121659,706,SOM31271,31271,01 June 2020,2020,1,Battles,Armed clash,Al Shabaab,,...,Dhobley,0.4063,41.0124,1,Radio Kulmiye,National,"On 1 June 2020, Al shabaab militants launched ...",0,1591646811,SOM
7518716,710,SAF12695,12695,01 June 2020,2020,1,Protests,Peaceful protest,Protesters (South Africa),,...,Cape Town - Bishop Lavis,-33.9473,18.5751,1,GroundUp; News24 (South Africa); Citizen (Sout...,National,"On 1 June 2020, about 30 parents demonstrated ...",0,1611019285,ZAF
7518717,710,SAF12696,12696,01 June 2020,2020,1,Riots,Violent demonstration,Rioters (South Africa),Women (South Africa),...,Cape Town - Milnerton,-33.8662,18.5297,2,News24 (South Africa); Times (South Africa); C...,National,"On 1 June 2020, about 300 demonstrators, mostl...",0,1611019285,ZAF


In [10]:
# see list of all columns
list(df)

['iso',
 'event_id_cnty',
 'event_id_no_cnty',
 'event_date',
 'year',
 'time_precision',
 'event_type',
 'sub_event_type',
 'actor1',
 'assoc_actor_1',
 'inter1',
 'actor2',
 'assoc_actor_2',
 'inter2',
 'interaction',
 'region',
 'country',
 'admin1',
 'admin2',
 'admin3',
 'location',
 'latitude',
 'longitude',
 'geo_precision',
 'source',
 'source_scale',
 'notes',
 'fatalities',
 'timestamp',
 'iso3']

In [11]:
#selecting columns that are pertient to the project
df_filter = df[['country', 'actor1', 'assoc_actor_1','event_type','sub_event_type','fatalities','notes']]

In [12]:
df_filter

Unnamed: 0_level_0,country,actor1,assoc_actor_1,event_type,sub_event_type,fatalities,notes
data_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
8166147,Democratic Republic of Congo,ADF: Allied Democratic Forces,,Battles,Armed clash,0,"On 1 June 2021, the ADF attacked a military ba..."
8166148,Sudan,Unidentified Armed Group (Sudan),,Violence against civilians,Attack,5,"On 1 June 2021, three masked gunmen opened fir..."
8166410,Lesotho,Rioters (Lesotho),Labour Group (Lesotho),Riots,Violent demonstration,0,"On 1 June 2021, workers pelted stones and loot..."
8166411,Lesotho,Rioters (Lesotho),Labour Group (Lesotho),Riots,Violent demonstration,0,"On 1 June 2021, workers set tires on fire and ..."
8059405,Uganda,Unidentified Armed Group (Uganda),,Violence against civilians,Attack,2,"On 1 June 2021, an unidentified armed group at..."
...,...,...,...,...,...,...,...
7121647,Nigeria,Rioters (Nigeria),PDP: Peoples Democratic Party,Riots,Mob violence,0,"On 1 June 2020, PDP supporters attacked APC su..."
7966971,Tunisia,Protesters (Tunisia),Health Workers (Tunisia); UGTT: Tunisian Gener...,Protests,Peaceful protest,0,"On 1 June 2020, aligned health workers protest..."
7121659,Somalia,Al Shabaab,,Battles,Armed clash,0,"On 1 June 2020, Al shabaab militants launched ..."
7518716,South Africa,Protesters (South Africa),,Protests,Peaceful protest,0,"On 1 June 2020, about 30 parents demonstrated ..."


In [13]:
#counting the amount of fatalities per event type
df_filter.groupby(['event_type'])['fatalities'].count()

event_type
Battles                        7448
Explosions/Remote violence     1543
Protests                      11540
Riots                          3928
Strategic developments         1778
Violence against civilians     7141
Name: fatalities, dtype: int64

In [14]:
conflict_count = df_filter.groupby(['event_type','sub_event_type'])['sub_event_type'].count().to_frame()

In [15]:
conflict_count

Unnamed: 0_level_0,Unnamed: 1_level_0,sub_event_type
event_type,sub_event_type,Unnamed: 2_level_1
Battles,Armed clash,7040
Battles,Government regains territory,308
Battles,Non-state actor overtakes territory,100
Explosions/Remote violence,Air/drone strike,404
Explosions/Remote violence,Grenade,129
Explosions/Remote violence,Remote explosive/landmine/IED,769
Explosions/Remote violence,Shelling/artillery/missile attack,215
Explosions/Remote violence,Suicide bomb,26
Protests,Excessive force against protesters,164
Protests,Peaceful protest,10267


In [16]:
df_filter.to_csv('step1_ACLED_Dataset_END.csv')

# Step 2 - Data Wrangling

In [17]:
import nltk
import re
from string import digits
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.stem import PorterStemmer
from nltk.tokenize import RegexpTokenizer
from nltk.stem import WordNetLemmatizer
ps = PorterStemmer()

In [18]:
#making all text in notes column lowercase
df["notes"]=df["notes"].str.lower()
df

Unnamed: 0_level_0,iso,event_id_cnty,event_id_no_cnty,event_date,year,time_precision,event_type,sub_event_type,actor1,assoc_actor_1,...,location,latitude,longitude,geo_precision,source,source_scale,notes,fatalities,timestamp,iso3
data_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
8166147,180,DRC21566,21566,01 June 2021,2021,1,Battles,Armed clash,ADF: Allied Democratic Forces,,...,Kokola,0.7821,29.6001,1,Al Nabaa,New media,"on 1 june 2021, the adf attacked a military ba...",0,1624310472,COD
8166148,729,SUD15181,15181,01 June 2021,2021,1,Violence against civilians,Attack,Unidentified Armed Group (Sudan),,...,Khartoum,15.5725,32.5364,1,Al Rakoba,National,"on 1 june 2021, three masked gunmen opened fir...",5,1624310472,SDN
8166410,426,LES165,165,01 June 2021,2021,1,Riots,Violent demonstration,Rioters (Lesotho),Labour Group (Lesotho),...,Maseru,-29.3167,27.4833,1,Post (Lesotho),National,"on 1 june 2021, workers pelted stones and loot...",0,1624310473,LSO
8166411,426,LES164,164,01 June 2021,2021,1,Riots,Violent demonstration,Rioters (Lesotho),Labour Group (Lesotho),...,Maputsoe,-28.8866,27.8991,1,Post (Lesotho),National,"on 1 june 2021, workers set tires on fire and ...",0,1624310473,LSO
8059405,800,UGA6836,6836,01 June 2021,2021,1,Violence against civilians,Attack,Unidentified Armed Group (Uganda),,...,Bukoto,0.3531,32.6000,1,Daily Monitor (Uganda); Chimp Reports,National,"on 1 june 2021, an unidentified armed group at...",2,1623100969,UGA
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7121647,566,NIG17229,17229,01 June 2020,2020,1,Riots,Mob violence,Rioters (Nigeria),PDP: Peoples Democratic Party,...,Oshogbo,7.7667,4.5667,1,Osun Defender,Subnational,"on 1 june 2020, pdp supporters attacked apc su...",0,1591646811,NGA
7966971,788,TUN6541,6541,01 June 2020,2020,1,Protests,Peaceful protest,Protesters (Tunisia),Health Workers (Tunisia); UGTT: Tunisian Gener...,...,Sfax,34.7406,10.7603,1,Agence Tunis Afrique Presse,National,"on 1 june 2020, aligned health workers protest...",0,1620691911,TUN
7121659,706,SOM31271,31271,01 June 2020,2020,1,Battles,Armed clash,Al Shabaab,,...,Dhobley,0.4063,41.0124,1,Radio Kulmiye,National,"on 1 june 2020, al shabaab militants launched ...",0,1591646811,SOM
7518716,710,SAF12695,12695,01 June 2020,2020,1,Protests,Peaceful protest,Protesters (South Africa),,...,Cape Town - Bishop Lavis,-33.9473,18.5751,1,GroundUp; News24 (South Africa); Citizen (Sout...,National,"on 1 june 2020, about 30 parents demonstrated ...",0,1611019285,ZAF


In [19]:
'''
removing numbers (dates) -- still need to remove months...
'''
def remove_num(list):
    pattern = '[0-9]'
    list = [re.sub(pattern, '', i) for i in list]
    return list  

In [20]:
df["notes"] = remove_num(df["notes"])

In [21]:
df

Unnamed: 0_level_0,iso,event_id_cnty,event_id_no_cnty,event_date,year,time_precision,event_type,sub_event_type,actor1,assoc_actor_1,...,location,latitude,longitude,geo_precision,source,source_scale,notes,fatalities,timestamp,iso3
data_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
8166147,180,DRC21566,21566,01 June 2021,2021,1,Battles,Armed clash,ADF: Allied Democratic Forces,,...,Kokola,0.7821,29.6001,1,Al Nabaa,New media,"on june , the adf attacked a military base in...",0,1624310472,COD
8166148,729,SUD15181,15181,01 June 2021,2021,1,Violence against civilians,Attack,Unidentified Armed Group (Sudan),,...,Khartoum,15.5725,32.5364,1,Al Rakoba,National,"on june , three masked gunmen opened fire aga...",5,1624310472,SDN
8166410,426,LES165,165,01 June 2021,2021,1,Riots,Violent demonstration,Rioters (Lesotho),Labour Group (Lesotho),...,Maseru,-29.3167,27.4833,1,Post (Lesotho),National,"on june , workers pelted stones and looted sh...",0,1624310473,LSO
8166411,426,LES164,164,01 June 2021,2021,1,Riots,Violent demonstration,Rioters (Lesotho),Labour Group (Lesotho),...,Maputsoe,-28.8866,27.8991,1,Post (Lesotho),National,"on june , workers set tires on fire and block...",0,1624310473,LSO
8059405,800,UGA6836,6836,01 June 2021,2021,1,Violence against civilians,Attack,Unidentified Armed Group (Uganda),,...,Bukoto,0.3531,32.6000,1,Daily Monitor (Uganda); Chimp Reports,National,"on june , an unidentified armed group attacke...",2,1623100969,UGA
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7121647,566,NIG17229,17229,01 June 2020,2020,1,Riots,Mob violence,Rioters (Nigeria),PDP: Peoples Democratic Party,...,Oshogbo,7.7667,4.5667,1,Osun Defender,Subnational,"on june , pdp supporters attacked apc support...",0,1591646811,NGA
7966971,788,TUN6541,6541,01 June 2020,2020,1,Protests,Peaceful protest,Protesters (Tunisia),Health Workers (Tunisia); UGTT: Tunisian Gener...,...,Sfax,34.7406,10.7603,1,Agence Tunis Afrique Presse,National,"on june , aligned health workers protested in...",0,1620691911,TUN
7121659,706,SOM31271,31271,01 June 2020,2020,1,Battles,Armed clash,Al Shabaab,,...,Dhobley,0.4063,41.0124,1,Radio Kulmiye,National,"on june , al shabaab militants launched an at...",0,1591646811,SOM
7518716,710,SAF12695,12695,01 June 2020,2020,1,Protests,Peaceful protest,Protesters (South Africa),,...,Cape Town - Bishop Lavis,-33.9473,18.5751,1,GroundUp; News24 (South Africa); Citizen (Sout...,National,"on june , about parents demonstrated outside...",0,1611019285,ZAF


In [22]:
tokenizer = RegexpTokenizer(r'\w+')
df["notes"] = df["notes"].apply(lambda x: tokenizer.tokenize(x.lower()))
df["notes"]

data_id
8166147    [on, june, the, adf, attacked, a, military, ba...
8166148    [on, june, three, masked, gunmen, opened, fire...
8166410    [on, june, workers, pelted, stones, and, loote...
8166411    [on, june, workers, set, tires, on, fire, and,...
8059405    [on, june, an, unidentified, armed, group, att...
                                 ...                        
7121647    [on, june, pdp, supporters, attacked, apc, sup...
7966971    [on, june, aligned, health, workers, protested...
7121659    [on, june, al, shabaab, militants, launched, a...
7518716    [on, june, about, parents, demonstrated, outsi...
7518717    [on, june, about, demonstrators, mostly, women...
Name: notes, Length: 33378, dtype: object

In [23]:
lemmatizer = WordNetLemmatizer()

def word_lemmatizer(text):
    lem_text = [lemmatizer.lemmatize(i) for i in text]
    return lem_text

In [24]:
df["notes"] = df["notes"].apply(lambda x: word_lemmatizer(x))
df

Unnamed: 0_level_0,iso,event_id_cnty,event_id_no_cnty,event_date,year,time_precision,event_type,sub_event_type,actor1,assoc_actor_1,...,location,latitude,longitude,geo_precision,source,source_scale,notes,fatalities,timestamp,iso3
data_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
8166147,180,DRC21566,21566,01 June 2021,2021,1,Battles,Armed clash,ADF: Allied Democratic Forces,,...,Kokola,0.7821,29.6001,1,Al Nabaa,New media,"[on, june, the, adf, attacked, a, military, ba...",0,1624310472,COD
8166148,729,SUD15181,15181,01 June 2021,2021,1,Violence against civilians,Attack,Unidentified Armed Group (Sudan),,...,Khartoum,15.5725,32.5364,1,Al Rakoba,National,"[on, june, three, masked, gunman, opened, fire...",5,1624310472,SDN
8166410,426,LES165,165,01 June 2021,2021,1,Riots,Violent demonstration,Rioters (Lesotho),Labour Group (Lesotho),...,Maseru,-29.3167,27.4833,1,Post (Lesotho),National,"[on, june, worker, pelted, stone, and, looted,...",0,1624310473,LSO
8166411,426,LES164,164,01 June 2021,2021,1,Riots,Violent demonstration,Rioters (Lesotho),Labour Group (Lesotho),...,Maputsoe,-28.8866,27.8991,1,Post (Lesotho),National,"[on, june, worker, set, tire, on, fire, and, b...",0,1624310473,LSO
8059405,800,UGA6836,6836,01 June 2021,2021,1,Violence against civilians,Attack,Unidentified Armed Group (Uganda),,...,Bukoto,0.3531,32.6000,1,Daily Monitor (Uganda); Chimp Reports,National,"[on, june, an, unidentified, armed, group, att...",2,1623100969,UGA
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7121647,566,NIG17229,17229,01 June 2020,2020,1,Riots,Mob violence,Rioters (Nigeria),PDP: Peoples Democratic Party,...,Oshogbo,7.7667,4.5667,1,Osun Defender,Subnational,"[on, june, pdp, supporter, attacked, apc, supp...",0,1591646811,NGA
7966971,788,TUN6541,6541,01 June 2020,2020,1,Protests,Peaceful protest,Protesters (Tunisia),Health Workers (Tunisia); UGTT: Tunisian Gener...,...,Sfax,34.7406,10.7603,1,Agence Tunis Afrique Presse,National,"[on, june, aligned, health, worker, protested,...",0,1620691911,TUN
7121659,706,SOM31271,31271,01 June 2020,2020,1,Battles,Armed clash,Al Shabaab,,...,Dhobley,0.4063,41.0124,1,Radio Kulmiye,National,"[on, june, al, shabaab, militant, launched, an...",0,1591646811,SOM
7518716,710,SAF12695,12695,01 June 2020,2020,1,Protests,Peaceful protest,Protesters (South Africa),,...,Cape Town - Bishop Lavis,-33.9473,18.5751,1,GroundUp; News24 (South Africa); Citizen (Sout...,National,"[on, june, about, parent, demonstrated, outsid...",0,1611019285,ZAF


In [25]:
def remove_stopwords(text):
    words = [w for w in text if w not in stopwords.words('english')]
    return words

In [26]:
df["notes"] = df["notes"].apply(lambda x: remove_stopwords(x))

In [27]:
df

Unnamed: 0_level_0,iso,event_id_cnty,event_id_no_cnty,event_date,year,time_precision,event_type,sub_event_type,actor1,assoc_actor_1,...,location,latitude,longitude,geo_precision,source,source_scale,notes,fatalities,timestamp,iso3
data_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
8166147,180,DRC21566,21566,01 June 2021,2021,1,Battles,Armed clash,ADF: Allied Democratic Forces,,...,Kokola,0.7821,29.6001,1,Al Nabaa,New media,"[june, adf, attacked, military, base, village,...",0,1624310472,COD
8166148,729,SUD15181,15181,01 June 2021,2021,1,Violence against civilians,Attack,Unidentified Armed Group (Sudan),,...,Khartoum,15.5725,32.5364,1,Al Rakoba,National,"[june, three, masked, gunman, opened, fire, ci...",5,1624310472,SDN
8166410,426,LES165,165,01 June 2021,2021,1,Riots,Violent demonstration,Rioters (Lesotho),Labour Group (Lesotho),...,Maseru,-29.3167,27.4833,1,Post (Lesotho),National,"[june, worker, pelted, stone, looted, shop, th...",0,1624310473,LSO
8166411,426,LES164,164,01 June 2021,2021,1,Riots,Violent demonstration,Rioters (Lesotho),Labour Group (Lesotho),...,Maputsoe,-28.8866,27.8991,1,Post (Lesotho),National,"[june, worker, set, tire, fire, blocked, road,...",0,1624310473,LSO
8059405,800,UGA6836,6836,01 June 2021,2021,1,Violence against civilians,Attack,Unidentified Armed Group (Uganda),,...,Bukoto,0.3531,32.6000,1,Daily Monitor (Uganda); Chimp Reports,National,"[june, unidentified, armed, group, attacked, m...",2,1623100969,UGA
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7121647,566,NIG17229,17229,01 June 2020,2020,1,Riots,Mob violence,Rioters (Nigeria),PDP: Peoples Democratic Party,...,Oshogbo,7.7667,4.5667,1,Osun Defender,Subnational,"[june, pdp, supporter, attacked, apc, supporte...",0,1591646811,NGA
7966971,788,TUN6541,6541,01 June 2020,2020,1,Protests,Peaceful protest,Protesters (Tunisia),Health Workers (Tunisia); UGTT: Tunisian Gener...,...,Sfax,34.7406,10.7603,1,Agence Tunis Afrique Presse,National,"[june, aligned, health, worker, protested, fro...",0,1620691911,TUN
7121659,706,SOM31271,31271,01 June 2020,2020,1,Battles,Armed clash,Al Shabaab,,...,Dhobley,0.4063,41.0124,1,Radio Kulmiye,National,"[june, al, shabaab, militant, launched, attack...",0,1591646811,SOM
7518716,710,SAF12695,12695,01 June 2020,2020,1,Protests,Peaceful protest,Protesters (South Africa),,...,Cape Town - Bishop Lavis,-33.9473,18.5751,1,GroundUp; News24 (South Africa); Citizen (Sout...,National,"[june, parent, demonstrated, outside, bergvill...",0,1611019285,ZAF


In [28]:
def remove_month(text):
    dates = ['january', 'feburary', 'march','april','may','june','july','august','september','october','november','december']
    words =[w for w in text if w not in dates]
    return words

In [29]:
df["notes"] = df["notes"].apply(lambda x: remove_month(x))

In [30]:
df['notes']

data_id
8166147    [adf, attacked, military, base, village, kokol...
8166148    [three, masked, gunman, opened, fire, civilian...
8166410    [worker, pelted, stone, looted, shop, thetsane...
8166411    [worker, set, tire, fire, blocked, road, maput...
8059405    [unidentified, armed, group, attacked, ministe...
                                 ...                        
7121647    [pdp, supporter, attacked, apc, supporter, oso...
7966971    [aligned, health, worker, protested, front, re...
7121659    [al, shabaab, militant, launched, attack, juba...
7518716    [parent, demonstrated, outside, bergville, pri...
7518717    [demonstrator, mostly, woman, set, truck, alig...
Name: notes, Length: 33378, dtype: object

In [31]:
def common_acled_words(text):
    common_acled_words = ['report', 'size']
    words =[w for w in text if w not in common_acled_words]
    return words

In [32]:
def common_protests_words(text):
    common_acled_words = ['protests', 'protest']
    words =[w for w in text if w not in common_acled_words]
    return words

In [33]:
df["notes"] = df["notes"].apply(lambda x: common_acled_words(x))
df_protests["notes"] = df_protests["notes"].apply(lambda x: common_protests_words(x))

NameError: name 'df_protests' is not defined

In [None]:
#create sub dataframes for each event type
df_protests = df.loc[df['event_type'] == "Protests"]
df_riots = df.loc[df['event_type'] == "Riots"]
df_battles = df.loc[df['event_type'] == "Battles"]
df_violence_civilians = df.loc[df['event_type'] == "Violence against civilians"]
df_explosions = df.loc[df['event_type'] == "Explosions/Remote violence"]
df_development = df.loc[df['event_type'] == "Strategic developments"]

In [None]:
df_protests.head(5)

In [None]:
df_riots.head(5)

In [None]:
df_battles.head(5)

In [None]:
df_violence_civilians.head(5)

In [None]:
df_explosions.head(5)

In [None]:
df_development.head(5)

# Step 3 Topic Modeling

In [None]:
import gensim
from gensim.utils import simple_preprocess
import gensim.corpora as corpora
from pprint import pprint

In [None]:
#convert sub-dataframes to lists for topic modeling
data_words_protests = list(df_protests['notes'])
data_words_riots = list(df_riots['notes']) 
data_words_battles = list(df_battles['notes'])
data_words_violence_civilians = list(df_violence_civilians['notes'])
data_words_explosions = list(df_explosions['notes'])
data_words_development = list(df_development['notes'])

In [None]:
data_words = list(df['notes'])
# Create Dictionary
id2word = corpora.Dictionary(data_words)
# Create Corpus
texts = data_words
# Term Document Frequency
corpus = [id2word.doc2bow(text) for text in texts]
# LDA model training
# number of topics
num_topics = 10
# Build LDA model
lda_model = gensim.models.LdaMulticore(corpus=corpus,
                                       id2word=id2word,
                                       num_topics=num_topics)
# Print the Keyword in the 10 topics
pprint(lda_model.print_topics())
doc_lda = lda_model[corpus]

### Load in Topic Modeling visualization

In [None]:
import pyLDAvis
import pyLDAvis.gensim_models as gensimvis
# Visualize the topics
pyLDAvis.enable_notebook()

In [None]:
vis = pyLDAvis.gensim_models.prepare(lda_model, corpus, id2word)

In [None]:
vis

In [None]:
# Create Dictionary
id2word = corpora.Dictionary(data_words_protests)
# Create Corpus
texts = data_words_protests
# Term Document Frequency
corpus = [id2word.doc2bow(text) for text in texts]
# LDA model training
# number of topics
num_topics = 10
# Build LDA model
lda_model_protests = gensim.models.LdaMulticore(corpus=corpus,
                                       id2word=id2word,
                                       num_topics=num_topics)
vis = pyLDAvis.gensim_models.prepare(lda_model_protests, corpus, id2word)

In [None]:
vis

In [None]:
# Create Dictionary
id2word = corpora.Dictionary(data_words_protests)
# Create Corpus
texts = data_words_violence_civilians
# Term Document Frequency
corpus = [id2word.doc2bow(text) for text in texts]
# LDA model training
# number of topics
num_topics = 10
# Build LDA model
lda_model_violence_civilians = gensim.models.LdaMulticore(corpus=corpus,
                                       id2word=id2word,
                                       num_topics=num_topics)
vis = pyLDAvis.gensim_models.prepare(lda_model_violence_civilians, corpus, id2word)

In [None]:
vis