# Issue

Civilians are collateral damage at best and deliberate targets at worst in violent conflicts between state and non-state political actors globally. 

The systematic censorship or outright suppression of domestic and interational journalism creates accountability and information vacuums that sustain social oblivion beyond regional or national borders. This in turn emboldens the most horrifically inhumane inclinations of both factions. 

The resulting escalation cycle only tends to be disrupted or at least exposed when important economic interests are meaningfully impacted that the political powers concerned actively or passively intervene. 

# Proposed Solution

TXTIMONY is the concept for a text-powered crisis monitoring model that predicts incident cause based on real-time civilian reports, diminishing reliance on journalistic coverage and the news cycle. It would run on the Ushahidi platform, which was created in 2008 to crowdsource and map data post-election violence in Kenya and has since become an enterprise data collection, management, and visualization solution for citizen engagement, election monitoring, humanitarian aid, incident management, and international development organizations.  

Force-multiplying the immense agency of courageous civilians who digitally document their accounts of political violence and then circumvent internet and mobile messaging service disruptions to disseminate them is crucial for two reasons: 
- It would enable non-governmental humanitarian assistance organizations to continuously re-assess developments and microtarget operations. 
- It would anonymize and democratize news production such that real-time reporting is accelerated and attribution-based retaliation is impossible.

The use cases I'm focusing on are:
- Cameroonian armed forces' violent conflicts with the Ambazonian Separatist Movement.
- Nigerian armed forces' violent conflicts with the Boko Haram Insurgency. 

# Anticipated Challenges

**The Project** 

- Dynamic Data's Inaccessiblity: Ideally, I'd be assembling my text corpus from mobile messages or social media posts. My ability to demonstrate the product's utility is constrained by how static my data sources are.  
- Methodology: I've framed incident cause identification as an extraction problem, but it may actually be an abstraction problem. I'd like to try building a neural network classifier for incident cause prediction. Both implementations may be beyond the scope of my current skillset. 


**The Product** 

- Data Veracity: Identifying and filtering out duplicative reports and deliberate disinformation would need to be a priority. Applying Artificial Intelligence to combatting the latter is still a nascent effort - a startup called [Factmata](http://factmata.com/) is currently tackling this challenge.  
- Digital Censorship: The ability to circumvent web and mobile network blocks would pose an accessibility challenge. Chinese college students have devised a novel approach that might also be applicable in this context - embedding messages in [Blockchain](https://slate.com/technology/2018/07/blockchain-is-helping-to-circumvent-censorship-in-china.html) transaction descriptions.

# Data Sources

The Armed Conflict Location & Event Data (ACLED) project  is one of the most credible free and open providers of real-time data  for analyzing and visualizing political violence in the developing world. It's primarily powered by reports from domestic and international news outlets, supplemented by local sources who tend to cover farther-flung locations.  

Based on API query results, I've identified an arbitrary subset of reputable non-government-owned domestic and international news outlets whose sites I intend to scrape for articles on the movement in Cameroon and the insurgency in Nigeria with which I'll create my text corpus:

**Domestic News Outlets**
- Cameroon Intelligence Report (Cameroon)
- Journal du Cameroun (Cameroon)
- The Guardian (Nigeria)
- Vanguard (Nigeria)

**International News Outlets**
- Agence France-Presse (France)
- British Broadcasting Corporation (United Kingdom)
- Reuters (United Kingdom)
- Voice of America (United States)

# Techniques

## Data Source Selection: ACLED API Query

I've already made arbitrary data source selections.  

## Site Scraping & Text Corpus Creation: Beautiful Soup 

I intend to find URLs for all articles pertaining to either crises on outlets' sites and reading their raw text into a dump file.

## Pipeline Step 1: Preprocessing: SKLearn & GenSim

Next I intend to parse the raw text, calculate a TF-IDF matrix, generate the word list, define a target number of topics, link words to the topics, and then extract top words with their loadings.  

## Pipeline Step 2: Topic Extraction: pLSA, LDA, & NNMF 

Then I intend to fit these models to the text corpus and compare the results, with a focus on examining topic distinctions/overlap and  sparsity. I anticipate issues with locally optimal solutions and overfitting (with pLSA in particular), but I'd ideally like to see some semblance of consistency. 

## Pipeline Step 3: Predictive Classification: Neural Network & Random Forest 

Finally I intend to train these models to predict incident cause for each of the articles comprising the text corpus and test the model's efficacy in the wild by feeding it news articles it hasn't seen. Cross-validation and hyperparameter tuning will be built into this step.  