# PHASE 5 CAPSTONE PROJECT

Magdalene Ondimu

Najma Abdi

Leon Maina

Brian Kariithi

Wilfred Lekisherumogi

### 1. INTRODUCTION
Protests are significant socio-political events that can shape the trajectory of nations and influence global dynamics. Understanding the dynamics of protests, including their causes, demands, and state responses, is critical for policymakers, researchers, and international organizations. Previous research has demonstrated the importance of protests in driving political change and social movements. For instance, Tilly (2004) highlights how protests have historically served as a mechanism for marginalized groups to voice their demands and effect change. Additionally, studies by Beissinger (2002) and Chenoweth and Stephan (2011) have shown the impact of mass mobilizations on political outcomes and the conditions under which nonviolent protests are more likely to succeed.

In the contemporary world, the need to understand protest dynamics remains as crucial as ever. The global landscape is marked by significant political, economic, and social upheavals. Protests continue to play a pivotal role in challenging injustices, advocating for rights, and prompting governmental reforms. The Arab Spring, the Black Lives Matter movement, and recent protests in response to economic policies and climate change are testament to the enduring power and relevance of collective action.

Analyzing modern protest data can provide invaluable insights into current socio-political climates, helping to forecast potential unrest, understand public sentiment, and guide policy responses. By studying this dataset, which spans protests worldwide from 1990 onwards, we aim to uncover patterns and trends in protests, identify the key issues being protested, and understand how governments typically respond to such events.


### 2.BUSINESS UNDERSTANDING
Understanding the dynamics of protests, including their causes, demands, and state responses, is critical for policymakers, researchers, and international organizations. This project aims to analyze protest events globally from 1990 to March 2020, focusing on identifying underlying factors, geographical distribution, and temporal trends. By leveraging NLP techniques, it will uncover common themes in protester demands and evaluate the effectiveness and impact of various state responses. The insights gained will inform actionable policy recommendations, enhance the understanding of social movements, and improve strategies for managing social unrest. Key success metrics include accurately identifying protest trends, categorizing demands, evaluating state responses, and achieving high accuracy in sentiment analysis and topic relevance.

## Main Objectives
To Analyze Protest Events:

Identify the underlying factors that lead to mass protests globally.
Examine the geographical distribution and temporal trends of protest events from 1990 to March 2020.
Understand the patterns and characteristics of protests, including their scale and intensity.

To Understand Protester Demands:

Analyze the diversity of demands made by protesters across different regions and countries.
Investigate common themes and variations in protester motivations and grievances.
Apply Natural Language Processing (NLP) techniques to extract and analyze textual data related to protester demands.

To Evaluate State Responses:

Assess the effectiveness and impact of government and state responses to protests.
Classify and analyze types of responses from governments, including their strategies and outcomes.
Provide insights into how state responses influence the outcomes and trajectories of protest movements.

To Provide Actionable Insights:

Offer actionable policy recommendations to policymakers and government officials based on the findings.
Enhance the understanding of social movements and political unrest among researchers, academics, and civil society.
Foster informed decision-making and improve strategies for managing social unrest and public grievances.

Key Questions

* How has the frequency of protests changed over the years?
* What are the most common demands made by protesters?
* How do state responses vary by region and type of protest?
* What sentiments and topics are prevalent in the narratives around protests?



## SPECIFIC OBJECTIVES
***1.	Data Collection and Preprocessing:***
* Extract and preprocess textual data on protester demands from the dataset.
* Prepare data by removing noise, stop words, and tokenizing for LDA analysis.

***2.	Topic Modelling Using LDA:***
* Implement LDA to identify underlying topics and themes within protester demands, optimizing parameters for coherence and interpretability.

***3. Sentiment Analysis and Machine Learning:***
* Utilize NLP techniques for sentiment analysis on protester demands, integrating LDA-derived topics for enhanced classification accuracy.
* Apply machine learning models like Logistic Regression to classify state responses based on textual data, incorporating LDA topics as features.

***4. Geospatial and Temporal Analysis:***
* Map protest hotspots and analyse trends over time using the dataset's location and date information.
* Conduct time series analysis to detect patterns and trends in protest occurrences.
* Leverage LDA topics to understand variations in protester demands across regions and time periods

***5.	Policy Recommendations:***
* Derive actionable policy recommendations to improve state-citizen relations and manage social unrest effectively, informed by comprehensive insights from sentiment analysis, state response classification, and LDA-derived topics.
* Focus on proactive measures to address common demands and grievances.

## 3.DATA UNDERSTANDING
id: Unique identifier for each protest event.

country: The country where the protest occurred.

ccode: Country code.

year: The year the protest occurred.

region: The region where the country is located.

protest: Indicator if there was a protest (1) or not (0).

protestnumber: Sequential number of protests in the dataset.

startday: The day the protest started.

startmonth: The month the protest started.

startyear: The year the protest started.

protesterdemand1: Primary protester demands.

protesterdemand2: Secondary protester demands.

protesterdemand3: Tertiary protester demands.

protesterdemand4: Additional protester demands.

stateresponse1: Primary state response.

stateresponse2: Secondary state response.

stateresponse3: Tertiary state response.

stateresponse4: Quaternary state response.

stateresponse5: Quinary state response.

stateresponse6: Senary state response.

stateresponse7: Septenary state response.

sources: Sources of information about the protest.

notes: Additional notes about the protest.


## 4.METRICS OF SUCCESS

Trend Identification: 
The ability to accurately identify and visualize trends in protest frequency over time.

Demand Categorization: 
Successfully categorizing and quantifying common protester demands.

Response Evaluation: 
Assessing the effectiveness and variation of state responses.

Sentiment Accuracy: 
Achieving high accuracy in sentiment analysis of protest notes.

Topic Relevance:
Effectively identifying and summarizing key topics from the protest notes

By conducting these analyses, we can gain a comprehensive understanding of global protest dynamics, which can inform policy decisions and future research directions.


**Data Exploration**

In [1]:
# Importing relevant libraries
#Basic libraries
import pandas as pd 
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
from matplotlib import rcParams
%matplotlib inline
import seaborn as sns
import re


#NLTK libraries
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize

nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
import string
import wordcloud
from wordcloud import WordCloud, STOPWORDS
from nltk.stem.porter import PorterStemmer

# Machine Learning libraries
import sklearn
from sklearn import svm, datasets
from sklearn import preprocessing
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import SVC
from sklearn.preprocessing import LabelEncoder, StandardScaler, MinMaxScaler, label_binarize
from sklearn.ensemble import ExtraTreesClassifier, RandomForestClassifier
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import BernoulliNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.multiclass import OneVsRestClassifier

import tensorflow
import keras
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, SimpleRNN, Dense
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences


#Metrics libraries
from sklearn import metrics
from sklearn.model_selection import cross_val_score
from sklearn.metrics import classification_report, roc_auc_score, roc_curve, auc


#Visualization libraries
from plotly import tools
import plotly.graph_objs as go
from plotly.offline import iplot

#Ignore warnings
import warnings
warnings.filterwarnings('ignore')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\USER\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\USER\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\USER\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [2]:
# Loading the Dataset
df = pd.read_csv('mass_mobilization.csv')
df.head()

Unnamed: 0,id,country,ccode,year,region,protest,protestnumber,startday,startmonth,startyear,...,protesterdemand4,stateresponse1,stateresponse2,stateresponse3,stateresponse4,stateresponse5,stateresponse6,stateresponse7,sources,notes
0,201990001,Canada,20,1990,North America,1,1,15.0,1.0,1990.0,...,,ignore,,,,,,,1. great canadian train journeys into history;...,canada s railway passenger system was finally ...
1,201990002,Canada,20,1990,North America,1,2,25.0,6.0,1990.0,...,,ignore,,,,,,,1. autonomy s cry revived in quebec the new yo...,protestors were only identified as young peopl...
2,201990003,Canada,20,1990,North America,1,3,1.0,7.0,1990.0,...,,ignore,,,,,,,1. quebec protest after queen calls for unity ...,"the queen, after calling on canadians to remai..."
3,201990004,Canada,20,1990,North America,1,4,12.0,7.0,1990.0,...,,accomodation,,,,,,,1. indians gather as siege intensifies; armed ...,canada s federal government has agreed to acqu...
4,201990005,Canada,20,1990,North America,1,5,14.0,8.0,1990.0,...,,crowd dispersal,arrests,accomodation,,,,,1. dozens hurt in mohawk blockade protest the ...,protests were directed against the state due t...


In [3]:
df.shape

(17145, 31)

In [4]:
df.columns

Index(['id', 'country', 'ccode', 'year', 'region', 'protest', 'protestnumber',
       'startday', 'startmonth', 'startyear', 'endday', 'endmonth', 'endyear',
       'protesterviolence', 'location', 'participants_category',
       'participants', 'protesteridentity', 'protesterdemand1',
       'protesterdemand2', 'protesterdemand3', 'protesterdemand4',
       'stateresponse1', 'stateresponse2', 'stateresponse3', 'stateresponse4',
       'stateresponse5', 'stateresponse6', 'stateresponse7', 'sources',
       'notes'],
      dtype='object')

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17145 entries, 0 to 17144
Data columns (total 31 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   id                     17145 non-null  int64  
 1   country                17145 non-null  object 
 2   ccode                  17145 non-null  int64  
 3   year                   17145 non-null  int64  
 4   region                 17145 non-null  object 
 5   protest                17145 non-null  int64  
 6   protestnumber          17145 non-null  int64  
 7   startday               15239 non-null  float64
 8   startmonth             15239 non-null  float64
 9   startyear              15239 non-null  float64
 10  endday                 15239 non-null  float64
 11  endmonth               15239 non-null  float64
 12  endyear                15239 non-null  float64
 13  protesterviolence      15758 non-null  float64
 14  location               15218 non-null  object 
 15  pa

In [6]:
df.describe()

Unnamed: 0,id,ccode,year,protest,protestnumber,startday,startmonth,startyear,endday,endmonth,endyear,protesterviolence
count,17145.0,17145.0,17145.0,17145.0,17145.0,15239.0,15239.0,15239.0,15239.0,15239.0,15239.0,15758.0
mean,4380888000.0,437.888189,2006.171654,0.888831,7.406299,15.455935,6.227836,2006.326465,15.580616,6.24352,2006.329221,0.25606
std,2320550000.0,232.054953,8.987378,0.314351,11.854041,8.817037,3.461912,8.958007,8.803944,3.461745,8.959254,0.436469
min,201990000.0,20.0,1990.0,0.0,0.0,1.0,1.0,1990.0,1.0,1.0,1990.0,0.0
25%,2202010000.0,220.0,1998.0,1.0,1.0,8.0,3.0,1999.0,8.0,3.0,1999.0,0.0
50%,4342008000.0,434.0,2007.0,1.0,3.0,15.0,6.0,2007.0,16.0,6.0,2007.0,0.0
75%,6512005000.0,651.0,2014.0,1.0,8.0,23.0,9.0,2014.0,23.0,9.0,2014.0,1.0
max,9102020000.0,910.0,2020.0,1.0,143.0,31.0,12.0,2020.0,31.0,12.0,2020.0,1.0


* The dataset has 17,145 rows with 31 columns.
* It is comprised of both categorical and numerical data, with many columns having a large number of missing values.
* The numerical columns is mainly binary and date data.

**Data Preprocessing**

In [7]:
# Aggregation of protester demands and state responses.
df['demands'] = df[['protesterdemand1', 'protesterdemand2', 'protesterdemand3', 'protesterdemand4']].apply(lambda x: ', '.join(x.dropna().astype(str)), axis=1)
df['response'] = df[['stateresponse1', 'stateresponse2', 'stateresponse3', 'stateresponse4', 'stateresponse5', 'stateresponse6', 'stateresponse7']].apply(lambda x: ', '.join(x.dropna().astype(str)), axis=1)

In [8]:
df.head()

Unnamed: 0,id,country,ccode,year,region,protest,protestnumber,startday,startmonth,startyear,...,stateresponse2,stateresponse3,stateresponse4,stateresponse5,stateresponse6,stateresponse7,sources,notes,demands,response
0,201990001,Canada,20,1990,North America,1,1,15.0,1.0,1990.0,...,,,,,,,1. great canadian train journeys into history;...,canada s railway passenger system was finally ...,"political behavior, process, labor wage dispute",ignore
1,201990002,Canada,20,1990,North America,1,2,25.0,6.0,1990.0,...,,,,,,,1. autonomy s cry revived in quebec the new yo...,protestors were only identified as young peopl...,"political behavior, process",ignore
2,201990003,Canada,20,1990,North America,1,3,1.0,7.0,1990.0,...,,,,,,,1. quebec protest after queen calls for unity ...,"the queen, after calling on canadians to remai...","political behavior, process",ignore
3,201990004,Canada,20,1990,North America,1,4,12.0,7.0,1990.0,...,,,,,,,1. indians gather as siege intensifies; armed ...,canada s federal government has agreed to acqu...,land farm issue,accomodation
4,201990005,Canada,20,1990,North America,1,5,14.0,8.0,1990.0,...,arrests,accomodation,,,,,1. dozens hurt in mohawk blockade protest the ...,protests were directed against the state due t...,"political behavior, process","crowd dispersal, arrests, accomodation"


In [9]:
# Function to create a date string from day, month, and year
def create_date_string(row, col_prefix):
    try:
        return f"{int(row[f'{col_prefix}year']):04d}-{int(row[f'{col_prefix}month']):02d}-{int(row[f'{col_prefix}day']):02d}"
    except ValueError:
        return None

# Apply the function to create date strings
df['start_date'] = df.apply(lambda row: create_date_string(row, 'start'), axis=1)
df['end_date'] = df.apply(lambda row: create_date_string(row, 'end'), axis=1)

# Convert the date strings to datetime
df['start_date'] = pd.to_datetime(df['start_date'], errors='coerce')
df['end_date'] = pd.to_datetime(df['end_date'], errors='coerce')

# Calculate protest_duration in days
df['protest_duration'] = (df['end_date'] - df['start_date']).dt.days
# Display the new columns
df[['start_date', 'end_date', 'protest_duration']].head()


Unnamed: 0,start_date,end_date,protest_duration
0,1990-01-15,1990-01-15,0.0
1,1990-06-25,1990-06-25,0.0
2,1990-07-01,1990-07-01,0.0
3,1990-07-12,1990-09-06,56.0
4,1990-08-14,1990-08-15,1.0


In [10]:
# Converting textual representations of participants to numeric.
# The following code maintains over 96% of the data in 'participants'.

def parse_texts(x):
    """
    Parses specific textual representations of participant counts into numeric values.
    Handles predefined text patterns like 'dozens', 'hundreds', etc.
    """
    x = x.lower()
    
    text_mapping = {
        "dozens": 50,
        "hundreds": 500,
        "thousands": 5000,
        "tens of thousands": 50000,
        "hundreds of thousands": 250000,
        "millions": 2000000,
        "million": 1000000,
        "a group": 10,
        "busloads": 50,
        "widespread": 500,
        "scores": 50,
        "a few dozen": 36,
        "a few hundred": 300,
        "a few thousand": 3000,
        "several 1000s": 5000,
        "few thousand": 3000,
        "few dozen": 24,
    }
    
    for key, value in text_mapping.items():
        if key in x:
            return value
    
    if "about " in x:
        match = re.search(r'\d+', x)
        if match:
            return int(match.group())
    if "more than " in x:
        match = re.search(r'\d+', x)
        if match:
            return int(match.group())
    
    if "several" in x:
        if "dozen" in x:
            return 50
        elif "hundred" in x:
            return 500
        elif "thousand" in x:
            return 5000
    
    return x

def strip_chars(x):
    """
    Removes unwanted characters from the string and converts to integer if possible.
    Specifically handles values ending in 's' by multiplying the preceding number by 5.
    """
    banned_chars = "+><,"
    x = "".join([c for c in x if c not in banned_chars])
    
    if x.endswith('s') and x[:-1].isdigit():
        return int(x[:-1]) * 5
    
    try:
        return int(x)
    except ValueError:
        return x

def avg_hyphen(x):
    """
    Calculates the average for values specified as a range (e.g., '100-200').
    """
    accepted_chars = "1234567890-"
    x = "".join([c for c in x if c in accepted_chars])
    
    if "-" in x:
        lower, upper = x.split("-")
        if lower.isdigit() and upper.isdigit():
            return (int(lower) + int(upper)) // 2
    
    return np.nan

def map_participants(x):
    """
    Sequentially applies parsing, stripping, and averaging to convert text representations
    of participant counts into numeric values.
    """
    while isinstance(x, str):
        x = parse_texts(x)
        if isinstance(x, str):
            x = strip_chars(x)
        if isinstance(x, str):
            x = avg_hyphen(x)
        if isinstance(x, str):
            x = np.nan
    return x

In [11]:
# Converting 'partcipants' to usable values.
df['participants_numeric'] = df["participants"].map(map_participants)
df[['participants', 'participants_numeric']].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17145 entries, 0 to 17144
Data columns (total 2 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   participants          15746 non-null  object 
 1   participants_numeric  15137 non-null  float64
dtypes: float64(1), object(1)
memory usage: 268.0+ KB


In [12]:
df.columns

Index(['id', 'country', 'ccode', 'year', 'region', 'protest', 'protestnumber',
       'startday', 'startmonth', 'startyear', 'endday', 'endmonth', 'endyear',
       'protesterviolence', 'location', 'participants_category',
       'participants', 'protesteridentity', 'protesterdemand1',
       'protesterdemand2', 'protesterdemand3', 'protesterdemand4',
       'stateresponse1', 'stateresponse2', 'stateresponse3', 'stateresponse4',
       'stateresponse5', 'stateresponse6', 'stateresponse7', 'sources',
       'notes', 'demands', 'response', 'start_date', 'end_date',
       'protest_duration', 'participants_numeric'],
      dtype='object')

* Aggregated protester demands and state responses into 'demands' and 'response' columns respectively.
* Created 'start_date', 'end_date' and 'protest_duration' columns using the date data.
* Converted the participants textual representations into numerical format, stored them in 'participants_numeric'.

**Data Cleaning**

In [13]:
# Columns to drop
time_drops = [
    'startday', 'startmonth', 'startyear', 'endday', 'endmonth', 'endyear'
]  #Redundant as we have 'start_date'&'end_date'
other_drops = [
    'id',  #Not useful to prediction.
    'ccode',  #Not useful to prediction as we already have country.
    'protest',  #Binary column with '0' values resulting in empty rows 
    'protestnumber',  #No. of protests per year.(incrementing per protest per year)
    'location',  #Not extremely useable given how it's already being broken by region.
    'participants_category',  #Too many null values to be of great value. The data is also captured in 'participants_numeric'
    'participants',  #'participants_numeric' has the numeric values of this column.
]
demand_drops = [
    'protesterdemand1', 'protesterdemand2', 'protesterdemand3',
    'protesterdemand4'
]  #Full of null values as individual columns. Aggregated in demands column.
response_drops = [
    'stateresponse1', 'stateresponse2', 'stateresponse3', 'stateresponse4',
    'stateresponse5', 'stateresponse6', 'stateresponse7'
]  #Full of null values as individual columns. Aggregated in response column.

columns_to_drop = time_drops + other_drops + demand_drops + response_drops
df1 = df.drop(columns=columns_to_drop)

In [14]:
# Checking if columns were dropped.
df1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17145 entries, 0 to 17144
Data columns (total 13 columns):
 #   Column                Non-Null Count  Dtype         
---  ------                --------------  -----         
 0   country               17145 non-null  object        
 1   year                  17145 non-null  int64         
 2   region                17145 non-null  object        
 3   protesterviolence     15758 non-null  float64       
 4   protesteridentity     14684 non-null  object        
 5   sources               15235 non-null  object        
 6   notes                 15193 non-null  object        
 7   demands               17145 non-null  object        
 8   response              17145 non-null  object        
 9   start_date            15239 non-null  datetime64[ns]
 10  end_date              15239 non-null  datetime64[ns]
 11  protest_duration      15239 non-null  float64       
 12  participants_numeric  15137 non-null  float64       
dtypes: datetime64[ns

In [15]:
# Handling null values.
df1.fillna(value={"protesteridentity":"unspecified"}, inplace=True)
col_with_null = [
    'protesterviolence', #Will drop nulls as this is a crucial column in analysis
    'sources', #Crucial column in text analysis
    'notes', #Crucial column in text analysis
    'start_date', #Crucial column in terms of temporal analysis
    'end_date', #Crucial column in terms of temporal analysis
    'protest_duration', #Crucial column in terms of temporal analysis
    'participants_numeric',#Important in understanding protester counts
]

#Dropping null values
for col in col_with_null:
    df1.dropna(subset=[col], inplace=True)

In [16]:
df1.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 15087 entries, 0 to 17141
Data columns (total 13 columns):
 #   Column                Non-Null Count  Dtype         
---  ------                --------------  -----         
 0   country               15087 non-null  object        
 1   year                  15087 non-null  int64         
 2   region                15087 non-null  object        
 3   protesterviolence     15087 non-null  float64       
 4   protesteridentity     15087 non-null  object        
 5   sources               15087 non-null  object        
 6   notes                 15087 non-null  object        
 7   demands               15087 non-null  object        
 8   response              15087 non-null  object        
 9   start_date            15087 non-null  datetime64[ns]
 10  end_date              15087 non-null  datetime64[ns]
 11  protest_duration      15087 non-null  float64       
 12  participants_numeric  15087 non-null  float64       
dtypes: datetime64[ns

* Dropped columns based on varying criteria explained in the code.
* Handled missing values by imputing and dropping where imputing wasn't possible.

***Miscellaneous Cleaning***

In [17]:
# Ensure columns have consistent data types
expected_types = {
    'country': 'object',
    'year': 'int64',
    'region': 'object',
    'protesterviolence': 'int64',
    'protesteridentity': 'object',
    'demands': 'object',
    'response': 'object',
    'start_date': 'datetime64[ns]',
    'end_date': 'datetime64[ns]',
    'protest_duration': 'int64',
    'participants_numeric': 'int64',
    'sources': 'object',
    'notes': 'object'
}

# Ensure columns have consistent data types
for column, dtype in expected_types.items():
    if dtype == 'datetime64[ns]':
        df1[column] = pd.to_datetime(df1[column], errors='coerce')
    else:
        df1[column] = df1[column].astype(dtype, errors='ignore')

In [18]:
# Check for duplicates and remove them if any
df1 = df1.drop_duplicates()

In [19]:
df1.shape

(15076, 13)

In [20]:
# Rearrange columns
columns_order = [
    'region', 'country', 'year', 'start_date', 'end_date', 'protest_duration',
    'participants_numeric', 'protesterviolence', 'protesteridentity',
    'demands', 'response', 'sources', 'notes'
]
df1 = df1[columns_order]

* Ensured all our columns had consistent data types.
* Dropped 11 duplicated rows.
* At the end of the cleaning process we've maintained 88% of the original data resulting in a shape of (15076, 13).

**Feature Engineering**

In [21]:
# Split the 'demands' and 'response' columns into multiple columns and apply one-hot encoding
demands_split = df1['demands'].str.get_dummies(sep=', ')
response_split = df1['response'].str.get_dummies(sep=', ')

# Add a prefix to avoid column name clashes
demands_split = demands_split.add_prefix('demand_')
response_split = response_split.add_prefix('response_')

# Concatenate the original DataFrame with the new one-hot encoded columns
df1 = pd.concat([df1, demands_split, response_split], axis=1)

# Drop the original 'demands' and 'response' columns
df1 = df1.drop(columns=['demands', 'response'])


In [22]:
df1.head()

Unnamed: 0,region,country,year,start_date,end_date,protest_duration,participants_numeric,protesterviolence,protesteridentity,sources,...,demand_social restrictions,demand_tax policy,response_.,response_accomodation,response_arrests,response_beatings,response_crowd dispersal,response_ignore,response_killings,response_shootings
0,North America,Canada,1990,1990-01-15,1990-01-15,0,5000,0,unspecified,1. great canadian train journeys into history;...,...,0,0,0,0,0,0,0,1,0,0
1,North America,Canada,1990,1990-06-25,1990-06-25,0,1000,0,unspecified,1. autonomy s cry revived in quebec the new yo...,...,0,0,0,0,0,0,0,1,0,0
2,North America,Canada,1990,1990-07-01,1990-07-01,0,500,0,separatist parti quebecois,1. quebec protest after queen calls for unity ...,...,0,0,0,0,0,0,0,1,0,0
3,North America,Canada,1990,1990-07-12,1990-09-06,56,500,1,mohawk indians,1. indians gather as siege intensifies; armed ...,...,0,0,0,1,0,0,0,0,0,0
4,North America,Canada,1990,1990-08-14,1990-08-15,1,950,1,local residents,1. dozens hurt in mohawk blockade protest the ...,...,0,0,0,1,1,0,1,0,0,0


In [23]:
df1.columns

Index(['region', 'country', 'year', 'start_date', 'end_date',
       'protest_duration', 'participants_numeric', 'protesterviolence',
       'protesteridentity', 'sources', 'notes', 'demand_.',
       'demand_labor wage dispute', 'demand_land farm issue',
       'demand_police brutality', 'demand_political behavior',
       'demand_price increases', 'demand_process',
       'demand_removal of politician', 'demand_social restrictions',
       'demand_tax policy', 'response_.', 'response_accomodation',
       'response_arrests', 'response_beatings', 'response_crowd dispersal',
       'response_ignore', 'response_killings', 'response_shootings'],
      dtype='object')

In [24]:
# Drop placeholder columns
df1 = df1.drop(columns=['demand_.', 'response_.']) # The '.' were actual inputs in the data.(Not useful)

# Column order
col_order = [
    'region', 'country', 'year', 'start_date', 'end_date', 'protest_duration',
    'participants_numeric', 'protesterviolence', 'protesteridentity',
    'demand_labor wage dispute', 'demand_land farm issue',
    'demand_police brutality', 'demand_political behavior',
    'demand_price increases', 'demand_process', 'demand_removal of politician',
    'demand_social restrictions', 'demand_tax policy', 'response_accomodation',
    'response_arrests', 'response_beatings', 'response_crowd dispersal',
    'response_ignore', 'response_killings', 'response_shootings', 'sources',
    'notes'
]
df_cleaned = df1[col_order]

In [25]:
# Reset the index of the DataFrame
df_cleaned = df_cleaned.reset_index(drop=True)

In [26]:
# Save the cleaned and dummified data to a new CSV in the specified directory
save_path = "C:\\Users\\USER\\Desktop\\capstone_project\\mass_mobilization_cleaned.csv"
df_cleaned.to_csv(save_path, index=False)

print(f"File saved to {save_path}")

File saved to C:\Users\USER\Desktop\capstone_project\mass_mobilization_cleaned.csv


* Split the 'demands' and 'response' columns to multiple individual columns and applied one-hot encoding on the new columns.
* Reset the index of the cleaned dataframe.
* Saved the csv to specified directory on my local machine.