# **PHASE 5 CAPSTONE PROJECT**

**Students**

Magdalene Ondimu

Najma Abdi

Leon Maina

Brian Kariithi

Wilfred Lekishorumongi

**Technical Mentors:** William Okwomba, Noah Kandie, Bonface Manyara

**Mode of Study:** Part time

In [2]:
from IPython.display import Image

# URL of the image
url = "https://blog.prif.org/wp-content/uploads/2020/12/Wolff-legacy-protest-movements-latin-america-Chile-BLOG.jpg"

# Display the image
Image(url=url)


## 1.1 INTRODUCTION

Protests are significant socio-political events that shape the trajectory of nations and influence global dynamics. Understanding the dynamics of protests—such as their causes, demands, and state responses—is critical for policymakers, researchers, and international organizations. Analyzing modern protest data can provide invaluable insights into current socio-political climates, helping to forecast potential unrest, understand public sentiment, and guide policy responses.

## 1.2 PROBLEM STATEMENT

Understanding the dynamics of protests, including their causes, demands, and state responses, is critical for policymakers, researchers, and international organizations. Despite the importance of protests in driving political change and social movements, there is a need for a comprehensive analysis that examines the underlying factors, geographical distribution, and temporal trends of protest events on a global scale. This project aims to fill this gap by analyzing global protest events from 1990 to March 2020.

## 1.3 BUSINESS UNDERSTANDING

This project aims to analyze protest events globally, focusing on identifying underlying factors, geographical distribution, and temporal trends. By leveraging NLP techniques, it will uncover common themes in protester demands and evaluate the effectiveness and impact of various state responses. The insights gained will inform actionable policy recommendations, enhance the understanding of social movements, and improve strategies for managing social unrest.

## 1.4 MAIN OBJECTIVE

**To analyze protest events**

o Identify the underlying factors that lead to mass protests globally.

o Examine the geographical distribution and temporal trends of protest events from 1990 to March 2020.

o Understand the patterns and characteristics of protests, including their scale and intensity.


## 1.5 SPECIFIC OBJECTIVES

The specific objectives are:

1. **Identify protest factors:**
   
       - Determine the key underlying factors that lead to mass protests globally
   

3. **Analyze Geographic and Temporal Trends:**

       - Examine the geographical distribution and temporal trends of protest events from 1990 to March 2020.

4. **Evaluate state responses:**

       - Assess the effectiveness and impact of government and state responses to protests.

5. **Derive actionable insights**

       - Offer actionable policy recommendations to improve state-citizen relations and manage social unrest effectively.

## 1.6 METHODOLOGY

The methodology outlines the steps to achieve the specific objectives and includes the following phases:

1. Data Cleaning.
2. Exploratory Data Analysis.
3. Data Preprocessing.
4. Modelling.
5. Evaluation.
6. Recommendations and Conclusion.

## 1.7 METRICS OF SUCCESS

**Trend Identification:**
The ability to accurately identify and visualize trends in protest frequency over time.

**Response Evaluation:**
Assessing the effectiveness and variation of state responses.

**Sentiment Accuracy:**
Achieving high accuracy in sentiment analysis of protest notes.

**Topic Relevance:**
Effectively identifying and summarizing key topics from the protest notes

**Model Performance:** 
Achieving high performance in classification tasks, measured by metrics such as accuracy, precision, recall, F1-score, and ROC-AUC.

By conducting these analyses, we can gain a comprehensive understanding of global protest dynamics, which can inform policy decisions and future research directions.

## 1.8. DATA UNDERSTANDING

**id**: Unique identifier for each protest event.

**country**: The country where the protest occurred.

**ccode**: Country code.

**year**: The year the protest occurred.

**region**: The region where the country is located.

**protes**t: Indicator if there was a protest (1) or not (0).

**protestnumber**: Sequential number of protests in the dataset.

**startday**: The day the protest started.

**startmonth**: The month the protest started.

**startyear**: The year the protest started.

**protesterdemand**1: Primary protester demands.

**protesterdemand2**: Secondary protester demands.

**protesterdemand3**: Tertiary protester demands.

**protesterdemand4**: Additional protester demands.

**stateresponse1**: Primary state response.

**stateresponse2**: Secondary state response.

**stateresponse3**: Tertiary state response.

**stateresponse4**: Quaternary state response.

**stateresponse5**: Quinary state response.

**stateresponse6**: Senary state response.

**stateresponse7**: Septenary state response.

**sources**: Sources of information about the protest.

**notes**: Additional notes about the protest.


## 2.0 DATA CLEANING

**importing the libraries**

In [3]:
# Importing relevant libraries
#Basic libraries
import pandas as pd
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
from matplotlib import rcParams
%matplotlib inline
import seaborn as sns
import re


#NLTK libraries
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize

nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
import string
import wordcloud
from wordcloud import WordCloud, STOPWORDS
from nltk.stem.porter import PorterStemmer

# Machine Learning libraries
import sklearn
from sklearn import svm, datasets
from sklearn import preprocessing
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import SVC
from sklearn.preprocessing import LabelEncoder, StandardScaler, MinMaxScaler, label_binarize
from sklearn.ensemble import ExtraTreesClassifier, RandomForestClassifier
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import BernoulliNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.multiclass import OneVsRestClassifier

import tensorflow
import keras
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, SimpleRNN, Dense
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences


#Metrics libraries
from sklearn import metrics
from sklearn.model_selection import cross_val_score
from sklearn.metrics import classification_report, roc_auc_score, roc_curve, auc


#Visualization libraries
from plotly import tools
import plotly.graph_objs as go
from plotly.offline import iplot

#Ignore warnings
import warnings
warnings.filterwarnings('ignore')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Magda\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Magda\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Magda\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [5]:
# Loading the Dataset
df = pd.read_csv('mmALL_073120_csv.csv')
df.head()

Unnamed: 0,id,country,ccode,year,region,protest,protestnumber,startday,startmonth,startyear,...,protesterdemand4,stateresponse1,stateresponse2,stateresponse3,stateresponse4,stateresponse5,stateresponse6,stateresponse7,sources,notes
0,201990001,Canada,20,1990,North America,1,1,15.0,1.0,1990.0,...,,ignore,,,,,,,1. great canadian train journeys into history;...,canada s railway passenger system was finally ...
1,201990002,Canada,20,1990,North America,1,2,25.0,6.0,1990.0,...,,ignore,,,,,,,1. autonomy s cry revived in quebec the new yo...,protestors were only identified as young peopl...
2,201990003,Canada,20,1990,North America,1,3,1.0,7.0,1990.0,...,,ignore,,,,,,,1. quebec protest after queen calls for unity ...,"the queen, after calling on canadians to remai..."
3,201990004,Canada,20,1990,North America,1,4,12.0,7.0,1990.0,...,,accomodation,,,,,,,1. indians gather as siege intensifies; armed ...,canada s federal government has agreed to acqu...
4,201990005,Canada,20,1990,North America,1,5,14.0,8.0,1990.0,...,,crowd dispersal,arrests,accomodation,,,,,1. dozens hurt in mohawk blockade protest the ...,protests were directed against the state due t...


Loading the data and previewing the first 5 rows to be understand the data that we will be using for our analysis.

In [6]:
df.shape

(17145, 31)

Our original dataset has 17145 rows and 31 features that has to be cleaned further.

In [7]:
df.columns

Index(['id', 'country', 'ccode', 'year', 'region', 'protest', 'protestnumber',
       'startday', 'startmonth', 'startyear', 'endday', 'endmonth', 'endyear',
       'protesterviolence', 'location', 'participants_category',
       'participants', 'protesteridentity', 'protesterdemand1',
       'protesterdemand2', 'protesterdemand3', 'protesterdemand4',
       'stateresponse1', 'stateresponse2', 'stateresponse3', 'stateresponse4',
       'stateresponse5', 'stateresponse6', 'stateresponse7', 'sources',
       'notes'],
      dtype='object')

The data understanding provides details about what each column contains.

In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17145 entries, 0 to 17144
Data columns (total 31 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   id                     17145 non-null  int64  
 1   country                17145 non-null  object 
 2   ccode                  17145 non-null  int64  
 3   year                   17145 non-null  int64  
 4   region                 17145 non-null  object 
 5   protest                17145 non-null  int64  
 6   protestnumber          17145 non-null  int64  
 7   startday               15239 non-null  float64
 8   startmonth             15239 non-null  float64
 9   startyear              15239 non-null  float64
 10  endday                 15239 non-null  float64
 11  endmonth               15239 non-null  float64
 12  endyear                15239 non-null  float64
 13  protesterviolence      15758 non-null  float64
 14  location               15218 non-null  object 
 15  pa

From our dataframe information above, only 7 columns have no missing or null values. Consequently we are going to aggregate and merge the rest of the columns with similar textual data/ details and drop unnecessary ones.

Our dataset comprises of both categorical and numerical data with many columns having large number of missing values. The numerical columns are mainly binary and date data.

In [9]:
df.describe()

Unnamed: 0,id,ccode,year,protest,protestnumber,startday,startmonth,startyear,endday,endmonth,endyear,protesterviolence
count,17145.0,17145.0,17145.0,17145.0,17145.0,15239.0,15239.0,15239.0,15239.0,15239.0,15239.0,15758.0
mean,4380888000.0,437.888189,2006.171654,0.888831,7.406299,15.455935,6.227836,2006.326465,15.580616,6.24352,2006.329221,0.25606
std,2320550000.0,232.054953,8.987378,0.314351,11.854041,8.817037,3.461912,8.958007,8.803944,3.461745,8.959254,0.436469
min,201990000.0,20.0,1990.0,0.0,0.0,1.0,1.0,1990.0,1.0,1.0,1990.0,0.0
25%,2202010000.0,220.0,1998.0,1.0,1.0,8.0,3.0,1999.0,8.0,3.0,1999.0,0.0
50%,4342008000.0,434.0,2007.0,1.0,3.0,15.0,6.0,2007.0,16.0,6.0,2007.0,0.0
75%,6512005000.0,651.0,2014.0,1.0,8.0,23.0,9.0,2014.0,23.0,9.0,2014.0,1.0
max,9102020000.0,910.0,2020.0,1.0,143.0,31.0,12.0,2020.0,31.0,12.0,2020.0,1.0


**Statistical summaries for numerical data**

**count:** Number of non-missing observations for each column.
For example, protesterviolence has 15,758 observations, while startday, startmonth, startyear, endday, endmonth, and endyear have fewer (15,239), indicating some missing data.

**mean:** The average value for each column.
For example, the average year of protest is around 2006, and the average protester violence indicator is 0.256, suggesting that about 25.6% of protests involved violence.

**std (Standard Deviation):** Measures the spread or variability of the data.
For instance, the standard deviation of the year column is about 8.99, indicating that protests are fairly spread out across the years, with some occurring earlier and others later in the dataset.

**min:** The minimum value for each column.
The earliest year recorded is 1990, with the earliest possible day of protest being the 1st of a month.

**25% (1st Quartile):** The value below which 25% of the data fall.
For example, 25% of protests occurred before 1998, and 25% of protest start days are before the 8th day of the month.

**50% (Median):** The middle value when the data are ordered.
For instance, the median year for a protest is 2007, meaning half of the protests occurred before 2007 and half after.

**75% (3rd Quartile):** The value below which 75% of the data fall.
For example, 75% of protests occurred before 2014, and 75% of protest start days are before the 23rd day of the month.

**max:** The maximum value for each column.
The latest recorded year of protest is 2020, and the maximum protest duration is captured by a start day of the 31st and an end day of the 31st in December.

**Key Insights:**
Temporal Range: 
Protests are recorded from 1990 to 2020, with a mean year around 2006, indicating a broad temporal range of data.

Protester Violence: 
With a mean of 0.256 and a max of 1, about a quarter of protests involved violence.

Data Gaps: 
The count shows that some columns have missing data, particularly in date-related fields (start day, end day, etc.), which should be addressed in further analysis.



**Data Cleaning**

In [10]:
# Aggregation of protester demands and state responses.
df['demands'] = df[['protesterdemand1', 'protesterdemand2', 'protesterdemand3', 'protesterdemand4']].apply(lambda x: ', '.join(x.dropna().astype(str)), axis=1)
df['response'] = df[['stateresponse1', 'stateresponse2', 'stateresponse3', 'stateresponse4', 'stateresponse5', 'stateresponse6', 'stateresponse7']].apply(lambda x: ', '.join(x.dropna().astype(str)), axis=1)

This code iterates through each row of a DataFrame, removing missing values, converts everything to strings, and joins them with commas and spaces into a single string for that row.

Aggregated protester demands and state responses into 'demands' and 'response' columns respectively.

In [11]:
df.head()

Unnamed: 0,id,country,ccode,year,region,protest,protestnumber,startday,startmonth,startyear,...,stateresponse2,stateresponse3,stateresponse4,stateresponse5,stateresponse6,stateresponse7,sources,notes,demands,response
0,201990001,Canada,20,1990,North America,1,1,15.0,1.0,1990.0,...,,,,,,,1. great canadian train journeys into history;...,canada s railway passenger system was finally ...,"political behavior, process, labor wage dispute",ignore
1,201990002,Canada,20,1990,North America,1,2,25.0,6.0,1990.0,...,,,,,,,1. autonomy s cry revived in quebec the new yo...,protestors were only identified as young peopl...,"political behavior, process",ignore
2,201990003,Canada,20,1990,North America,1,3,1.0,7.0,1990.0,...,,,,,,,1. quebec protest after queen calls for unity ...,"the queen, after calling on canadians to remai...","political behavior, process",ignore
3,201990004,Canada,20,1990,North America,1,4,12.0,7.0,1990.0,...,,,,,,,1. indians gather as siege intensifies; armed ...,canada s federal government has agreed to acqu...,land farm issue,accomodation
4,201990005,Canada,20,1990,North America,1,5,14.0,8.0,1990.0,...,arrests,accomodation,,,,,1. dozens hurt in mohawk blockade protest the ...,protests were directed against the state due t...,"political behavior, process","crowd dispersal, arrests, accomodation"


In [12]:
# Function to create a date string from day, month, and year
def create_date_string(row, col_prefix):
    try:
        return f"{int(row[f'{col_prefix}year']):04d}-{int(row[f'{col_prefix}month']):02d}-{int(row[f'{col_prefix}day']):02d}"
    except ValueError:
        return None

# Apply the function to create date strings
df['start_date'] = df.apply(lambda row: create_date_string(row, 'start'), axis=1)
df['end_date'] = df.apply(lambda row: create_date_string(row, 'end'), axis=1)

# Convert the date strings to datetime
df['start_date'] = pd.to_datetime(df['start_date'], errors='coerce')
df['end_date'] = pd.to_datetime(df['end_date'], errors='coerce')

# Calculate protest_duration in days
df['protest_duration'] = (df['end_date'] - df['start_date']).dt.days

# Display the new columns
df[['start_date', 'end_date', 'protest_duration']].head()


Unnamed: 0,start_date,end_date,protest_duration
0,1990-01-15,1990-01-15,0.0
1,1990-06-25,1990-06-25,0.0
2,1990-07-01,1990-07-01,0.0
3,1990-07-12,1990-09-06,56.0
4,1990-08-14,1990-08-15,1.0


Creating date string from the day, month and year data and converting it into a single datetime column. This will ease analysis and manipulations.

Created 'start_date', 'end_date' and 'protest_duration' columns using the date data.

In [13]:
# Converting textual representations of participants to numeric.
# The following code maintains over 96% of the data in 'participants'.

def parse_texts(x):
    """
    Parses specific textual representations of participant counts into numeric values.
    Handles predefined text patterns like 'dozens', 'hundreds', etc.
    """
    x = x.lower()

    text_mapping = {
        "dozens": 50,
        "hundreds": 500,
        "thousands": 5000,
        "tens of thousands": 50000,
        "hundreds of thousands": 250000,
        "millions": 2000000,
        "million": 1000000,
        "a group": 10,
        "busloads": 50,
        "widespread": 500,
        "scores": 50,
        "a few dozen": 36,
        "a few hundred": 300,
        "a few thousand": 3000,
        "several 1000s": 5000,
        "few thousand": 3000,
        "few dozen": 24,
    }

    for key, value in text_mapping.items():
        if key in x:
            return value

    if "about " in x:
        match = re.search(r'\d+', x)
        if match:
            return int(match.group())
    if "more than " in x:
        match = re.search(r'\d+', x)
        if match:
            return int(match.group())

    if "several" in x:
        if "dozen" in x:
            return 50
        elif "hundred" in x:
            return 500
        elif "thousand" in x:
            return 5000

    return x

def strip_chars(x):
    """
    Removes unwanted characters from the string and converts to integer if possible.
    Specifically handles values ending in 's' by multiplying the preceding number by 5.
    """
    banned_chars = "+><,"
    x = "".join([c for c in x if c not in banned_chars])

    if x.endswith('s') and x[:-1].isdigit():
        return int(x[:-1]) * 5

    try:
        return int(x)
    except ValueError:
        return x

def avg_hyphen(x):
    """
    Calculates the average for values specified as a range (e.g., '100-200').
    """
    accepted_chars = "1234567890-"
    x = "".join([c for c in x if c in accepted_chars])

    if "-" in x:
        lower, upper = x.split("-")
        if lower.isdigit() and upper.isdigit():
            return (int(lower) + int(upper)) // 2

    return np.nan

def map_participants(x):
    """
    Sequentially applies parsing, stripping, and averaging to convert text representations
    of participant counts into numeric values.
    """
    while isinstance(x, str):
        x = parse_texts(x)
        if isinstance(x, str):
            x = strip_chars(x)
        if isinstance(x, str):
            x = avg_hyphen(x)
        if isinstance(x, str):
            x = np.nan
    return x

* Converted the participants textual representations into numerical format.

In [14]:
# Converting 'partcipants' to usable values.
df['participants_numeric'] = df["participants"].map(map_participants)
df[['participants', 'participants_numeric']].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17145 entries, 0 to 17144
Data columns (total 2 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   participants          15746 non-null  object 
 1   participants_numeric  15137 non-null  float64
dtypes: float64(1), object(1)
memory usage: 268.0+ KB


In [15]:
df.columns

Index(['id', 'country', 'ccode', 'year', 'region', 'protest', 'protestnumber',
       'startday', 'startmonth', 'startyear', 'endday', 'endmonth', 'endyear',
       'protesterviolence', 'location', 'participants_category',
       'participants', 'protesteridentity', 'protesterdemand1',
       'protesterdemand2', 'protesterdemand3', 'protesterdemand4',
       'stateresponse1', 'stateresponse2', 'stateresponse3', 'stateresponse4',
       'stateresponse5', 'stateresponse6', 'stateresponse7', 'sources',
       'notes', 'demands', 'response', 'start_date', 'end_date',
       'protest_duration', 'participants_numeric'],
      dtype='object')

In [16]:
# Columns to drop
time_drops = [
    'startday', 'startmonth', 'startyear', 'endday', 'endmonth', 'endyear'
]  #Redundant as we have 'start_date'&'end_date'
other_drops = [
    'id',  #Not useful to prediction.
    'ccode',  #Not useful to prediction as we already have country.
    'protest',  #Binary column with '0' values resulting in empty rows
    'protestnumber',  #No. of protests per year.(incrementing per protest per year)
    'location',  #Not extremely useable given how it's already being broken by region.
    'participants_category',  #Too many null values to be of great value. The data is also captured in 'participants_numeric'
    'participants',  #'participants_numeric' has the numeric values of this column.
]
demand_drops = [
    'protesterdemand1', 'protesterdemand2', 'protesterdemand3',
    'protesterdemand4'
]  #Full of null values as individual columns. Aggregated in demands column.
response_drops = [
    'stateresponse1', 'stateresponse2', 'stateresponse3', 'stateresponse4',
    'stateresponse5', 'stateresponse6', 'stateresponse7'
]  #Full of null values as individual columns. Aggregated in response column.

columns_to_drop = time_drops + other_drops + demand_drops + response_drops
df_cleaned = df.drop(columns=columns_to_drop)

Dropping uncessary columns or those that are not useful in our analysis as well as those that have high percentage of missing values.

Also redundant columns for protesterdemands & stateresponses dropped after aggregating them into new response and demand columns.

In [17]:
# Checking the column drops
df_cleaned.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17145 entries, 0 to 17144
Data columns (total 13 columns):
 #   Column                Non-Null Count  Dtype         
---  ------                --------------  -----         
 0   country               17145 non-null  object        
 1   year                  17145 non-null  int64         
 2   region                17145 non-null  object        
 3   protesterviolence     15758 non-null  float64       
 4   protesteridentity     14684 non-null  object        
 5   sources               15235 non-null  object        
 6   notes                 15193 non-null  object        
 7   demands               17145 non-null  object        
 8   response              17145 non-null  object        
 9   start_date            15239 non-null  datetime64[ns]
 10  end_date              15239 non-null  datetime64[ns]
 11  protest_duration      15239 non-null  float64       
 12  participants_numeric  15137 non-null  float64       
dtypes: datetime64[ns

In [19]:
# Handling null values.
df_cleaned.fillna(value={"protesteridentity":"unspecified"}, inplace=True)
col_with_null = [
    'protesterviolence', 
    'sources', 
    'notes', 
    'start_date', 
    'end_date', 
    'protest_duration',
    'participants_numeric',
]

#Dropping null values
for col in col_with_null:
    df_cleaned.dropna(subset=[col], inplace=True)

Handling null values in the retained columns and the missing values by imputing and dropping where imputing wasn't possible
All this columns the null values were dropped as they columns were crucial in text analysis, temporal analysis and  in understanding the protester counts. 

In [20]:
# Checking if the null values have been dropped.
df_cleaned.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 15087 entries, 0 to 17141
Data columns (total 13 columns):
 #   Column                Non-Null Count  Dtype         
---  ------                --------------  -----         
 0   country               15087 non-null  object        
 1   year                  15087 non-null  int64         
 2   region                15087 non-null  object        
 3   protesterviolence     15087 non-null  float64       
 4   protesteridentity     15087 non-null  object        
 5   sources               15087 non-null  object        
 6   notes                 15087 non-null  object        
 7   demands               15087 non-null  object        
 8   response              15087 non-null  object        
 9   start_date            15087 non-null  datetime64[ns]
 10  end_date              15087 non-null  datetime64[ns]
 11  protest_duration      15087 non-null  float64       
 12  participants_numeric  15087 non-null  float64       
dtypes: datetime64[ns

* Dropped columns based on varying criteria explained in the code.
* Handled missing values by imputing and dropping where imputing wasn't possible.

***Miscellaneous Cleaning***

In [21]:
# Ensure columns have consistent data types
expected_types = {
    'country': 'object',
    'year': 'int64',
    'region': 'object',
    'protesterviolence': 'int64',
    'protesteridentity': 'object',
    'demands': 'object',
    'response': 'object',
    'start_date': 'datetime64[ns]',
    'end_date': 'datetime64[ns]',
    'protest_duration': 'int64',
    'participants_numeric': 'int64',
    'sources': 'object',
    'notes': 'object'
}

# Ensure columns have consistent data types
for column, dtype in expected_types.items():
    if dtype == 'datetime64[ns]':
        df_cleaned[column] = pd.to_datetime(df_cleaned[column], errors='coerce')
    else:
        df_cleaned[column] = df_cleaned[column].astype(dtype, errors='ignore')

* Ensured all our columns had consistent data types.

In [22]:
# Check for duplicates and remove them if any
df_cleaned = df_cleaned.drop_duplicates()

In [23]:
df_cleaned.shape

(15076, 13)


* Dropped 11 duplicated rows. The clean dataset has 15076 rows and 13 columns




In [24]:
# Rearrange columns and renaming
columns_order = [
    'region', 'country', 'year', 'start_date', 'end_date', 'protest_duration',
    'participants_numeric', 'protesterviolence', 'protesteridentity',
    'demands', 'response', 'sources', 'notes'
]
df_cleaned = df_cleaned[columns_order]


* At the end of the cleaning process we've maintained 88% of the original data resulting in a shape of (15076, 13).

**Data Preparation and Processing**

In [25]:
# Split the 'demands' and 'response' columns into multiple columns and apply one-hot encoding
demands_split = df_cleaned['demands'].str.get_dummies(sep=', ')
response_split = df_cleaned['response'].str.get_dummies(sep=', ')

# Add a prefix to avoid column name clashes
demands_split = demands_split.add_prefix('demand_')
response_split = response_split.add_prefix('response_')

# Concatenate the original DataFrame with the new one-hot encoded columns
df_cleaned = pd.concat([df_cleaned, demands_split, response_split], axis=1)

# Drop the original 'demands' and 'response' columns
df_cleaned = df_cleaned.drop(columns=['demands', 'response'])


**One-hot encoding**

We have split the 'demands' and 'response' columns to multiple individual columns, convert categorical variables into a numerical format that can be used by algorithms and applying one-hot encoding on the new columns.

In [26]:
df_cleaned.head()

Unnamed: 0,region,country,year,start_date,end_date,protest_duration,participants_numeric,protesterviolence,protesteridentity,sources,...,demand_social restrictions,demand_tax policy,response_.,response_accomodation,response_arrests,response_beatings,response_crowd dispersal,response_ignore,response_killings,response_shootings
0,North America,Canada,1990,1990-01-15,1990-01-15,0,5000,0,unspecified,1. great canadian train journeys into history;...,...,0,0,0,0,0,0,0,1,0,0
1,North America,Canada,1990,1990-06-25,1990-06-25,0,1000,0,unspecified,1. autonomy s cry revived in quebec the new yo...,...,0,0,0,0,0,0,0,1,0,0
2,North America,Canada,1990,1990-07-01,1990-07-01,0,500,0,separatist parti quebecois,1. quebec protest after queen calls for unity ...,...,0,0,0,0,0,0,0,1,0,0
3,North America,Canada,1990,1990-07-12,1990-09-06,56,500,1,mohawk indians,1. indians gather as siege intensifies; armed ...,...,0,0,0,1,0,0,0,0,0,0
4,North America,Canada,1990,1990-08-14,1990-08-15,1,950,1,local residents,1. dozens hurt in mohawk blockade protest the ...,...,0,0,0,1,1,0,1,0,0,0


In [27]:
# Check the columns
df_cleaned.columns

Index(['region', 'country', 'year', 'start_date', 'end_date',
       'protest_duration', 'participants_numeric', 'protesterviolence',
       'protesteridentity', 'sources', 'notes', 'demand_.',
       'demand_labor wage dispute', 'demand_land farm issue',
       'demand_police brutality', 'demand_political behavior',
       'demand_price increases', 'demand_process',
       'demand_removal of politician', 'demand_social restrictions',
       'demand_tax policy', 'response_.', 'response_accomodation',
       'response_arrests', 'response_beatings', 'response_crowd dispersal',
       'response_ignore', 'response_killings', 'response_shootings'],
      dtype='object')

In [28]:
# Drop placeholder columns
df_cleaned = df_cleaned.drop(columns=['demand_.', 'response_.']) # The '.' were actual inputs in the data.(Not useful)

# Column order
col_order = [
    'region', 'country', 'year', 'start_date', 'end_date', 'protest_duration',
    'participants_numeric', 'protesterviolence', 'protesteridentity',
    'demand_labor wage dispute', 'demand_land farm issue',
    'demand_police brutality', 'demand_political behavior',
    'demand_price increases', 'demand_process', 'demand_removal of politician',
    'demand_social restrictions', 'demand_tax policy', 'response_accomodation',
    'response_arrests', 'response_beatings', 'response_crowd dispersal',
    'response_ignore', 'response_killings', 'response_shootings', 'sources',
    'notes'
]
df_cleaned = df_cleaned[col_order]

In [29]:
# Reset the index of the DataFrame
df_cleaned = df_cleaned.reset_index(drop=True)

In [33]:
# Save the cleaned and dummified data to a new CSV in the specified directory
save_path = "mass_mobilization_cleaned.csv"
df_cleaned.to_csv(save_path, index=False)

print(f"File saved to {save_path}")

File saved to mass_mobilization_cleaned.csv



* Reset the index of the cleaned dataframe.
* Saved the csv to specified directory on my local machine.