# Exploratory Data Analysis (EDA) on Israel-Palestine Conflict (2000-2021)

![EDA on Israel-Palestine Conflict](image.jpg)

## Overview

This notebook explores the **human cost of the Israel-Palestine conflict** between 2000 and 2021. Using the dataset sourced from [Kaggle](https://www.kaggle.com/datasets/zusmani/palestine-body-count), we perform an in-depth Exploratory Data Analysis (EDA) to uncover key trends, patterns, and insights related to fatalities and injuries on both sides.

### Dataset Highlights
- Covers data from **2000 to 2021**.
- Includes injuries and fatalities for **Palestine** (111,475 injured, 10,000 killed) and **Israel** (5,160 injured, 1,275 killed).
- Documents the **impact of the conflict**, providing a stark reminder of the human toll over two decades.

### Key Objectives
- Visualize trends in casualties and injuries over time.
- Analyze the statistical value of life and **economic impact** (e.g., lost work hours due to fatalities and injuries).
- Highlight disparities and patterns between the two sides of the conflict.

## Motivation
This project aims to use data science to foster a deeper understanding of the **humanitarian impact of the Israel-Palestine conflict**, promoting dialogue and awareness. The dataset serves as a resource for further analysis and actionable insights.

## Source
- Dataset: [Israel-Palestine Conflict Data (2000-2021)](https://www.kaggle.com/datasets/zusmani/palestine-body-count)

---

Let’s dive into the analysis to explore the **stories behind the data**.

In [1]:
import pandas as pd
import numpy as np

In [2]:
# Load the data
df = pd.read_csv("Palestine Body Count.csv")
df.head()

Unnamed: 0,Year,Month,Palestinians Injuries,Israelis Injuries,Palestinians Killed,Israelis Killed
0,2000.0,DECEMBER,781.0,,51,8
1,2000.0,NOVEMBER,3838.0,,112,22
2,2000.0,OCTOBER,5984.0,,104,10
3,2000.0,SEPTEMBER,,,16,1
4,2001.0,DECEMBER,304.0,,67,36


- check data description

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 251 entries, 0 to 250
Data columns (total 6 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   Year                   249 non-null    float64
 1   Month                  249 non-null    object 
 2   Palestinians Injuries  196 non-null    object 
 3   Israelis Injuries      133 non-null    object 
 4   Palestinians Killed    250 non-null    object 
 5   Israelis Killed        250 non-null    object 
dtypes: float64(1), object(5)
memory usage: 11.9+ KB


#### Let's check each column one by one to see if there's any need of **Data Cleaning**, then we'll proceed towards visualization.

- 1. Column (`Year`)

In [4]:
df['Year']

0      2000.0
1      2000.0
2      2000.0
3      2000.0
4      2001.0
        ...  
246    2021.0
247    2021.0
248    2021.0
249       NaN
250       NaN
Name: Year, Length: 251, dtype: float64

In [5]:
df['Year'].value_counts()

2011.0    12
2020.0    12
2018.0    12
2017.0    12
2016.0    12
2015.0    12
2014.0    12
2013.0    12
2012.0    12
2001.0    12
2010.0    12
2009.0    12
2008.0    12
2007.0    12
2006.0    12
2005.0    12
2004.0    12
2003.0    12
2002.0    12
2019.0    12
2021.0     5
2000.0     4
Name: Year, dtype: int64

- **Insight**: `Year 2000` has only data for 4 months, remaining Year's have data of all months.

- Year column is in _float64_, it is better to be in _int_

In [6]:
# Let's first check missing rows in Year column
df[df['Year'].isnull()]

Unnamed: 0,Year,Month,Palestinians Injuries,Israelis Injuries,Palestinians Killed,Israelis Killed
249,,,,,,
250,,,111475.0,5160.0,10000.0,1275.0


- we can see that rows:249 is completelty empty, so we can remove it.

In [7]:
df.drop(249, axis=0, inplace=True)

In [8]:
df.tail(10)

Unnamed: 0,Year,Month,Palestinians Injuries,Israelis Injuries,Palestinians Killed,Israelis Killed
240,2020.0,APRIL,,,1,0
241,2020.0,MARCH,,,2,0
242,2020.0,FEBRUARY,,,9,0
243,2020.0,JANUARY,,,6,0
244,2021.0,JANUARY,,,4,0
245,2021.0,FEBRUARY,,,1,0
246,2021.0,MARCH,,,4,0
247,2021.0,APRIL,,,1,0
248,2021.0,MAY,,,26,3
250,,,111475.0,5160.0,10000,1275


- we can clealy see that data for row:250 will be:
    - 250: 2021.0: JULY       (according to sequence and excluding the row:249)

In [9]:
df.loc[250, ['Year', 'Month']] = [2021.0, 'JULY']

In [10]:
df.tail(5)

Unnamed: 0,Year,Month,Palestinians Injuries,Israelis Injuries,Palestinians Killed,Israelis Killed
245,2021.0,FEBRUARY,,,1,0
246,2021.0,MARCH,,,4,0
247,2021.0,APRIL,,,1,0
248,2021.0,MAY,,,26,3
250,2021.0,JULY,111475.0,5160.0,10000,1275


In [11]:
df.loc[250, ['Year', 'Month']] = [2021.0, 'JULY']
df.tail(5)

Unnamed: 0,Year,Month,Palestinians Injuries,Israelis Injuries,Palestinians Killed,Israelis Killed
245,2021.0,FEBRUARY,,,1,0
246,2021.0,MARCH,,,4,0
247,2021.0,APRIL,,,1,0
248,2021.0,MAY,,,26,3
250,2021.0,JULY,111475.0,5160.0,10000,1275


In [12]:
# reseting the index
df.reset_index(drop=True, inplace=True)

In [13]:
df.tail(5)

Unnamed: 0,Year,Month,Palestinians Injuries,Israelis Injuries,Palestinians Killed,Israelis Killed
245,2021.0,FEBRUARY,,,1,0
246,2021.0,MARCH,,,4,0
247,2021.0,APRIL,,,1,0
248,2021.0,MAY,,,26,3
249,2021.0,JULY,111475.0,5160.0,10000,1275


In [14]:
# check for missing values
df.isnull().sum()

Year                       0
Month                      0
Palestinians Injuries     54
Israelis Injuries        117
Palestinians Killed        0
Israelis Killed            0
dtype: int64

-----

Now we have to deal with `Palestinians Injuries` and `Israelis Injuries` columns\	

- Let's first check the datatype and anomalies in both columns

In [15]:
df.dtypes

Year                     float64
Month                     object
Palestinians Injuries     object
Israelis Injuries         object
Palestinians Killed       object
Israelis Killed           object
dtype: object

In [16]:
df['Palestinians Injuries'].unique()

array(['781', '3838', '5984', nan, '304', '160', '407', '657', '502',
       '394', '319', '932', '715', '927', '598', '471', '185', '264',
       '388', '353', '186', '374', '299', '181', '523', '870', '429',
       '330', '322', '106', '289', '226', '191', '34', '367', '239',
       '303', '379', '244', '292', '161', '98', '343', '579', '287',
       '251', '377', '545', '346', '417', '437', '168', '166', '99', '81',
       '90', '130', '165', '116', '183', '68', '42', '164', '73', '491',
       '180', '196', '266', '799', '198', '257', '254', '203', '194',
       '88', '13720', '127', '152', '135', '154', '67', '162', '281',
       '115', '153', '256', '104', '195', '30', '80', '36', '26', '89',
       '82', '136', '87', '65', '5557', '105', '147', '148', '46', '85',
       '118', '119', '402', '92', '86', '210', '63', '110', '149', '156',
       '159', '397', '193', '204', '143', '2252', '169', '202', '231',
       '703', '404', '60', '320', '137', '138', '124', '656', '493',
     

In [17]:
# accessing ambiguous rows
df[~df['Palestinians Injuries'].apply(lambda x: str(x).replace('.', '', 1).isdigit()) & df['Palestinians Injuries'].notnull()]

Unnamed: 0,Year,Month,Palestinians Injuries,Israelis Injuries,Palestinians Killed,Israelis Killed
165,2014.0,JULY,(incl. Aug),(incl. Aug),1590,59
203,2017.0,MAY & JUNE,(incl. Jun),(incl. Jun),6,0
249,2021.0,JULY,111475,5160,10000,1275


- There are 3 such rows:
  - Let's first remove `,` from row:249 

In [18]:
df.loc[249] = df.loc[249].replace(',', '', regex=True)

- to deal with rows:165, 203    we should analyze them separately in order to replace and fill correct data.

- dealing row:165

In [19]:
df.loc[163:167]

Unnamed: 0,Year,Month,Palestinians Injuries,Israelis Injuries,Palestinians Killed,Israelis Killed
163,2014.0,SEPTEMBER,209,22,20,0
164,2014.0,AUGUST,13735,2347,614,9
165,2014.0,JULY,(incl. Aug),(incl. Aug),1590,59
166,2014.0,JUNE,326,5,10,3
167,2014.0,MAY,265,28,3,1


In [20]:
df.loc[200:205]

Unnamed: 0,Year,Month,Palestinians Injuries,Israelis Injuries,Palestinians Killed,Israelis Killed
200,2017.0,AUGUST,135,2,2,0
201,2017.0,JULY,1772,23,20,5
202,2017.0,JUNE,790,20,8,1
203,2017.0,MAY & JUNE,(incl. Jun),(incl. Jun),6,0
204,2017.0,APRIL,150,12,5,1
205,2017.0,MARCH,189,19,7,0


- Both Rows:165, 203 have ambiguous data, and we don't see any linear patter in them
- so it's preferable to empty them and later fill them.

In [21]:
df.loc[165, ['Palestinians Injuries', 'Israelis Injuries']] = None
df.loc[203, ['Palestinians Injuries', 'Israelis Injuries']] = None

- Now let's check again for any non-numeric data in both cols

In [22]:
df[~df['Palestinians Injuries'].apply(lambda x: str(x).replace('.', '', 1).isdigit()) & df['Palestinians Injuries'].notnull()]

Unnamed: 0,Year,Month,Palestinians Injuries,Israelis Injuries,Palestinians Killed,Israelis Killed


- WE are good to go now.....
- -----

In [23]:
df.isnull().sum()

Year                       0
Month                      0
Palestinians Injuries     56
Israelis Injuries        119
Palestinians Killed        0
Israelis Killed            0
dtype: int64

Now instead of filling with _traditional method_ of mean, median we'll use _advance_ approach i.e; **IterativeImputer**

## What is IterativeImputer?

**IterativeImputer** is a sophisticated method for imputing missing values in datasets. Unlike simple imputation methods such as mean or median imputation, **IterativeImputer** uses a machine learning model to predict the missing values. It works iteratively by treating each feature with missing values as a target, and other features as predictors. This method creates more accurate imputations by considering the relationships between features.

In our case, we will use the **RandomForestRegressor** as the estimator, which is a powerful machine learning algorithm capable of learning complex relationships between features.

---

In [24]:
# Import Necessary Libraries
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.ensemble import RandomForestRegressor

Imputing Missing Values in Both Columns  `Palestinians Injuries` and `Israelis Injuries`

- we will use the IterativeImputer with a RandomForestRegressor to impute missing values for both 'Palestinians Injuries' and 'Israelis Injuries'. We will treat the imputation as an iterative process, allowing the model to predict the missing values based on the other available features in the dataset.


In [25]:
# Initializing IterativeImputer with RandomForestRegressor
iterative_imputer = IterativeImputer(estimator=RandomForestRegressor(random_state=42), max_iter=10, random_state=42)

In [26]:
# Imputing missing values for 'Palestinians Injuries' and 'Israelis Injuries'
df[['Palestinians Injuries', 'Israelis Injuries']] = iterative_imputer.fit_transform(df[['Palestinians Injuries', 'Israelis Injuries']])

In [27]:
# Checking the missing values again to ensure they've been filled
df[['Palestinians Injuries', 'Israelis Injuries']].isnull().sum()

Palestinians Injuries    0
Israelis Injuries        0
dtype: int64

In [28]:
df.isnull().sum()

Year                     0
Month                    0
Palestinians Injuries    0
Israelis Injuries        0
Palestinians Killed      0
Israelis Killed          0
dtype: int64

In [29]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 250 entries, 0 to 249
Data columns (total 6 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   Year                   250 non-null    float64
 1   Month                  250 non-null    object 
 2   Palestinians Injuries  250 non-null    float64
 3   Israelis Injuries      250 non-null    float64
 4   Palestinians Killed    250 non-null    object 
 5   Israelis Killed        250 non-null    object 
dtypes: float64(3), object(3)
memory usage: 11.8+ KB


- Final Step: Conversion of datatypes....
    - Year to int
    - Palestinians Injuries to int 
    - Israelis Injuries  to int
    - Palestinians Killed to int 
    - Israelis Killed  to int

In [30]:
df['Year'] = df['Year'].astype(int)
df['Palestinians Injuries'] = df['Palestinians Injuries'].astype(int)
df['Israelis Injuries'] = df['Israelis Injuries'].astype(int)
df['Palestinians Killed'] = df['Palestinians Killed'].astype(int)
df['Israelis Killed'] = df['Israelis Killed'].astype(int)

In [31]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 250 entries, 0 to 249
Data columns (total 6 columns):
 #   Column                 Non-Null Count  Dtype 
---  ------                 --------------  ----- 
 0   Year                   250 non-null    int32 
 1   Month                  250 non-null    object
 2   Palestinians Injuries  250 non-null    int32 
 3   Israelis Injuries      250 non-null    int32 
 4   Palestinians Killed    250 non-null    int32 
 5   Israelis Killed        250 non-null    int32 
dtypes: int32(5), object(1)
memory usage: 7.0+ KB


## Data Cleaning Complete!

Hooray! 🎉 We have successfully imputed the missing values in the **'Palestinians Injuries'** and **'Israelis Injuries'** columns using the **IterativeImputer** method with a **RandomForestRegressor** estimator. 

This advanced imputation technique ensures that the missing values are filled based on the relationships between other columns, giving us a much more accurate representation of the data compared to simple mean or median imputation.

----

## Next Steps: Visualization 💹

Now that we have successfully cleaned the data, it's time to move towards **visualizing**.

Let's dive into the visualizations! 📊

In [32]:
# importing Plotly interactive library for visualization
import plotly.express as px

In [33]:
df.sample(10)

Unnamed: 0,Year,Month,Palestinians Injuries,Israelis Injuries,Palestinians Killed,Israelis Killed
4,2001,DECEMBER,304,17,67,36
96,2008,APRIL,302,18,72,8
30,2003,OCTOBER,289,26,58,27
160,2014,DECEMBER,350,15,5,0
97,2008,MARCH,302,18,114,12
22,2002,JUNE,299,18,57,57
119,2010,MAY,118,17,2,0
219,2018,JANUARY,302,18,7,1
43,2004,SEPTEMBER,579,14,112,11
242,2020,FEBRUARY,302,18,9,0


## Interactive Time Series Plot: Injuries and Deaths Over the Years

In this plot, we will visualize the trends in **Palestinians Injuries**, **Israelis Injuries**, **Palestinians Killed**, and **Israelis Killed** over the years. This line plot will allow us to examine how the counts change over time, making it easy to spot significant events and trends.

_Hover over the lines for more detailed information about each data point._

In [34]:
# Creating a time series plot with logarithmic scaling
fig = px.line(df, x='Year', y=['Palestinians Injuries', 'Israelis Injuries', 'Palestinians Killed', 'Israelis Killed'],
              title="Injuries and Deaths Over the Years (Log Scale)",
              labels={'value': 'Count', 'variable': 'Category'},
              markers=True)

# Update y-axis to be logarithmic scale
fig.update_layout(
    yaxis_type="log",
    title='Injuries and Deaths Over the Years (Log Scale)',
    xaxis_title='Year',
    yaxis_title='Log Count',
    template='plotly_dark',
    hovermode='x unified',
    legend_title="Category"
)

# Show the plot
fig.show()

## Interactive Monthly Breakdown of Injuries and Deaths

This bar chart represents the **monthly distribution** of **Palestinians Injuries**, **Israelis Injuries**, **Palestinians Killed**, and **Israelis Killed**. The visualization will allow you to see which months were more severe in terms of injuries and deaths.

Hovering over the bars will display the exact counts for each month.

In [35]:
# Creating an interactive bar plot for injuries and deaths by month
fig = px.bar(df, x='Month', y=['Palestinians Injuries', 'Israelis Injuries', 'Palestinians Killed', 'Israelis Killed'],
             title="Monthly Breakdown of Injuries and Deaths",
             labels={'value': 'Count', 'variable': 'Category'},
             barmode='group', color='variable')

# Add some styling
fig.update_traces(marker=dict(line=dict(width=0)), opacity=0.7)
fig.update_layout(
    title="Monthly Breakdown of Injuries and Deaths",
    xaxis_title='Month',
    yaxis_title='Count',
    template='plotly_dark',
    hovermode='x unified',
    legend_title="Category"
)

# Show the plot
fig.show()

## Heatmap of Injuries and Deaths Across Year and Month

This heatmap visualizes the **intensity** of **Palestinians Injuries**, **Israelis Injuries**, **Palestinians Killed**, and **Israelis Killed** over **year** and **month**. The heatmap will help us see the distribution of these events, making it easier to identify when and how often significant events occurred.

Lighter colors represent higher counts, while darker colors represent lower counts.


In [36]:
# Grouping the data by 'Year' and 'Month' for better aggregation
data_month_year = df.groupby(['Year', 'Month'])[['Palestinians Injuries', 'Israelis Injuries', 'Palestinians Killed', 'Israelis Killed']].sum().reset_index()

# Normalize data (log scaling or by dividing by the maximum value)
normalized_data = data_month_year[['Palestinians Injuries', 'Israelis Injuries', 'Palestinians Killed', 'Israelis Killed']].div(
    data_month_year[['Palestinians Injuries', 'Israelis Injuries', 'Palestinians Killed', 'Israelis Killed']].max(), axis=1)

# Heatmap for Palestinians Injuries
fig = px.density_heatmap(data_month_year, x='Month', y='Year', z=normalized_data['Palestinians Injuries'],
                         color_continuous_scale='Viridis',
                         title="Normalized Heatmap of Palestinians Injuries by Year and Month",
                         labels={'Palestinians Injuries': 'Injuries Count'})

# Show the plot
fig.show()

# Similarly for Israelis Injuries
fig = px.density_heatmap(data_month_year, x='Month', y='Year', z=normalized_data['Israelis Injuries'],
                         color_continuous_scale='Viridis',
                         title="Normalized Heatmap of Israelis Injuries by Year and Month",
                         labels={'Israelis Injuries': 'Injuries Count'})

# Show the plot
fig.show()

## Stacked Bar Chart: Injuries and Deaths Comparison by Year

This **stacked bar chart** will provide a comparison of **Palestinians Injuries**, **Israelis Injuries**, **Palestinians Killed**, and **Israelis Killed** for each **year**. It shows how the different categories are distributed in each year, making it easier to compare the overall scale of injuries and deaths year over year.

Hovering over the bars will give you detailed counts for each category.

In [37]:
# Creating a stacked bar chart with logarithmic scaling
fig = px.bar(df, x='Year', y=['Palestinians Injuries', 'Israelis Injuries', 'Palestinians Killed', 'Israelis Killed'],
             title="Stacked Bar Chart of Injuries and Deaths by Year (Log Scale)",
             labels={'value': 'Count', 'variable': 'Category'},
             barmode='stack')

# Update y-axis to be logarithmic scale
fig.update_layout(
    yaxis_type="log",
    title="Stacked Bar Chart of Injuries and Deaths by Year (Log Scale)",
    xaxis_title='Year',
    yaxis_title='Log Count',
    template='plotly_dark',
    hovermode='x unified',
    legend_title="Category"
)

# Show the plot
fig.show()

## Interactive Correlation Heatmap

This **correlation heatmap** provides an overview of how the various variables—**Palestinians Injuries**, **Israelis Injuries**, **Palestinians Killed**, and **Israelis Killed**—are correlated with one another. It allows us to quickly identify strong positive or negative relationships between the variables.

A darker shade indicates a stronger correlation, while lighter shades represent weaker correlations.


In [38]:
# Correlation matrix heatmap
corr_matrix = df[['Palestinians Injuries', 'Israelis Injuries', 'Palestinians Killed', 'Israelis Killed']].corr()

# Creating the heatmap
fig = px.imshow(corr_matrix, text_auto=True, color_continuous_scale='Blues', 
                title="Correlation Heatmap of Injuries and Deaths")

# Show the plot
fig.show()


## 🎉 Conclusion 🎉

In this notebook, we embarked on a comprehensive exploration of the dataset containing vital information on **Palestinian** and **Israeli injuries and deaths** over time. Through a series of robust **data cleaning** and **imputation** techniques, we successfully filled the missing values in the **Palestinians Injuries** and **Israelis Injuries** columns using an advanced **IterativeImputer with RandomForestRegressor**. This approach allowed us to make accurate predictions and ensure the dataset is ready for deeper analysis. 🔍✅

We then created a series of **interactive and visually engaging plots** using **Plotly**, including:
- **Time Series Visualizations** 📈: Showcasing trends in injuries and deaths over the years.
- **Stacked Bar Charts** 📊: Providing a detailed breakdown of injuries and deaths by year.
- **Heatmaps** 🌡️: Giving a month-by-month insight into the severity of injuries and deaths across different years.

To ensure clarity and insightful visualization, we applied **logarithmic scaling** 🔢 and **normalization** where needed, making the patterns and trends **easier to interpret**. 

This exploration not only sheds light on the **human impact** of the ongoing conflict but also provides **clear, accessible, and interactive** visualizations that can serve as a foundation for informed discussions. 🌍💡

In summary, this notebook transforms complex and incomplete data into an elegant and comprehensive story, ready to be shared with the **international community**. 🌐📚

Thank you for taking the time to explore this **important analysis** with us! 🙏✨

In [39]:
df.to_csv("cleaned_data.csv", index=False)