# Loading the dataset and libraries 

This dataset records fatalities in the Israeli-Palestinian conflict from 2000 to 2023. It provides extensive details on individual casualties, including personal information, event specifics, and circumstances of death => [link of dataset.](https://www.kaggle.com/datasets/asaniczka/fatalities-in-the-israeli-palestinian-conflict/code)

In [84]:
import numpy as np
import pandas as pd 

In [85]:
data = pd.read_csv("fatalities_isr_pse_conflict_2000_to_2023.csv")

In [None]:
data.head()
# data.info()
# data.shape
# data.dtypes
# data.describe()


In [87]:
data.isnull().sum()

name                               0
date_of_event                      0
age                              129
citizenship                        0
event_location                     0
event_location_district            0
event_location_region              0
date_of_death                      0
gender                            20
took_part_in_the_hostilities    1430
place_of_residence                68
place_of_residence_district       68
type_of_injury                   291
ammunition                      5253
killed_by                          0
notes                            280
dtype: int64

In [88]:
missing_values_both_columns = (data['place_of_residence'].isnull() & data['place_of_residence_district'].isnull()).sum()
print(f" nbr of missing values in both 2 columns is : {missing_values_both_columns}")

 nbr of missing values in both 2 columns is : 68


### Indications from data
#### Null Values 
* total records : 11124
* Age: 129
* took_part_in_the_hostilities: 1430
* type_of_injury: 291
* ammunition: 5253
* notes: 280
#### Objects Vs Float 
* all columns are objects except age ( float ).

In [89]:
# Unique values 
unique_values_per_column = { col: data[col].unique() for col in data.columns if data[col].nunique() < 30 }
for col, val in unique_values_per_column.items():
    print(f"*** Unique values for [{col}] is : {val}")

*** Unique values for [citizenship] is : ['Palestinian' 'Israeli' 'Jordanian' 'American']
*** Unique values for [event_location_district] is : ['Tulkarm' 'Jenin' 'Jericho' 'Gaza' 'Hebron' 'Tubas'
 'Ramallah and al-Bira' 'East Jerusalem' 'Nablus' 'Israel' 'al-Quds'
 'Bethlehem' 'Khan Yunis' 'Deir al-Balah' 'North Gaza' 'Rafah' 'Qalqiliya'
 'Salfit' 'Gaza Strip' 'Gush Katif']
*** Unique values for [event_location_region] is : ['West Bank' 'Gaza Strip' 'Israel']
*** Unique values for [gender] is : ['M' 'F' nan]
*** Unique values for [took_part_in_the_hostilities] is : [nan 'No' 'Yes' 'Unknown' 'Israelis' 'Object of targeted killing']
*** Unique values for [place_of_residence_district] is : ['Tulkarm' 'Jenin' 'Jericho' 'Khan Yunis' 'Hebron' 'Tubas'
 'Ramallah and al-Bira' 'East Jerusalem' 'Israel' 'Nablus' 'al-Quds'
 'Bethlehem' 'Gaza' 'Rafah' 'Deir al-Balah' 'North Gaza' 'Salfit'
 'Qalqiliya' nan 'Gush Katif' 'West Bank']
*** Unique values for [type_of_injury] is : ['gunfire' 'stabbing' '

# Missing Values 
* **age** : it is really important for our case, so if we keep those 129 Nan in our dataset the result of our analyse will be incorrect ( drop NaN values ).
* **took_par_in_hostilities** : there are like 4 unique values ( one of them is : Unknown ), so we are going to label NaN of this attribute as Unknown.
* **type_of_injuries** : we only going to do derive insights from this dataset using visualization, so we are going to replace NaN with Unknown.
* **ammunition** : we only going to do derive insights from this dataset using visualization, so we are going to replace NaN with Unknown.
* **notes** : we are going to replace NaN with No_Note.
* **place_of_residence** & **place_of_residence_district**: are 44 missing values at the same level, we can only drop them because we are trying to get the most correct result of visualization as possible.   

In [90]:
# drop age nan values 
data.dropna(subset='age',inplace=True)
# data['age'].fillna(-1,inplace=True)

In [91]:
# replace nan with Unkown 
data['took_part_in_the_hostilities'].fillna('Unknown',inplace=True)
data['type_of_injury'].fillna('Unkown',inplace=True)
data['ammunition'].fillna('Unkown',inplace=True)

In [92]:
# drop place_of_residence nan values : 
data.dropna(subset='place_of_residence_district',inplace=True)

In [93]:
# replace missing values with no note.
data['notes'].fillna('No note', inplace=True)

In [94]:
print(f"the total number of records remain with no missing values is : {data.shape[0]}")

the total number of records remain with no missing values is : 10951


In [None]:
data.head()

# Location - Longitude and Latitude 

* We are going to use the popular geocoding service : OpenCage Geocoding API
  1. install library OpenCage
  2. load library Open Cage 
  3. first sing up and get the api key
  4. start converting each place_of_residence_district to geocoding location.
 
!! Geocoding API Plan : free trial, allows up to 2,500 API requests/day for testing

In [None]:
data.columns

In [None]:
# get the unique values of combined ( region and district ) 
unique_combinations = data.drop_duplicates(subset=["event_location_region","event_location_district"])
for index, row in unique_combinations.iterrows():
    print(f"Region : {row['event_location_region']}, District : {row['event_location_district']}")


In [98]:
# load the OpenCage library 
from geopy.geocoders import OpenCage
# Set up the OpenCage geocoder
geocoder = OpenCage(api_key='89de0afb64e04eb7933f0c252039ff3a')

In [99]:
coordinate_list = []
for index, row in unique_combinations.iterrows():
    location = geocoder.geocode(f"{row['event_location_region']}, {row['event_location_district']}")
    if location:
        coordinate_list.append([
            row['event_location_region'],
            row['event_location_district'],
            location.latitude,
            location.longitude
        ])
    else:
        print(f"Could not find coordinates for {row['event_location_region']}, {row['event_location_district']}")
        

In [100]:
# we are going to create a dataframe from this list then we are going to merge ( left join ) it with original one and then we drop unecessary columns.
coordinate_df = pd.DataFrame(coordinate_list,columns=['event_location_region','event_location_district','latitude','longitude'])

In [101]:
# Merging the 2 dataframes together 
data = pd.merge(data,coordinate_df,on=['event_location_region','event_location_district'],how='left')

# Preparing format  

In [102]:
data.columns

Index(['name', 'date_of_event', 'age', 'citizenship', 'event_location',
       'event_location_district', 'event_location_region', 'date_of_death',
       'gender', 'took_part_in_the_hostilities', 'place_of_residence',
       'place_of_residence_district', 'type_of_injury', 'ammunition',
       'killed_by', 'notes', 'latitude', 'longitude'],
      dtype='object')

In [103]:
# date_of_event: object => datetime  
# date_of_death : object => datetime 
data['date_of_event'] = pd.to_datetime(data['date_of_event'],format='%Y-%m-%d')
data['date_of_death'] = pd.to_datetime(data['date_of_death'],format='%Y-%m-%d')

In [104]:
# Age => int8
data['age'] = data['age'].astype('int8')

In [105]:
data.dtypes

name                                    object
date_of_event                   datetime64[ns]
age                                       int8
citizenship                             object
event_location                          object
event_location_district                 object
event_location_region                   object
date_of_death                   datetime64[ns]
gender                                  object
took_part_in_the_hostilities            object
place_of_residence                      object
place_of_residence_district             object
type_of_injury                          object
ammunition                              object
killed_by                               object
notes                                   object
latitude                               float64
longitude                              float64
dtype: object

# Saving the cleaned Dataset as CSV file 

In [106]:
data.to_csv('fatalities_isr_pse_conflict_2000_to_2023_Clean.csv', index=False)