# Missing Migrants Project

Missing Migrants Project tracks deaths of migrants, including refugees and asylum-seekers, who have died or gone missing in the process of migration towards an international destination. 
## Please note that these data represent minimum estimates, as many deaths during migration go unrecorded. 
#### The downloads below are licensed under a Creative Commons Attribution 4.0 International License. 
This means that Missing Migrants Project data are free to share and adapt, as long as the appropriate attribution is given. This includes stating that the source is 'IOM's Missing Migrants Project', and indicating if changes were made to the data. Ideally, a link to this website should also be included.

|Variable Name|Description|
|-------------|-----------|
|Web ID|An automatically generated number used to identify each unique entry in the dataset.|
|Region of incident|The region in which an incident took place.|
|Reported date|Estimated date of death. In cases where the exact date of death is not known, this variable indicates the date in which the body or bodies were found. In cases where data are drawn from surviving migrants, witnesses or other interviews, this variable is entered as the date of the death as reported by the interviewee.  At a minimum, the month and the year of death is recorded. In some cases, official statistics are not disaggregated by the incident, meaning that data is reported as a total number of deaths occurring during a certain time period. In such cases the entry is marked as a “cumulative total,” and the latest date of the range is recorded, with the full dates recorded in the comments.
|Reported year|The year in which the incident occurred.
|Reported month|The month in which the incident occurred.
|Number dead|The total number of people confirmed dead in one incident, i.e. the number of bodies recovered.  If migrants are missing and presumed dead, such as in cases of shipwrecks, leave blank.
|Number missing |The total number of those who are missing and are thus assumed to be dead.  This variable is generally recorded in incidents involving shipwrecks.  The number of missing is calculated by subtracting the number of bodies recovered from a shipwreck and the number of survivors from the total number of migrants reported to have been on the boat.  This number may be reported by surviving migrants or witnesses.  If no missing persons are reported, it is left blank.
|Total dead and missing|The sum of the ‘number dead’ and ‘number missing’ variables.
|Number of survivors|The number of migrants that survived the incident, if known. The age, gender, and country of origin of survivors are recorded in the ‘Comments’ variable if known. If unknown, it is left blank
|Number of females|Indicates the number of females found dead or missing. If unknown, it is left blank.
|Number of males|Indicates the number of males found dead or missing. If unknown, it is left blank.
|Number of children|Indicates the number of individuals under the age of 18 found dead or missing. If unknown, it is left blank.
|Country of origin|Country of birth of the decedent. If unknown, the entry will be marked “unknown”.
|Region of origin|Region of origin of the decedent(s). In some incidents, region of origin may be marked as “Presumed” or “(P)” if migrants travelling through that location are known to hail from a certain region. If unknown, the entry will be marked “unknown”.
|Cause of death|The determination of conditions resulting in the migrant's death i.e. the circumstances of the event that produced the fatal injury. If unknown, the reason why is included where possible.  For example, “Unknown – skeletal remains only”, is used in cases in which only the skeleton of the decedent was found.
|Location description|Place where the death(s) occurred or where the body or bodies were found. Nearby towns or cities or borders are included where possible. When incidents are reported in an unspecified location, this will be noted.
|Location coordinates|Place where the death(s) occurred or where the body or bodies were found. In many regions, most notably the Mediterranean, geographic coordinates are estimated as precise locations are not often known. The location description should always be checked against the location coordinates.
|Migration route|Name of the migrant route on which incident occurred, if known. If unknown, it is left blank.
|UNSD geographical grouping|Geographical region in which the incident took place, as designated by the United Nations Statistics Division (UNSD) geoscheme.
|Source quality|Incidents are ranked on a scale from 1-5 based on the source(s) of information available. Incidents ranked as level 1 are based on information from only one media source. Incidents ranked as level 2 are based on information from uncorroborated eyewitness accounts or data from survey respondents. Incidents ranked as level 3 are based on information from multiple media reports, while level 4 incidents are based on information from at least one NGO, IGO, or another humanitarian actor with direct knowledge of the incident. Incidents ranked at level 5 are based on information from official sources such as coroners, medical examiners, or government officials OR from multiple humanitarian actors.
|Comments|Brief description narrating additional facts about the death.  If no extra information is available, this is left blank.

### For more info:
#### 1) http://missingmigrants.iom.int/methodology
#### 2) https://missingmigrants.iom.int/regional-classifications
#### 3) https://www.iom.int/
#### 4) https://gmdac.iom.int/

## If you liked the kernel, please give me an upvote.
#### NOTE: This is an alpha version. I will update the kernel with more information. I'm still doing research.

# 1. Importing modules

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import datetime
import re
from mpl_toolkits.basemap import Basemap

# 2. Defining Functions

In [None]:
def missingData(data):
    total = data.isnull().sum().sort_values(ascending = False)
    percent = (data.isnull().sum()/data.isnull().count()*100).sort_values(ascending = False)
    md = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
    md = md[md["Percent"] > 0]
    sns.set(style = 'darkgrid')
    plt.figure(figsize = (8, 4))
    plt.xticks(rotation='90')
    sns.barplot(md.index, md["Percent"],color="g",alpha=0.8)
    plt.xlabel('Features', fontsize=15)
    plt.ylabel('Percent of missing values', fontsize=15)
    plt.title('Percent missing data by feature', fontsize=15)
    return md

def valueCounts(dataset, features):
    """Display the features value counts """
    for feature in features:
        vc = dataset[feature].value_counts()
        print(vc)
        print('-'*30)

# 3. Open the dataframe

In [None]:
raw_data = pd.read_csv('../input/MissingMigrants-Global-2019-03-29T18-36-07.csv')

In [None]:
raw_data.head(3)

# 4. Checking the Missing Values

In [None]:
raw_data.shape

In [None]:
raw_data.info()

In [None]:
missingData(raw_data)

In [None]:
f = ['Region of Incident', 'Migration Route']

In [None]:
valueCounts(raw_data, f)

# 5. Preprocessing the data

#### First, I create a copy of the dataset

In [None]:
data = raw_data.copy()

#### At this point, I convert the dates to datetime

In [None]:
def convert_date(s):
    new_s = datetime.datetime.strptime(s, '%B %d, %Y')
    return new_s

In [None]:
data['Date'] = data['Reported Date'].apply(convert_date)

#### Now I separate the geographic coordinates, so that they can be subsequently applied for a geographical visualization.

In [None]:
data['Lat'], data['Lon'] = data['Location Coordinates'].str.split(', ').str

In [None]:
data.Lat = data.Lat.astype(float)
data.Lon = data.Lon.astype(float)

#### Now I replace the NaN values of the 'Number Dead' and 'Minimum Estimated Number of Missing' features with 0.
## Warning!
### I want to remind you that these datasets, although accurate, represent the estimated minimums.
|Variable Name|Description|
|-------------|-----------|
|Number dead|The total number of people confirmed dead in one incident, i.e. the number of bodies recovered.  If migrants are missing and presumed dead, such as in cases of shipwrecks, leave blank.
|Number missing |The total number of those who are missing and are thus assumed to be dead.  This variable is generally recorded in incidents involving shipwrecks.  The number of missing is calculated by subtracting the number of bodies recovered from a shipwreck and the number of survivors from the total number of migrants reported to have been on the boat.  This number may be reported by surviving migrants or witnesses.  If no missing persons are reported, it is left blank.
|Total dead and missing|The sum of the ‘number dead’ and ‘number missing’ variables.

In [None]:
data['Number Dead'].fillna(0, inplace=True)

In [None]:
data['Minimum Estimated Number of Missing'].fillna(0, inplace=True)

#### Convert the 'Total Dead and Missing' feature to float, so that you can plot them

In [None]:
data['Total Dead and Missing'] = data['Total Dead and Missing'].astype(float)
# "Mediterranean", "April 18, 2015" I corrected this line in the original file (1,022 ---> 1022)
# so I don't have to use additional functions for converting to float.

#### Now I remove these columns.
#### N.B. If you have any advice regarding the use of these features, they are welcome.

In [None]:
toDrop = [
    'Reported Date',
    'Web ID',
    'Number of Children',
    'Number of Survivors',
    'Number of Females',
    'Number of Males',
    'Location Coordinates',
    'URL',
    'UNSD Geographical Grouping'
]

In [None]:
data.drop(toDrop, axis=1, inplace=True)

In [None]:
data.shape

In [None]:
data.sample(3)

In [None]:
data.shape

#### I have to reduce the types of causes of death, because they are complex to manipulate. Therefore I will try to "cluster" similar causes of death into a single cause.

In [None]:
def deathCauseReplacement(data):
    #HEALTH CONDITION
    data.loc[data['Cause of Death'].str.contains('Sickness|sickness'), 'Cause of Death'] = 'Health Condition'
    data.loc[data['Cause of Death'].str.contains('diabetic|heart attack|meningitis|virus|cancer|bleeding|insuline|inhalation'), 'Cause of Death'] = 'Health Condition'
    data.loc[data['Cause of Death'].str.contains('Organ|Coronary|Envenomation|Post-partum|Respiratory|Hypoglycemia'), 'Cause of Death'] = 'Health Condition'
    #HARSH CONDITIONS
    data.loc[data['Cause of Death'].str.contains('harsh weather|Harsh weather'), 'Cause of Death'] = 'Harsh conditions'
    data.loc[data['Cause of Death'].str.contains('Harsh conditions|harsh conditions'), 'Cause of Death'] = 'Harsh conditions'
    data.loc[data['Cause of Death'].str.contains('Exhaustion|Heat stroke'), 'Cause of Death'] = 'Harsh conditions'
    #UNKNOWN
    data.loc[data['Cause of Death'].str.contains('Unknown|unknown'), 'Cause of Death'] = 'Unknown'
    #STARVATION
    data.loc[data['Cause of Death'].str.contains('Starvation|starvation'), 'Cause of Death'] = 'Starvation'
    #DEHYDRATION
    data.loc[data['Cause of Death'].str.contains('dehydration|Dehydration'), 'Cause of Death'] = 'Dehydration'
    #DROWNING
    data.loc[data['Cause of Death'].str.contains('Drowning|drowning|Pulmonary|respiratory|lung|bronchial|pneumonia|Pneumonia'), 'Cause of Death'] = 'Drowning'
    #HYPERTHERMIA
    data.loc[data['Cause of Death'].str.contains('hyperthermia|Hyperthermia'), 'Cause of Death'] = 'Hyperthermia'
    #HYPOTHERMIA
    data.loc[data['Cause of Death'].str.contains('hypothermia|Hypothermia'), 'Cause of Death'] = 'Hypothermia'
    #ASPHYXIATION
    data.loc[data['Cause of Death'].str.contains('asphyxiation|suffocation'), 'Cause of Death'] = 'Asphyxiation'
    #VEHICLE ACCIDENT
    data.loc[data['Cause of Death'].str.contains('train|bus|vehicle|truck|boat|car|road|van|plane'), 'Cause of Death'] = 'Vehicle Accident'
    data.loc[data['Cause of Death'].str.contains('Train|Bus|Vehicle|Truck|Boat|Car|Road|Van|Plane'), 'Cause of Death'] = 'Vehicle Accident'
    #MURDER
    data.loc[data['Cause of Death'].str.contains('murder|stab|shot|violent|blunt force|violence|beat-up|fight|murdered|death'), 'Cause of Death'] = 'Murder'
    data.loc[data['Cause of Death'].str.contains('Murder|Stab|Shot|Violent|Blunt force|Violence|Beat-up|Fight|Murdered|Death'), 'Cause of Death'] = 'Murder'
    data.loc[data['Cause of Death'].str.contains('Hanging|Apache|mortar|landmine|Rape|Gassed'), 'Cause of Death'] = 'Murder'
    #CRUSHED
    data.loc[data['Cause of Death'].str.contains('crushed to death|crush|Crush|Rockslide'), 'Cause of Death'] = 'Crushed'
    #BURNED
    data.loc[data['Cause of Death'].str.contains('burn|burns|burned|fire'), 'Cause of Death'] = 'Burned'
    data.loc[data['Cause of Death'].str.contains('Burn|Burns|Burned|Fire'), 'Cause of Death'] = 'Burned'
    #ELECTROCUTION
    data.loc[data['Cause of Death'].str.contains('electrocution|Electrocution'), 'Cause of Death'] = 'Electrocution' #folgorazione
    #FALLEN
    data.loc[data['Cause of Death'].str.contains('Fall|fall'), 'Cause of Death'] = 'Fallen' 
    #KILLED BY ANIMALS
    data.loc[data['Cause of Death'].str.contains('crocodile|hippopotamus|hippoptamus'), 'Cause of Death'] = 'Killed by animals'
    #EXPOSURE
    data.loc[data['Cause of Death'].str.contains('exposure|Exposure'), 'Cause of Death'] = 'Exposure'

In [None]:
deathCauseReplacement(data)

In [None]:
valueCounts(data,['Cause of Death'])

#### In this way it will be possible for me to better manipulate the information. If I were to investigate precise events, I will start from raw_data.

# 6. First general data visualization

In [None]:
fig = plt.figure(figsize=(20, 14)) 
sns.set(style = 'white')
data['Cause of Death'].value_counts().plot(kind='bar', 
                                   color='r',
                                   align='center')
plt.title('Cause of Death', fontsize=20)
plt.show()

In [None]:
fig = plt.figure(figsize=(20, 14)) 
sns.set(style = 'white')
data['Region of Incident'].value_counts().plot(kind='bar', 
                                   color='r',
                                   align='center')
plt.title('Region of Incident', fontsize=20)
plt.show()

In [None]:
fig = plt.figure(figsize=(20, 14)) 
data['Migration Route'].value_counts().plot(kind='bar', 
                                   color='r',
                                   align='center')
plt.title('Migration Route', fontsize=20)
plt.show()

In [None]:
lat = data['Lat'][:]
lon = data['Lon'][:]
lat = lat.dropna()
lon = lon.dropna()
lat = np.array(lat)
lon = np.array(lon)

fig=plt.figure()
ax=fig.add_axes([1.0,1.0,2.8,2.8])
mapp = Basemap(llcrnrlon=-180.,llcrnrlat=-60.,urcrnrlon=180.,urcrnrlat=80.,
            rsphere=(6378137.00,6356752.3142),
            resolution='l',projection='merc',
            lat_0=40.,lon_0=-20.,lat_ts=20.)
mapp.drawcoastlines()
mapp.drawparallels(np.arange(-90,90,30),labels=[1,0,0,0])
mapp.drawmeridians(np.arange(mapp.lonmin,mapp.lonmax+30,60),labels=[0,0,0,1])
x, y = mapp(lon,lat)
mapp.scatter(x,y,3,marker='o',color='r')
ax.set_title('Refugee deaths across the world', fontsize=20)
plt.show()

In [None]:
data.pivot_table('Total Dead and Missing', index='Reported Year', 
                               columns='Migration Route', aggfunc='sum').plot(figsize=(20, 10), kind='bar')
plt.ylabel('Count')
plt.title('Total Dead and Missing by Migration Route and Year', fontsize=20)
plt.show()

#### We had already had an idea, but these visualizations make us understand how tragic the situation in certain areas of the globe can be.
#### Note: I want to specify that the dataset does not contain all migratory routes and total deaths, as this data is not always accessible.

In [None]:
data.pivot_table('Total Dead and Missing', index='Migration Route', columns='Reported Year', aggfunc='sum')

# 7. The Mediterranean migratory routes

<img src='https://upload.wikimedia.org/wikipedia/commons/f/f4/Rotte_di_migranti_nel_mediterraneo.svg'>

#### Wikipedia page about migrant routes in the Mediterranean
#### https://it.wikipedia.org/wiki/Rotte_di_migranti_nel_Mediterraneo

In [None]:
mediterranean = data.loc[data['Region of Incident'] =='Mediterranean']

#### I take the data from 01/01/2014 to 12/31/2018, as 2019 is still incomplete and I wouldn't want to invalidate the data. Maybe I'll look at the latest data later.

In [None]:
mediterranean14_18 = mediterranean.loc[data['Reported Year'] < 2019]

In [None]:
mediterranean14_18.head(3)

In [None]:
valueCounts(mediterranean14_18, ['Migration Route'])

#### About half of the Mediterranean route accidents occurred in the Central Route.

In [None]:
lat = mediterranean14_18['Lat'][:]
lon = mediterranean14_18['Lon'][:]
lat = lat.dropna()
lon = lon.dropna()
lat = np.array(lat)
lon = np.array(lon)

fig=plt.figure()
ax=fig.add_axes([1.0,1.0,2.8,2.8])
mapp = Basemap(llcrnrlon=-10.,llcrnrlat=25.,urcrnrlon=40.,urcrnrlat=50.,
            rsphere=(6378137.00,6356752.3142),
            resolution='l',projection='merc',
            lat_0=40.,lon_0=-20.,lat_ts=20.)
mapp.drawcoastlines()
mapp.drawparallels(np.arange(25,55,5),labels=[1,0,0,0])
mapp.drawmeridians(np.arange(-20,55,5),labels=[0,0,0,1])
x, y = mapp(lon,lat)
mapp.scatter(x,y,5,marker='o',color='r')
ax.set_title('Refugee deaths in the Mediterranean Routes', fontsize=20)
plt.show()

In [None]:
plt.figure(figsize=(20,10))
sns.scatterplot('Lon','Lat',hue='Cause of Death',size='Total Dead and Missing', data=mediterranean14_18, palette='Set1')
plt.title('Latitude vs Longitude', fontsize=20)
plt.show()

#### As a Sardinian, I was surprised to see some points in Sardinia. Therefore I decided to investigate these events further.

### Dead in Sardinia.
#### 1) web ID 40607 (Cagliari) 19-12-2015
http://www.ansa.it/sardegna/notizie/2015/12/19/tenta-fuga-da-ospedale-muore-eritreo_e9f4f8da-4058-4a54-83a8-0fd8c8e29b11.html
#### 2) web ID 42607 (Cagliari) 21-05-2016
https://www.ilmessaggero.it/primopiano/cronaca/migranti_sardegna_guardia_costiera-1747870.html
#### 3) web ID 46079 (Isola del Toro) 15-11-2018
https://tg24.sky.it/cronaca/2018/11/16/migranti-naufragio-sardegna.html

#### How reliable are the sources concerning the Routes in the Mediterranean?

In [None]:
valueCounts(mediterranean14_18,['Source Quality'])

In [None]:
unreliableData = (197*100)/len(mediterranean14_18)
print('Unreliable Data: %f percent' % unreliableData)

|Variable Name|Description|
|-------------|-----------|
|Source quality|Incidents are ranked on a scale from 1-5 based on the source(s) of information available. Incidents ranked as level 1 are based on information from only one media source. Incidents ranked as level 2 are based on information from uncorroborated eyewitness accounts or data from survey respondents. Incidents ranked as level 3 are based on information from multiple media reports, while level 4 incidents are based on information from at least one NGO, IGO, or another humanitarian actor with direct knowledge of the incident. Incidents ranked at level 5 are based on information from official sources such as coroners, medical examiners, or government officials OR from multiple humanitarian actors.

It is possible to affirm that most of the data concerning the migratory routes in the Mediterranean is quite reliable (scores 3,4,5).
#### The chart below gives us further confirmation.

In [None]:
mediterranean.plot(kind='scatter',x='Lon',y='Lat',alpha=0.4,
                   s=mediterranean14_18['Total Dead and Missing']/2, label='Total Dead & Missing',
                   figsize=(20,10),c='Source Quality', cmap=plt.get_cmap('jet'), colorbar=True)
plt.title('Total Dead & Missing Migrants by Source Quality', fontsize=20)
plt.show()

In [None]:
mediterranean14_18.describe()

In [None]:
years = [2014,2015,2016,2017,2018]
mRoutes = ['Western Mediterranean','Central Mediterranean','Eastern Mediterranean']

for year in years:
    print('Year: %d' % year)
    for mroute in mRoutes:
        m = mediterranean.loc[(mediterranean['Migration Route'] == mroute) & (mediterranean['Reported Year'] == year)]
        print('_'*40)
        print('Migration Route: %s' %mroute)
        print('Total death & missing: ', m['Total Dead and Missing'].sum())
        print('Total death : ', m['Number Dead'].sum())
    m1= mediterranean.loc[mediterranean['Reported Year'] == year]
    print('_'*40)
    print('Total death & missing: ', m1['Total Dead and Missing'].sum())
    print('Total death : ', m1['Number Dead'].sum())
    print('*'*40)

#### Now I create some pivot tables so that I can study some data.

In [None]:
mediterranean14_18.pivot_table('Total Dead and Missing', index='Migration Route', aggfunc='sum')

In [None]:
mediterranean14_18.pivot_table('Number Dead', index='Migration Route', aggfunc='sum')

#### NOTE: I want to clarify one thing immediately: The 'Total Dead and Missing' feature is precisely the sum of the dead or missing migrants. However, to be missing in the middle of the sea, unfortunately it is similar to calling it dead.

More than 15000 people died only in the Central Mediterranean Route in the years between 2014 and 2018.

In [None]:
mediterranean14_18.pivot_table('Total Dead and Missing', index='Migration Route', columns='Reported Year', aggfunc='sum')

The number appears to be falling. This is probably due to political actions starting in 2017 in Italy. (I will report the sources).

#### However, it is correct to point out that the reduction of deaths is related to the sharp reduction in departures from the Libyan coast. The absolute number of deaths has fallen. This with regards to Mediterranean routes. Afterwards it will be necessary, however difficult, to analyze the evolution of the African routes.

In [None]:
mediterranean14_18.pivot_table('Total Dead and Missing', index='Reported Year', 
                               columns='Migration Route', aggfunc='sum').plot(figsize=(20, 10), kind='bar')
plt.ylabel('Count')
plt.title('Total Dead and Missing 2014-2018', fontsize=20)
plt.show()

In [None]:
mediterranean14_18.pivot_table('Number Dead', index='Reported Year', 
                               columns='Migration Route', aggfunc='sum').plot(figsize=(20, 10), kind='bar')
plt.ylabel('Count')
plt.title('Total Dead 2014-2018', fontsize=20)
plt.show()

# 8. Exploring Africa (I have to do some research)

In [None]:
lat = data['Lat'][:]
lon = data['Lon'][:]
lat = lat.dropna()
lon = lon.dropna()
lat = np.array(lat)
lon = np.array(lon)

fig=plt.figure()
ax=fig.add_axes([1.0,1.0,2.8,2.8])
map = Basemap(llcrnrlon=-20.,llcrnrlat=5.,urcrnrlon=50.,urcrnrlat=60.,
            rsphere=(6378137.00,6356752.3142),
            resolution='l',projection='merc',
            lat_0=40.,lon_0=-20.,lat_ts=20.)
map.drawcoastlines()
map.drawparallels(np.arange(5,65,5),labels=[1,0,0,0])
map.drawmeridians(np.arange(-20,55,5),labels=[0,0,0,1])
x, y = map(lon,lat)
map.scatter(x,y,3,marker='o',color='r')
ax.set_title('Dead or Missing Migrants between Europe and Africa', fontsize=20)
plt.show()

# Conclusion

## I hope you enjoy this kernel. This is not only meant to be a demonstration kernel, but also a tool to learn more about these events. They are tragedies that happen every day, we should at least know them, to prevent them from happening again.

## If you liked the kernel, please give me an upvote.