### Problem Statement

    * In this project, we will endeavor to determine whether there is a correlation between suicide rates, an individual's mental health and socio-economic factors. 

## Procedure:
    1. Begin by importing the datasets pertaining to suicide rates and mental disorders.
    2. Conduct data cleaning operations on these datasets, followed by merging them into a unified dataset.
    3. Proceed to analyze the merged dataset, identifying countries that warrant further in-depth analysis.
    4. Introduce the World Development Indicators (WDI) dataset and execute data cleaning procedures.
    5. Selectively retain data for the identified countries from the WDI dataset, discarding irrelevant entries.
    6. Standardize the compiled dataset and undertake modeling activities.
    7. Evaluate the model's performance and extract insights regarding the significance of each feature through feature importance analysis.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

### Data Cleaning

#### Suicide Rate Dataset

In [4]:
suicide_df = pd.read_csv("./Datasets/crude suicide rates.csv", sep=",")
suicide_df.head()

Unnamed: 0.1,Unnamed: 0,Unnamed: 1,"Probability (%) of dying between age 30 and exact age 70 from any of cardiovascular disease, cancer, diabetes, or chronic respiratory disease","Probability (%) of dying between age 30 and exact age 70 from any of cardiovascular disease, cancer, diabetes, or chronic respiratory disease.1","Probability (%) of dying between age 30 and exact age 70 from any of cardiovascular disease, cancer, diabetes, or chronic respiratory disease.2",Crude suicide rates (per 100 000 population),Crude suicide rates (per 100 000 population).1,Crude suicide rates (per 100 000 population).2
0,Country,Year,Both sexes,Male,Female,Both sexes,Male,Female
1,Afghanistan,2016,29.8,31.8,27.7,4.7,7.6,1.5
2,Afghanistan,2015,29.8,31.9,27.8,4.8,7.8,1.5
3,Afghanistan,2010,31.7,34.1,29.4,5.1,8.6,1.4
4,Afghanistan,2005,34.1,36.5,31.6,6.3,10.8,1.5


In [None]:
### Renaming the columns to reduce their length as make the dataset more readable. 

In [5]:
suicide_df.rename(columns={"Unnamed: 0":"country"}, inplace=True)

In [7]:
suicide_df.rename(columns={"Unnamed: 1": "year",
                           "Probability (%) of dying between age 30 and exact age 70 from any of cardiovascular disease, cancer, diabetes, or chronic respiratory disease":"prob_death_bw_30_70_both",
                           "Probability (%) of dying between age 30 and exact age 70 from any of cardiovascular disease, cancer, diabetes, or chronic respiratory disease.1":"prob_death_bw_30_70_male",
                           "Probability (%) of dying between age 30 and exact age 70 from any of cardiovascular disease, cancer, diabetes, or chronic respiratory disease.2":"prob_death_bw_30_70_female",
                           "Crude suicide rates (per 100 000 population)":"crude_suicide_rate_both_per_100000",
                           "Crude suicide rates (per 100 000 population).1":"crude_suicide_rate_male_per_100000",
                           "Crude suicide rates (per 100 000 population).2":"crude_suicide_rate_female_per_100000"}, inplace=True)

In [8]:
suicide_df.head()

Unnamed: 0,country,year,prob_death_bw_30_70_both,prob_death_bw_30_70_male,prob_death_bw_30_70_female,crude_suicide_rate_both_per_100000,crude_suicide_rate_male_per_100000,crude_suicide_rate_female_per_100000
0,Country,Year,Both sexes,Male,Female,Both sexes,Male,Female
1,Afghanistan,2016,29.8,31.8,27.7,4.7,7.6,1.5
2,Afghanistan,2015,29.8,31.9,27.8,4.8,7.8,1.5
3,Afghanistan,2010,31.7,34.1,29.4,5.1,8.6,1.4
4,Afghanistan,2005,34.1,36.5,31.6,6.3,10.8,1.5


In [None]:
### Removing the first row, as it contains a sub-heading which we combined with the main column heading while renaming.

In [9]:
suicide_df = suicide_df.iloc[1:]

In [10]:
suicide_df.head()

Unnamed: 0,country,year,prob_death_bw_30_70_both,prob_death_bw_30_70_male,prob_death_bw_30_70_female,crude_suicide_rate_both_per_100000,crude_suicide_rate_male_per_100000,crude_suicide_rate_female_per_100000
1,Afghanistan,2016,29.8,31.8,27.7,4.7,7.6,1.5
2,Afghanistan,2015,29.8,31.9,27.8,4.8,7.8,1.5
3,Afghanistan,2010,31.7,34.1,29.4,5.1,8.6,1.4
4,Afghanistan,2005,34.1,36.5,31.6,6.3,10.8,1.5
5,Afghanistan,2000,34.4,36.6,32.1,5.7,10.0,1.0


In [11]:
### Checking for missing values.

In [14]:
suicide_df.isnull().sum()

country                                 0
year                                    0
prob_death_bw_30_70_both                0
prob_death_bw_30_70_male                0
prob_death_bw_30_70_female              0
crude_suicide_rate_both_per_100000      0
crude_suicide_rate_male_per_100000      0
crude_suicide_rate_female_per_100000    0
dtype: int64

Conclusion: Our data has no missing values.

In [17]:
suicide_df.shape

(915, 8)

* Our suicide_df dataframe has 915 rows and 8 columns

In [16]:
suicide_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 915 entries, 1 to 915
Data columns (total 8 columns):
 #   Column                                Non-Null Count  Dtype 
---  ------                                --------------  ----- 
 0   country                               915 non-null    object
 1   year                                  915 non-null    object
 2   prob_death_bw_30_70_both              915 non-null    object
 3   prob_death_bw_30_70_male              915 non-null    object
 4   prob_death_bw_30_70_female            915 non-null    object
 5   crude_suicide_rate_both_per_100000    915 non-null    object
 6   crude_suicide_rate_male_per_100000    915 non-null    object
 7   crude_suicide_rate_female_per_100000  915 non-null    object
dtypes: object(8)
memory usage: 57.3+ KB


* Here, the datatype of all the columns is `object`. But from the look of the data, except `country` columns, all other columns are supposed to of type `float`.

* So, we convert the required fields to float dtype.

In [22]:
### Convert selected columns to `float`.

columns_to_convert = ["prob_death_bw_30_70_both", "prob_death_bw_30_70_male", "prob_death_bw_30_70_female",
                     "crude_suicide_rate_both_per_100000", "crude_suicide_rate_male_per_100000",
                     "crude_suicide_rate_female_per_100000"]

suicide_df[columns_to_convert] = suicide_df[columns_to_convert].astype(float)

suicide_df["year"] = suicide_df["year"].astype(int)

In [23]:
suicide_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 915 entries, 1 to 915
Data columns (total 8 columns):
 #   Column                                Non-Null Count  Dtype  
---  ------                                --------------  -----  
 0   country                               915 non-null    object 
 1   year                                  915 non-null    int32  
 2   prob_death_bw_30_70_both              915 non-null    float64
 3   prob_death_bw_30_70_male              915 non-null    float64
 4   prob_death_bw_30_70_female            915 non-null    float64
 5   crude_suicide_rate_both_per_100000    915 non-null    float64
 6   crude_suicide_rate_male_per_100000    915 non-null    float64
 7   crude_suicide_rate_female_per_100000  915 non-null    float64
dtypes: float64(6), int32(1), object(1)
memory usage: 53.7+ KB


In [24]:
suicide_df.head()

Unnamed: 0,country,year,prob_death_bw_30_70_both,prob_death_bw_30_70_male,prob_death_bw_30_70_female,crude_suicide_rate_both_per_100000,crude_suicide_rate_male_per_100000,crude_suicide_rate_female_per_100000
1,Afghanistan,2016,29.8,31.8,27.7,4.7,7.6,1.5
2,Afghanistan,2015,29.8,31.9,27.8,4.8,7.8,1.5
3,Afghanistan,2010,31.7,34.1,29.4,5.1,8.6,1.4
4,Afghanistan,2005,34.1,36.5,31.6,6.3,10.8,1.5
5,Afghanistan,2000,34.4,36.6,32.1,5.7,10.0,1.0


In [30]:
### Number of unique countries in the suicide_df dataframe.

suicide_df.country.nunique()

183

* So, in total we have 183 countries

In [31]:
suicide_df.year.unique()

array([2016, 2015, 2010, 2005, 2000])

* We have the suicide rate for the years : 2000, 2005, 2010, 2015, 2016

#### Exploring the statistic suicide rate dataset

#### Steps : 
    * As we are given the data from year 2000 to 2015, per country, we have to find if there is a uptrend or downtrend in the suicide rates.
    * Find out which is the most representative statistic of the suicide rate and death rate.
    * Add a new columns, `change_perc_from_00_to_16_for_X` for each of the column. Which will represent the change in suicide rates.
    * After, computing all the required data, we will append it to a new dataframe, with unique countries with corresponding mean and trend data.

In [32]:
unique_country_list = suicide_df.country.unique()
unique_country_list

array(['Afghanistan', 'Albania', 'Algeria', 'Angola',
       'Antigua and Barbuda', 'Argentina', 'Armenia', 'Australia',
       'Austria', 'Azerbaijan', 'Bahamas', 'Bahrain', 'Bangladesh',
       'Barbados', 'Belarus', 'Belgium', 'Belize', 'Benin', 'Bhutan',
       'Bolivia (Plurinational State of)', 'Bosnia and Herzegovina',
       'Botswana', 'Brazil', 'Brunei Darussalam', 'Bulgaria',
       'Burkina Faso', 'Burundi', 'Cabo Verde', 'Cambodia', 'Cameroon',
       'Canada', 'Central African Republic', 'Chad', 'Chile', 'China',
       'Colombia', 'Comoros', 'Congo', 'Costa Rica', "Côte d'Ivoire",
       'Croatia', 'Cuba', 'Cyprus', 'Czechia',
       "Democratic People's Republic of Korea",
       'Democratic Republic of the Congo', 'Denmark', 'Djibouti',
       'Dominican Republic', 'Ecuador', 'Egypt', 'El Salvador',
       'Equatorial Guinea', 'Eritrea', 'Estonia', 'Eswatini', 'Ethiopia',
       'Fiji', 'Finland', 'France', 'Gabon', 'Gambia', 'Georgia',
       'Germany', 'Ghana', 'Gree

In [61]:
suicide_final_df = pd.DataFrame(unique_country_list, columns=["country"])
suicide_final_df.head()

Unnamed: 0,country
0,Afghanistan
1,Albania
2,Algeria
3,Angola
4,Antigua and Barbuda


In [43]:
### Finding the median and mean for each column for each country

for country in unique_country_list:
    print(f" Country: {country} Mean: {suicide_df[suicide_df.country==country].prob_death_bw_30_70_both.mean()}")
    print(f" Country: {country} Median: {suicide_df[suicide_df.country==country].prob_death_bw_30_70_both.median()}")

 Country: Afghanistan Mean: 31.96
 Country: Afghanistan Median: 31.7
 Country: Albania Mean: 18.24
 Country: Albania Median: 18.6
 Country: Algeria Mean: 16.54
 Country: Algeria Median: 15.4
 Country: Angola Mean: 19.36
 Country: Angola Median: 18.1
 Country: Antigua and Barbuda Mean: 22.360000000000003
 Country: Antigua and Barbuda Median: 22.6
 Country: Argentina Mean: 17.839999999999996
 Country: Argentina Median: 17.8
 Country: Armenia Mean: 25.099999999999998
 Country: Armenia Median: 25.9
 Country: Australia Mean: 10.58
 Country: Australia Median: 10.0
 Country: Austria Mean: 12.62
 Country: Austria Median: 12.2
 Country: Azerbaijan Mean: 25.56
 Country: Azerbaijan Median: 24.8
 Country: Bahamas Mean: 17.32
 Country: Bahamas Median: 16.9
 Country: Bahrain Mean: 15.680000000000001
 Country: Bahrain Median: 13.8
 Country: Bangladesh Mean: 21.74
 Country: Bangladesh Median: 21.7
 Country: Barbados Mean: 18.06
 Country: Barbados Median: 18.0
 Country: Belarus Mean: 29.580000000000002

 Country: Sweden Median: 10.4
 Country: Switzerland Mean: 10.1
 Country: Switzerland Median: 9.7
 Country: Syrian Arab Republic Mean: 22.1
 Country: Syrian Arab Republic Median: 22.1
 Country: Tajikistan Mean: 26.28
 Country: Tajikistan Median: 26.1
 Country: Thailand Mean: 16.360000000000003
 Country: Thailand Median: 15.8
 Country: The former Yugoslav republic of Macedonia Mean: 22.86
 Country: The former Yugoslav republic of Macedonia Median: 23.0
 Country: Timor-Leste Mean: 22.82
 Country: Timor-Leste Median: 23.2
 Country: Togo Mean: 24.32
 Country: Togo Median: 24.4
 Country: Tonga Mean: 24.580000000000002
 Country: Tonga Median: 24.7
 Country: Trinidad and Tobago Mean: 24.340000000000003
 Country: Trinidad and Tobago Median: 23.6
 Country: Tunisia Mean: 17.5
 Country: Tunisia Median: 17.5
 Country: Turkey Mean: 18.64
 Country: Turkey Median: 18.2
 Country: Turkmenistan Mean: 32.14
 Country: Turkmenistan Median: 31.0
 Country: Uganda Mean: 23.439999999999998
 Country: Uganda Medi

As, there is not much difference between the mean and the median, meaning there are not many outliers. so it is safe to take mean as the representative statistics for each country's data

In [69]:
### Taking the mean of the data from original table and assigning it across each country in the new table.

country_mean_prob_dict1 = {}
country_mean_prob_dict2 = {}
country_mean_prob_dict3 = {}
country_suicide_dict1 = {}
country_suicide_dict2 = {}
country_suicide_dict3 = {}
for country in unique_country_list:
    country_mean_prob_dict1[country] = round(suicide_df[suicide_df.country==country].prob_death_bw_30_70_both.mean(),2)
    country_mean_prob_dict2[country] = round(suicide_df[suicide_df.country==country].prob_death_bw_30_70_male.mean(),2)
    country_mean_prob_dict3[country] = round(suicide_df[suicide_df.country==country].prob_death_bw_30_70_female.mean(),2)
    country_suicide_dict1[country] = round(suicide_df[suicide_df.country==country].crude_suicide_rate_both_per_100000.mean(),2)
    country_suicide_dict2[country] = round(suicide_df[suicide_df.country==country].crude_suicide_rate_male_per_100000.mean(),2)
    country_suicide_dict3[country] = round(suicide_df[suicide_df.country==country].crude_suicide_rate_female_per_100000.mean(),2)

In [70]:
country_suicide_dict3

{'Afghanistan': 1.38,
 'Albania': 4.88,
 'Algeria': 1.98,
 'Angola': 3.12,
 'Antigua and Barbuda': 0.5,
 'Argentina': 3.56,
 'Armenia': 2.24,
 'Australia': 6.52,
 'Austria': 8.36,
 'Azerbaijan': 1.12,
 'Bahamas': 0.6,
 'Bahrain': 2.42,
 'Bangladesh': 7.76,
 'Barbados': 1.26,
 'Belarus': 11.36,
 'Belgium': 13.38,
 'Belize': 1.82,
 'Benin': 5.98,
 'Bhutan': 8.88,
 'Bolivia (Plurinational State of)': 10.38,
 'Bosnia and Herzegovina': 3.74,
 'Botswana': 4.7,
 'Brazil': 2.72,
 'Brunei Darussalam': 2.08,
 'Bulgaria': 6.62,
 'Burkina Faso': 4.98,
 'Burundi': 5.04,
 'Cabo Verde': 6.12,
 'Cambodia': 3.24,
 'Cameroon': 7.54,
 'Canada': 6.9,
 'Central African Republic': 4.6,
 'Chad': 7.56,
 'Chile': 3.94,
 'China': 12.38,
 'Colombia': 3.06,
 'Comoros': 3.42,
 'Congo': 4.62,
 'Costa Rica': 2.28,
 "Côte d'Ivoire": 6.86,
 'Croatia': 9.44,
 'Cuba': 6.8,
 'Cyprus': 2.12,
 'Czechia': 5.66,
 "Democratic People's Republic of Korea": 8.66,
 'Democratic Republic of the Congo': 3.18,
 'Denmark': 9.22,
 'Dji

In [71]:
for country in unique_country_list:
    suicide_final_df["m_pd_bw_30_70"] = suicide_final_df["country"].map(country_mean_prob_dict1)
    suicide_final_df["m_pd_bw_30_70_male"] = suicide_final_df["country"].map(country_mean_prob_dict2)
    suicide_final_df["m_pd_bw_30_70_female"] = suicide_final_df["country"].map(country_mean_prob_dict3)
    suicide_final_df["m_cr_suicide_r"] = suicide_final_df["country"].map(country_suicide_dict1)
    suicide_final_df["m_cr_suicide_r_male"] = suicide_final_df["country"].map(country_suicide_dict2)
    suicide_final_df["m_cr_suicide_r_female"] = suicide_final_df["country"].map(country_suicide_dict3)

In [72]:
suicide_final_df.head()

Unnamed: 0,country,m_pd_bw_30_70,m_pd_bw_30_70_male,m_pd_bw_30_70_female,m_cr_suicide_r,m_cr_suicide_r_male,m_cr_suicide_r_female
0,Afghanistan,31.96,34.18,29.72,5.32,8.96,1.38
1,Albania,18.24,21.82,14.26,6.46,8.02,4.88
2,Algeria,16.54,17.8,15.24,3.52,5.06,1.98
3,Angola,19.36,19.82,19.02,6.1,9.22,3.12
4,Antigua and Barbuda,22.36,24.7,20.22,0.96,1.42,0.5


In [73]:
suicide_final_df.describe()

Unnamed: 0,m_pd_bw_30_70,m_pd_bw_30_70_male,m_pd_bw_30_70_female,m_cr_suicide_r,m_cr_suicide_r_male,m_cr_suicide_r_female
count,183.0,183.0,183.0,183.0,183.0,183.0
mean,20.584918,23.709727,17.639781,9.935628,14.800328,5.231694
std,5.619535,7.01967,5.563229,7.028739,11.885376,3.605993
min,9.66,12.24,6.44,0.96,1.42,0.5
25%,16.48,18.13,13.78,5.13,7.32,2.48
50%,20.78,22.72,17.72,8.08,11.6,4.16
75%,24.42,28.28,21.37,12.66,18.79,6.87
max,33.48,44.64,34.02,41.46,74.28,22.88


In [74]:
suicide_df.head()

Unnamed: 0,country,year,prob_death_bw_30_70_both,prob_death_bw_30_70_male,prob_death_bw_30_70_female,crude_suicide_rate_both_per_100000,crude_suicide_rate_male_per_100000,crude_suicide_rate_female_per_100000
1,Afghanistan,2016,29.8,31.8,27.7,4.7,7.6,1.5
2,Afghanistan,2015,29.8,31.9,27.8,4.8,7.8,1.5
3,Afghanistan,2010,31.7,34.1,29.4,5.1,8.6,1.4
4,Afghanistan,2005,34.1,36.5,31.6,6.3,10.8,1.5
5,Afghanistan,2000,34.4,36.6,32.1,5.7,10.0,1.0


In [98]:
### Calculation of growth rate for Probability of death and suicide rate on `prob_death_bw_30_70_both` and `crude_suicide_rate_both_per_100000` column of suicide_df dataset.
Prob_death_growth_rate_dict = {}
crude_suicide_growth_rate_dict = {}
for country in unique_country_list:
    Prob_d_data_2000 = round(float(suicide_df[(suicide_df['country'] == country) & (suicide_df['year'] == 2000)].prob_death_bw_30_70_both),2)
    Prob_d_data_2016 = round(float(suicide_df[(suicide_df['country'] == country) & (suicide_df['year'] == 2016)].prob_death_bw_30_70_both),2)
    Prob_death_growth_rate = round((((Prob_d_data_2000 - Prob_d_data_2016)/Prob_d_data_2000)*100),2)
    Prob_death_growth_rate_dict[country] = Prob_death_growth_rate
    
    cs_data_2000 = round(float(suicide_df[(suicide_df['country'] == country) & (suicide_df['year'] == 2000)].crude_suicide_rate_both_per_100000),2)
    cs_data_2016 = round(float(suicide_df[(suicide_df['country'] == country) & (suicide_df['year'] == 2016)].crude_suicide_rate_both_per_100000),2)
    cs_growth_rate = round((((cs_data_2000 - cs_data_2016)/cs_data_2000)*100),2)
    crude_suicide_growth_rate_dict[country] = cs_growth_rate
    

In [99]:
Prob_death_growth_rate_dict

{'Afghanistan': 13.37,
 'Albania': 12.37,
 'Algeria': 32.7,
 'Angola': 31.25,
 'Antigua and Barbuda': 3.42,
 'Argentina': 23.3,
 'Armenia': 19.78,
 'Australia': 30.53,
 'Austria': 25.49,
 'Azerbaijan': 24.23,
 'Bahamas': 22.89,
 'Bahrain': 51.91,
 'Bangladesh': -0.93,
 'Barbados': 22.86,
 'Belarus': 30.29,
 'Belgium': 25.0,
 'Belize': 15.65,
 'Benin': 2.0,
 'Bhutan': 24.35,
 'Bolivia (Plurinational State of)': 25.22,
 'Bosnia and Herzegovina': 27.05,
 'Botswana': 12.12,
 'Brazil': 31.97,
 'Brunei Darussalam': 19.02,
 'Bulgaria': 17.19,
 'Burkina Faso': 3.98,
 'Burundi': 2.55,
 'Cabo Verde': 17.7,
 'Cambodia': 17.25,
 'Cameroon': 11.84,
 'Canada': 31.94,
 'Central African Republic': 7.23,
 'Chad': 1.65,
 'Chile': 15.07,
 'China': 20.93,
 'Colombia': 23.3,
 'Comoros': 11.58,
 'Congo': 32.66,
 'Costa Rica': 19.01,
 "Côte d'Ivoire": -11.07,
 'Croatia': 27.07,
 'Cuba': 9.39,
 'Cyprus': 19.86,
 'Czechia': 33.63,
 "Democratic People's Republic of Korea": -8.47,
 'Democratic Republic of the Co

In [100]:
crude_suicide_growth_rate_dict

{'Afghanistan': 17.54,
 'Albania': -14.55,
 'Algeria': 21.95,
 'Angola': 40.51,
 'Antigua and Barbuda': 75.0,
 'Argentina': 3.16,
 'Armenia': -100.0,
 'Australia': 0.0,
 'Austria': 22.0,
 'Azerbaijan': -18.18,
 'Bahamas': 15.0,
 'Bahrain': 11.94,
 'Bangladesh': 11.94,
 'Barbados': 61.9,
 'Belarus': 39.21,
 'Belgium': 8.81,
 'Belize': 14.55,
 'Benin': -4.21,
 'Bhutan': 8.06,
 'Bolivia (Plurinational State of)': 25.61,
 'Bosnia and Herzegovina': 17.76,
 'Botswana': 21.19,
 'Brazil': -35.42,
 'Brunei Darussalam': -64.29,
 'Bulgaria': 37.84,
 'Burkina Faso': 6.1,
 'Burundi': -28.17,
 'Cabo Verde': -20.21,
 'Cambodia': 5.36,
 'Cameroon': 8.27,
 'Canada': 3.85,
 'Central African Republic': 12.5,
 'Chad': -10.0,
 'Chile': 2.75,
 'China': 26.52,
 'Colombia': 5.26,
 'Comoros': -33.33,
 'Congo': 41.0,
 'Costa Rica': -8.22,
 "Côte d'Ivoire": -52.63,
 'Croatia': 22.54,
 'Cuba': 20.11,
 'Cyprus': -140.91,
 'Czechia': 22.94,
 "Democratic People's Republic of Korea": -12.0,
 'Democratic Republic of t

In [101]:
for country in unique_country_list:
    suicide_final_df["Prob_death_growth_rate"] = suicide_final_df["country"].map(Prob_death_growth_rate_dict)
    suicide_final_df["crude_suicide_growth_rate"] = suicide_final_df["country"].map(crude_suicide_growth_rate_dict)

In [102]:
suicide_final_df.head()

Unnamed: 0,country,m_pd_bw_30_70,m_pd_bw_30_70_male,m_pd_bw_30_70_female,m_cr_suicide_r,m_cr_suicide_r_male,m_cr_suicide_r_female,Prob_death_growth_rate,crude_suicide_growth_rate
0,Afghanistan,31.96,34.18,29.72,5.32,8.96,1.38,13.37,17.54
1,Albania,18.24,21.82,14.26,6.46,8.02,4.88,12.37,-14.55
2,Algeria,16.54,17.8,15.24,3.52,5.06,1.98,32.7,21.95
3,Angola,19.36,19.82,19.02,6.1,9.22,3.12,31.25,40.51
4,Antigua and Barbuda,22.36,24.7,20.22,0.96,1.42,0.5,3.42,75.0


##### Our Suicide dataset has been cleaned and the following is the new dataset's data dictionary:

    * m_pd_bw_30_70: It is the mean probability of death for somebody between age of 30 and 70.
    * m_pd_bw_30_70_male: It is the mean probability of death for somebody who is a male between age of 30 and 70.
    * m_pd_bw_30_70_female: It is the mean probability of death for somebody who is a female between age of 30 and 70.
    * m_cr_suicide_r: It is the mean suicide rate per 100000 people.
    * m_cr_suicide_r_male: It is the mean suicide rate of males per 100000 male.
    * m_cr_suicide_r_female: It is the mean suicide rate of females per 100000 females.
    * Prob_death_growth_rate: It is the growth rate percentage of mean probability of death of someone between age 30 and 70 for 2000 to 2016 period. 
    * crude_suicide_growth_rate: It is the growth rate percentage of crude suicide rate of someone for 2000 to 2016 period.
    
Note: For `Prob_death_growth_rate` and `crude_suicide_growth_rate`, a positive number implies that there is a downtrend and a negative number imples that there is uptrend in the respective growth rates..

#### Mental Disorder dataset

In [103]:
mental_df = pd.read_csv("./Datasets/Mental disorders.csv")
mental_df.head()

Unnamed: 0,COUNTRY,DD_total_cases,DD_perc_of_pop,AD_total_cases,AD_perc_of_pop,DD_total_YLD,DD_perc_of_total_YLD,AD_total_YLD,AD_perc_of_total_YLD
0,,,,,,,,,
1,Algeria,1683914.0,"4,5%",1657172.0,"4,5%",302560.0,"8,1%",153227.0,"4,1%"
2,Angola,892128.0,"3,6%",675748.0,"2,8%",162164.0,"6,9%",62325.0,"2,7%"
3,Benin,411695.0,"3,9%",290713.0,"2,7%",74822.0,"8,0%",26960.0,"2,9%"
4,Botswana,102065.0,"4,7%",68954.0,"3,1%",18183.0,"7,2%",6290.0,"2,5%"


In [105]:
mental_df = mental_df.iloc[1:]

## Data Dictionary for mental disorder dataset

    * DD_total_cases: Mean of Total cases of people with Depressive disorder from year 2000-2016.
    * DD_perc_of_pop: Percentage of population with Depressive Disorder to total population.
    * AD_total_cases: Mean of Total cases of people with Anxiety disorder from year 2000-2016.
    * AD_perc_of_pop: Percentage of population with Anxiety Disorder to total population.

In [112]:
mental_df = mental_df.drop(["DD_total_YLD", "DD_perc_of_total_YLD", "AD_total_YLD", "AD_perc_of_total_YLD"], axis=1)

In [113]:
mental_df.head()

Unnamed: 0,COUNTRY,DD_total_cases,DD_perc_of_pop,AD_total_cases,AD_perc_of_pop
1,Algeria,1683914.0,"4,5%",1657172.0,"4,5%"
2,Angola,892128.0,"3,6%",675748.0,"2,8%"
3,Benin,411695.0,"3,9%",290713.0,"2,7%"
4,Botswana,102065.0,"4,7%",68954.0,"3,1%"
5,Burkina Faso,640502.0,"3,6%",471618.0,"2,7%"


In [114]:
mental_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 194 entries, 1 to 194
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   COUNTRY         194 non-null    object 
 1   DD_total_cases  183 non-null    float64
 2   DD_perc_of_pop  183 non-null    object 
 3   AD_total_cases  183 non-null    float64
 4   AD_perc_of_pop  183 non-null    object 
dtypes: float64(2), object(3)
memory usage: 7.7+ KB


In [115]:
mental_df.shape

(194, 5)

So, there are about 194 rows and 5 columns

As we can see there are some data insertions error in the percentage columns, eg: 4,5% etc. 

In [117]:
mental_df['DD_perc_of_pop'] = mental_df['DD_perc_of_pop'].str.replace(',', '.').str.rstrip('%').astype(float)

In [119]:
mental_df['AD_perc_of_pop'] = mental_df['AD_perc_of_pop'].str.replace(',', '.').str.rstrip('%').astype(float)

In [121]:
mental_df.head()

Unnamed: 0,COUNTRY,DD_total_cases,DD_perc_of_pop,AD_total_cases,AD_perc_of_pop
1,Algeria,1683914.0,4.5,1657172.0,4.5
2,Angola,892128.0,3.6,675748.0,2.8
3,Benin,411695.0,3.9,290713.0,2.7
4,Botswana,102065.0,4.7,68954.0,3.1
5,Burkina Faso,640502.0,3.6,471618.0,2.7


In [123]:
mental_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 194 entries, 1 to 194
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   COUNTRY         194 non-null    object 
 1   DD_total_cases  183 non-null    float64
 2   DD_perc_of_pop  183 non-null    float64
 3   AD_total_cases  183 non-null    float64
 4   AD_perc_of_pop  183 non-null    float64
dtypes: float64(4), object(1)
memory usage: 7.7+ KB


So, all the data columns have been converted into floating point.

Mental Data frame is cleaned.

### Merging the two dataframes.

In [124]:
mental_df["COUNTRY"] = mental_df["COUNTRY"].str.lower()

In [125]:
suicide_final_df["country"] = suicide_final_df["country"].str.lower()

In [126]:
mental_df.head()

Unnamed: 0,COUNTRY,DD_total_cases,DD_perc_of_pop,AD_total_cases,AD_perc_of_pop
1,algeria,1683914.0,4.5,1657172.0,4.5
2,angola,892128.0,3.6,675748.0,2.8
3,benin,411695.0,3.9,290713.0,2.7
4,botswana,102065.0,4.7,68954.0,3.1
5,burkina faso,640502.0,3.6,471618.0,2.7


In [127]:
suicide_final_df.head()

Unnamed: 0,country,m_pd_bw_30_70,m_pd_bw_30_70_male,m_pd_bw_30_70_female,m_cr_suicide_r,m_cr_suicide_r_male,m_cr_suicide_r_female,Prob_death_growth_rate,crude_suicide_growth_rate
0,afghanistan,31.96,34.18,29.72,5.32,8.96,1.38,13.37,17.54
1,albania,18.24,21.82,14.26,6.46,8.02,4.88,12.37,-14.55
2,algeria,16.54,17.8,15.24,3.52,5.06,1.98,32.7,21.95
3,angola,19.36,19.82,19.02,6.1,9.22,3.12,31.25,40.51
4,antigua and barbuda,22.36,24.7,20.22,0.96,1.42,0.5,3.42,75.0


In [128]:
## Rows in suicide dataframe but not in mental disorder dataframe.

suicide_final_df.loc[~suicide_final_df["country"].isin(mental_df["COUNTRY"]), :]

Unnamed: 0,country,m_pd_bw_30_70,m_pd_bw_30_70_male,m_pd_bw_30_70_female,m_cr_suicide_r,m_cr_suicide_r_male,m_cr_suicide_r_female,Prob_death_growth_rate,crude_suicide_growth_rate
19,bolivia (plurinational state of),19.44,19.82,19.08,14.08,17.72,10.38,25.22,25.61
20,bosnia and herzegovina,20.82,26.52,15.56,9.14,14.72,3.74,27.05,17.76
27,cabo verde,18.64,20.18,17.56,10.58,15.16,6.12,17.7,-20.21
43,czechia,18.14,24.3,12.32,15.14,25.02,5.66,33.63,22.94
44,democratic people's republic of korea,25.76,34.56,18.18,11.08,13.66,8.66,-8.47,-12.0
45,democratic republic of the congo,20.58,20.54,20.66,5.9,8.64,3.18,14.54,13.64
55,eswatini,26.88,29.82,24.62,12.38,18.68,6.48,1.84,-14.66
90,lao people's democratic republic,28.02,29.62,26.6,9.1,12.0,6.22,7.53,14.85
95,libya,21.02,24.7,17.28,5.14,7.74,2.4,11.45,1.89
107,micronesia (federated states of),26.72,29.56,23.92,11.26,15.48,6.92,4.74,-2.78


In [129]:
## Rows in mental disorder dataframe but not in suicide dataframe.

mental_df.loc[~mental_df["COUNTRY"].isin(suicide_final_df["country"]), :]

Unnamed: 0,COUNTRY,DD_total_cases,DD_perc_of_pop,AD_total_cases,AD_perc_of_pop
8,cape verde,24240.0,4.9,15175.0,3.1
14,democratic republic,2871309.0,3.8,2113267.0,2.8
15,of the congo,,,,
43,swaziland,53223.0,4.2,37984.0,3.0
46,united republic of,2138939.0,4.1,1551036.0,3.0
47,tanzania,,,,
55,bolivia (plurinational,453716.0,4.4,565857.0,5.4
56,state of),,,,
78,saint vincent and the,5144.0,4.9,6187.0,5.8
79,grenadines,,,,


In [130]:
mental_df = mental_df.dropna()

In [131]:
mental_df.loc[~mental_df["COUNTRY"].isin(suicide_final_df["country"]), :]

Unnamed: 0,COUNTRY,DD_total_cases,DD_perc_of_pop,AD_total_cases,AD_perc_of_pop
8,cape verde,24240.0,4.9,15175.0,3.1
14,democratic republic,2871309.0,3.8,2113267.0,2.8
43,swaziland,53223.0,4.2,37984.0,3.0
46,united republic of,2138939.0,4.1,1551036.0,3.0
55,bolivia (plurinational,453716.0,4.4,565857.0,5.4
78,saint vincent and the,5144.0,4.9,6187.0,5.8
82,united states of,17491047.0,5.9,18711966.0,6.3
85,venezuela (bolivarian,1270099.0,4.2,1322024.0,4.4
96,libyan arab jamahiriya,265833.0,4.5,265210.0,4.5
114,bosnia and,185557.0,5.1,140314.0,3.8


In [132]:
mental_df["COUNTRY"] = mental_df["COUNTRY"].replace("democratic republic", "democratic republic of the congo")
mental_df["COUNTRY"] = mental_df["COUNTRY"].replace("united republic of", "united republic of tanzania")
mental_df["COUNTRY"] = mental_df["COUNTRY"].replace("bolivia (plurinational", "bolivia (plurinational state of)")
mental_df["COUNTRY"] = mental_df["COUNTRY"].replace("saint vincent and the", "saint vincent and the grenadines")
mental_df["COUNTRY"] = mental_df["COUNTRY"].replace("united states of", "united states of america")
mental_df["COUNTRY"] = mental_df["COUNTRY"].replace("venezuela (bolivarian", "venezuela (bolivarian republic of)")
mental_df["COUNTRY"] = mental_df["COUNTRY"].replace("the former yugoslav", "the former yugoslav republic of macedonia")
mental_df["COUNTRY"] = mental_df["COUNTRY"].replace("bosnia and", "bosnia and herzegovina")
mental_df["COUNTRY"] = mental_df["COUNTRY"].replace("lao people's", "lao people's democratic republic")
mental_df["COUNTRY"] = mental_df["COUNTRY"].replace("micronesia (federated", "micronesia (federated states of)")

In [135]:
mental_df["COUNTRY"] = mental_df["COUNTRY"].replace("democratic people's", "democratic people's republic of korea")
mental_df["COUNTRY"] = mental_df["COUNTRY"].replace("libyan arab jamahiriya", "libya")
suicide_final_df["country"] = suicide_final_df["country"].replace("cabo verde", "cape verde")
suicide_final_df["country"] = suicide_final_df["country"].replace("united kingdom of great britain and northern ireland", "united kingdom")

In [138]:
mental_df["COUNTRY"] = mental_df["COUNTRY"].replace("swaziland", "eswatini")
mental_df["COUNTRY"] = mental_df["COUNTRY"].replace("czech republic", "czechia")

In [139]:
mental_df.loc[~mental_df["COUNTRY"].isin(suicide_final_df["country"]), :]

Unnamed: 0,COUNTRY,DD_total_cases,DD_perc_of_pop,AD_total_cases,AD_perc_of_pop


In [140]:
suicide_final_df.loc[~suicide_final_df["country"].isin(mental_df["COUNTRY"]), :]

Unnamed: 0,country,m_pd_bw_30_70,m_pd_bw_30_70_male,m_pd_bw_30_70_female,m_cr_suicide_r,m_cr_suicide_r_male,m_cr_suicide_r_female,Prob_death_growth_rate,crude_suicide_growth_rate


Now all the country data in both the datasets are the same, and they are ready to be merged.

### Merging both dataframes

In [141]:
master = pd.merge(suicide_final_df, mental_df, how="inner", left_on="country", right_on="COUNTRY")

In [142]:
master.head()

Unnamed: 0,country,m_pd_bw_30_70,m_pd_bw_30_70_male,m_pd_bw_30_70_female,m_cr_suicide_r,m_cr_suicide_r_male,m_cr_suicide_r_female,Prob_death_growth_rate,crude_suicide_growth_rate,COUNTRY,DD_total_cases,DD_perc_of_pop,AD_total_cases,AD_perc_of_pop
0,afghanistan,31.96,34.18,29.72,5.32,8.96,1.38,13.37,17.54,afghanistan,1038610.0,3.3,1238880.0,4.0
1,albania,18.24,21.82,14.26,6.46,8.02,4.88,12.37,-14.55,albania,131048.0,4.8,104925.0,3.8
2,algeria,16.54,17.8,15.24,3.52,5.06,1.98,32.7,21.95,algeria,1683914.0,4.5,1657172.0,4.5
3,angola,19.36,19.82,19.02,6.1,9.22,3.12,31.25,40.51,angola,892128.0,3.6,675748.0,2.8
4,antigua and barbuda,22.36,24.7,20.22,0.96,1.42,0.5,3.42,75.0,antigua and barbuda,4424.0,5.1,5327.0,6.1


In [143]:
master = master.drop("COUNTRY", axis=1)

In [144]:
master.head()

Unnamed: 0,country,m_pd_bw_30_70,m_pd_bw_30_70_male,m_pd_bw_30_70_female,m_cr_suicide_r,m_cr_suicide_r_male,m_cr_suicide_r_female,Prob_death_growth_rate,crude_suicide_growth_rate,DD_total_cases,DD_perc_of_pop,AD_total_cases,AD_perc_of_pop
0,afghanistan,31.96,34.18,29.72,5.32,8.96,1.38,13.37,17.54,1038610.0,3.3,1238880.0,4.0
1,albania,18.24,21.82,14.26,6.46,8.02,4.88,12.37,-14.55,131048.0,4.8,104925.0,3.8
2,algeria,16.54,17.8,15.24,3.52,5.06,1.98,32.7,21.95,1683914.0,4.5,1657172.0,4.5
3,angola,19.36,19.82,19.02,6.1,9.22,3.12,31.25,40.51,892128.0,3.6,675748.0,2.8
4,antigua and barbuda,22.36,24.7,20.22,0.96,1.42,0.5,3.42,75.0,4424.0,5.1,5327.0,6.1


In [145]:
master.shape

(183, 13)

In [146]:
master.to_csv("./Datasets/master_suicide_MH.csv", index=False)