# **Suicide Rates Data Analysis:**

> **Source:** __https://www.kaggle.com/datasets/russellyates88/suicide-rates-overview-1985-to-2016__

In [95]:
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
import warnings

# Ignore all warnings
warnings.filterwarnings("ignore")

In [None]:
# List out all the available plot styles in metplotlib and seaborn:
plt.style.available

In [None]:
# Set the style
plt.style.use('seaborn-v0_8-bright')
# Set themes
sns.set_theme()

In [96]:
data= pd.read_csv("/home/russ/Desktop/Mission-Project/00_DataSets/24_Suicide_Rates.csv")

In [97]:
data.shape

(27820, 12)

In [98]:
data.head()

Unnamed: 0,country,year,sex,age,suicides_no,population,suicides/100k pop,country-year,HDI for year,gdp_for_year ($),gdp_per_capita ($),generation
0,Albania,1987,male,15-24 years,21,312900,6.71,Albania1987,,2156624900,796,Generation X
1,Albania,1987,male,35-54 years,16,308000,5.19,Albania1987,,2156624900,796,Silent
2,Albania,1987,female,15-24 years,14,289700,4.83,Albania1987,,2156624900,796,Generation X
3,Albania,1987,male,75+ years,1,21800,4.59,Albania1987,,2156624900,796,G.I. Generation
4,Albania,1987,male,25-34 years,9,274300,3.28,Albania1987,,2156624900,796,Boomers


In [99]:
data.dtypes

country                object
year                    int64
sex                    object
age                    object
suicides_no             int64
population              int64
suicides/100k pop     float64
country-year           object
HDI for year          float64
 gdp_for_year ($)      object
gdp_per_capita ($)      int64
generation             object
dtype: object

In [100]:
data.columns

Index(['country', 'year', 'sex', 'age', 'suicides_no', 'population',
       'suicides/100k pop', 'country-year', 'HDI for year',
       ' gdp_for_year ($) ', 'gdp_per_capita ($)', 'generation'],
      dtype='object')


| **Column Name:**              | **Description:** |
|--------------------------|-------------|
| `country`                | The name of the country where the data point was recorded. |
| `year`                   | The year the data was recorded (ranging from 1985 to 2016). |
| `sex`                    | The gender of the individuals (values: `male` or `female`). |
| `age`                    | The age group of the individuals (e.g., `15-24 years`, `35-54 years`, `75+ years`, etc.). |
| `suicides_no`            | The **total number of suicides** reported in that country, year, gender, and age group. |
| `population`             | The **total population** of that specific demographic group (country, year, sex, and age group). |
| `suicides/100k pop`      | The **suicide rate** per 100,000 people in the demographic group. This is a normalized value to allow for comparisons between populations of different sizes. |
| `country-year`           | A combined field used as a unique identifier in the format `"Country-Year"` (e.g., `India-2010`). |
| `HDI for year`           | The **Human Development Index (HDI)** for the country in that particular year. It is a composite index measuring average achievement in key dimensions of human development: life expectancy, education, and per capita income.  |
| ` gdp_for_year ($) `     | The **total GDP (Gross Domestic Product)** of the country for that year. |
| `gdp_per_capita ($)`     | The **GDP per capita**, i.e., GDP divided by the total population of the country for that year. |
| `generation`             | The **generation category** of the age group, such as `Generation Z`, `Millennials`, `Generation X`, etc. This adds a sociological dimension to the age data. |

In [101]:
data.duplicated().any()

np.False_

In [102]:
data.isnull().any()

country               False
year                  False
sex                   False
age                   False
suicides_no           False
population            False
suicides/100k pop     False
country-year          False
HDI for year           True
 gdp_for_year ($)     False
gdp_per_capita ($)    False
generation            False
dtype: bool

One column have null values.

In [103]:
data["HDI for year"].isnull().sum()

np.int64(19456)

In [104]:
len(data)

27820

In [105]:
Percentage_of_null_values= data["HDI for year"].isnull().sum() / len(data) *100
print(Percentage_of_null_values)

69.93529834651329


So, `"HDI for year"` column have almost 70% null values. This column is very important for our analsis but is useless at this stage.

In [106]:
# Remove $ signs from the column names:
data.columns = data.columns.str.replace('$', '')

# Remove commas from the column names:
data.columns = data.columns.str.replace(',', '')

# Remove spaces from the column names:
data.columns = data.columns.str.replace(' ', '')

# Remove periods from the column names:
data.columns = data.columns.str.replace('.', '')

In [107]:
data.columns

Index(['country', 'year', 'sex', 'age', 'suicides_no', 'population',
       'suicides/100kpop', 'country-year', 'HDIforyear', 'gdp_for_year()',
       'gdp_per_capita()', 'generation'],
      dtype='object')

In [108]:
data.head(1)

Unnamed: 0,country,year,sex,age,suicides_no,population,suicides/100kpop,country-year,HDIforyear,gdp_for_year(),gdp_per_capita(),generation
0,Albania,1987,male,15-24 years,21,312900,6.71,Albania1987,,2156624900,796,Generation X


In [109]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27820 entries, 0 to 27819
Data columns (total 12 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   country           27820 non-null  object 
 1   year              27820 non-null  int64  
 2   sex               27820 non-null  object 
 3   age               27820 non-null  object 
 4   suicides_no       27820 non-null  int64  
 5   population        27820 non-null  int64  
 6   suicides/100kpop  27820 non-null  float64
 7   country-year      27820 non-null  object 
 8   HDIforyear        8364 non-null   float64
 9   gdp_for_year()    27820 non-null  object 
 10  gdp_per_capita()  27820 non-null  int64  
 11  generation        27820 non-null  object 
dtypes: float64(2), int64(4), object(6)
memory usage: 2.5+ MB


In [110]:
# convert date column datatype to datetime
#data['year'] = pd.to_datetime(data['year'])
# I comment it out because data values other then year are all zero and that makes
# no sense

In [111]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27820 entries, 0 to 27819
Data columns (total 12 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   country           27820 non-null  object 
 1   year              27820 non-null  int64  
 2   sex               27820 non-null  object 
 3   age               27820 non-null  object 
 4   suicides_no       27820 non-null  int64  
 5   population        27820 non-null  int64  
 6   suicides/100kpop  27820 non-null  float64
 7   country-year      27820 non-null  object 
 8   HDIforyear        8364 non-null   float64
 9   gdp_for_year()    27820 non-null  object 
 10  gdp_per_capita()  27820 non-null  int64  
 11  generation        27820 non-null  object 
dtypes: float64(2), int64(4), object(6)
memory usage: 2.5+ MB


In [112]:
# From age column; the repeating years word should be removed:
data["age"]= data["age"].str.replace("years","")

In [113]:
data.head(2)

Unnamed: 0,country,year,sex,age,suicides_no,population,suicides/100kpop,country-year,HDIforyear,gdp_for_year(),gdp_per_capita(),generation
0,Albania,1987,male,15-24,21,312900,6.71,Albania1987,,2156624900,796,Generation X
1,Albania,1987,male,35-54,16,308000,5.19,Albania1987,,2156624900,796,Silent


In [114]:
# In gdp for year column; all the commas should be removed:
data["gdp_for_year()"]= data["gdp_for_year()"].str.replace(",","")

In [115]:
data.head(2)

Unnamed: 0,country,year,sex,age,suicides_no,population,suicides/100kpop,country-year,HDIforyear,gdp_for_year(),gdp_per_capita(),generation
0,Albania,1987,male,15-24,21,312900,6.71,Albania1987,,2156624900,796,Generation X
1,Albania,1987,male,35-54,16,308000,5.19,Albania1987,,2156624900,796,Silent


In [116]:
data["generation"].nunique()

6

#### **Missing values Handeling:**

In [117]:
data.columns

Index(['country', 'year', 'sex', 'age', 'suicides_no', 'population',
       'suicides/100kpop', 'country-year', 'HDIforyear', 'gdp_for_year()',
       'gdp_per_capita()', 'generation'],
      dtype='object')

In [118]:
data["country"].nunique()

101

In [119]:
# Define a dictionary mapping countries to their continents
country_to_continent = {
    'Albania': 'Europe', 'Antigua and Barbuda': 'North America', 'Argentina': 'South America',
    'Armenia': 'Asia', 'Aruba': 'North America', 'Australia': 'Oceania',
    'Austria': 'Europe', 'Azerbaijan': 'Asia', 'Bahamas': 'North America',
    'Bahrain': 'Asia', 'Barbados': 'North America', 'Belarus': 'Europe',
    'Belgium': 'Europe', 'Belize': 'North America', 'Bosnia and Herzegovina': 'Europe',
    'Brazil': 'South America', 'Bulgaria': 'Europe', 'Cabo Verde': 'Africa',
    'Canada': 'North America', 'Chile': 'South America', 'Colombia': 'South America',
    'Costa Rica': 'North America', 'Croatia': 'Europe', 'Cuba': 'North America',
    'Cyprus': 'Europe', 'Czech Republic': 'Europe', 'Denmark': 'Europe',
    'Dominica': 'North America', 'Ecuador': 'South America', 'El Salvador': 'North America',
    'Estonia': 'Europe', 'Fiji': 'Oceania', 'Finland': 'Europe',
    'France': 'Europe', 'Georgia': 'Asia', 'Germany': 'Europe',
    'Greece': 'Europe', 'Grenada': 'North America', 'Guatemala': 'North America',
    'Guyana': 'South America', 'Hungary': 'Europe', 'Iceland': 'Europe',
    'Ireland': 'Europe', 'Israel': 'Asia', 'Italy': 'Europe',
    'Jamaica': 'North America', 'Japan': 'Asia', 'Kazakhstan': 'Asia',
    'Kiribati': 'Oceania', 'Kuwait': 'Asia', 'Kyrgyzstan': 'Asia',
    'Latvia': 'Europe', 'Lithuania': 'Europe', 'Luxembourg': 'Europe',
    'Macau': 'Asia', 'Maldives': 'Asia', 'Malta': 'Europe',
    'Mauritius': 'Africa', 'Mexico': 'North America', 'Mongolia': 'Asia',
    'Montenegro': 'Europe', 'Netherlands': 'Europe', 'New Zealand': 'Oceania',
    'Nicaragua': 'North America', 'Norway': 'Europe', 'Oman': 'Asia',
    'Panama': 'North America', 'Paraguay': 'South America', 'Philippines': 'Asia',
    'Poland': 'Europe', 'Portugal': 'Europe', 'Puerto Rico': 'North America',
    'Qatar': 'Asia', 'Republic of Korea': 'Asia', 'Romania': 'Europe',
    'Russian Federation': 'Europe', 'Saint Kitts and Nevis': 'North America',
    'Saint Lucia': 'North America', 'Saint Vincent and Grenadines': 'North America',
    'San Marino': 'Europe', 'Serbia': 'Europe', 'Seychelles': 'Africa',
    'Singapore': 'Asia', 'Slovakia': 'Europe', 'Slovenia': 'Europe',
    'South Africa': 'Africa', 'Spain': 'Europe', 'Sri Lanka': 'Asia',
    'Suriname': 'South America', 'Sweden': 'Europe', 'Switzerland': 'Europe',
    'Thailand': 'Asia', 'Trinidad and Tobago': 'North America', 'Turkey': 'Asia',
    'Turkmenistan': 'Asia', 'Ukraine': 'Europe', 'United Arab Emirates': 'Asia',
    'United Kingdom': 'Europe', 'United States': 'North America', 'Uruguay': 'South America',
    'Uzbekistan': 'Asia'
}

# Map the continent to a new column
data['continent'] = data['country'].map(country_to_continent)

# Verify the result
data.head()

Unnamed: 0,country,year,sex,age,suicides_no,population,suicides/100kpop,country-year,HDIforyear,gdp_for_year(),gdp_per_capita(),generation,continent
0,Albania,1987,male,15-24,21,312900,6.71,Albania1987,,2156624900,796,Generation X,Europe
1,Albania,1987,male,35-54,16,308000,5.19,Albania1987,,2156624900,796,Silent,Europe
2,Albania,1987,female,15-24,14,289700,4.83,Albania1987,,2156624900,796,Generation X,Europe
3,Albania,1987,male,75+,1,21800,4.59,Albania1987,,2156624900,796,G.I. Generation,Europe
4,Albania,1987,male,25-34,9,274300,3.28,Albania1987,,2156624900,796,Boomers,Europe


In [120]:
import matplotlib.pyplot as plt
# Group by continent and find the average for HDIforyear column for each continent
continent_avg = data.groupby('continent')['HDIforyear'].mean()
continent_avg

continent
Africa           0.701609
Asia             0.748797
Europe           0.828088
North America    0.730168
Oceania          0.865950
South America    0.704061
Name: HDIforyear, dtype: float64

In [121]:
# Replace missing values in HDIforyear column by continent average value:

data["HDIforyear"]= data['HDIforyear'].fillna(data.groupby('continent')['HDIforyear'].transform('mean'),)

So, we have successfully filled mising values by the average of the contry's continent values. This is better approach to fill missing values rather than filling them by overall avegage.

In [122]:
data.head(10)

Unnamed: 0,country,year,sex,age,suicides_no,population,suicides/100kpop,country-year,HDIforyear,gdp_for_year(),gdp_per_capita(),generation,continent
0,Albania,1987,male,15-24,21,312900,6.71,Albania1987,0.828088,2156624900,796,Generation X,Europe
1,Albania,1987,male,35-54,16,308000,5.19,Albania1987,0.828088,2156624900,796,Silent,Europe
2,Albania,1987,female,15-24,14,289700,4.83,Albania1987,0.828088,2156624900,796,Generation X,Europe
3,Albania,1987,male,75+,1,21800,4.59,Albania1987,0.828088,2156624900,796,G.I. Generation,Europe
4,Albania,1987,male,25-34,9,274300,3.28,Albania1987,0.828088,2156624900,796,Boomers,Europe
5,Albania,1987,female,75+,1,35600,2.81,Albania1987,0.828088,2156624900,796,G.I. Generation,Europe
6,Albania,1987,female,35-54,6,278800,2.15,Albania1987,0.828088,2156624900,796,Silent,Europe
7,Albania,1987,female,25-34,4,257200,1.56,Albania1987,0.828088,2156624900,796,Boomers,Europe
8,Albania,1987,male,55-74,1,137500,0.73,Albania1987,0.828088,2156624900,796,G.I. Generation,Europe
9,Albania,1987,female,5-14,0,311000,0.0,Albania1987,0.828088,2156624900,796,Generation X,Europe


In [123]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27820 entries, 0 to 27819
Data columns (total 13 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   country           27820 non-null  object 
 1   year              27820 non-null  int64  
 2   sex               27820 non-null  object 
 3   age               27820 non-null  object 
 4   suicides_no       27820 non-null  int64  
 5   population        27820 non-null  int64  
 6   suicides/100kpop  27820 non-null  float64
 7   country-year      27820 non-null  object 
 8   HDIforyear        27820 non-null  float64
 9   gdp_for_year()    27820 non-null  object 
 10  gdp_per_capita()  27820 non-null  int64  
 11  generation        27820 non-null  object 
 12  continent         27820 non-null  object 
dtypes: float64(2), int64(4), object(7)
memory usage: 2.8+ MB


In [124]:
data["sex"].unique()

array(['male', 'female'], dtype=object)

In [132]:
data["gdp_for_year()"] = pd.to_numeric(data["gdp_for_year()"])

`suicides/100kpop` should roughly equal `(suicides_no / population) * 100000`.

In [145]:
# `suicides/100kpop` should roughly equal `(suicides_no / population) * 100000`
col_1= data["suicides/100kpop"]
col_2= ((data["suicides_no"]/data["population"])*100000).round(2)
d= {"Given": col_1, "Calculated": col_2}
comparision= pd.DataFrame(d)
comparision

Unnamed: 0,Given,Calculated
0,6.71,6.71
1,5.19,5.19
2,4.83,4.83
3,4.59,4.59
4,3.28,3.28
...,...,...
27815,2.96,2.96
27816,2.58,2.58
27817,2.17,2.17
27818,1.67,1.67


There are 5 data samples having slight difference in `calculated` and `given values` and that happens because of the use of `rounding by 2`; so it can be ignored.

In [150]:
comparision[(comparision["Given"]!= comparision["Calculated"])]

Unnamed: 0,Given,Calculated
15355,40.63,40.62
15384,15.63,15.62
15942,15.63,15.62
18270,1.88,1.87
19746,3.13,3.12


In [151]:
data.columns

Index(['country', 'year', 'sex', 'age', 'suicides_no', 'population',
       'suicides/100kpop', 'country-year', 'HDIforyear', 'gdp_for_year()',
       'gdp_per_capita()', 'generation', 'continent'],
      dtype='object')

In [152]:
data["country-year"].nunique()

2321

In [199]:
data["country-year"].sample(2)

21335    Saint Lucia-1989
17194    Netherlands-1995
Name: country-year, dtype: object

Separate country and year by `-` for ease of use:

In [197]:
# Add a hyphen between the country name and the year in the 'country-year' column
data['country-year'] = data['country'] + '-' + data['year'].astype(str)

In [198]:
data

Unnamed: 0,country,year,sex,age,suicides_no,population,suicides/100kpop,country-year,HDIforyear,gdp_for_year(),gdp_per_capita(),generation,continent
0,Albania,1987,male,15-24,21,312900,6.71,Albania-1987,0.828088,2156624900,796,Generation X,Europe
1,Albania,1987,male,35-54,16,308000,5.19,Albania-1987,0.828088,2156624900,796,Silent,Europe
2,Albania,1987,female,15-24,14,289700,4.83,Albania-1987,0.828088,2156624900,796,Generation X,Europe
3,Albania,1987,male,75+,1,21800,4.59,Albania-1987,0.828088,2156624900,796,G.I. Generation,Europe
4,Albania,1987,male,25-34,9,274300,3.28,Albania-1987,0.828088,2156624900,796,Boomers,Europe
...,...,...,...,...,...,...,...,...,...,...,...,...,...
27815,Uzbekistan,2014,female,35-54,107,3620833,2.96,Uzbekistan-2014,0.675000,63067077179,2309,Generation X,Asia
27816,Uzbekistan,2014,female,75+,9,348465,2.58,Uzbekistan-2014,0.675000,63067077179,2309,Silent,Asia
27817,Uzbekistan,2014,male,5-14,60,2762158,2.17,Uzbekistan-2014,0.675000,63067077179,2309,Generation Z,Asia
27818,Uzbekistan,2014,female,5-14,44,2631600,1.67,Uzbekistan-2014,0.675000,63067077179,2309,Generation Z,Asia
