# Project: Analysis of Causes of Death in the Top 5 Most Populated Countries

## Introduction
This project explores the causes of death in the five most populated countries: China, India, the United States, Indonesia, and Pakistan. Utilizing a comprehensive dataset spanning 30 years, the analysis aims to uncover patterns, trends, and significant factors contributing to mortality rates in these nations.

## Objectives
- **Analyze and visualize** the total number of deaths over three decades in each country.
- **Identify** the top causes of death for each country and observe how these trends change over time.
- **Provide insights** that could inform policymakers and healthcare professionals in addressing critical public health concerns.

## Dataset
The dataset used for this analysis was sourced from [Kaggle](https://www.kaggle.com/datasets/iamsouravbanerjee/cause-of-deaths-around-the-world), a well-known platform for data science and analytics competitions. The data includes information on various causes of death across multiple countries over a 30-year period, providing a solid foundation for in-depth analysis and visualization.

In [1]:
# Importing Libraries
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as px

In [2]:
# Loading Dataset
df = pd.read_csv("D:\Data_Science\Data_Set\cause_of_deaths.csv")

  df = pd.read_csv("D:\Data_Science\Data_Set\cause_of_deaths.csv")


## **Exploratory Data Analysis**

In [4]:
# Display the first few rows of the dataset
df.head()

Unnamed: 0,Country/Territory,Code,Year,Meningitis,Alzheimer's Disease and Other Dementias,Parkinson's Disease,Nutritional Deficiencies,Malaria,Drowning,Interpersonal Violence,...,Diabetes Mellitus,Chronic Kidney Disease,Poisonings,Protein-Energy Malnutrition,Road Injuries,Chronic Respiratory Diseases,Cirrhosis and Other Chronic Liver Diseases,Digestive Diseases,"Fire, Heat, and Hot Substances",Acute Hepatitis
0,Afghanistan,AFG,1990,2159,1116,371,2087,93,1370,1538,...,2108,3709,338,2054,4154,5945,2673,5005,323,2985
1,Afghanistan,AFG,1991,2218,1136,374,2153,189,1391,2001,...,2120,3724,351,2119,4472,6050,2728,5120,332,3092
2,Afghanistan,AFG,1992,2475,1162,378,2441,239,1514,2299,...,2153,3776,386,2404,5106,6223,2830,5335,360,3325
3,Afghanistan,AFG,1993,2812,1187,384,2837,108,1687,2589,...,2195,3862,425,2797,5681,6445,2943,5568,396,3601
4,Afghanistan,AFG,1994,3027,1211,391,3081,211,1809,2849,...,2231,3932,451,3038,6001,6664,3027,5739,420,3816


In [5]:
# Data Summary and Initial Exploration
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6120 entries, 0 to 6119
Data columns (total 34 columns):
 #   Column                                      Non-Null Count  Dtype 
---  ------                                      --------------  ----- 
 0   Country/Territory                           6120 non-null   object
 1   Code                                        6120 non-null   object
 2   Year                                        6120 non-null   int64 
 3   Meningitis                                  6120 non-null   int64 
 4   Alzheimer's Disease and Other Dementias     6120 non-null   int64 
 5   Parkinson's Disease                         6120 non-null   int64 
 6   Nutritional Deficiencies                    6120 non-null   int64 
 7   Malaria                                     6120 non-null   int64 
 8   Drowning                                    6120 non-null   int64 
 9   Interpersonal Violence                      6120 non-null   int64 
 10  Maternal Disorders      

In [6]:
# Statistical Description of Data
df.describe()

Unnamed: 0,Year,Meningitis,Alzheimer's Disease and Other Dementias,Parkinson's Disease,Nutritional Deficiencies,Malaria,Drowning,Interpersonal Violence,Maternal Disorders,HIV/AIDS,...,Diabetes Mellitus,Chronic Kidney Disease,Poisonings,Protein-Energy Malnutrition,Road Injuries,Chronic Respiratory Diseases,Cirrhosis and Other Chronic Liver Diseases,Digestive Diseases,"Fire, Heat, and Hot Substances",Acute Hepatitis
count,6120.0,6120.0,6120.0,6120.0,6120.0,6120.0,6120.0,6120.0,6120.0,6120.0,...,6120.0,6120.0,6120.0,6120.0,6120.0,6120.0,6120.0,6120.0,6120.0,6120.0
mean,2004.5,1719.701307,4864.189379,1173.169118,2253.6,4140.960131,1683.33317,2083.797222,1262.589216,5941.898529,...,5138.704575,4724.13268,425.013399,1965.994281,5930.795588,17092.37,6124.072059,10725.267157,588.711438,618.429902
std,8.656149,6672.00693,18220.659072,4616.156238,10483.633601,18427.753137,8877.018366,6917.006075,6057.973183,21011.962487,...,16773.08104,16470.429969,2022.640521,8255.999063,24097.784291,105157.2,20688.11858,37228.051096,2128.59512,4186.023497
min,1990.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
25%,1997.0,15.0,90.0,27.0,9.0,0.0,34.0,40.0,5.0,11.0,...,236.0,145.75,6.0,5.0,174.75,289.0,154.0,284.0,17.0,2.0
50%,2004.5,109.0,666.5,164.0,119.0,0.0,177.0,265.0,54.0,136.0,...,1087.0,822.0,52.5,92.0,966.5,1689.0,1210.0,2185.0,126.0,15.0
75%,2012.0,847.25,2456.25,609.25,1167.25,393.0,698.0,877.0,734.0,1879.0,...,2954.0,2922.5,254.0,1042.5,3435.25,5249.75,3547.25,6080.0,450.0,160.0
max,2019.0,98358.0,320715.0,76990.0,268223.0,280604.0,153773.0,69640.0,107929.0,305491.0,...,273089.0,222922.0,30883.0,202241.0,329237.0,1366039.0,270037.0,464914.0,25876.0,64305.0


In [7]:
print("\nShape of Data:", df.shape)  # Dimensions of the dataset


Shape of Data: (6120, 34)


### ***Data Cleaning***

In [8]:
# Checking Duplicated Values
df.duplicated().sum()

0

In [9]:
# Checking Missing Values
df.isnull().sum()

Country/Territory                             0
Code                                          0
Year                                          0
Meningitis                                    0
Alzheimer's Disease and Other Dementias       0
Parkinson's Disease                           0
Nutritional Deficiencies                      0
Malaria                                       0
Drowning                                      0
Interpersonal Violence                        0
Maternal Disorders                            0
HIV/AIDS                                      0
Drug Use Disorders                            0
Tuberculosis                                  0
Cardiovascular Diseases                       0
Lower Respiratory Infections                  0
Neonatal Disorders                            0
Alcohol Use Disorders                         0
Self-harm                                     0
Exposure to Forces of Nature                  0
Diarrheal Diseases                      

### ***Data Frame***

In [10]:
# Data types
df.dtypes

Country/Territory                             object
Code                                          object
Year                                           int64
Meningitis                                     int64
Alzheimer's Disease and Other Dementias        int64
Parkinson's Disease                            int64
Nutritional Deficiencies                       int64
Malaria                                        int64
Drowning                                       int64
Interpersonal Violence                         int64
Maternal Disorders                             int64
HIV/AIDS                                       int64
Drug Use Disorders                             int64
Tuberculosis                                   int64
Cardiovascular Diseases                        int64
Lower Respiratory Infections                   int64
Neonatal Disorders                             int64
Alcohol Use Disorders                          int64
Self-harm                                     

In [11]:
# Columns of Data Frame
df.columns

Index(['Country/Territory', 'Code', 'Year', 'Meningitis',
       'Alzheimer's Disease and Other Dementias', 'Parkinson's Disease',
       'Nutritional Deficiencies', 'Malaria', 'Drowning',
       'Interpersonal Violence', 'Maternal Disorders', 'HIV/AIDS',
       'Drug Use Disorders', 'Tuberculosis', 'Cardiovascular Diseases',
       'Lower Respiratory Infections', 'Neonatal Disorders',
       'Alcohol Use Disorders', 'Self-harm', 'Exposure to Forces of Nature',
       'Diarrheal Diseases', 'Environmental Heat and Cold Exposure',
       'Neoplasms', 'Conflict and Terrorism', 'Diabetes Mellitus',
       'Chronic Kidney Disease', 'Poisonings', 'Protein-Energy Malnutrition',
       'Road Injuries', 'Chronic Respiratory Diseases',
       'Cirrhosis and Other Chronic Liver Diseases', 'Digestive Diseases',
       'Fire, Heat, and Hot Substances', 'Acute Hepatitis'],
      dtype='object')

In [12]:
# Unique Years and Countries in Dataset
print("\nNumber of Unique Years:", df['Year'].nunique())
print("List of Years:", df['Year'].unique())
print("\nNumber of Unique Countries:", df['Country/Territory'].nunique())
print("Top Countries by Data Availability:")
print(df['Country/Territory'].value_counts().head())


Number of Unique Years: 30
List of Years: [1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003
 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017
 2018 2019]

Number of Unique Countries: 204
Top Countries by Data Availability:
Country/Territory
Afghanistan         30
Papua New Guinea    30
Niue                30
North Korea         30
North Macedonia     30
Name: count, dtype: int64


In [13]:
# Correlation of Year with diseases that cause deaths
df.corr(numeric_only=True)['Year']

Year                                          1.000000
Meningitis                                   -0.043288
Alzheimer's Disease and Other Dementias       0.083710
Parkinson's Disease                           0.068756
Nutritional Deficiencies                     -0.078266
Malaria                                      -0.015964
Drowning                                     -0.040910
Interpersonal Violence                       -0.001122
Maternal Disorders                           -0.027460
HIV/AIDS                                      0.022964
Drug Use Disorders                            0.023917
Tuberculosis                                 -0.025297
Cardiovascular Diseases                       0.029813
Lower Respiratory Infections                 -0.027531
Neonatal Disorders                           -0.026949
Alcohol Use Disorders                         0.011315
Self-harm                                    -0.004192
Exposure to Forces of Nature                 -0.005178
Diarrheal 

### ***Feature Engineering***

In [3]:
# Feature Engineering: Summing Causes of Death to Create 'Total_Deaths' Column
# List of Causes of Death Columns
cause_of_deaths = [ 'Meningitis',
       'Alzheimer\'s Disease and Other Dementias', 'Parkinson\'s Disease',
       'Nutritional Deficiencies', 'Malaria', 'Drowning',
       'Interpersonal Violence', 'Maternal Disorders', 'HIV/AIDS',
       'Drug Use Disorders', 'Tuberculosis', 'Cardiovascular Diseases',
       'Lower Respiratory Infections', 'Neonatal Disorders',
       'Alcohol Use Disorders', 'Self-harm', 'Exposure to Forces of Nature',
       'Diarrheal Diseases', 'Environmental Heat and Cold Exposure',
       'Neoplasms', 'Conflict and Terrorism', 'Diabetes Mellitus',
       'Chronic Kidney Disease', 'Poisonings', 'Protein-Energy Malnutrition',
       'Road Injuries', 'Chronic Respiratory Diseases',
       'Cirrhosis and Other Chronic Liver Diseases', 'Digestive Diseases',
       'Fire, Heat, and Hot Substances', 'Acute Hepatitis']

In [4]:
# Create a new column 'Total_Deaths' as the sum of deaths across all causes
df['Total_Deaths'] = df[cause_of_deaths].sum(axis=1)

## **Visualization**

##### *****For 5 Most Populated Countries interms of "Total_no_of_Deaths_per_Year".*****

In [7]:

# Filter for top 5 populated countries over the year
top_5_countries = ["China", "India", "United States", "Indonesia", "Pakistan"]
filtered_data = df[df['Country/Territory'].isin(top_5_countries)]
total_no_deaths = filtered_data.sort_values(by='Total_Deaths', ascending=False)[['Country/Territory', 'Year', 'Total_Deaths']]
total_no_deaths

Unnamed: 0,Country/Territory,Year,Total_Deaths
1139,China,2019,10442561
1138,China,2018,10163943
1137,China,2017,9978653
1119,China,2016,9814213
1118,China,2015,9591222
...,...,...,...
4091,Pakistan,1994,1134724
4090,Pakistan,1993,1107986
4089,Pakistan,1992,1083237
4088,Pakistan,1991,1057939


##### *****Total Death and Pattern Over 30 Years*****

In [9]:
# Deaths over 30 years in eachy 5 Most-Populated countries
Death_over_30_Years = filtered_data.groupby('Country/Territory')['Total_Deaths'].sum().reset_index()
Death_over_30_Years

Unnamed: 0,Country/Territory,Total_Deaths
0,China,265408106
1,India,238158165
2,Indonesia,44046941
3,Pakistan,38151878
4,United States,71197802


In [14]:
fig = px.pie(Death_over_30_Years, values='Total_Deaths', names='Country/Territory',
             title='Total Deaths Over 30 Years by Country',
             color_discrete_sequence=px.colors.sequential.RdBu) 
fig.show()

In [None]:
fig = px.line(total_no_deaths, x='Year', y='Total_Deaths', color='Country/Territory',
              title='Total Number of Deaths Over 30 Years in Top 5 Populated Countries',
              labels={'Total_Deaths': 'Total Deaths', 'Year': 'Year', 'Country/Territory': 'Country'},
              markers=True)  # Optional: adds markers at data points

fig.show()

##### *****Most Common Death Causing Diseases in 5 Most Populated Countries*****

In [18]:

# Step 1: Filter the data for the top 5 populated countries
top_5_countries = ["China", "India", "United States", "Indonesia", "Pakistan"]
filtered_data = df[df['Country/Territory'].isin(top_5_countries)]

# Step 2: Identify the top 5 diseases for each country
top_diseases_dict = {}

# Find top 5 diseases causing deaths for each country
for country in top_5_countries:
    country_data = filtered_data[filtered_data['Country/Territory'] == country]
    top_diseases = country_data[cause_of_deaths].sum().nlargest(10).index.tolist()
    top_diseases_dict[country] = top_diseases

# Step 3: Create bar charts for each country
for country in top_5_countries:
    # Get the top diseases to visualize
    diseases_to_plot = top_diseases_dict[country]

    # Filter data for this country and the selected top diseases
    disease_data = filtered_data[filtered_data['Country/Territory'] == country][diseases_to_plot]

    # Calculate the total deaths by disease
    total_deaths_by_disease = disease_data.sum().reset_index()
    total_deaths_by_disease.columns = ['Disease', 'Total_Deaths']

    # Create the bar chart
    fig = px.bar(total_deaths_by_disease, x='Disease', y='Total_Deaths',
                  title=f'Total Deaths by Top 5 Diseases in {country}',
                  labels={'Total_Deaths': 'Total Deaths', 'Disease': 'Disease'},
                  color='Disease',
                  text='Total_Deaths')  # Optional: Show total deaths on the bars

    # Show the plot
    fig.show()


##### *****Death Trends in 5 Most Populated Countries*****

In [19]:
# Step 1: Filter the data for the top 5 populated countries
top_5_countries = ["China", "India", "United States", "Indonesia", "Pakistan"]
filtered_data = df[df['Country/Territory'].isin(top_5_countries)]

# Step 2: Identify the top 5 diseases for each country
top_diseases_dict = {}

for country in top_5_countries:
    country_data = filtered_data[filtered_data['Country/Territory'] == country]
    top_diseases = country_data[cause_of_deaths].sum().nlargest(10).index.tolist()
    top_diseases_dict[country] = top_diseases

# Step 3: Create line charts for each disease over the years
for country in top_5_countries:
    diseases_to_plot = top_diseases_dict[country]

    # Prepare data for line chart
    line_data = filtered_data[filtered_data['Country/Territory'] == country][['Year'] + diseases_to_plot]

    # Reshape data for Plotly
    line_data = line_data.melt(id_vars='Year', value_vars=diseases_to_plot,
                                var_name='Disease', value_name='Total_Deaths')

    # Create line chart
    fig = px.line(line_data, x='Year', y='Total_Deaths', color='Disease',
                  title=f'Trend of Deaths for Top 10 Diseases in {country} Over Years',
                  labels={'Total_Deaths': 'Total Deaths', 'Year': 'Year'},
                  markers=True)  # Add markers for clarity

    # Show the plot
    fig.show()

### ****End of the Data Analysis. Thanks for watching.****