# 6.1 Cleaning and Descriptive Statistics

## Contents List:

- Import libraries and cause_of_deaths_raw_data.csv

- Check for missing values

- Check for duplicates

- Check for mixed-type data

- List data types

- Perform basic descriptive statistics

In [1]:
# Import libraries

import pandas as pd
import numpy as np
import os

In [2]:
# Import cause_of_deaths_raw_data.csv

path = r'C:\Users\susan\OneDrive\Desktop\Data Analytics Program'

df_deaths_raw = pd.read_csv(os.path.join(path, 'cause_of_deaths_raw_data.csv'), index_col = False)

In [23]:
# Set the max columns to none

pd.set_option('display.max_columns', None)

In [24]:
# Check structure

df_deaths_raw.head()

Unnamed: 0,Country/Territory,Code,Year,Meningitis,Alzheimer's Disease and Other Dementias,Parkinson's Disease,Nutritional Deficiencies,Malaria,Drowning,Interpersonal Violence,Maternal Disorders,HIV/AIDS,Drug Use Disorders,Tuberculosis,Cardiovascular Diseases,Lower Respiratory Infections,Neonatal Disorders,Alcohol Use Disorders,Self-harm,Exposure to Forces of Nature,Diarrheal Diseases,Environmental Heat and Cold Exposure,Neoplasms,Conflict and Terrorism,Diabetes Mellitus,Chronic Kidney Disease,Poisonings,Protein-Energy Malnutrition,Road Injuries,Chronic Respiratory Diseases,Cirrhosis and Other Chronic Liver Diseases,Digestive Diseases,"Fire, Heat, and Hot Substances",Acute Hepatitis
0,Afghanistan,AFG,1990,2159,1116,371,2087,93,1370,1538,2655,34,93,4661,44899,23741,15612,72,696,0,4235,175,11580,1490,2108,3709,338,2054,4154,5945,2673,5005,323,2985
1,Afghanistan,AFG,1991,2218,1136,374,2153,189,1391,2001,2885,41,102,4743,45492,24504,17128,75,751,1347,4927,113,11796,3370,2120,3724,351,2119,4472,6050,2728,5120,332,3092
2,Afghanistan,AFG,1992,2475,1162,378,2441,239,1514,2299,3315,48,118,4976,46557,27404,20060,80,855,614,6123,38,12218,4344,2153,3776,386,2404,5106,6223,2830,5335,360,3325
3,Afghanistan,AFG,1993,2812,1187,384,2837,108,1687,2589,3671,56,132,5254,47951,31116,22335,85,943,225,8174,41,12634,4096,2195,3862,425,2797,5681,6445,2943,5568,396,3601
4,Afghanistan,AFG,1994,3027,1211,391,3081,211,1809,2849,3863,63,142,5470,49308,33390,23288,88,993,160,8215,44,12914,8959,2231,3932,451,3038,6001,6664,3027,5739,420,3816


In [7]:
# Check shape

df_deaths_raw.shape

(6120, 34)

## Check for missing values

In [8]:
df_deaths_raw.isnull().sum()

Country/Territory                             0
Code                                          0
Year                                          0
Meningitis                                    0
Alzheimer's Disease and Other Dementias       0
Parkinson's Disease                           0
Nutritional Deficiencies                      0
Malaria                                       0
Drowning                                      0
Interpersonal Violence                        0
Maternal Disorders                            0
HIV/AIDS                                      0
Drug Use Disorders                            0
Tuberculosis                                  0
Cardiovascular Diseases                       0
Lower Respiratory Infections                  0
Neonatal Disorders                            0
Alcohol Use Disorders                         0
Self-harm                                     0
Exposure to Forces of Nature                  0
Diarrheal Diseases                      

### There are 0 missing values in the dataset.

## Check for duplicates

In [9]:
# look for full duplicates

df_dups = df_deaths_raw[df_deaths_raw.duplicated()]

In [25]:
df_dups

Unnamed: 0,Country/Territory,Code,Year,Meningitis,Alzheimer's Disease and Other Dementias,Parkinson's Disease,Nutritional Deficiencies,Malaria,Drowning,Interpersonal Violence,Maternal Disorders,HIV/AIDS,Drug Use Disorders,Tuberculosis,Cardiovascular Diseases,Lower Respiratory Infections,Neonatal Disorders,Alcohol Use Disorders,Self-harm,Exposure to Forces of Nature,Diarrheal Diseases,Environmental Heat and Cold Exposure,Neoplasms,Conflict and Terrorism,Diabetes Mellitus,Chronic Kidney Disease,Poisonings,Protein-Energy Malnutrition,Road Injuries,Chronic Respiratory Diseases,Cirrhosis and Other Chronic Liver Diseases,Digestive Diseases,"Fire, Heat, and Hot Substances",Acute Hepatitis


### There are no duplicates in this dataset.

## Check for mixed-type data

In [11]:
for col in df_deaths_raw.columns.tolist():
  weird = (df_deaths_raw[[col]].applymap(type) != df_deaths_raw[[col]].iloc[0].apply(type)).any(axis = 1)
  if len (df_deaths_raw[weird]) > 0:
    print (col)

### There is no mixed-type data in this dataset.

## List data types

In [18]:
df_deaths_raw.dtypes

Country/Territory                             object
Code                                          object
Year                                           int64
Meningitis                                     int64
Alzheimer's Disease and Other Dementias        int64
Parkinson's Disease                            int64
Nutritional Deficiencies                       int64
Malaria                                        int64
Drowning                                       int64
Interpersonal Violence                         int64
Maternal Disorders                             int64
HIV/AIDS                                       int64
Drug Use Disorders                             int64
Tuberculosis                                   int64
Cardiovascular Diseases                        int64
Lower Respiratory Infections                   int64
Neonatal Disorders                             int64
Alcohol Use Disorders                          int64
Self-harm                                     

### There are 2 columns with object data types and 32 columns with int64 data types.

## Perform basic descriptive statistics

In [26]:
df_deaths_raw.describe()

Unnamed: 0,Year,Meningitis,Alzheimer's Disease and Other Dementias,Parkinson's Disease,Nutritional Deficiencies,Malaria,Drowning,Interpersonal Violence,Maternal Disorders,HIV/AIDS,Drug Use Disorders,Tuberculosis,Cardiovascular Diseases,Lower Respiratory Infections,Neonatal Disorders,Alcohol Use Disorders,Self-harm,Exposure to Forces of Nature,Diarrheal Diseases,Environmental Heat and Cold Exposure,Neoplasms,Conflict and Terrorism,Diabetes Mellitus,Chronic Kidney Disease,Poisonings,Protein-Energy Malnutrition,Road Injuries,Chronic Respiratory Diseases,Cirrhosis and Other Chronic Liver Diseases,Digestive Diseases,"Fire, Heat, and Hot Substances",Acute Hepatitis
count,6120.0,6120.0,6120.0,6120.0,6120.0,6120.0,6120.0,6120.0,6120.0,6120.0,6120.0,6120.0,6120.0,6120.0,6120.0,6120.0,6120.0,6120.0,6120.0,6120.0,6120.0,6120.0,6120.0,6120.0,6120.0,6120.0,6120.0,6120.0,6120.0,6120.0,6120.0,6120.0
mean,2004.5,1719.701307,4864.189379,1173.169118,2253.6,4140.960131,1683.33317,2083.797222,1262.589216,5941.898529,434.006699,7491.928595,73160.45,13687.914706,12558.942647,787.421242,3874.825327,243.485621,10822.8,292.295915,37542.24,538.243954,5138.704575,4724.13268,425.013399,1965.994281,5930.795588,17092.37,6124.072059,10725.267157,588.711438,618.429902
std,8.656149,6672.00693,18220.659072,4616.156238,10483.633601,18427.753137,8877.018366,6917.006075,6057.973183,21011.962487,2898.761628,39549.977578,291577.5,48031.720009,56058.366412,3545.823616,18425.616418,4717.104377,65416.17,1704.466356,161558.4,7033.308187,16773.08104,16470.429969,2022.640521,8255.999063,24097.784291,105157.2,20688.11858,37228.051096,2128.59512,4186.023497
min,1990.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
25%,1997.0,15.0,90.0,27.0,9.0,0.0,34.0,40.0,5.0,11.0,3.0,35.0,2028.0,345.0,131.0,9.0,94.0,0.0,20.0,2.0,809.75,0.0,236.0,145.75,6.0,5.0,174.75,289.0,154.0,284.0,17.0,2.0
50%,2004.5,109.0,666.5,164.0,119.0,0.0,177.0,265.0,54.0,136.0,20.0,417.0,11742.0,2126.5,916.0,80.0,533.0,0.0,296.5,21.0,5629.5,0.0,1087.0,822.0,52.5,92.0,966.5,1689.0,1210.0,2185.0,126.0,15.0
75%,2012.0,847.25,2456.25,609.25,1167.25,393.0,698.0,877.0,734.0,1879.0,129.0,2924.25,42546.5,10161.25,7419.75,316.0,1882.25,12.0,3946.75,109.0,20147.75,23.0,2954.0,2922.5,254.0,1042.5,3435.25,5249.75,3547.25,6080.0,450.0,160.0
max,2019.0,98358.0,320715.0,76990.0,268223.0,280604.0,153773.0,69640.0,107929.0,305491.0,65717.0,657515.0,4584273.0,690913.0,852761.0,55200.0,220357.0,222641.0,1119477.0,29048.0,2716551.0,503532.0,273089.0,222922.0,30883.0,202241.0,329237.0,1366039.0,270037.0,464914.0,25876.0,64305.0


### Cardiovascular Diseases (73,160) and Neoplasms (37,542) have the highest mean values by far. It is interesting to note that for almost all of the causes of death, the minimum is 0.