# IMPORTANTE

En este notebook vamos a hacer la exploración de los datos para todo el dataframe; es decir, incluyendo todos los años en los que hay datos \
Nuestro análisis se realizará para el año "2023" como veremos en el "Notebook_EDA_2; pero lo dejamos hecho por si cara a futuro queremos hacer un análisis temporal 

Muestro más abajo las descripciones de cada columna:

- **Country**: The name of the country where the health data was recorded.
- **Year**: The year in which the data was collected.
- **Disease Name**: The name of the disease or health condition tracked.
- **Disease Category**: The category of the disease (e.g., Infectious, Non-Communicable).
- **Prevalence Rate (%)**: The percentage of the population affected by the disease.
- **Incidence Rate (%)**: The percentage of new or newly diagnosed cases.
- **Mortality Rate (%)**: The percentage of the affected population that dies from the disease.
- **Age Group**: The age range most affected by the disease.
- **Gender**: The gender(s) affected by the disease (Male, Female, Both).
- **Population Affected**: The total number of individuals affected by the disease.
- **Healthcare Access (%)**: The percentage of the population with access to healthcare.
- **Doctors per 1000**: The number of doctors per 1000 people.
- **Hospital Beds per 1000**: The number of hospital beds available per 1000 people.
- **Treatment Type**: The primary treatment method for the disease (e.g., Medication, Surgery).
- **Average Treatment Cost (USD)**: The average cost of treating the disease in USD.
- **Availability of Vaccines/Treatment**: Whether vaccines or treatments are available.
- **Recovery Rate (%)**: The percentage of people who recover from the disease.
- **DALYs**: Disability-Adjusted Life Years, a measure of disease burden.
- **Improvement in 5 Years (%)**: The improvement in disease outcomes over the last five years.
- **Per Capita Income (USD)**: The average income per person in the country.
- **Education Index**: The average level of education in the country.
- **Urbanization Rate (%)**: The percentage of the population living in urban areas.

In [None]:
# Importamos las librerías
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

from scipy.stats import mannwhitneyu
from scipy import stats
from scipy.stats import chi2_contingency

In [6]:
# Abrimos el archivo csv en un dataset
df = pd.read_csv("../data/Global Health Statistics.csv")

## EXPLORACIÓN INICIAL

In [12]:
# Visualizamos el df
pd.set_option("display.max_columns", None) # Así vemos todas las columnas
df

Unnamed: 0,Country,Year,Disease Name,Disease Category,Prevalence Rate (%),Incidence Rate (%),Mortality Rate (%),Age Group,Gender,Population Affected,Healthcare Access (%),Doctors per 1000,Hospital Beds per 1000,Treatment Type,Average Treatment Cost (USD),Availability of Vaccines/Treatment,Recovery Rate (%),DALYs,Improvement in 5 Years (%),Per Capita Income (USD),Education Index,Urbanization Rate (%)
0,Italy,2013,Malaria,Respiratory,0.95,1.55,8.42,0-18,Male,471007,57.74,3.34,7.58,Medication,21064,No,91.82,4493,2.16,16886,0.79,86.02
1,France,2002,Ebola,Parasitic,12.46,8.63,8.75,61+,Male,634318,89.21,1.33,5.11,Surgery,47851,Yes,76.65,2366,4.82,80639,0.74,45.52
2,Turkey,2015,COVID-19,Genetic,0.91,2.35,6.22,36-60,Male,154878,56.41,4.07,3.49,Vaccination,27834,Yes,98.55,41,5.81,12245,0.41,40.20
3,Indonesia,2011,Parkinson's Disease,Autoimmune,4.68,6.29,3.99,0-18,Other,446224,85.20,3.18,8.44,Surgery,144,Yes,67.35,3201,2.22,49336,0.49,58.47
4,Italy,2013,Tuberculosis,Genetic,0.83,13.59,7.01,61+,Male,472908,67.00,4.61,5.90,Medication,8908,Yes,50.06,2832,6.93,47701,0.50,48.14
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
999995,Saudi Arabia,2021,Parkinson's Disease,Infectious,4.56,4.83,9.65,0-18,Female,119332,88.78,1.98,4.23,Vaccination,4528,Yes,92.11,1024,3.88,29335,0.75,27.94
999996,Saudi Arabia,2013,Malaria,Respiratory,0.26,1.76,0.56,0-18,Female,354927,82.24,1.28,6.34,Surgery,20686,No,84.47,202,7.95,30752,0.47,77.66
999997,USA,2016,Zika,Respiratory,13.44,14.13,1.91,19-35,Other,807915,71.46,4.18,8.11,Therapy,18807,No,86.81,3338,7.31,62897,0.72,46.90
999998,Nigeria,2020,Asthma,Chronic,1.96,14.56,4.98,61+,Female,385896,57.10,2.61,6.91,Medication,21033,Yes,62.15,4806,3.82,98189,0.51,34.73


In [37]:
# Vamos a ver los valores únicos de las diferentes columnas a ver si hay algo raro
for columna in df.columns:
    print(f"{columna}: {df[columna].sort_values().unique()}")
    print("--------------")

Country: ['Argentina' 'Australia' 'Brazil' 'Canada' 'China' 'France' 'Germany'
 'India' 'Indonesia' 'Italy' 'Japan' 'Mexico' 'Nigeria' 'Russia'
 'Saudi Arabia' 'South Africa' 'South Korea' 'Turkey' 'UK' 'USA']
--------------
Year: [2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013
 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024]
--------------
Disease Name: ["Alzheimer's Disease" 'Asthma' 'COVID-19' 'Cancer' 'Cholera' 'Dengue'
 'Diabetes' 'Ebola' 'HIV/AIDS' 'Hepatitis' 'Hypertension' 'Influenza'
 'Leprosy' 'Malaria' 'Measles' "Parkinson's Disease" 'Polio' 'Rabies'
 'Tuberculosis' 'Zika']
--------------
Disease Category: ['Autoimmune' 'Bacterial' 'Cardiovascular' 'Chronic' 'Genetic'
 'Infectious' 'Metabolic' 'Neurological' 'Parasitic' 'Respiratory' 'Viral']
--------------
Prevalence Rate (%): [ 0.1   0.11  0.12 ... 19.98 19.99 20.  ]
--------------
Incidence Rate (%): [ 0.1   0.11  0.12 ... 14.98 14.99 15.  ]
--------------
Mortality Rate (%): [ 0.1   0.11  0

Se comprueba que no hay valores/texto raro en las diferentes columnas (p.e. USA / EEUU / US) 

In [14]:
# Vemos la descripción general matemática de los valores numéricos
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Year,1000000.0,2011.996999,7.217287,2000.0,2006.0,2012.0,2018.0,2024.0
Prevalence Rate (%),1000000.0,10.047992,5.740189,0.1,5.09,10.04,15.01,20.0
Incidence Rate (%),1000000.0,7.555005,4.298947,0.1,3.84,7.55,11.28,15.0
Mortality Rate (%),1000000.0,5.049919,2.859427,0.1,2.58,5.05,7.53,10.0
Population Affected,1000000.0,500735.427363,288660.116648,1000.0,250491.25,501041.0,750782.0,1000000.0
Healthcare Access (%),1000000.0,74.987835,14.436345,50.0,62.47,75.0,87.49,100.0
Doctors per 1000,1000000.0,2.747929,1.299067,0.5,1.62,2.75,3.87,5.0
Hospital Beds per 1000,1000000.0,5.245931,2.742865,0.5,2.87,5.24,7.62,10.0
Average Treatment Cost (USD),1000000.0,25010.313665,14402.279227,100.0,12538.0,24980.0,37493.0,50000.0
Recovery Rate (%),1000000.0,74.496934,14.155168,50.0,62.22,74.47,86.78,99.0


En principio se deja entre ver que para las 15 variables numéricas mostradas arriba no hay valores nulos

In [18]:
# Vemos información más completa de las variables
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 22 columns):
 #   Column                              Non-Null Count    Dtype  
---  ------                              --------------    -----  
 0   Country                             1000000 non-null  object 
 1   Year                                1000000 non-null  int64  
 2   Disease Name                        1000000 non-null  object 
 3   Disease Category                    1000000 non-null  object 
 4   Prevalence Rate (%)                 1000000 non-null  float64
 5   Incidence Rate (%)                  1000000 non-null  float64
 6   Mortality Rate (%)                  1000000 non-null  float64
 7   Age Group                           1000000 non-null  object 
 8   Gender                              1000000 non-null  object 
 9   Population Affected                 1000000 non-null  int64  
 10  Healthcare Access (%)               1000000 non-null  float64
 11  Doctors per 

Efectivamente, se ratifica que no hay valores nulos en ninguna de las columnas

La columna "Availability of Vaccines/Treatment" es de tipo objeto pero realmente se podría tratar como binario:
- Avaliability of Vaccines/Treatment: Yes/No

In [24]:
# Comprobamos si hay valores duplicados
df.duplicated(keep = False).value_counts()

False    1000000
Name: count, dtype: int64

Tampoco hay valores duplicados

In [26]:
# Comprobamos si hay valores nulos
for columna in df.columns:
    print(df[columna].isna().value_counts(dropna=False))
    print("--------------")

Country
False    1000000
Name: count, dtype: int64
--------------
Year
False    1000000
Name: count, dtype: int64
--------------
Disease Name
False    1000000
Name: count, dtype: int64
--------------
Disease Category
False    1000000
Name: count, dtype: int64
--------------
Prevalence Rate (%)
False    1000000
Name: count, dtype: int64
--------------
Incidence Rate (%)
False    1000000
Name: count, dtype: int64
--------------
Mortality Rate (%)
False    1000000
Name: count, dtype: int64
--------------
Age Group
False    1000000
Name: count, dtype: int64
--------------
Gender
False    1000000
Name: count, dtype: int64
--------------
Population Affected
False    1000000
Name: count, dtype: int64
--------------
Healthcare Access (%)
False    1000000
Name: count, dtype: int64
--------------
Doctors per 1000
False    1000000
Name: count, dtype: int64
--------------
Hospital Beds per 1000
False    1000000
Name: count, dtype: int64
--------------
Treatment Type
False    1000000
Name: count, d

Tampoco hay valores nulos