## Proyecto Diabetes CRISP-DM : Fases Business Undestanding, EDA & Data Wrangling
 

A partir del siguiente Dataset se realiza el proceso de investigación y consecución de objetivos de Negocio/Tarea de Investigación.

> **Dataset:** shorturl.at/orJU5

A partir de estos datos, se pretenden alcanzar los siguientes objeivos de Negocio/Investigación:

- Objetivo 1: localizar el país con más diabetes 
- Objetivo 2: los hombres tienen más diabetes que las mujeres (true or false)
- Objetivo 3: ¿Cuáles son los 3 países con más diabetes ?



In [None]:
# Paso 0:  Dataset

### Parte 0: Busineess Understanding 🤔



### Parte 1.1: Exploración de Datos

**¿De donde son los datos?** 
 Este Dataset proviene de Kaggle 

**¿De qué empresa o institución es?** 

*https://ncdrisc.org/*

Provienen de la red mundial de científicos cuyo objetivo es proporcionar información fiable y actualizada sobre enfermedades no transmisibles y factores de riesgo.

**¿Qué vamos a hacer con ello?**

Extraer , manipular la información dada basándonos en los objetivos marcados.
El Dataset contiene 14.000 filas * 7 columnas.

**A continuación, se expone el diccionario de datos:**
* Country, Region, World: String. Procedencia del individuo
* ISO : String. Código internacional para denominar cada país de forma normalizada
* Sex : String. Género de cada individuo
* Year: Int. Años
* Age-standardised diabetes prevalence: Float. Estándard por edad de diabetes
* Lower 95% uncertainty interval: Float. Intervalo de incertidumbre númerico inferior al 95 %
* Upper 95% uncertainty interval: Float. Intervalo de incertidumbre númerico mayor al 95 %



### Parte 1.2: Pandas

In [None]:
# Carga de datos
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

# head()
# info()
# % Completitud
# NA ?


In [None]:
diabetes = pd.read_csv('Diabetes.csv', sep = ',')
diabetes.head(200)

Unnamed: 0,Country/Region/World,ISO,Sex,Year,Age-standardised diabetes prevalence,Lower 95% uncertainty interval,Upper 95% uncertainty interval
0,Afghanistan,AFG,Men,1980,0.044712,0.015339,0.094918
1,Afghanistan,AFG,Men,1981,0.046114,0.016883,0.093777
2,Afghanistan,AFG,Men,1982,0.047601,0.018745,0.094018
3,Afghanistan,AFG,Men,1983,0.049173,0.020375,0.093950
4,Afghanistan,AFG,Men,1984,0.050834,0.022269,0.093679
...,...,...,...,...,...,...,...
195,Angola,AGO,Men,2000,0.053250,0.028657,0.087909
196,Angola,AGO,Men,2001,0.055094,0.029983,0.090278
197,Angola,AGO,Men,2002,0.057113,0.031255,0.093183
198,Angola,AGO,Men,2003,0.059203,0.032907,0.096256


In [None]:
diabetes.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14000 entries, 0 to 13999
Data columns (total 7 columns):
 #   Column                                Non-Null Count  Dtype  
---  ------                                --------------  -----  
 0   Country/Region/World                  14000 non-null  object 
 1   ISO                                   14000 non-null  object 
 2   Sex                                   14000 non-null  object 
 3   Year                                  14000 non-null  int64  
 4   Age-standardised diabetes prevalence  14000 non-null  float64
 5   Lower 95% uncertainty interval        14000 non-null  float64
 6   Upper 95% uncertainty interval        14000 non-null  float64
dtypes: float64(3), int64(1), object(3)
memory usage: 765.8+ KB


# % Completitud 
La tabla no contiene ningún dato nulo, he allí que su carácter sea completo 

In [None]:
# NA ?
diabetes.isnull().sum()


Country/Region/World                    0
ISO                                     0
Sex                                     0
Year                                    0
Age-standardised diabetes prevalence    0
Lower 95% uncertainty interval          0
Upper 95% uncertainty interval          0
dtype: int64

In [None]:
diabetes.notnull().sum()

Country/Region/World                    14000
ISO                                     14000
Sex                                     14000
Year                                    14000
Age-standardised diabetes prevalence    14000
Lower 95% uncertainty interval          14000
Upper 95% uncertainty interval          14000
dtype: int64

### Parte 2: Data Wrangling

# ¿NA son eliminables?
No es necesarios eliminarlos dado que carece de valor nulo

In [None]:
# renombrar columnas...
diabetes = diabetes.rename(columns={'Country/Region/World': 'País', 'ISO': 'Cod.País', 'Sex': 'Sexo', 'Year': 'Año', 'Age-standardised diabetes prevalence': 'Diabetes.Por.Edad', 'Lower 95% uncertainty interval': 'Incertidumbre.Menor95%', 'Upper 95% uncertainty interval': 'Incertidumbre.Mayor95%'})
print(diabetes)



              País Cod.País   Sexo   Año  Diabetes.Por.Edad  \
0      Afghanistan      AFG    Men  1980           0.044712   
1      Afghanistan      AFG    Men  1981           0.046114   
2      Afghanistan      AFG    Men  1982           0.047601   
3      Afghanistan      AFG    Men  1983           0.049173   
4      Afghanistan      AFG    Men  1984           0.050834   
...            ...      ...    ...   ...                ...   
13995     Zimbabwe      ZWE  Women  2010           0.072249   
13996     Zimbabwe      ZWE  Women  2011           0.072956   
13997     Zimbabwe      ZWE  Women  2012           0.073752   
13998     Zimbabwe      ZWE  Women  2013           0.074616   
13999     Zimbabwe      ZWE  Women  2014           0.075607   

       Incertidumbre.Menor95%  Incertidumbre.Mayor95%  
0                    0.015339                0.094918  
1                    0.016883                0.093777  
2                    0.018745                0.094018  
3                  

In [None]:
# iloc/ filtro de los datos que nos interesan
diabetes.iloc[:, [1, 2, 6]]

Unnamed: 0,Cod.País,Sexo,Incertidumbre.Mayor95%
0,AFG,Men,0.094918
1,AFG,Men,0.093777
2,AFG,Men,0.094018
3,AFG,Men,0.093950
4,AFG,Men,0.093679
...,...,...,...
13995,ZWE,Women,0.108806
13996,ZWE,Women,0.112512
13997,ZWE,Women,0.116488
13998,ZWE,Women,0.121880


In [None]:
# Crear columnas
diabetes["Media"] = (diabetes["Incertidumbre.Menor95%"] + diabetes["Incertidumbre.Menor95%"]) / 2
diabetes


Unnamed: 0,País,Cod.País,Sexo,Año,Diabetes.Por.Edad,Incertidumbre.Menor95%,Incertidumbre.Mayor95%,Media
0,Afghanistan,AFG,Men,1980,0.044712,0.015339,0.094918,0.015339
1,Afghanistan,AFG,Men,1981,0.046114,0.016883,0.093777,0.016883
2,Afghanistan,AFG,Men,1982,0.047601,0.018745,0.094018,0.018745
3,Afghanistan,AFG,Men,1983,0.049173,0.020375,0.093950,0.020375
4,Afghanistan,AFG,Men,1984,0.050834,0.022269,0.093679,0.022269
...,...,...,...,...,...,...,...,...
13995,Zimbabwe,ZWE,Women,2010,0.072249,0.043879,0.108806,0.043879
13996,Zimbabwe,ZWE,Women,2011,0.072956,0.042840,0.112512,0.042840
13997,Zimbabwe,ZWE,Women,2012,0.073752,0.041895,0.116488,0.041895
13998,Zimbabwe,ZWE,Women,2013,0.074616,0.040434,0.121880,0.040434


### El dataset que nos interesa para los objetivos

In [None]:
# Operar sobre ellas
# - Objetivo 1: localizar el país con más diabetes 
# - Objetivo 2: los hombres tienen más diabetes que las mujeres (true or false)
# - Objetivo 3: ¿Cuáles son los 3 países con más diabetes ?

# Transformar

In [None]:
# - Objetivo 1: localizar el país con más diabetes 
media_diabetes = diabetes.groupby(by=['País']).mean()['Media']
media_diabetes


  media_diabetes = diabetes.groupby(by=['País']).mean()['Media']


País
Afghanistan       0.048202
Albania           0.031626
Algeria           0.054531
American Samoa    0.192953
Andorra           0.043173
                    ...   
Venezuela         0.051580
Viet Nam          0.023598
Yemen             0.035808
Zambia            0.027186
Zimbabwe          0.029599
Name: Media, Length: 200, dtype: float64

In [None]:
# - Objetivo 2: los hombres tienen más diabetes que las mujeres (true or false)
genero_diabetes = diabetes.groupby(by=['Sexo']).mean()['Media']
genero_diabetes

# Los mujeres tienden a tener más diabetes que los hombres.

  genero_diabetes = diabetes.groupby(by=['Sexo']).mean()['Media']


Sexo
Men      0.044094
Women    0.045878
Name: Media, dtype: float64

In [None]:
# - Objetivo 3: ¿Cuáles son los 3 países con más diabetes ?
media_diabetes = diabetes.groupby(by=['País']).mean()['Media']
media_diabetes 

### Objetivos

In [None]:
# Ir a por los objetivos

## Visualización de datos


In [None]:
# Plots

### Conclusiones 
- En relación al objetivo 1...
- Objetivo 2:
- ...

En resumen, blablabla