## Exploración datos y limpieza csv.

🌇🏃‍♀️ En este proyecto, trabajaremos con un csv que contiene información sobre la vida saludable en distintas ciudades del mundo.

In [1]:
import numpy as np
import pandas as pd
import re
from geopy.geocoders import Nominatim

pd.options.display.max_columns = None

import sys
sys.path.append('../')
import src.funcioneslimpieza as flimp

In [2]:
calid_city = pd.read_csv('../data/vida_ciudades.csv')
calid_city.head(2)

Unnamed: 0,City,Rank,Sunshine hours(City),Cost of a bottle of water(City),Obesity levels(Country),Life expectancy(years) (Country),Pollution(Index score) (City),Annual avg. hours worked,Happiness levels(Country),Outdoor activities(City),Number of take out places(City),Cost of a monthly gym membership(City)
0,Amsterdam,1,1858,£1.92,20.40%,81.2,30.93,1434,7.44,422,1048,£34.90
1,Sydney,2,2636,£1.48,29.00%,82.1,26.86,1712,7.22,406,1103,£41.66


1️⃣ En primer lugar, eliminaremos los caracteres especiales que contienen las distintas celdas para facilitar el análisis de los datos posterior

In [3]:
calid_city['Cost of a bottle of water(City)'] = calid_city['Cost of a bottle of water(City)'].str.replace(r'£', '', regex = True)

In [4]:
calid_city.head(2)

Unnamed: 0,City,Rank,Sunshine hours(City),Cost of a bottle of water(City),Obesity levels(Country),Life expectancy(years) (Country),Pollution(Index score) (City),Annual avg. hours worked,Happiness levels(Country),Outdoor activities(City),Number of take out places(City),Cost of a monthly gym membership(City)
0,Amsterdam,1,1858,1.92,20.40%,81.2,30.93,1434,7.44,422,1048,£34.90
1,Sydney,2,2636,1.48,29.00%,82.1,26.86,1712,7.22,406,1103,£41.66


In [5]:
calid_city['Obesity levels(Country)'] = calid_city['Obesity levels(Country)'].str.replace(r'%', '', regex = True)

In [6]:
calid_city.head(2)

Unnamed: 0,City,Rank,Sunshine hours(City),Cost of a bottle of water(City),Obesity levels(Country),Life expectancy(years) (Country),Pollution(Index score) (City),Annual avg. hours worked,Happiness levels(Country),Outdoor activities(City),Number of take out places(City),Cost of a monthly gym membership(City)
0,Amsterdam,1,1858,1.92,20.4,81.2,30.93,1434,7.44,422,1048,£34.90
1,Sydney,2,2636,1.48,29.0,82.1,26.86,1712,7.22,406,1103,£41.66


In [7]:
calid_city['Cost of a monthly gym membership(City)'] = calid_city['Cost of a monthly gym membership(City)'].str.replace(r'£', '', regex = True)

In [8]:
calid_city.head(2)

Unnamed: 0,City,Rank,Sunshine hours(City),Cost of a bottle of water(City),Obesity levels(Country),Life expectancy(years) (Country),Pollution(Index score) (City),Annual avg. hours worked,Happiness levels(Country),Outdoor activities(City),Number of take out places(City),Cost of a monthly gym membership(City)
0,Amsterdam,1,1858,1.92,20.4,81.2,30.93,1434,7.44,422,1048,34.9
1,Sydney,2,2636,1.48,29.0,82.1,26.86,1712,7.22,406,1103,41.66


2️⃣ Por otro lado, eliminaremos los valores nulos de las celdas y los reemplazaremos por np.nan.

In [14]:
def replac_nulos(x):
    
    '''
    Esta función nos permite sustituir los valores - de las celdas del csv por el valor medio 
    del resto de celdas de la misma columna que componen el csv. Para ello, en primer lugar se
    sustituye el - por un np.nan y, posteriormente, se realiza un nuevo cambio del np.nan por el
    valor medio mencionado con anterioridad.

        x(object): '-'

    return:

        un int: valor medio del resto de celdas de la misma columna.
    '''
    calid_city[x] = calid_city[x].replace('-',np.nan)
    calid_city[x] = calid_city[x] = calid_city[x].astype('float')
    calid_city[x] = calid_city[x].replace(np.nan, calid_city[x].mean())

In [15]:
replac_nulos('Sunshine hours(City)')
replac_nulos('Annual avg. hours worked')
replac_nulos('Pollution(Index score) (City)')

In [16]:
calid_city['Sunshine hours(City)'] = calid_city['Sunshine hours(City)'].round()
calid_city['Annual avg. hours worked'] = calid_city['Annual avg. hours worked'].round()
calid_city['Pollution(Index score) (City)'] = calid_city['Pollution(Index score) (City)'].round()

In [17]:
calid_city.head(5)

Unnamed: 0,City,Rank,Sunshine hours(City),Cost of a bottle of water(City),Obesity levels(Country),Life expectancy(years) (Country),Pollution(Index score) (City),Annual avg. hours worked,Happiness levels(Country),Outdoor activities(City),Number of take out places(City),Cost of a monthly gym membership(City)
0,Amsterdam,1,1858.0,1.92,20.4,81.2,31.0,1434.0,7.44,422,1048,34.9
1,Sydney,2,2636.0,1.48,29.0,82.1,27.0,1712.0,7.22,406,1103,41.66
2,Vienna,3,1884.0,1.94,20.1,81.0,17.0,1501.0,7.29,132,1008,25.74
3,Stockholm,4,1821.0,1.72,20.6,81.8,20.0,1452.0,7.35,129,598,37.31
4,Copenhagen,5,1630.0,2.19,19.7,79.8,21.0,1380.0,7.64,154,523,32.53


❗ A continuación se obtendrán una serie de tres csv para poder realizar posteriormente el estudio de datos mediante SQL.

1️⃣ En primer lugar, obtendremos un csv con las características de la ciudad: horas de sol, polución y número de lugares al aire libre.

In [18]:
caract_city = calid_city[['City', 'Rank', 'Sunshine hours(City)', 'Pollution(Index score) (City)', 'Number of take out places(City)']].copy()

In [19]:
caract_city.head(2)

Unnamed: 0,City,Rank,Sunshine hours(City),Pollution(Index score) (City),Number of take out places(City)
0,Amsterdam,1,1858.0,31.0,1048
1,Sydney,2,2636.0,27.0,1103


In [20]:
caract_city.to_csv('../data/caracteristicas_ciudad.csv')

2️⃣ En segundo lugar, los costes asociados a la vida saludable en la ciudad y el número de actividades que se ofrecen al aire libre.

In [21]:
activ_city = calid_city[['City', 'Rank', 'Cost of a bottle of water(City)', 'Outdoor activities(City)', 'Cost of a monthly gym membership(City)']].copy()

In [22]:
activ_city.head(2)

Unnamed: 0,City,Rank,Cost of a bottle of water(City),Outdoor activities(City),Cost of a monthly gym membership(City)
0,Amsterdam,1,1.92,422,34.9
1,Sydney,2,1.48,406,41.66


In [23]:
activ_city.to_csv('../data/actividades_ciudad.csv')

3️⃣ Por último, las características poblacionales con los niveles de obesidad, la esperanza de vida, las horas trabajadas anualmente y el nivel de felicidad medio.

In [24]:
peop_city = calid_city[['City', 'Rank', 'Obesity levels(Country)', 'Life expectancy(years) (Country)', 'Annual avg. hours worked', 'Happiness levels(Country)']].copy()

In [25]:
peop_city.head(2)

Unnamed: 0,City,Rank,Obesity levels(Country),Life expectancy(years) (Country),Annual avg. hours worked,Happiness levels(Country)
0,Amsterdam,1,20.4,81.2,1434.0,7.44
1,Sydney,2,29.0,82.1,1712.0,7.22


In [26]:
peop_city.to_csv('../data/calidad_personas.csv')

In [27]:
calid_city.to_csv('../data/ciudades.csv')