# Exploración datos y limpieza csv

🎨🖌 **En este caso, trabajaremos con un csv que contiene información sobre 50 de los artistas más importantes de la historia.** 

In [1]:
import pandas as pd
import re

pd.options.display.max_columns = None

In [2]:
art = pd.read_csv('../data/artists.csv', index_col = 0)
art.head(2)

Unnamed: 0_level_0,name,years,genre,nationality,bio,wikipedia,paintings
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,Amedeo Modigliani,1884 - 1920,Expressionism,Italian,Amedeo Clemente Modigliani (Italian pronunciat...,http://en.wikipedia.org/wiki/Amedeo_Modigliani,193
1,Vasiliy Kandinskiy,1866 - 1944,"Expressionism,Abstractionism",Russian,Wassily Wassilyevich Kandinsky (Russian: Васи́...,http://en.wikipedia.org/wiki/Wassily_Kandinsky,88


1️⃣ **En primer lugar, realizamos un análisis del csv para poder obtener la información básica del mismo y trabajar sobre él.**

**colums:** obtenemos las columnas que conforman el csv.

In [3]:
art.columns

Index(['name', 'years', 'genre', 'nationality', 'bio', 'wikipedia',
       'paintings'],
      dtype='object')

**shape:** nos devuelve cuántas filas y columnas conforman el dataframe.

In [4]:
art.shape

(50, 7)

**dtypes:** nos devuelve qué tipos de datos tenemos por columna

In [5]:
art.dtypes

name           object
years          object
genre          object
nationality    object
bio            object
wikipedia      object
paintings       int64
dtype: object

**info:** nos aporta información general sobre el dataframe. Así, obtenemos el nombre de las columnas, número de valores nulos y tipo de dato.

In [6]:
art.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 50 entries, 0 to 49
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   name         50 non-null     object
 1   years        50 non-null     object
 2   genre        50 non-null     object
 3   nationality  50 non-null     object
 4   bio          50 non-null     object
 5   wikipedia    50 non-null     object
 6   paintings    50 non-null     int64 
dtypes: int64(1), object(6)
memory usage: 3.1+ KB


**describe:** descripción simple del dataframe.

In [7]:
art.describe()

Unnamed: 0,paintings
count,50.0
mean,168.92
std,157.451105
min,24.0
25%,81.0
50%,123.0
75%,191.75
max,877.0


**isnull().sum():** nos devuelve la suma de valores que faltan en nuestro dataframe, detectando NaN o None.

In [8]:
art.isnull().sum()

name           0
years          0
genre          0
nationality    0
bio            0
wikipedia      0
paintings      0
dtype: int64

**dtypes:** tipos de datos en nuestro dataframe.

In [10]:
art.dtypes

name           object
years          object
genre          object
nationality    object
bio            object
wikipedia      object
paintings       int64
dtype: object

2️⃣ **A continuación, procedemos a eliminar la información entre paréntesis que nos aporta la columna bio, debido a que es información repetida sobre fechas o sobre pronunciación, aspectos que no nos interesan.** De este modo, se puede apreciar la diferencia tras aplicar regex.

In [11]:
art['bio'][2]

'Diego María de la Concepción Juan Nepomuceno Estanislao de la Rivera y Barrientos Acosta y Rodríguez, known as Diego Rivera (Spanish pronunciation: [ˈdjeɣo riˈβeɾa]; December 8, 1886 – November 24, 1957) was a prominent Mexican painter. His large frescoes helped establish the Mexican mural movement in Mexican art. Between 1922 and 1953, Rivera painted murals in, among other places, Mexico City, Chapingo, Cuernavaca, San Francisco, Detroit, and New York City. In 1931, a retrospective exhibition of his works was held at the Museum of Modern Art in New York. Rivera had a volatile marriage with fellow Mexican artist Frida Kahlo.'

In [12]:
art['bio'] = art['bio'].str.replace(r'\(.*?\)', '', regex = True)

In [13]:
art['bio'][2]

'Diego María de la Concepción Juan Nepomuceno Estanislao de la Rivera y Barrientos Acosta y Rodríguez, known as Diego Rivera  was a prominent Mexican painter. His large frescoes helped establish the Mexican mural movement in Mexican art. Between 1922 and 1953, Rivera painted murals in, among other places, Mexico City, Chapingo, Cuernavaca, San Francisco, Detroit, and New York City. In 1931, a retrospective exhibition of his works was held at the Museum of Modern Art in New York. Rivera had a volatile marriage with fellow Mexican artist Frida Kahlo.'

3️⃣ **Por otro lado, obtenemos la fecha de nacimiento y muerte de los distintos artistas mediante regex, aplicándolo a la columna 
years del csv original.**

In [14]:
art['year_of_birth'] = art['years'].str.extract(r'^(\d..\d)')

In [15]:
art.head(2)

Unnamed: 0_level_0,name,years,genre,nationality,bio,wikipedia,paintings,year_of_birth
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,Amedeo Modigliani,1884 - 1920,Expressionism,Italian,Amedeo Clemente Modigliani was an Italian Jew...,http://en.wikipedia.org/wiki/Amedeo_Modigliani,193,1884
1,Vasiliy Kandinskiy,1866 - 1944,"Expressionism,Abstractionism",Russian,Wassily Wassilyevich Kandinsky was a Russian...,http://en.wikipedia.org/wiki/Wassily_Kandinsky,88,1866


In [16]:
art['year_of_death'] = art['years'].str.extract(r'(\d..\d)$')

In [17]:
art.head(2)

Unnamed: 0_level_0,name,years,genre,nationality,bio,wikipedia,paintings,year_of_birth,year_of_death
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
0,Amedeo Modigliani,1884 - 1920,Expressionism,Italian,Amedeo Clemente Modigliani was an Italian Jew...,http://en.wikipedia.org/wiki/Amedeo_Modigliani,193,1884,1920
1,Vasiliy Kandinskiy,1866 - 1944,"Expressionism,Abstractionism",Russian,Wassily Wassilyevich Kandinsky was a Russian...,http://en.wikipedia.org/wiki/Wassily_Kandinsky,88,1866,1944


4️⃣ **Posteriormente, ordeno los datos por fecha de nacimiento.**

In [18]:
art.sort_values(by=['years']).head(2)

Unnamed: 0_level_0,name,years,genre,nationality,bio,wikipedia,paintings,year_of_birth,year_of_death
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
23,Giotto di Bondone,1266 - 1337,Proto Renaissance,Italian,"Giotto di Bondone , known mononymously as Giot...",http://en.wikipedia.org/wiki/Giotto_di_Bondone,119,1266,1337
7,Andrei Rublev,1360 - 1430,Byzantine Art,Russian,Andrei Rublev is considered to be one of the ...,http://en.wikipedia.org/wiki/Andrei_Rublev,99,1360,1430


🔚 **Por último, guerdo el csv con los cambios realizados.**

In [19]:
art.to_csv('../data/art.csv')