**Introducción a las redes neuronales y su aplicación en Geociencia**  by Karen Cruz in licensed under <a href="https://creativecommons.org/licenses/by-nc-nd/4.0?ref=chooser-v1">Attribution-NonCommercial-NoDerivatives 4.0 International</a>

<font size=4 color='indianred'>
    
 # Análisis de datos usando Pandas

<font size=4>
    
  Pandas es una biblioteca de Python que facilita el análisis de datos. 
 
Pandas trabaja con dos tipos de objetos principales: **DataFrame** y **Serie**.

Un **DataFrame** es como una tabla y una **serie** es como una columna. 

La sintaxis para declarar a un **DataFrame** es declarando un diccionario, en donde las palabras clave seran el nombre de las columnas y los valores serán el listado de entradas para la columna dada. 



<font size=4 color='indianred'>
    
> ### Leer un archivo de datos.

<font size=4>
    
Un archivo csv es una tabla de valores separadas por comas ("Comma-Separated-Values").

Usando la siguiente función, Pandas nos permite leer archivos csv e interpretarlo como un **DataFrame**:

csv ------------> **pd.read_csv(** file_name **)** --------------> DataFrame

___


<font size=4>
    
  Ejemplo: Se descargó la base de datos *Superconductivty Data Data Set*.

    
[UCI Machine Learning Repository: Superconductivty Data Data Set ](https://archive.ics.uci.edu/ml/datasets/Superconductivty+Data)



Hay dos archivos: 

    (1) train.csv: contiene 81 características extraídas de 21263 superconductores junto con la temperatura crítica en la columna 82, 


    (2) unique_m.csv: contiene la fórmula química dividida para todos los 21263 superconductores del archivo  train.csv 



Las dos últimas columnas tienen la temperatura crítica y la fórmula química. Los datos originales provienen de [aquí](https://supercon.nims.go.jp/index_en.html) que es público. 

**El objetivo aquí es predecir la temperatura crítica en función de las características extraídas.**

In [None]:
import pandas as pd

In [None]:
data = pd.read_csv('superconduct/train.csv')

[read_csv( )](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html)

In [None]:
data

Unnamed: 0,number_of_elements,mean_atomic_mass,wtd_mean_atomic_mass,gmean_atomic_mass,wtd_gmean_atomic_mass,entropy_atomic_mass,wtd_entropy_atomic_mass,range_atomic_mass,wtd_range_atomic_mass,std_atomic_mass,...,wtd_mean_Valence,gmean_Valence,wtd_gmean_Valence,entropy_Valence,wtd_entropy_Valence,range_Valence,wtd_range_Valence,std_Valence,wtd_std_Valence,critical_temp
0,4,88.944468,57.862692,66.361592,36.116612,1.181795,1.062396,122.90607,31.794921,51.968828,...,2.257143,2.213364,2.219783,1.368922,1.066221,1,1.085714,0.433013,0.437059,29.00
1,5,92.729214,58.518416,73.132787,36.396602,1.449309,1.057755,122.90607,36.161939,47.094633,...,2.257143,1.888175,2.210679,1.557113,1.047221,2,1.128571,0.632456,0.468606,26.00
2,4,88.944468,57.885242,66.361592,36.122509,1.181795,0.975980,122.90607,35.741099,51.968828,...,2.271429,2.213364,2.232679,1.368922,1.029175,1,1.114286,0.433013,0.444697,19.00
3,4,88.944468,57.873967,66.361592,36.119560,1.181795,1.022291,122.90607,33.768010,51.968828,...,2.264286,2.213364,2.226222,1.368922,1.048834,1,1.100000,0.433013,0.440952,22.00
4,4,88.944468,57.840143,66.361592,36.110716,1.181795,1.129224,122.90607,27.848743,51.968828,...,2.242857,2.213364,2.206963,1.368922,1.096052,1,1.057143,0.433013,0.428809,23.00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
21258,4,106.957877,53.095769,82.515384,43.135565,1.177145,1.254119,146.88130,15.504479,65.764081,...,3.555556,3.223710,3.519911,1.377820,0.913658,1,2.168889,0.433013,0.496904,2.44
21259,5,92.266740,49.021367,64.812662,32.867748,1.323287,1.571630,188.38390,7.353333,69.232655,...,2.047619,2.168944,2.038991,1.594167,1.337246,1,0.904762,0.400000,0.212959,122.10
21260,2,99.663190,95.609104,99.433882,95.464320,0.690847,0.530198,13.51362,53.041104,6.756810,...,4.800000,4.472136,4.781762,0.686962,0.450561,1,3.200000,0.500000,0.400000,1.98
21261,2,99.663190,97.095602,99.433882,96.901083,0.690847,0.640883,13.51362,31.115202,6.756810,...,4.690000,4.472136,4.665819,0.686962,0.577601,1,2.210000,0.500000,0.462493,1.84


<font size=4>
    
El atributo  **shape** regresa el numero de datos por el numero de columnas: (#records, #column)

In [None]:
data.shape

(21263, 82)

<font size=4>
    
Los metodos **head()** y **tail()** permiten ver los primeros 5 y los ultimos 5 renglones, respectivamente. 

También se puede especificar dentro de los paréntesis el numero de renglones a ver. 

In [None]:
#data.head()
data.tail(15)

Unnamed: 0,number_of_elements,mean_atomic_mass,wtd_mean_atomic_mass,gmean_atomic_mass,wtd_gmean_atomic_mass,entropy_atomic_mass,wtd_entropy_atomic_mass,range_atomic_mass,wtd_range_atomic_mass,std_atomic_mass,...,wtd_mean_Valence,gmean_Valence,wtd_gmean_Valence,entropy_Valence,wtd_entropy_Valence,range_Valence,wtd_range_Valence,std_Valence,wtd_std_Valence,critical_temp
21248,3,135.777533,135.46764,126.142532,124.017412,1.030578,0.975011,120.1624,50.5682,49.068328,...,4.8,3.914868,4.477695,1.012331,0.918428,4,2.0,1.699673,1.469694,2.13
21249,3,135.777533,135.46764,126.142532,124.017412,1.030578,0.975011,120.1624,50.5682,49.068328,...,4.8,3.914868,4.477695,1.012331,0.918428,4,2.0,1.699673,1.469694,3.8
21250,3,135.539823,105.341263,125.319604,95.202197,1.026956,0.910759,122.454,43.097325,50.048251,...,4.411765,4.160168,4.326572,1.057905,0.778998,3,2.647059,1.247219,0.91129,8.29
21251,3,135.013667,105.248412,124.8431,95.138216,1.026758,0.909303,122.454,43.190176,50.018391,...,4.352941,3.634241,4.2246,1.011404,0.743562,4,2.705882,1.632993,1.025623,4.98
21252,3,86.607717,80.8205,80.298396,75.396707,1.023455,1.075786,80.909352,13.727825,34.049031,...,4.375,3.107233,3.580309,0.918428,0.86867,5,2.0,2.160247,1.99609,1.85
21253,4,99.2902,85.84516,62.256121,54.519504,1.037993,1.122529,192.981,38.5962,79.302066,...,4.6,3.935979,4.282255,1.31973,1.187775,4,2.0,1.47902,1.496663,1.6
21254,4,99.2902,85.84516,62.256121,54.519504,1.037993,1.122529,192.981,38.5962,79.302066,...,4.6,3.935979,4.282255,1.31973,1.187775,4,2.0,1.47902,1.496663,3.0
21255,3,89.389833,89.389833,63.694713,63.694713,0.782574,0.782574,164.1315,54.7105,73.156893,...,4.666667,4.578857,4.578857,1.078992,1.078992,2,0.666667,0.942809,0.942809,1.42
21256,3,89.389833,89.389833,63.694713,63.694713,0.782574,0.782574,164.1315,54.7105,73.156893,...,4.666667,4.578857,4.578857,1.078992,1.078992,2,0.666667,0.942809,0.942809,1.85
21257,3,89.389833,89.389833,63.694713,63.694713,0.782574,0.782574,164.1315,54.7105,73.156893,...,4.666667,4.578857,4.578857,1.078992,1.078992,2,0.666667,0.942809,0.942809,3.43


<font size=4 color='indianred'>
    
> ### Seleccionar una columna específica

<font size=4>
    
Para seleccionar una columna (serie) especifica de un DataFrame (df) se usa la sintaxis:
    
    
**df.column_name** o **df['column_name']**

por ejemplo:

In [None]:
data.critical_temp

#data['critical_temp']

0         29.00
1         26.00
2         19.00
3         22.00
4         23.00
          ...  
21258      2.44
21259    122.10
21260      1.98
21261      1.84
21262     12.80
Name: critical_temp, Length: 21263, dtype: float64

<font size=4 color='indianred'>
    
> ### Seleccionar un valor específico

<font size=4>
    
Para seleccionar un valor especifico de un DataFrame (df) se usa la sintaxis:
    
**df['column_name'][index]**

El indexado en Pandas funciona de la misma manera que en Python.

por ejemplo:

In [None]:
#data.critical_temp

data['critical_temp'][567]

40.0

<font size=4 color='indianred'>
    
> ### Operadores de acceso

<font size=4>
    
Pandas tiene sus propios operadores de acceso.



<font size=4 color='indianred'>
    
>> ### Selección basada en el índice. 'iloc'

<font size=4>
    
Selecciona a los datos basado en su posicion numérica.(Ignora el indice del conjunto de datos)

**df.iloc[i]** ---> muestra el renglón i+1

**df.iloc[row, column]**


In [None]:
index = 0

row = 0

column = 1

#data.iloc[index]   # muestra el renglon 'index'

#data.iloc[row, column]   # muestra el dato del renglon 'row' y la columna 'column'

data.iloc[:, column]    # muestra todos los elementos de la columna 'column'


0         88.944468
1         92.729214
2         88.944468
3         88.944468
4         88.944468
            ...    
21258    106.957877
21259     92.266740
21260     99.663190
21261     99.663190
21262     87.468333
Name: mean_atomic_mass, Length: 21263, dtype: float64

<font size=4 color='indianred'>
    
>> ### Selección basada en la etiqueta. 'loc'

<font size=4>
    
Selecciona a los datos basado en el valor del índice de datos, no en su posición.

**df.loc[i, 'name_column']**


In [None]:
data.loc[0, 'number_of_elements']

data.loc[:, ['number_of_elements', 'critical_temp']]

Unnamed: 0,number_of_elements,critical_temp
0,4,29.00
1,5,26.00
2,4,19.00
3,4,22.00
4,4,23.00
...,...,...
21258,4,2.44
21259,5,122.10
21260,2,1.98
21261,2,1.84


<font size=4>
    
iloc[0:10] ---> 0, 1, 2, ..., 9 


loc[0:10] ---> 0, 1, 2, ..., 10


<font size=4 color='indianred'>
    
> ### Selección condicional

<font size=4>
    
Pandas tiene sus propios operadores de acceso.



In [None]:
data.number_of_elements == 4

0         True
1        False
2         True
3         True
4         True
         ...  
21258     True
21259    False
21260    False
21261    False
21262    False
Name: number_of_elements, Length: 21263, dtype: bool

In [None]:
data.loc[data.number_of_elements == 4]

Unnamed: 0,number_of_elements,mean_atomic_mass,wtd_mean_atomic_mass,gmean_atomic_mass,wtd_gmean_atomic_mass,entropy_atomic_mass,wtd_entropy_atomic_mass,range_atomic_mass,wtd_range_atomic_mass,std_atomic_mass,...,wtd_mean_Valence,gmean_Valence,wtd_gmean_Valence,entropy_Valence,wtd_entropy_Valence,range_Valence,wtd_range_Valence,std_Valence,wtd_std_Valence,critical_temp
0,4,88.944468,57.862692,66.361592,36.116612,1.181795,1.062396,122.906070,31.794921,51.968828,...,2.257143,2.213364,2.219783,1.368922,1.066221,1,1.085714,0.433013,0.437059,29.00
2,4,88.944468,57.885242,66.361592,36.122509,1.181795,0.975980,122.906070,35.741099,51.968828,...,2.271429,2.213364,2.232679,1.368922,1.029175,1,1.114286,0.433013,0.444697,19.00
3,4,88.944468,57.873967,66.361592,36.119560,1.181795,1.022291,122.906070,33.768010,51.968828,...,2.264286,2.213364,2.226222,1.368922,1.048834,1,1.100000,0.433013,0.440952,22.00
4,4,88.944468,57.840143,66.361592,36.110716,1.181795,1.129224,122.906070,27.848743,51.968828,...,2.242857,2.213364,2.206963,1.368922,1.096052,1,1.057143,0.433013,0.428809,23.00
5,4,88.944468,57.795044,66.361592,36.098926,1.181795,1.225203,122.906070,20.687458,51.968828,...,2.214286,2.213364,2.181543,1.368922,1.141474,1,1.000000,0.433013,0.410326,23.00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
21211,4,53.172391,51.271682,48.794569,45.834240,1.314758,1.037980,51.931831,24.600623,18.870271,...,3.006333,2.783158,2.470710,1.265857,0.976571,4,1.641333,1.479020,1.634919,19.30
21212,4,53.172391,51.275799,48.794569,45.837529,1.314758,1.043787,51.931831,24.522046,18.870271,...,3.007667,2.783158,2.471657,1.265857,0.982941,4,1.636000,1.479020,1.635321,15.70
21253,4,99.290200,85.845160,62.256121,54.519504,1.037993,1.122529,192.981000,38.596200,79.302066,...,4.600000,3.935979,4.282255,1.319730,1.187775,4,2.000000,1.479020,1.496663,1.60
21254,4,99.290200,85.845160,62.256121,54.519504,1.037993,1.122529,192.981000,38.596200,79.302066,...,4.600000,3.935979,4.282255,1.319730,1.187775,4,2.000000,1.479020,1.496663,3.00


<font size=4>
    
Para seleccionar datos que estan en una lista de valores:



In [None]:
data.loc[data.number_of_elements.isin([5, 11])]

Unnamed: 0,number_of_elements,mean_atomic_mass,wtd_mean_atomic_mass,gmean_atomic_mass,wtd_gmean_atomic_mass,entropy_atomic_mass,wtd_entropy_atomic_mass,range_atomic_mass,wtd_range_atomic_mass,std_atomic_mass,...,wtd_mean_Valence,gmean_Valence,wtd_gmean_Valence,entropy_Valence,wtd_entropy_Valence,range_Valence,wtd_range_Valence,std_Valence,wtd_std_Valence,critical_temp
1,5,92.729214,58.518416,73.132787,36.396602,1.449309,1.057755,122.90607,36.161939,47.094633,...,2.257143,1.888175,2.210679,1.557113,1.047221,2,1.128571,0.632456,0.468606,26.0
11,5,111.273574,63.713457,82.793319,37.934231,1.409442,1.335472,184.59060,27.848743,64.459004,...,2.242857,2.168944,2.206963,1.594167,1.173869,1,1.057143,0.400000,0.428809,26.0
12,5,92.729214,58.201829,73.132787,36.259297,1.449309,1.026457,122.90607,36.932426,47.094633,...,2.264286,1.888175,2.221652,1.557113,1.040517,2,1.135714,0.632456,0.456864,27.0
13,5,92.729214,58.518416,73.132787,36.396602,1.449309,1.057755,122.90607,36.161939,47.094633,...,2.257143,1.888175,2.210679,1.557113,1.047221,2,1.128571,0.632456,0.468606,27.0
14,5,92.729214,59.468178,73.132787,36.811646,1.449309,1.114758,122.90607,35.741099,47.094633,...,2.235714,1.888175,2.178087,1.557113,1.057441,2,1.114286,0.632456,0.501579,26.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
21153,5,84.057041,82.936867,49.386684,53.403902,1.175160,1.189124,192.98100,34.616495,78.480209,...,4.276860,2.825235,3.737110,1.452044,1.155955,5,2.376033,1.854724,1.816420,4.5
21156,5,84.057041,86.850300,49.386684,55.405069,1.175160,1.151736,192.98100,40.656176,78.480209,...,4.340000,2.825235,3.878135,1.452044,1.175387,5,2.340000,1.854724,1.727542,4.7
21157,5,84.057041,86.970260,49.386684,55.787132,1.175160,1.157507,192.98100,40.196140,78.480209,...,4.300000,2.825235,3.772087,1.452044,1.169665,5,2.300000,1.854724,1.791647,4.4
21246,5,70.406250,43.096221,59.725961,31.931306,1.497519,1.466096,79.96060,11.170606,29.321360,...,2.147710,2.701920,2.103026,1.494365,1.315617,4,1.002954,1.549193,0.589456,78.0


<font size=4>
    
Para resaltar valores que estan vacios (NaN):



In [None]:
data.loc[data.number_of_elements.notnull()]
#data.loc[data.number_of_elements.isnull()]

Unnamed: 0,number_of_elements,mean_atomic_mass,wtd_mean_atomic_mass,gmean_atomic_mass,wtd_gmean_atomic_mass,entropy_atomic_mass,wtd_entropy_atomic_mass,range_atomic_mass,wtd_range_atomic_mass,std_atomic_mass,...,wtd_mean_Valence,gmean_Valence,wtd_gmean_Valence,entropy_Valence,wtd_entropy_Valence,range_Valence,wtd_range_Valence,std_Valence,wtd_std_Valence,critical_temp
0,4,88.944468,57.862692,66.361592,36.116612,1.181795,1.062396,122.90607,31.794921,51.968828,...,2.257143,2.213364,2.219783,1.368922,1.066221,1,1.085714,0.433013,0.437059,29.00
1,5,92.729214,58.518416,73.132787,36.396602,1.449309,1.057755,122.90607,36.161939,47.094633,...,2.257143,1.888175,2.210679,1.557113,1.047221,2,1.128571,0.632456,0.468606,26.00
2,4,88.944468,57.885242,66.361592,36.122509,1.181795,0.975980,122.90607,35.741099,51.968828,...,2.271429,2.213364,2.232679,1.368922,1.029175,1,1.114286,0.433013,0.444697,19.00
3,4,88.944468,57.873967,66.361592,36.119560,1.181795,1.022291,122.90607,33.768010,51.968828,...,2.264286,2.213364,2.226222,1.368922,1.048834,1,1.100000,0.433013,0.440952,22.00
4,4,88.944468,57.840143,66.361592,36.110716,1.181795,1.129224,122.90607,27.848743,51.968828,...,2.242857,2.213364,2.206963,1.368922,1.096052,1,1.057143,0.433013,0.428809,23.00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
21258,4,106.957877,53.095769,82.515384,43.135565,1.177145,1.254119,146.88130,15.504479,65.764081,...,3.555556,3.223710,3.519911,1.377820,0.913658,1,2.168889,0.433013,0.496904,2.44
21259,5,92.266740,49.021367,64.812662,32.867748,1.323287,1.571630,188.38390,7.353333,69.232655,...,2.047619,2.168944,2.038991,1.594167,1.337246,1,0.904762,0.400000,0.212959,122.10
21260,2,99.663190,95.609104,99.433882,95.464320,0.690847,0.530198,13.51362,53.041104,6.756810,...,4.800000,4.472136,4.781762,0.686962,0.450561,1,3.200000,0.500000,0.400000,1.98
21261,2,99.663190,97.095602,99.433882,96.901083,0.690847,0.640883,13.51362,31.115202,6.756810,...,4.690000,4.472136,4.665819,0.686962,0.577601,1,2.210000,0.500000,0.462493,1.84


<font size=4 color='indianred'>
    
> ### Funciones utiles

<font size=4>
    
**describe()** ---> aplicado a valores numéricos da un resumen estadístico.


In [None]:
data.number_of_elements.describe()
data.describe()


Unnamed: 0,number_of_elements,mean_atomic_mass,wtd_mean_atomic_mass,gmean_atomic_mass,wtd_gmean_atomic_mass,entropy_atomic_mass,wtd_entropy_atomic_mass,range_atomic_mass,wtd_range_atomic_mass,std_atomic_mass,...,wtd_mean_Valence,gmean_Valence,wtd_gmean_Valence,entropy_Valence,wtd_entropy_Valence,range_Valence,wtd_range_Valence,std_Valence,wtd_std_Valence,critical_temp
count,21263.0,21263.0,21263.0,21263.0,21263.0,21263.0,21263.0,21263.0,21263.0,21263.0,...,21263.0,21263.0,21263.0,21263.0,21263.0,21263.0,21263.0,21263.0,21263.0,21263.0
mean,4.115224,87.557631,72.98831,71.290627,58.539916,1.165608,1.063884,115.601251,33.225218,44.391893,...,3.153127,3.056536,3.055885,1.295682,1.052841,2.04101,1.483007,0.839342,0.673987,34.421219
std,1.439295,29.676497,33.490406,31.030272,36.651067,0.36493,0.401423,54.626887,26.967752,20.03543,...,1.191249,1.046257,1.174815,0.393155,0.380291,1.242345,0.978176,0.484676,0.45558,34.254362
min,1.0,6.941,6.423452,5.320573,1.960849,0.0,0.0,0.0,0.0,0.0,...,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00021
25%,3.0,72.458076,52.143839,58.041225,35.24899,0.966676,0.775363,78.512902,16.824174,32.890369,...,2.116732,2.279705,2.091251,1.060857,0.775678,1.0,0.921454,0.451754,0.306892,5.365
50%,4.0,84.92275,60.696571,66.361592,39.918385,1.199541,1.146783,122.90607,26.636008,45.1235,...,2.618182,2.615321,2.434057,1.368922,1.166532,2.0,1.063077,0.8,0.5,20.0
75%,5.0,100.40441,86.10354,78.116681,73.113234,1.444537,1.359418,154.11932,38.356908,59.322812,...,4.026201,3.727919,3.914868,1.589027,1.330801,3.0,1.9184,1.2,1.020436,63.0
max,9.0,208.9804,208.9804,208.9804,208.9804,1.983797,1.958203,207.97246,205.58991,101.0197,...,7.0,7.0,7.0,2.141963,1.949739,6.0,6.9922,3.0,3.0,185.0


<font size=4>
    
**mean()** 


In [None]:
data.mean_atomic_mass.mean()
data.mean()

number_of_elements        4.115224
mean_atomic_mass         87.557631
wtd_mean_atomic_mass     72.988310
gmean_atomic_mass        71.290627
wtd_gmean_atomic_mass    58.539916
                           ...    
range_Valence             2.041010
wtd_range_Valence         1.483007
std_Valence               0.839342
wtd_std_Valence           0.673987
critical_temp            34.421219
Length: 82, dtype: float64

<font size=4>
    


**unique()** ---> Indica los valores diferentes que se pueden tomar en una columna en específico.



In [None]:
data.number_of_elements.unique()

# unique() es atributo de series no de dataframe

array([4, 5, 6, 3, 7, 2, 8, 9, 1], dtype=int64)

<font size=4>
    
**value_counts()** ---> Indica el numero de muestras para los valores que una determinada columna puede tomar.

In [None]:
data.number_of_elements.value_counts()

5    5792
4    4496
3    3895
2    3280
6    2666
7     774
1     285
8      61
9      14
Name: number_of_elements, dtype: int64

<font size=4>

Aquí hay algunos enlaces en donde puedes encontrar diferentes bases de datos.

[UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/index.php)

[Kaggle](https://www.kaggle.com/)

[sklearn](https://scikit-learn.org/stable/datasets/index.html)

[Google datasets](https://datasetsearch.research.google.com/)

[CDMX data](https://datos.cdmx.gob.mx/pages/home/)