
<h1 id="data_acquisition">1. Obtención de Datos</h1>
<div class="alert alert-block alert-info" style="margin-top: 20px">
    
<p>
Existen varios formatos para un conjunto de datos, .csv, .xlsx, .json, etc. Los datos pueden ser almacenados en distintos lugares, ya sea localmente o en línea.
Lo más común es el formato 'csv' (Comma Separated Values) porque ocupa menos memoria y se puede leer con multitud de editores.
Cuando hablamos de "hojas de cálculo" suele llevar implícito un programa para visualizar los datos y trabajar con ellos, como Excel.

Nosotros trabajaremos <b>SOLO</b> con los <b>DATOS</b>, y cuando leamos un archivos Excel SOLO leeremos los datos (sin formatos, ni funciones).
<ul>
    <li>fuente de datos: <a href="https://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.data?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDeveloperSkillsNetworkDA0101ESedX20601300-2021-01-01" target="_blank">https://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.data</a></li>
    <li>tipo de datos: csv</li>
</ul>

</div>

<hr>


In [1]:
# Importar la librería Pandas 
import pandas as pd

The "<b>pandas</b>" name itself is derived from <i>panel data</i>, an econometrics term for multidimensional
structured datasets, and a play on the phrase Python data analysis itself.

There are two main data structures: <b>DataFrame</b> and <b>Series</b>

<h2> SERIES </h2>

A Series is a one-dimensional array-like object containing a sequence of <b>values</b> (of
    similar types to NumPy types) and an associated array of data labels, called its <b>index</b>

In [2]:
#Series (dtype: int64)
serie1 = pd.Series([8,9,0,-5,7])
serie1

0    8
1    9
2    0
3   -5
4    7
dtype: int64

In [3]:
#Series (dtype: float64)
serie2 = pd.Series([8.0, 9, 0, -5, 7])
serie2

0    8.0
1    9.0
2    0.0
3   -5.0
4    7.0
dtype: float64

In [4]:
#Series (dtype: object)
serie3 = pd.Series(['poco', 9, 0,-5,7])
serie3

0    poco
1       9
2       0
3      -5
4       7
dtype: object

In [5]:
print(serie1*4)
print(serie2*4)
print(serie3*4)

0    32
1    36
2     0
3   -20
4    28
dtype: int64
0    32.0
1    36.0
2     0.0
3   -20.0
4    28.0
dtype: float64
0    pocopocopocopoco
1                  36
2                   0
3                 -20
4                  28
dtype: object


In [6]:
print(serie3.index)
print(serie3.values)

RangeIndex(start=0, stop=5, step=1)
['poco' 9 0 -5 7]


In [7]:
serie3.index=['a','b','c','d','e']
serie3.index
print(serie3)

a    poco
b       9
c       0
d      -5
e       7
dtype: object


In [8]:
indices4 = ['Id1', 'Id2', 'Id3', 'Id4']
valores4 = [1.3, 2.4, -3.8, 1e-4]
serie4 = pd.Series(valores4, index=indices4)
serie4

Id1    1.3000
Id2    2.4000
Id3   -3.8000
Id4    0.0001
dtype: float64

In [10]:
serie3['a']


'poco'

In [13]:
serie3[['a','c']]

a    poco
c       0
dtype: object

In [14]:
serie2.index=['Id1', 'Id2', 'Id3', 'Id4', 'Id5']
print(serie2)
serie2_4 = serie2 + serie4
serie2_4

Id1    8.0
Id2    9.0
Id3    0.0
Id4   -5.0
Id5    7.0
dtype: float64


Id1     9.3000
Id2    11.4000
Id3    -3.8000
Id4    -4.9999
Id5        NaN
dtype: float64

In [15]:
serie4.name = 'Serie_prueba4'
serie4.index.name = 'Ids'
serie4

Ids
Id1    1.3000
Id2    2.4000
Id3   -3.8000
Id4    0.0001
Name: Serie_prueba4, dtype: float64

<h2> DATAFRAMES </h2>

A DataFrame represents a rectangular table of data and contains an ordered collection
of columns, each of which can be a different value type (numeric, string,
boolean, etc.). The DataFrame has both a row and column index.

While a DataFrame is physically two-dimensional, you can use it to
represent higher dimensional data in a tabular format using hierarchical
indexing.

In [16]:
# Creating a DataFrame from a dictionary of lists, arrays or tuples
data = {'col1':[1.0, 2.0, 3.0], 'col2':[11.1, 22.2, 33.3], 'col3':[-0.1, -0.2, -0.3]}
cols = ['col1', 'col3', 'col2']
df1 = pd.DataFrame(data, columns=cols, index=['ind1', 'ind2', 'ind3'])
df1

Unnamed: 0,col1,col3,col2
ind1,1.0,-0.1,11.1
ind2,2.0,-0.2,22.2
ind3,3.0,-0.3,33.3


In [17]:
#Creating a DataFrame from ndarray
import numpy as np
data = np.array([[1.1, 2.2, 3.3], [-0.11, -0.22, -0.33], [11.0, 22.0, 33.0]])
df2 = pd.DataFrame(data, columns=cols, index=['ind11', 'ind22', 'ind33'])
df2

Unnamed: 0,col1,col3,col2
ind11,1.1,2.2,3.3
ind22,-0.11,-0.22,-0.33
ind33,11.0,22.0,33.0


In [18]:
#Creating a DataFrame from dictionary of Series
df3 = pd.DataFrame({'col_serie4': serie4, 'col_serie2': serie2}, columns=['col_serie2', 'col_serie4'])
df3

Unnamed: 0,col_serie2,col_serie4
Id1,8.0,1.3
Id2,9.0,2.4
Id3,0.0,-3.8
Id4,-5.0,0.0001
Id5,7.0,


In [19]:
df3b = pd.DataFrame({'col_serie4': serie4, 'col_serie3': serie3}, columns=['col_serie4', 'col_serie3'])
df3b

Unnamed: 0,col_serie4,col_serie3
Id1,1.3,
Id2,2.4,
Id3,-3.8,
Id4,0.0001,
a,,poco
b,,9
c,,0
d,,-5
e,,7


In [20]:
df3.index.name = 'ÍNDICES'
df3.columns.name = 'COLUMNAS'
df3

COLUMNAS,col_serie2,col_serie4
ÍNDICES,Unnamed: 1_level_1,Unnamed: 2_level_1
Id1,8.0,1.3
Id2,9.0,2.4
Id3,0.0,-3.8
Id4,-5.0,0.0001
Id5,7.0,


In [21]:
df3.values

array([[ 8.0e+00,  1.3e+00],
       [ 9.0e+00,  2.4e+00],
       [ 0.0e+00, -3.8e+00],
       [-5.0e+00,  1.0e-04],
       [ 7.0e+00,      nan]])

<h2 align=center> Important: </h2>
The column returned from indexing a DataFrame is a <b>view</b> on the
underlying data, not a copy. Thus, any in-place modifications to the
Series will be reflected in the DataFrame. The column can be
explicitly copied with the Series’s <i>copy</i> method

In [22]:
col2 = df3['col_serie2']
col2

ÍNDICES
Id1    8.0
Id2    9.0
Id3    0.0
Id4   -5.0
Id5    7.0
Name: col_serie2, dtype: float64

In [23]:
print(type(df3))
print(type(col2))

<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.series.Series'>


In [24]:
col2['Id1']=8.8
df3

COLUMNAS,col_serie2,col_serie4
ÍNDICES,Unnamed: 1_level_1,Unnamed: 2_level_1
Id1,8.8,1.3
Id2,9.0,2.4
Id3,0.0,-3.8
Id4,-5.0,0.0001
Id5,7.0,


In [26]:
col2b = df3['col_serie2'].copy()
col2b['Id1']=9.99
print(df3)
print(col2b)


COLUMNAS  col_serie2  col_serie4
ÍNDICES                         
Id1              8.8      1.3000
Id2              9.0      2.4000
Id3              0.0     -3.8000
Id4             -5.0      0.0001
Id5              7.0         NaN
ÍNDICES
Id1    9.99
Id2    9.00
Id3    0.00
Id4   -5.00
Id5    7.00
Name: col_serie2, dtype: float64


Add a new column from a data list, ndarray, tuple


In [27]:
data5list = [5.1, 5.2, 5.3, 5.4, 5.5]
# data5tuple = (5.1, 5.2, 5.3, 5.4, 5.5)
# data5array = np.array(data5list)

df3['col_new']= data5list
df3

COLUMNAS,col_serie2,col_serie4,col_new
ÍNDICES,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Id1,8.8,1.3,5.1
Id2,9.0,2.4,5.2
Id3,0.0,-3.8,5.3
Id4,-5.0,0.0001,5.4
Id5,7.0,,5.5


Delete a column or a row with the method <b>drop</b> (Drop specified labels from rows or columns).

There is also a pandas method for deleting columns: <b>del</b> (<i>del df['col']</i>)


In [31]:
# df3['col_new'].del()
df3b = df3.drop('col_new', axis='columns', inplace=False)
print(df3)
df3b

COLUMNAS  col_serie2  col_serie4  col_new
ÍNDICES                                  
Id1              8.8      1.3000      5.1
Id2              9.0      2.4000      5.2
Id3              0.0     -3.8000      5.3
Id4             -5.0      0.0001      5.4
Id5              7.0         NaN      5.5


COLUMNAS,col_serie2,col_serie4
ÍNDICES,Unnamed: 1_level_1,Unnamed: 2_level_1
Id1,8.8,1.3
Id2,9.0,2.4
Id3,0.0,-3.8
Id4,-5.0,0.0001
Id5,7.0,


In [32]:
df3c = df3.drop('Id4', axis=0)
df3c

COLUMNAS,col_serie2,col_serie4,col_new
ÍNDICES,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Id1,8.8,1.3,5.1
Id2,9.0,2.4,5.2
Id3,0.0,-3.8,5.3
Id5,7.0,,5.5


In [33]:
df3d = df3.copy()
df3d

COLUMNAS,col_serie2,col_serie4,col_new
ÍNDICES,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Id1,8.8,1.3,5.1
Id2,9.0,2.4,5.2
Id3,0.0,-3.8,5.3
Id4,-5.0,0.0001,5.4
Id5,7.0,,5.5


In [34]:
del df3d['col_new']
df3d

COLUMNAS,col_serie2,col_serie4
ÍNDICES,Unnamed: 1_level_1,Unnamed: 2_level_1
Id1,8.8,1.3
Id2,9.0,2.4
Id3,0.0,-3.8
Id4,-5.0,0.0001
Id5,7.0,


<h2>Index objects</h2>
pandas’s Index objects are responsible for holding the axis labels and other metadata
(like the axis name or names).

In [35]:
indices3 = df3.index
indices3

Index(['Id1', 'Id2', 'Id3', 'Id4', 'Id5'], dtype='object', name='ÍNDICES')

In [36]:
data3 = df3.values
df4 = pd.DataFrame(data3, index=indices3)
df4

Unnamed: 0_level_0,0,1,2
ÍNDICES,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Id1,8.8,1.3,5.1
Id2,9.0,2.4,5.2
Id3,0.0,-3.8,5.3
Id4,-5.0,0.0001,5.4
Id5,7.0,,5.5


In [44]:
df3.columns

Index(['col_serie2', 'col_serie4', 'col_new'], dtype='object', name='COLUMNAS')

In [45]:
df4.columns

RangeIndex(start=0, stop=3, step=1)

In [46]:
df3.index

Index(['Id1', 'Id2', 'Id3', 'Id4', 'Id5'], dtype='object', name='ÍNDICES')

<b>reindex</b> can alter either the (row) index, columns, or both. When
passed only a sequence, it reindexes the rows in the result. The columns can be reindexed with the <i>columns</i> keyword.

In [47]:
df3e = df3.reindex(columns=['col_serie4','col_serie2'])
df3f = df3e.reindex(['Id1','Id2','Id3','Id5','Id4','Id6'])
df3f

COLUMNAS,col_serie4,col_serie2
ÍNDICES,Unnamed: 1_level_1,Unnamed: 2_level_1
Id1,1.3,8.8
Id2,2.4,9.0
Id3,-3.8,0.0
Id5,,7.0
Id4,0.0001,-5.0
Id6,,


<h3>Dropping entries</h3>

<b>drop</b> method will return a new object with the indicated value or values deleted from
an axis.

In [48]:
df3f.drop('Id6',axis=0, inplace=True)
df3f

COLUMNAS,col_serie4,col_serie2
ÍNDICES,Unnamed: 1_level_1,Unnamed: 2_level_1
Id1,1.3,8.8
Id2,2.4,9.0
Id3,-3.8,0.0
Id5,,7.0
Id4,0.0001,-5.0


In [49]:
df3

COLUMNAS,col_serie2,col_serie4,col_new
ÍNDICES,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Id1,8.8,1.3,5.1
Id2,9.0,2.4,5.2
Id3,0.0,-3.8,5.3
Id4,-5.0,0.0001,5.4
Id5,7.0,,5.5


In [50]:
df3g = df3.reindex(columns=['col_new','col_serie4','col_serie2'])
df3g


COLUMNAS,col_new,col_serie4,col_serie2
ÍNDICES,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Id1,5.1,1.3,8.8
Id2,5.2,2.4,9.0
Id3,5.3,-3.8,0.0
Id4,5.4,0.0001,-5.0
Id5,5.5,,7.0


In [51]:
df3h = df3g[['col_serie2','col_serie4']]
df3h

COLUMNAS,col_serie2,col_serie4
ÍNDICES,Unnamed: 1_level_1,Unnamed: 2_level_1
Id1,8.8,1.3
Id2,9.0,2.4
Id3,0.0,-3.8
Id4,-5.0,0.0001
Id5,7.0,


<h3> Indexing </h3>

<b> loc </b> and <b>iloc</b>

<h3> Reading data file </h3>

<b> read_csv </b> and <b>read_excel</b>

In [None]:

# Leer el archivo en línea desde la URL de arriba y asignarla a la variable "df"
other_path = "https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/DA0101EN/auto.csv"
df = pd.read_csv(other_path, header=None)

In [None]:
# Veamos el tipo de objeto que hemos creado
type(df)

Después de leer el conjunto de datos podemos utilizar el método <code>dataframe.head(n)</code> para revisar las primeras n filas del dataframe; donde n es un entero. Al contrario de <code>dataframe.head(n)</code>, <code>dataframe.tail(n)</code> mostrará las n filas del final del dataframe.

In [None]:
df.head(5)

In [None]:
df.tail(5)

For users of the R language for statistical computing, the DataFrame name will be
familiar, as the object was named after the similar R data.frame object. Unlike
Python, data frames are built into the R programming language and its standard
library. As a result, many features found in pandas are typically either part of the R
core implementation or provided by add-on packages.

<h2> Ejercicios </h2>


<ol>
    <li> Lee un archivo excel y cambiales elnombre de las columnas por 'col1', 'col2', etc. </li>
    <li> Cámbiale el nombre a los índices por 'Id1', 'Id2', etc.</li>
</ol>