<a href="https://colab.research.google.com/drive/1Ohbu1EgL06hQwXLKyAWvDEJeOnufZ0qD">Abre este Jupyter en Google Colab</a>

# Introducción a Pandas

[Pandas](https://pandas.pydata.org/about/index.html) es una librería que proporciona estructuras de datos y herramientas de análisis de datos de alto rendimiento y fáciles de usar. 
* La estructura de datos principal es el DataFrame, que puede considerarse como una tabla 2D en memoria (como una hoja de cálculo, con nombres de columna y etiquetas de fila). 
* Muchas funciones disponibles en Excel están disponibles mediante programación, como crear tablas dinámicas, calcular columnas basadas en otras columnas, trazar gráficos, etc.
* Proporciona un alto rendimiento para manipular (unir, dividir, modificar…) grandes conjuntos de datos

## Import

In [2]:
# Instalación de Pandas
!pip install pandas
!pip install pyarrow



In [3]:
import pandas as pd

## Estructuras de datos en Pandas

La librería Pandas, de manera genérica, contiene las siguientes estructuras de datos:
* **Series**: Array de una dimensión
* **DataFrame**: Se corresponde con una tabla de 2 dimensiones
* **Panel**: Similar a un diccionario de DataFrames

## Creación del objeto Series

In [4]:
# Creacion de un objeto Series
s = pd.Series([2, 4, 6, 8, 10, 12])
s

0     2
1     4
2     6
3     8
4    10
5    12
dtype: int64

In [5]:
# Creación de un objeto Series inicializándolo con un diccionario de Python
height = {"Juan": 1.67, "Roberto": 1.65, "Manuel": 1.78, "Rafael": 1.63, "Jorge": 1.60, "Carlos": 1.93}
s = pd.Series(height)
s

Juan       1.67
Roberto    1.65
Manuel     1.78
Rafael     1.63
Jorge      1.60
Carlos     1.93
dtype: float64

In [6]:
# Creación de un objeto Series inicializándolo con algunos 
# de los elementos de un diccionario de Python
height = {"Juan": 1.67, "Roberto": 1.65, "Manuel": 1.78, "Rafael": 1.63, "Jorge": 1.60, "Carlos": 1.93}
s = pd.Series(height, index = ["Juan", "Manuel"])
s

Juan      1.67
Manuel    1.78
dtype: float64

In [7]:
# Creación de un objeto Series inicializandolo con un escalar
s = pd.Series(30, ["test1", "test2", "test3"])
s

test1    30
test2    30
test3    30
dtype: int64

## Acceso a los elementos de un objeto Series

Cada elemento en un objeto Series tiene un identificador único que se denomina **_index label_**.

In [8]:
# Creación de un objeto Series
s = pd.Series([2, 4,6, 8], index=["num1", "num2", "num3", "num4"])
s

num1    2
num2    4
num3    6
num4    8
dtype: int64

In [9]:
# Accediendo al tercer elemento del objeto
s["num3"]

np.int64(6)

In [10]:
# Tambien se puede acceder al elemento por posición
print(s[2])

6


  print(s[2])


In [11]:
# loc es la forma estándar de acceder a un elemento de un objeto Series por atributo
s.loc["num2"]

np.int64(4)

In [12]:
# iloc es la forma estándar de acceder a un elemento de un objeto Series por posición
s.iloc[1]

np.int64(4)

In [13]:
# Accediendo al segundo y tercer elemento por posición
s.iloc[2:4]

num3    6
num4    8
dtype: int64

## Operaciones aritméticas con Series

In [14]:
# Creacion de un objeto Series
s = pd.Series([2,4,6,8,10,12])
s

0     2
1     4
2     6
3     8
4    10
5    12
dtype: int64

In [15]:
# Los objeto Series son similares y compatibles con los Arrays de Numpy
import numpy as np
np.sum(s)

np.int64(42)

In [16]:
# El resto de operaciones aritméticas de Numpy sobre Arrays también son posibles
# Más información al respecto en la Introducción a Numpy
s * 24772828

0     49545656
1     99091312
2    148636968
3    198182624
4    247728280
5    297273936
dtype: int64

In [None]:
np.mean(s)

## Representación gráfica de un objeto Series

In [17]:
# Creación de un objeto Series denominado Temperaturas
temperaturas = [4.4, 5.1, 6.1, 6.2, 6.1, 6.1, 5.7, 5.2, 4.7, 4.1, 3.9, 1.3, 1.3, 4.5, 9.9]
s = pd.Series(temperaturas, name="Temperaturas")
s

0     4.4
1     5.1
2     6.1
3     6.2
4     6.1
5     6.1
6     5.7
7     5.2
8     4.7
9     4.1
10    3.9
11    1.3
12    1.3
13    4.5
14    9.9
Name: Temperaturas, dtype: float64

In [None]:
# Representación gráfica del objeto Series
%matplotlib inline
import matplotlib.pyplot as plt

s.plot()
plt.show()

## Creación de un objeto DataFrame

In [18]:
# Creación de un DataFrame inicializándolo con un diccionario de objetios Series
personas = {
    "peso": pd.Series([84, 90, 56, 64], ["Santiago","Pedro", "Ana", "Julia"]),
    "altura": pd.Series({"Santiago": 187, "Pedro": 178, "Julia": 170, "Ana": 165}),
    "hijos": pd.Series([2, 3], ["Pedro", "Julia"])
}

df = pd.DataFrame(personas)
df

Unnamed: 0,peso,altura,hijos
Ana,56,165,
Julia,64,170,3.0
Pedro,90,178,2.0
Santiago,84,187,


Puede forzarse al DataFrame a que presente unas columnas determinadas y en un orden determinado

In [19]:
# Creación de un DataFrame inicializándolo con algunos elementos de un diccionario
# de objetos Series
personas = {
    "peso": pd.Series([84, 90, 56, 64], ["Santiago","Pedro", "Ana", "Julia"]),
    "altura": pd.Series({"Santiago": 187, "Pedro": 178, "Julia": 170, "Ana": 165}),
    "hijos": pd.Series([2, 3], ["Pedro", "Julia"])
}

df = pd.DataFrame(personas, columns = ["peso", "altura"], index=["Santiago", "Pedro"])
df

Unnamed: 0,peso,altura
Santiago,84,187
Pedro,90,178


In [None]:
# Creación de un DataFrame inicializándolo con una lista de listas de Python
# Importante: Deben especificarse las columnas e indices por separado
valores = [
    [183, 4, 76],
    [170, 0, 65],
    [190, 1, 89],
    [176, 0, 76]
]

df = pd.DataFrame(valores, columns=["altura", "hijos", "peso"], index=["Raul", "Rafael", "Maria", "Roberto"])
df

In [22]:
# Creación de un DataFrame inicializándolo con un diccionario de Python
personas = {
    "altura": {"Santiago": 187, "Pedro": 178, "Julia": 170, "Ana": 165}, 
    "peso": {"Santiago": 87, "Pedro": 78, "Julia": 70, "Ana": 65}}

df = pd.DataFrame(personas)
df

Unnamed: 0,altura,peso
Santiago,187,87
Pedro,178,78
Julia,170,70
Ana,165,65


## Acceso a los elementos de un DataFrame

In [23]:
# Creación de un DataFrame inicializándolo con un diccionario de objetios Series

cart = {
    "price" : pd.Series(
        [100, 150, 299, 430, 600, 900, 1240], 
        ["Computer", "Television", "Smartphone", "Watch", "Mouse", "Headphones", "Chip"]),
    "color": pd.Series({"Computer": "black", "Television": "white", "Smartphone": "gray", "Watch": "black", "Mouse": "blue", "Headphones": "red"}),
    "brand": pd.Series(["Panasonic", "Apple", "Sony", "Huawei"],["Television", "Smartphone", "Headphones", "Computer"])
}
df = pd.DataFrame(cart)
df

Unnamed: 0,price,color,brand
Chip,1240,,
Computer,100,black,Huawei
Headphones,900,red,Sony
Mouse,600,blue,
Smartphone,299,gray,Apple
Television,150,white,Panasonic
Watch,430,black,


### Acceso a los elementos de las columnas del DataFrame

In [24]:
df["price"]

Chip          1240
Computer       100
Headphones     900
Mouse          600
Smartphone     299
Television     150
Watch          430
Name: price, dtype: int64

In [None]:
df[["price", "color"]]

In [25]:
# Pueden combinarse los metodos anteriores con expresiones booleanas
df[(df["price"] < 600) & (df["brand"].notna())] 

Unnamed: 0,price,color,brand
Computer,100,black,Huawei
Smartphone,299,gray,Apple
Television,150,white,Panasonic


In [26]:
# Pueden combinarse los metodos anteriores con expresiones booleanas
df[(df["price"] < 600) & (df["color"] == "black")]

Unnamed: 0,price,color,brand
Computer,100,black,Huawei
Watch,430,black,


### Acceso a los elementos de las filas del DataFrame

In [27]:
# Mostrar el DataFrame
df.loc["Computer"]

price       100
color     black
brand    Huawei
Name: Computer, dtype: object

In [28]:
type(df.loc["Computer"])

pandas.core.series.Series

In [29]:
df.iloc[2]

price     900
color     red
brand    Sony
Name: Headphones, dtype: object

In [31]:
df.iloc[0:3]

Unnamed: 0,price,color,brand
Chip,1240,,
Computer,100,black,Huawei
Headphones,900,red,Sony


### Consulta avanzada de los elementos de un DataFrame

In [32]:
# Mostrar el DataFrame
df

Unnamed: 0,price,color,brand
Chip,1240,,
Computer,100,black,Huawei
Headphones,900,red,Sony
Mouse,600,blue,
Smartphone,299,gray,Apple
Television,150,white,Panasonic
Watch,430,black,


In [35]:
df.query("price >= 900 and brand.notna()")

Unnamed: 0,price,color,brand
Headphones,900,red,Sony


## Copiar un DataFrame

In [36]:
# Creación de un DataFrame inicializándolo con un diccionario de objetios Series
cart = {
    "price" : pd.Series(
        [100, 150, 299, 430, 600, 900, 1240], 
        ["Computer", "Television", "Smartphone", "Watch", "Mouse", "Headphones", "Chip"]),
    "color": pd.Series({"Computer": "black", "Television": "white", "Smartphone": "gray", "Watch": "black", "Mouse": "blue", "Headphones": "red"}),
    "brand": pd.Series(["Panasonic", "Apple", "Sony", "Huawei"],["Television", "Smartphone", "Headphones", "Computer"])
}
df = pd.DataFrame(cart)
df

Unnamed: 0,price,color,brand
Chip,1240,,
Computer,100,black,Huawei
Headphones,900,red,Sony
Mouse,600,blue,
Smartphone,299,gray,Apple
Television,150,white,Panasonic
Watch,430,black,


In [38]:
# Copia del DataFrame df en df_copy
# Importante: Al modificar un elemento de df_copy no se modifica df
df_copy = df.copy()
df_copy

Unnamed: 0,price,color,brand
Chip,1240,,
Computer,100,black,Huawei
Headphones,900,red,Sony
Mouse,600,blue,
Smartphone,299,gray,Apple
Television,150,white,Panasonic
Watch,430,black,


## Modificación de un DataFrame

In [39]:
# Creación de un DataFrame inicializándolo con un diccionario de objetios Series
cart = {
    "price" : pd.Series(
        [100, 150, 299, 430, 600, 900, 1240], 
        ["Computer", "Television", "Smartphone", "Watch", "Mouse", "Headphones", "Chip"]),
    "color": pd.Series({"Computer": "black", "Television": "white", "Smartphone": "gray", "Watch": "black", "Mouse": "blue", "Headphones": "red"}),
    "brand": pd.Series(["Panasonic", "Apple", "Sony", "Huawei"],["Television", "Smartphone", "Headphones", "Computer"])
}

df = pd.DataFrame(cart)
df

Unnamed: 0,price,color,brand
Chip,1240,,
Computer,100,black,Huawei
Headphones,900,red,Sony
Mouse,600,blue,
Smartphone,299,gray,Apple
Television,150,white,Panasonic
Watch,430,black,


In [42]:
# Añadir una nueva columna al DataFrame
df["discount(%)"] = [10,15,5,5,10,5,25]
df

Unnamed: 0,price,color,brand,discount(%)
Chip,1240,,,10
Computer,100,black,Huawei,15
Headphones,900,red,Sony,5
Mouse,600,blue,,5
Smartphone,299,gray,Apple,10
Television,150,white,Panasonic,5
Watch,430,black,,25


In [48]:
# Añadir una nueva columna calculada al DataFrame
df["total_with_discount"] = df["price"] * (1 - df["discount(%)"] / 100)
df

Unnamed: 0,price,color,brand,discount(%),total,total_with_discount
Chip,1240,,,10,1116.0,1116.0
Computer,100,black,Huawei,15,85.0,85.0
Headphones,900,red,Sony,5,855.0,855.0
Mouse,600,blue,,5,570.0,570.0
Smartphone,299,gray,Apple,10,269.1,269.1
Television,150,white,Panasonic,5,142.5,142.5
Watch,430,black,,25,322.5,322.5


In [50]:
# Añadir una nueva columna creando un DataFrame nuevo
df_mod = df.assign(year=[2016, 2021, 2023, 2022, 2024, 2022, 2025]) # No modifica el original
df_mod

Unnamed: 0,price,color,brand,discount(%),total,total_with_discount,year
Chip,1240,,,10,1116.0,1116.0,2016
Computer,100,black,Huawei,15,85.0,85.0,2021
Headphones,900,red,Sony,5,855.0,855.0,2023
Mouse,600,blue,,5,570.0,570.0,2022
Smartphone,299,gray,Apple,10,269.1,269.1,2024
Television,150,white,Panasonic,5,142.5,142.5,2022
Watch,430,black,,25,322.5,322.5,2025


In [51]:
df

Unnamed: 0,price,color,brand,discount(%),total,total_with_discount
Chip,1240,,,10,1116.0,1116.0
Computer,100,black,Huawei,15,85.0,85.0
Headphones,900,red,Sony,5,855.0,855.0
Mouse,600,blue,,5,570.0,570.0
Smartphone,299,gray,Apple,10,269.1,269.1
Television,150,white,Panasonic,5,142.5,142.5
Watch,430,black,,25,322.5,322.5


In [None]:
# Eliminar una columna existente del DataFrame
del df["total"]

In [54]:
df

Unnamed: 0,price,color,brand,discount(%),total_with_discount
Chip,1240,,,10,1116.0
Computer,100,black,Huawei,15,85.0
Headphones,900,red,Sony,5,855.0
Mouse,600,blue,,5,570.0
Smartphone,299,gray,Apple,10,269.1
Television,150,white,Panasonic,5,142.5
Watch,430,black,,25,322.5


In [56]:
# Eliminar una columna existente devolviendo una copia del DataFrame resultante
df_mod = df.drop(["discount(%)"], axis=1)

In [57]:
df_mod

Unnamed: 0,price,color,brand,total_with_discount
Chip,1240,,,1116.0
Computer,100,black,Huawei,85.0
Headphones,900,red,Sony,855.0
Mouse,600,blue,,570.0
Smartphone,299,gray,Apple,269.1
Television,150,white,Panasonic,142.5
Watch,430,black,,322.5


In [58]:
df

Unnamed: 0,price,color,brand,discount(%),total_with_discount
Chip,1240,,,10,1116.0
Computer,100,black,Huawei,15,85.0
Headphones,900,red,Sony,5,855.0
Mouse,600,blue,,5,570.0
Smartphone,299,gray,Apple,10,269.1
Television,150,white,Panasonic,5,142.5
Watch,430,black,,25,322.5


In [59]:
dir(df)

['T',
 '_AXIS_LEN',
 '_AXIS_ORDERS',
 '_AXIS_TO_AXIS_NUMBER',
 '_HANDLED_TYPES',
 '__abs__',
 '__add__',
 '__and__',
 '__annotations__',
 '__array__',
 '__array_priority__',
 '__array_ufunc__',
 '__arrow_c_stream__',
 '__bool__',
 '__class__',
 '__contains__',
 '__copy__',
 '__dataframe__',
 '__dataframe_consortium_standard__',
 '__deepcopy__',
 '__delattr__',
 '__delitem__',
 '__dict__',
 '__dir__',
 '__divmod__',
 '__doc__',
 '__eq__',
 '__finalize__',
 '__firstlineno__',
 '__floordiv__',
 '__format__',
 '__ge__',
 '__getattr__',
 '__getattribute__',
 '__getitem__',
 '__getstate__',
 '__gt__',
 '__hash__',
 '__iadd__',
 '__iand__',
 '__ifloordiv__',
 '__imod__',
 '__imul__',
 '__init__',
 '__init_subclass__',
 '__invert__',
 '__ior__',
 '__ipow__',
 '__isub__',
 '__iter__',
 '__itruediv__',
 '__ixor__',
 '__le__',
 '__len__',
 '__lt__',
 '__matmul__',
 '__mod__',
 '__module__',
 '__mul__',
 '__ne__',
 '__neg__',
 '__new__',
 '__nonzero__',
 '__or__',
 '__pandas_priority__',
 '__pos__

## Evaluación de expresiones sobre un DataFrame

In [61]:
# Creación de un DataFrame inicializándolo con un diccionario de objetios Series
cart = {
    "price" : pd.Series(
        [100, 150, 299, 430, 600, 900, 1240], 
        ["Computer", "Television", "Smartphone", "Watch", "Mouse", "Headphones", "Chip"]),
    "color": pd.Series({"Computer": "black", "Television": "white", "Smartphone": "gray", "Watch": "black", "Mouse": "blue", "Headphones": "red"}),
    "brand": pd.Series(["Panasonic", "Apple", "Sony", "Huawei"],["Television", "Smartphone", "Headphones", "Computer"])
}

df = pd.DataFrame(cart)
df

Unnamed: 0,price,color,brand
Chip,1240,,
Computer,100,black,Huawei
Headphones,900,red,Sony
Mouse,600,blue,
Smartphone,299,gray,Apple
Television,150,white,Panasonic
Watch,430,black,


In [62]:
# Evaluar una función sobre una columna del DataFrame
df.eval("price / 2")

Chip          620.0
Computer       50.0
Headphones    450.0
Mouse         300.0
Smartphone    149.5
Television     75.0
Watch         215.0
Name: price, dtype: float64

In [66]:
# Asignar el valor resultante como una nueva columna
df.eval("mean_price = price / 2", inplace=True)
df

Unnamed: 0,price,color,brand,mean_price
Chip,1240,,,620.0
Computer,100,black,Huawei,50.0
Headphones,900,red,Sony,450.0
Mouse,600,blue,,300.0
Smartphone,299,gray,Apple,149.5
Television,150,white,Panasonic,75.0
Watch,430,black,,215.0


In [67]:
# Evaluar una función utilizando una variable local
max_price = 700
df.eval("price > @max_price")

Chip           True
Computer      False
Headphones     True
Mouse         False
Smartphone    False
Television    False
Watch         False
Name: price, dtype: bool

In [69]:
# Aplicar una función externa a una columna del DataFrame
def add100dollars(x):
    return x + 100

df["price"].apply(add100dollars)


Chip          1340
Computer       200
Headphones    1000
Mouse          700
Smartphone     399
Television     250
Watch          530
Name: price, dtype: int64

In [70]:
df

Unnamed: 0,price,color,brand,mean_price
Chip,1240,,,620.0
Computer,100,black,Huawei,50.0
Headphones,900,red,Sony,450.0
Mouse,600,blue,,300.0
Smartphone,299,gray,Apple,149.5
Television,150,white,Panasonic,75.0
Watch,430,black,,215.0


## Guardar y Cargar el DataFrame

In [71]:
# Creación de un DataFrame inicializándolo con un diccionario de objetios Series
cart = {
    "price" : pd.Series(
        [100, 150, 299, 430, 600, 900, 1240], 
        ["Computer", "Television", "Smartphone", "Watch", "Mouse", "Headphones", "Chip"]),
    "color": pd.Series({"Computer": "black", "Television": "white", "Smartphone": "gray", "Watch": "black", "Mouse": "blue", "Headphones": "red"}),
    "brand": pd.Series(["Panasonic", "Apple", "Sony", "Huawei"],["Television", "Smartphone", "Headphones", "Computer"])
}
df = pd.DataFrame(cart)
df

Unnamed: 0,price,color,brand
Chip,1240,,
Computer,100,black,Huawei
Headphones,900,red,Sony
Mouse,600,blue,
Smartphone,299,gray,Apple
Television,150,white,Panasonic
Watch,430,black,


In [72]:
# Guardar el DataFrame como CSV, HTML y JSON
df.to_csv("df_cart.csv")
df.to_html("df_cart.html")
df.to_json("df_cart.json")

In [73]:
# Cargar el DataFrame en Jupyter
df2 = pd.read_csv("df_cart.csv")

In [74]:
df2

Unnamed: 0.1,Unnamed: 0,price,color,brand
0,Chip,1240,,
1,Computer,100,black,Huawei
2,Headphones,900,red,Sony
3,Mouse,600,blue,
4,Smartphone,299,gray,Apple
5,Television,150,white,Panasonic
6,Watch,430,black,


In [75]:
# Cargar el DataFrame con la primera columna correctamente asignada
df2 = pd.read_csv("df_cart.csv", index_col=0)
df2

Unnamed: 0,price,color,brand
Chip,1240,,
Computer,100,black,Huawei
Headphones,900,red,Sony
Mouse,600,blue,
Smartphone,299,gray,Apple
Television,150,white,Panasonic
Watch,430,black,


In [76]:
df2

Unnamed: 0,price,color,brand
Chip,1240,,
Computer,100,black,Huawei
Headphones,900,red,Sony
Mouse,600,blue,
Smartphone,299,gray,Apple
Television,150,white,Panasonic
Watch,430,black,
