In [None]:
# MIT License

# Copyright (c) GDSC UNI

# Permission is hereby granted, free of charge, to any person obtaining a copy
# of this software and associated documentation files (the "Software"), to deal
# in the Software without restriction, including without limitation the rights
# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
# copies of the Software, and to permit persons to whom the Software is
# furnished to do so, subject to the following conditions:

# The above copyright notice and this permission notice shall be included in all
# copies or substantial portions of the Software.

# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
# SOFTWARE.

<table align="center">
  <td align="center"><a target="_blank" href="https://gdsc.community.dev/universidad-nacional-de-ingenieria/">
        <img src="https://i.ibb.co/pX2w52P/GDSC.png" style="padding-bottom:5px;" />
      View GDSC UNI</a></td>

  <td align="center"><a target="_blank" href="https://colab.research.google.com/drive/1fMyfLTTgV2kjV22Q7W_bwV3ucAxZUQoV?usp=sharing">
        <img src="https://i.ibb.co/Bf0HK0q/Colaboratory.png"  style="padding-bottom:5px;" />Run in Google Colab </a></td>

  <td align="center"><a target="_blank" href="https://github.com/GDSC-UNI/Pandas-For-Data-Science/blob/main/PFDS7_Operaciones_matem%C3%A1ticas.ipynb">
        <img src="https://i.ibb.co/VHHdRx2/Github.png"  height="110px" style="padding-bottom:5px;"/>View source on GitHub</a></td>
</table>



<h1></h1>

<h1 style="font-size:42px; text-align:center; margin-bottom:30px;"><span style="color:#000080">PFDS7:</span> Operaciones Matemáticas con DataFrames</h1>
<hr>

Dentro de nuestros análisis, tendremos que hacer operaciones entre columnas, aplicarle una función a una determinada columna o a todo nuestro DataFrame. Pandas nos ofrece métodos que hacen que la aplicación de funciones matemáticas sobre nuestro dataset no sea un dolor de cabeza. Para conocer estos métodos usaremos el dataset London bike sharing" extraído de plataforma de <a href="https://www.kaggle.com/hmavrodiev/london-bike-sharing-dataset">kaggle</a>. Los datos que podemos encontrar en este dataset son los siguientes: 
* "Timesamp" = Marca de tiempo para agrupar los datos
* "cnt" = el número de bicicletas nuevas compartidas
* "t1" = Temperatura real en Cº
* "t2" = Temperatura en Cº "como se siente"
* "hum" = Velocidad del viento en Km/h
* "weathercode" = categoría del clima
* "isholiday" = Dato booleano
* "isweekend" = Dato booleano
* "season" = 0 - primavera, 1-verano, 2-otoño, 3-invierno.


In [None]:
import pandas as pd

df = pd.read_csv('./Datasets/london_merged.csv')
# df.head(5) 

Al ser la primera vez que trabajamos con este dataset, haremos un análisis basándonos en los conocimientos obtenidos en notebooks previos.

In [None]:
df.dtypes

timestamp        object
cnt               int64
t1              float64
t2              float64
hum             float64
wind_speed      float64
weather_code    float64
is_holiday      float64
is_weekend      float64
season          float64
dtype: object

# Datetime

En esta oportunidad nos estamos topando con un nuevo tipo de dato que no vimos anteriormente, el tiempo, el cual se encuentra como tipo de dato *object*. A continuación, convertiremos este dato a datetime que nos dará mayor facilidad para trabajar con este tipo de dato.

In [None]:
df["timestamp"] = pd.to_datetime(df["timestamp"])
df.dtypes

timestamp       datetime64[ns]
cnt                      int64
t1                     float64
t2                     float64
hum                    float64
wind_speed             float64
weather_code           float64
is_holiday             float64
is_weekend             float64
season                 float64
dtype: object

Al convertir nuestro dato a tipo datetime, contamos con atributos que nos pueden dar como resultado la hora, día, mes o año de una determinada columna.

In [None]:
df["timestamp"].dt.day.head(5)

0    4
1    4
2    4
3    4
4    4
Name: timestamp, dtype: int64

In [None]:
df["timestamp"].dt.hour.head(5)

0    0
1    1
2    2
3    3
4    4
Name: timestamp, dtype: int64

# Operaciones en columnas

Ya teniendo nuestros datos trabajables, es decir con todos los tipos de datos convertidos a su correspondiente tipo, empezaremos a hacer operaciones matemáticas. La columna "T1" contiene la temperatura real en grado Celsius, una escala muy utilizada a nivel mundial, pero si la queremos en su unidad del sistema internacional a este valor le debemos de sumar 273. Con Pandas no es necesario sumar elemento por elemento de la columna, podemos hacer esta suma a todos los elementos de la siguiente manera.

In [None]:
df["t1"] + 273

0        276.0
1        276.0
2        275.5
3        275.0
4        275.0
         ...  
17409    278.0
17410    278.0
17411    278.5
17412    278.5
17413    278.0
Name: t1, Length: 17414, dtype: float64

Del mismo modo, podemos aplicar una función matemática a cada elemento de nuestro DataFrame, solo usando como variable dependiente nuestro DataFrame.

In [None]:
import numpy as np

np.cos(df["hum"])

0        0.317429
1        0.317429
2       -0.629900
3        0.862319
4        0.317429
           ...   
17409    0.776686
17410    0.776686
17411   -0.999207
17412    0.824331
17413    0.824331
Name: hum, Length: 17414, dtype: float64

In [None]:
df["t1"]@df["t2"]

3135730.277777778

In [None]:
np.linalg.norm(df["t1"], ord=0)

17380.0

Si hacemos una sustracción entre algunos datos de una columna con otra que contiene datos faltantes, se obtendrán datos faltantes dentro del resultado tal como vimos en el notebook 5. Para evitar tener datos nulos en nuestro resultado, utilizamos el método sub y especificamos con que valor llenaremos los datos faltantes de la columna que los contiene.

In [None]:
df['t1'].iloc[::2] -df['t2']

0        1.0
1        NaN
2        0.0
3        NaN
4        2.0
        ... 
17409    NaN
17410    4.0
17411    NaN
17412    4.0
17413    NaN
Length: 17414, dtype: float64

In [None]:
df["t1"].iloc[::2].sub(df["t2"], fill_value=np.mean(df["t1"]))

0         1.000000
1         9.968091
2         0.000000
3        10.468091
4         2.000000
           ...    
17409    11.468091
17410     4.000000
17411    10.968091
17412     4.000000
17413    11.468091
Length: 17414, dtype: float64

# Apply

Una forma más formal de realizar una operación en las columnas de nuestro DataFrame, es utilizar el método *apply*.

<code>DataFrame.apply(func, axis=0, raw=False, result_type=None, args=(), **kwargs)</code>

In [None]:
def celcius2kelvin(x: int) -> int:    
    return x + 273

In [None]:
df["t1"].apply(celcius2kelvin)

0        276.0
1        276.0
2        275.5
3        275.0
4        275.0
         ...  
17409    278.0
17410    278.0
17411    278.5
17412    278.5
17413    278.0
Name: t1, Length: 17414, dtype: float64

In [None]:
df["t1"].apply(lambda x: x+273)

0        276.0
1        276.0
2        275.5
3        275.0
4        275.0
         ...  
17409    278.0
17410    278.0
17411    278.5
17412    278.5
17413    278.0
Name: t1, Length: 17414, dtype: float64

En caso de que la función contenga argumentos adicionales como por ejemplo algunos valores constantes de una función cuadrática, podemos agregarlos dentro del parámetro args del método apply.

In [None]:
def function(x: int, a: int = 1, b: int = 0) -> int:
    return x**2 + a*x + b    

In [None]:
df["t1"].apply(function, args=(20,-100))

0       -31.00
1       -31.00
2       -43.75
3       -56.00
4       -56.00
         ...  
17409    25.00
17410    25.00
17411    40.25
17412    40.25
17413    25.00
Name: t1, Length: 17414, dtype: float64

In [None]:
df.apply(lambda x: x.mean())

timestamp       2016-01-03 22:31:00.571953664
cnt                               1143.101642
t1                                  12.468091
t2                                  11.520836
hum                                 72.324954
wind_speed                          15.913063
weather_code                         2.722752
is_holiday                           0.022051
is_weekend                           0.285403
season                               1.492075
dtype: object

In [None]:
df.iloc[:,1:].apply(lambda x: x.std(), axis=1)

0         63.543511
1         51.226078
2         51.069873
3         38.017905
4         32.204856
            ...    
17409    343.649814
17410    177.453828
17411    110.404591
17412     74.094300
17413     48.459032
Length: 17414, dtype: float64

# Applymap


Otro método para hacer aplicar una función a todo nuestro DataFrame, es applymap

<code>DataFrame.applymap(func, na_action=None, **kwargs)</code>

dentro de sus parámetros esta na_action que como valor predeterminado tiene None, que indica que no se realizará ninguna acción, el otro valor que puede tener este parámetro es 'ignore' que propaga los valores Nan sin pasarlos por la función.

In [None]:
df.iloc[:,1:].applymap(lambda x: x/100)

Unnamed: 0,cnt,t1,t2,hum,wind_speed,weather_code,is_holiday,is_weekend,season
0,1.82,0.030,0.020,0.930,0.060,0.03,0.0,0.01,0.03
1,1.38,0.030,0.025,0.930,0.050,0.01,0.0,0.01,0.03
2,1.34,0.025,0.025,0.965,0.000,0.01,0.0,0.01,0.03
3,0.72,0.020,0.020,1.000,0.000,0.01,0.0,0.01,0.03
4,0.47,0.020,0.000,0.930,0.065,0.01,0.0,0.01,0.03
...,...,...,...,...,...,...,...,...,...
17409,10.42,0.050,0.010,0.810,0.190,0.03,0.0,0.00,0.03
17410,5.41,0.050,0.010,0.810,0.210,0.04,0.0,0.00,0.03
17411,3.37,0.055,0.015,0.785,0.240,0.04,0.0,0.00,0.03
17412,2.24,0.055,0.015,0.760,0.230,0.04,0.0,0.00,0.03
