<a href="https://colab.research.google.com/github/JonaJS/E_Pandas/blob/main/Chptr6_Operators%2B(%26%2BDunder%2BMethods).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import pandas as pd
URL = 'https://github.com/mattharrison/datasets/raw/master/data/vehicles.csv.zip'

In [2]:
df = pd.read_csv(URL)
city_mpg = df.city08
highway_mpg = df.highway08

  df = pd.read_csv(URL)


### Duner Methods

Pure Python example:

In [3]:
2 + 4

6

Under the hood Python runs this:

In [4]:
(2).__add__(4)

6

A Python integer object that has a `.__add__` method responds to the `+ ` operation. Because a `Series` object has this method, you can call `+` on it. These is also a `.__div__` method that supports division. One way to calculate the average of the two series is the following:

In [5]:
(city_mpg + highway_mpg)/2

0        22.0
1        11.5
2        28.0
3        11.0
4        20.0
         ... 
41139    22.5
41140    24.0
41141    21.0
41142    21.0
41143    18.5
Length: 41144, dtype: float64

### Index Alignment

Of note, you can apply most math operations on a series with another series, and you can also use a scalar (as we did with the division). When you operate with two series, pandas will *align* the index before performing the operation.
Aligning will take each index entry in the left series and match it up with every entry with the same name in the index of the right series. In the above case, values with the same index name are added together and then divided by 2. These operations returns a `Series` object.

Because of index alignment, you will want to make sure that the indexes:


*   Are unique (no duplicates).
*   Are common to both series.

If this condition do not exist you will get missing values or a combinatoric explosion of results.


In [6]:
s1 = pd.Series([10, 20, 30], index=[1, 2, 2])
s2 = pd.Series([35, 44, 53], index=[2, 2, 4], name='s2')

In [7]:
s1

1    10
2    20
2    30
dtype: int64

In [8]:
s2

2    35
2    44
4    53
Name: s2, dtype: int64

In [9]:
s1 +s2

1     NaN
2    55.0
2    64.0
2    65.0
2    74.0
4     NaN
dtype: float64

Note that index names 1 and 4 have `NaN` while index name 2 has four results - every 2 from s1 is matched up with every 2 from s2.

### Broadcasting.

When you perform math operations with a scalar, pandas `broadcasts` the operation to all values. In the above case, the values are added together.

In [10]:
s1

1    10
2    20
2    30
dtype: int64

In [11]:
s2

2    35
2    44
4    53
Name: s2, dtype: int64

In [12]:
(s1).add(s2, fill_value=0)

1    10.0
2    55.0
2    64.0
2    65.0
2    74.0
4    53.0
dtype: float64

There is another advantage to broadcasting. With many math operations, these are optimized and happen very quickly in the CPU. This is called `vectorization`. (A numeric pandas series is a block of memory, and modern CPUs leverage a technology called Single Instruction/Multiple Data (SIMD) to apply a math operation to the block of memory.)

Operations that are available include: `+, -, /, // (floor division), % (modulus), @ (matrix multiplication), ** (power), <, <=, ==, !=, >=, >, & (binary and), ^ (binary xor), | (binary or).`


### Iterations
Basically do not use for loop to iterate through a series. There is a `.__iter__` method by the way.

### Operators Method.

Why pandas does provide methods for the standard operators (like + and .add)?

**R=** In general, functions and methods have parameters to allow you to `parameterize` or change the behavior based on the parameters. The dunder methods generally fill in `NaN` (or `< NA >` for Int64) when one of the operands is missing following index alignment.

In [13]:
s1 + s2

1     NaN
2    55.0
2    64.0
2    65.0
2    74.0
4     NaN
dtype: float64

In [14]:
(s1).add(s2)

1     NaN
2    55.0
2    64.0
2    65.0
2    74.0
4     NaN
dtype: float64

In [15]:
(s1).add(s2, fill_value=0)

1    10.0
2    55.0
2    64.0
2    65.0
2    74.0
4    53.0
dtype: float64

In this last example we used the `fill_value=0` to indicates that when an index name on s1 doesn't match up with an index name on s2, use a 0 instead.

### Chaining.

Another stylistic reason to prefer the method to the operator is that it makes `chaining` manipulation easy. Because most pandas methods do not mutate data in place but instead return a new object, we can keep tacking on method calls to the returned object.

Chaining makes the code easy to read and understand. We can chain with operators as well, but it requires that we wrap the operation with parentheses.

In [16]:
# Operators
((city_mpg +
  highway_mpg)
/2
)

0        22.0
1        11.5
2        28.0
3        11.0
4        20.0
         ... 
41139    22.5
41140    24.0
41141    21.0
41142    21.0
41143    18.5
Length: 41144, dtype: float64

In [17]:
# Methods
(city_mpg
 .add(highway_mpg)
 .div(2)
)

# We can read this as "we are taking the city_mpg series, then we are adding the highway series to it. Finally, we are dividing by two."

0        22.0
1        11.5
2        28.0
3        11.0
4        20.0
         ... 
41139    22.5
41140    24.0
41141    21.0
41142    21.0
41143    18.5
Length: 41144, dtype: float64

Methods on page 41/42.