In [20]:
import pandas as pd
import numpy as np
url = 'https://github.com/mattharrison/datasets/raw/master/data/vehicles.csv.zip'
df = pd.read_csv(url)


  df = pd.read_csv(url)


In [21]:
city_mpg = df.city08
highway_mpg = df.highway08

city_mpg, highway_mpg

(0        19
 1         9
 2        23
 3        10
 4        17
          ..
 41139    19
 41140    20
 41141    18
 41142    18
 41143    16
 Name: city08, Length: 41144, dtype: int64,
 0        25
 1        14
 2        33
 3        12
 4        23
          ..
 41139    26
 41140    28
 41141    24
 41142    24
 41143    21
 Name: highway08, Length: 41144, dtype: int64)

The dir function will list the attributes of an object

In [22]:
len(dir(city_mpg))

420

<h1>Operators & Dunder Methods</h1>

Operators and dunder methods are overloads that determine how Python reacts to operations.

When you run 2 + 4, under the covers Python runs (2).__add__(4)

In [23]:
print(2 + 4)
print((2).__add__(4))

6
6


<h2>Index Alignment</h2>

You can apply most math operations on a series with another series, and you can also use a scalar.
When you operate with 2 series, pandas will align the indexes first by matching each index entry in the series on the left with an entry with the same index on the right.

Because of index alignment, you want to make sure that the indexes:
<ul>
    <li>Are unique</li>
    <li>Are common to both series</li>
</ul>

In [24]:
(city_mpg + highway_mpg) / 2

0        22.0
1        11.5
2        28.0
3        11.0
4        20.0
         ... 
41139    22.5
41140    24.0
41141    21.0
41142    21.0
41143    18.5
Length: 41144, dtype: float64

If you don't have matching, distinct indexes, you will end up with missing values and combinations from the duplicates.

In [25]:
s1 = pd.Series([10,20,30], index=[1,2,2])
s2 = pd.Series([35,44,53], index=[2,2,4], name='s2')

In [26]:
s1 + s2

1     NaN
2    55.0
2    64.0
2    65.0
2    74.0
4     NaN
dtype: float64

<h2>Broadcasting</h2>

When you perform math operations with a scaler, pandas broadcasts the operation to all values.
Broadcasting is CPU optimized, since a numeric pandas series is a block of memory.

<h2>Iteration</h2>

the .__iter__ method is what allows iteration in a for loop.
You should avoid using a for loop with a series, because you lose the benefits of vectorization.
There are better ways to search and filter than using a for loop.

<h2>Operators</h2>

Pandas also provides methods for standard operators, like add. 
This lets you change the behavior by using different parameters.
for example, the add method has the optional fill_na parameter to fill NaN values.
Using the .add method with default parameters will produce the same result as the + operator.

In [27]:
s1 + s2

1     NaN
2    55.0
2    64.0
2    65.0
2    74.0
4     NaN
dtype: float64

In [28]:
s1.add(s2)

1     NaN
2    55.0
2    64.0
2    65.0
2    74.0
4     NaN
dtype: float64

<h2>Aggregate Methods</h2>

Aggregate methods collapse the values of a series down to a scaler.
These are typically the numbers used for reporting.

In [29]:
city_mpg.mean()

18.369045304297103

<h2>Quintile</h2>

The quintile method returns a quintile, 50% by default.
You can also pass in a list of quintiles and get a series result.
The quintile is the index of this series.

In [30]:
city_mpg.quantile()

17.0

In [31]:
city_mpg.quantile(.9)

24.0

In [32]:
city_mpg.quantile([.1, .5, .9])

0.1    13.0
0.5    17.0
0.9    24.0
Name: city08, dtype: float64

<h2>Count and Mean of an Attribute</h2>

Tou count the values that meat some criteria, you can use the sum method on a mask of the series.

In [33]:
city_mpg.gt(20).sum()

10272

You can use the mean method to get the percentage of values that meet a criteria.

In [34]:
city_mpg.gt(20).mul(100).mean()

24.965973167412017

This works because Python lets you add boolean values, so the sum of the boolean mask is the number of items that meet the criteria and evaluate to 1.

Calculating the mean returns the fraction of the true items.

<h2>.agg and Aggregation Strings</h2>
The .agg method on a series lets you specify the string names of aggregate functions to perform.
You can also supply your own.
Pandas will try to map the string names to series methods.

In [35]:
def second_to_last(s):
    return s.iloc[-2]

In [36]:
city_mpg.agg(['mean', np.var, max, second_to_last])

mean               18.369045
var                62.503036
max               150.000000
second_to_last     18.000000
Name: city08, dtype: float64

In [44]:
def not_na(s):
    return s.count() - s.isna().sum()

In [38]:
city_mpg.count() - city_mpg.isna().sum()

41144

In [45]:
def unique_items(s):
    return len(s.unique())

In [41]:
len(city_mpg.unique())

105

In [42]:
city_mpg.mean()

18.369045304297103

In [43]:
city_mpg.max()

150

In [46]:
city_mpg.agg([not_na,'count', unique_items, 'mean', 'max'])

not_na          41144.000000
count           41144.000000
unique_items      105.000000
mean               18.369045
max               150.000000
Name: city08, dtype: float64