**Table of contents**<a id='toc0_'></a>    
- [Import Statements](#toc1_1_)    
  - [Importing the data](#toc1_2_)    
- [Aggregate methods](#toc2_)    
  - [Multiple aggregation: The _.agg()_ function](#toc2_1_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=2
	maxLevel=5
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

**Read the official documentation on pandas Series objects @ https://pandas.pydata.org/pandas-docs/stable/reference/series.html**

**`Note:`** We can actually use python built in functions on pandas series objects. i.e., **len, type, dir, in, sum, product, mean, sorted, max, min** etc.

Also, the notion of **chaining functions/methods** in pandas is similar to python.

### <a id='toc1_1_'></a>[Import Statements](#toc0_)

In [1]:
# import statements
import numpy as np
import pandas as pd

### <a id='toc1_2_'></a>[Importing the data](#toc0_)

One of the many datasets we will use for our examples in this notebook is the `/Data/vehicles.csv.zip` dataset.

In [2]:
# read the vehicles.csv dataset
df = pd.read_csv("Data/vehicles.csv.zip")

  df = pd.read_csv("Data/vehicles.csv.zip")


Columns of a dataframe can be accessed in various ways. One of which is to use the **dot i.e, ' . ' notation**.

In [3]:
# the city08 and highway08 columns from the vehicles.csv dataset provides information on
# miles per gallon usage while driving around in the city and highway respectively.
city_mpg = df.city08
highway_mpg = df.highway08

In [4]:
# The make in the vehicles dataset provides the manufacturer name (strings) and is stored as an object.
manufac = df.make

**Note:** The first thing we should do when we load in a dataset is to check the datatypes of each column and cast each of them to more suitable datatypes. This is to save space and speed up our code execution.

---------------------------------------------

## <a id='toc2_'></a>[Aggregate methods](#toc0_)

------------------------------------------------

**Aggregate methods collapse the values of a series down to a scalar.** Thus, allowing you to take detailed data and collapse it to a single value e.g, sum, count, mean, median etc.

> Some of the commonly used aggregate methods are, 

    all(), any(), count(), prod(), min(), max(), nsmallest(), nlargest(), cummax(), cumsum(), cumprod(), mean(), median(), mode(), sum(), std(), var(), quantile(), unique(), nunique(), value_counts(), describe() etc.

- Total number of rows in a Series

In [5]:
city_mpg.size

41144

In [6]:
city_mpg.shape

(41144,)

- Counting non-null values

In [7]:
# Count total number of non-NA/non null values in a series
print("no of non-NA vals:", city_mpg.count())

# but if we use count as an attribute then this will return a series containing the non-NA values

# city_mpg.count

no of non-NA vals: 41144


- Counting number of null values

In [8]:
city_mpg.isna().sum()

0

- Largest n elements

In [9]:
# return the largest n values
print("3 largest values: \n", city_mpg.nlargest(3))

3 largest values: 
 31256    150
32599    150
33423    150
Name: city08, dtype: int64


- Cumilitive functions

In [10]:
# cumilitive sum of a series
city_mpg.cumsum().tail()

41139    755704
41140    755724
41141    755742
41142    755760
41143    755776
Name: city08, dtype: int64

- Quantile values

Quantile is where probability distribution is divided into areas of equal probability. If we consider percentages, we first divide the distribution into 100 pieces. When we look into PDF, the 5th quantile is the point that cuts off an area of 5% in the lower tail of the distribution.

In [11]:
# Quantile by default returns the 50% quantile.
# We can also pass in a list. In such case, this will return a series object

city_mpg.quantile([0.25, 0.5, 0.75])  # 25%, 50% and 75% quantiles

0.25    15.0
0.50    17.0
0.75    20.0
Name: city08, dtype: float64

- Generate descriptive statistics

In [12]:
city_mpg.describe()

count    41144.000000
mean        18.369045
std          7.905886
min          6.000000
25%         15.000000
50%         17.000000
75%         20.000000
max        150.000000
Name: city08, dtype: float64

- Unique values in a series

In [13]:
# return unique values in a series as ndarray
manufac.unique()

array(['Alfa Romeo', 'Ferrari', 'Dodge', 'Subaru', 'Toyota', 'Volkswagen',
       'Volvo', 'Audi', 'BMW', 'Buick', 'Cadillac', 'Chevrolet',
       'Chrysler', 'CX Automotive', 'Nissan', 'Ford', 'Hyundai',
       'Infiniti', 'Lexus', 'Mercury', 'Mazda', 'Oldsmobile', 'Plymouth',
       'Pontiac', 'Rolls-Royce', 'Eagle', 'Lincoln', 'Mercedes-Benz',
       'GMC', 'Saab', 'Honda', 'Saturn', 'Mitsubishi', 'Isuzu', 'Jeep',
       'AM General', 'Geo', 'Suzuki', 'E. P. Dutton, Inc.', 'Land Rover',
       'PAS, Inc', 'Acura', 'Jaguar', 'Lotus', 'Grumman Olson', 'Porsche',
       'American Motors Corporation', 'Kia', 'Lamborghini',
       'Panoz Auto-Development', 'Maserati', 'Saleen', 'Aston Martin',
       'Dabryan Coach Builders Inc', 'Federal Coach', 'Vector', 'Bentley',
       'Daewoo', 'Qvale', 'Roush Performance', 'Autokraft Limited',
       'Bertone', 'Panther Car Company Limited', 'Texas Coach Company',
       'TVR Engineering Ltd', 'Morgan', 'MINI', 'Yugo', 'BMW Alpina',
       'Renaul

In [14]:
# return a series containing counts of unique values
manufac.value_counts(dropna=False)  # by default, dropna=True

make
Chevrolet                      4003
Ford                           3371
Dodge                          2583
GMC                            2494
Toyota                         2071
                               ... 
Volga Associated Automobile       1
Panos                             1
Mahindra                          1
Excalibur Autos                   1
London Coach Co Inc               1
Name: count, Length: 136, dtype: int64

### <a id='toc2_1_'></a>[Multiple aggregation: The _.agg()_ function](#toc0_)

The `.agg()` function can be used to perform multiple aggregate operations on a series object at the same time. You can also define your own aggregate functions.

In [15]:
city_mpg.agg(
    ["min", "idxmin", "max", "idxmax", "mean", "std", "var", "quantile", "all", "sum"]
)

min                 6
idxmin           7901
max               150
idxmax          31256
mean        18.369045
std          7.905886
var         62.503036
quantile         17.0
all              True
sum            755776
Name: city08, dtype: object

To learn about all the available aggregate functions see the documentation @ https://pandas.pydata.org/pandas-docs/stable/reference/series.html#computations-descriptive-stats