# Deep dive into Pandas Series data structure 

**Read the official documentation on pandas Series objects @ https://pandas.pydata.org/pandas-docs/stable/reference/series.html**

**Note:** We can actually use pythons built in functions on pandas series objects. i.e., len, type, dir, in, sum, product, mean, sorted, max, min etc. Also, the notion of chaining functions/methods in pandas is similar to python.

In [1]:
# import statements
import numpy as np
import pandas as pd

---------------------------
### Importing the data
------------------------

One of the datasets we will use for our examples in this notebook is the `/Data/vehicles.csv.zip` dataset.

In [2]:
# read the vehicles.csv dataset
df = pd.read_csv("Data/vehicles.csv.zip")

  df = pd.read_csv("Data/vehicles.csv.zip")


---------------------------------------------
## Mathematical operations and Index Alignment 
------------------------------------------------

> The mathematical operations that are available include: 

    +, -, /, // (floor division), % (modulus), @ (matrix multiplication), ** (power), <, <=, ==, !=, >=, >, & (binary and), ^ (binary xor), | (binary or).

However, pandas will **align the index** before performing any of these operations. Aligning will take each index entry in the left series and match it up with every entry with the same name in the index of the right series. Because of index alignment, you will want to **make sure that the indexes are, unique (no duplicates) and, common to both series**. Otherwise, the produced result will not be as expected.


- Example of how index alignment can cause to produce unintended results

In [3]:
s1 = pd.Series([10, 20, 30], index=[1, 2, 2])
s2 = pd.Series([15, 28, 32], index=[2, 2, 3])

In [4]:
s1 + s2

1     NaN
2    35.0
2    48.0
2    45.0
2    58.0
3     NaN
dtype: float64

Note that, index-1 and index-3 has NaN values. Whereas, for index-2, every index-2 value from s1 is matched with every index-2 value from s2. 

- **Operator methods**

Pandas also provides **operator methods** for each of the mathematical operations. The benefit is, the operator methods have a **fill_value** parameter that can change the default behavior. **By default, the operator methods will produce the same output as the mathematical operators themselves. But, if a fill_value is defined then, when one of the operands is missing, the method will use the fill_value instead.**

> Some of the operator methods include: 
    
    add(), sub(), mul(), div(), mod(), pow(), rfloordiv(), lt(), gt(), eq(), ne(), le(), ge(), dot(), product() etc.

For instance, in the above example if we wanted to have values at index-1 and index-3 instead of NaN values we can use the **s1.add(s2, fill_value=0)** method. Probably, you can already guess how this will work out.

In [5]:
# using the add() method to add the elements of both series with fill_value=0
s1.add(s2, fill_value=0)

1    10.0
2    35.0
2    48.0
2    45.0
2    58.0
3    32.0
dtype: float64

Learn about all the available operator methods @ https://pandas.pydata.org/pandas-docs/stable/reference/series.html#binary-operator-functions

---------------------------------------------
## Aggregate methods 
------------------------------------------------

Aggregate methods collapse the values of a series down to a scalar. Thus, allowing you to take detailed data and collapse it to a single value e.g, sum, count, mean, median etc.

> Some of the commonly used aggregate methods are, 

    all(), any(), count(), prod(), min(), max(), nsmallest(), nlargest(), cummax(), cumsum(), cumprod(), mean(), median(), mode(), sum(), std(), var(), quantile(), unique(), nunique(), value_counts(), describe() etc.

In [6]:
# the city08 and highway08 columns from the vehicles.csv dataset
# provide information on miles per gallon usage while driving around in the city and highway respectively.
city_mpg = df.city08
highway_mpg = df.highway08

- Counting non-null values

In [7]:
# Count total number of non-NA/null values in a series
print("no of non-NA vals:", city_mpg.count())

# but if we use count as an attribute then this will return a series containing the non-NA values

# city_mpg.count

no of non-NA vals: 41144


- Largest n elements

In [8]:
# return the largest n values
print("3 largest values: \n", city_mpg.nlargest(3))

3 largest values: 
 31256    150
32599    150
33423    150
Name: city08, dtype: int64


- Cumilitive functions

In [9]:
# cumilitive sum of a series
city_mpg.cumsum().tail()

41139    755704
41140    755724
41141    755742
41142    755760
41143    755776
Name: city08, dtype: int64

- Quantile values

Quantile is where probability distribution is divided into areas of equal probability. If we consider percentages, we first divide the distribution into 100 pieces. When we look into PDF, the 5th quantile is the point that cuts off an area of 5% in the lower tail of the distribution

In [10]:
# Quantile by default returns the 50% quantile. We can also pass in a list. In such case, this will return a series object
city_mpg.quantile([0.25, 0.5, 0.75])  # 25%, 50% and 75% quantiles

0.25    15.0
0.50    17.0
0.75    20.0
Name: city08, dtype: float64

- Generate descriptive statistics

In [11]:
city_mpg.describe()

count    41144.000000
mean        18.369045
std          7.905886
min          6.000000
25%         15.000000
50%         17.000000
75%         20.000000
max        150.000000
Name: city08, dtype: float64

- Unique values in a series

In [12]:
# return unique values in a series as ndarray
city_mpg.unique()

array([ 19,   9,  23,  10,  17,  21,  22,  18,  12,  20,  14,  11,  15,
        13,  16,  25,  24,  26,  31,  27,  30,  38,  28,  43,  35,  33,
        29,  39,  37,   8,   7,  34,  32,  36,  49,  81,  45,  48,  42,
         6,  44,  74,  84,  40,  87,  41,  51,  62,  59,  79,  50,  52,
       102, 106,  94, 126,  53, 107,  77, 110,  88, 132, 122, 138,  78,
        60,  47, 129,  93, 128,  61, 137,  85, 120,  86,  89,  95, 101,
        90, 124, 121,  54,  58,  91,  97,  73,  98,  92, 150,  55,  57,
        46, 118, 112, 131, 136,  83, 125,  80, 123, 127, 114, 140, 115,
       104])

In [13]:
# return a series containing counts of unique values
city_mpg.value_counts(dropna=False)  # by default, dropna=True

15     4503
18     4053
17     4035
16     3975
19     3012
       ... 
127       1
114       1
140       1
115       1
104       1
Name: city08, Length: 105, dtype: int64

### The _.agg()_ function 

The `.agg()` function can be used to perform multiple aggregate operations (you can also define your own aggregate functions) on a series object at the same time.

In [14]:
city_mpg.agg(
    ["min", "idxmin", "max", "idxmax", "mean", "std", "var", "quantile", "all", "sum"]
)

min                 6
idxmin           7901
max               150
idxmax          31256
mean        18.369045
std          7.905886
var         62.503036
quantile         17.0
all              True
sum            755776
Name: city08, dtype: object

To learn about all the available aggregate functions see the documentation @ https://pandas.pydata.org/pandas-docs/stable/reference/series.html#computations-descriptive-stats

----------------------------
## Data type casting 
-----------------------------

It is often the case that we need to convert between data types, usually for better performance (more manipulation
options or use less memory) or some other reasons. Whatever may be the case Pandas provides a very useful function namely, `astype(dtype)` for converting data type of a Series or DataFrame object. 

> Some of the major datatypes available in pandas include: 

    object, int, float, bool, datetime, category etc.

Refer to this article @ https://pbpython.com/pandas_dtypes.html for a basic idea on the pandas data types.

- Inspecting numerical limits of different integer and float types 

The **default numeric type is 8 bytes wide (64 bits, ie int64 or float64)**. If you can use a narrower type, you can cut back on memory usage, giving you memory to process more data. You can use NumPy to inspect limits on integer and float types.

In [15]:
# integer
print(np.iinfo(np.int16))  # or, np.iinfo("int16")
print(np.iinfo(np.uint8))  # or, np.iinfo("uint8")

Machine parameters for int16
---------------------------------------------------------------
min = -32768
max = 32767
---------------------------------------------------------------

Machine parameters for uint8
---------------------------------------------------------------
min = 0
max = 255
---------------------------------------------------------------



In [16]:
# float
print(np.finfo("float16"))

Machine parameters for float16
---------------------------------------------------------------
precision =   3   resolution = 1.00040e-03
machep =    -10   eps =        9.76562e-04
negep =     -11   epsneg =     4.88281e-04
minexp =    -14   tiny =       6.10352e-05
maxexp =     16   max =        6.55040e+04
nexp =        5   min =        -max
smallest_normal = 6.10352e-05   smallest_subnormal = 5.96046e-08
---------------------------------------------------------------



- Checking memory usage

To check how much memory the values of a Series or DataFrame is consuming we can use the `nbytes` method.

In [17]:
# by default, the data in city_mpg Series was stored as int64 type

# the max value in our seires object is 150
# so, we can't use int8 but we can cast to int16

# to see how much space is saved
city_mpg.nbytes - city_mpg.astype("int16").nbytes

246864

Using `.nbytes` with object types only shows how much memory the Pandas object is taking. The **make** in the vehicles dataset provides the manufacturer name (strings) and is stored as an object. To get the amount of memory that includes the strings, we need to use the `.memory_usage` method.

In [18]:
# the make column as a series
manufac = df.make

In [19]:
manufac.head(3)

0    Alfa Romeo
1       Ferrari
2         Dodge
Name: make, dtype: object

In [20]:
# examining memory usage with nbbytes function
manufac.nbytes

329152

In [21]:
# examining memory usage with memory_usage function
manufac.memory_usage(deep=True)

2606395

The value of _.nbytes_ is just the memory that the data is using and not the ancillary parts of the Series. The _.memory_usage_ includes the index memory and can include the contribution from object types.

- **String and Category type**

A `categorical` series is useful for string data and can result in large memory savings. This is because for categorical data, instead of using python `string` to store the values, pandas optimizes it so that **repeating values are not duplicated**. You **still have all of the functionality found off of the .str attribute.**

For example, if we convert the make column from the vehicles dataframe i.e, the manufac series to category object, this will have much more improved memory footprint.

In [22]:
# the make column as categorical object
manufac_cat = df.make.astype("category")

In [23]:
manufac_cat.head(3)

0    Alfa Romeo
1       Ferrari
2         Dodge
Name: make, dtype: category
Categories (136, object): ['AM General', 'ASC Incorporated', 'Acura', 'Alfa Romeo', ..., 'Volvo', 'Wallace Environmental', 'Yugo', 'smart']

In [24]:
# examining the memory footprint
manufac.memory_usage() - manufac_cat.memory_usage()

241608

- **Custom & ordered categories**

To define custom categories we need to use the `pd.Categorical(values, categories, ordered=False)` function. And, to have the categories in order we need to set, _ordered = True_

**Note:**
1. a Categorical **might have an order**, but numerical operations (additions, divisions, ...) are not possible.    
2. Assigning values **outside of categories** will result in replacing the value with **NaN** in the series object.
3. Order is defined by the order of the categories, **not lexical order** of the values.

In [25]:
# values
vals = manufac
# categories
cat = manufac.unique()

# to have an ordered category we just need to set, ordered = True
ord_manufac = pd.Categorical(values=vals, categories=cat)

In [26]:
ord_manufac

['Alfa Romeo', 'Ferrari', 'Dodge', 'Dodge', 'Subaru', ..., 'Subaru', 'Subaru', 'Subaru', 'Subaru', 'Subaru']
Length: 41144
Categories (136, object): ['Alfa Romeo', 'Ferrari', 'Dodge', 'Subaru', ..., 'Consulier Industries Inc', 'Goldacre', 'Isis Imports Ltd', 'PAS Inc - GMC']

## Manipulation methods 

When working on a dataset, usually the most used methods are some kind of manipulation methods. These are specially useful when we are cleaning up our dataset or, are exploring it to understand it better.

- Applying a custom function to every element of a Series (also works on DataFrames)

The `apply(func)` method will call the `func` function on every element of a Series. 

This is usually not wise to use since this will dramatically increase the computation time. We already have a wide range of predefined pandas methods and functions for almost all the manipulation operations we can desire. But for some reason or other, if we can't find any suitable methods then we can define our own function and call it using the apply() method. **Note that,** we only need to pass in the name of the function and not call them.

In [59]:
# let's say, we only want to keep the top 5 manufacturers and replace other values with "Other" in the manufac Series

In [34]:
# defining the custom manipulation function
top_5_manufac = manufac.value_counts().index[0:5]


def custom_manipulation(val):
    if val in top_5_manufac:
        return val
    else:
        return "Other"

In [53]:
%%timeit # magic functions needs to go at the top of the cell

# applying the custom_manipulation function on manufac Series
custom_manufac = manufac.apply(custom_manipulation)

28 ms ± 1.29 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [57]:
custom_manufac.values

array([26622,  4003,  3371,  2583,  2494,  2071])

- The `where(cond, other)` method

This method **replaces values where the condition is False with corresponding value from 'other'**.

In [60]:
# if we were to do the same thing as the above example with where() method

In [63]:
%%timeit
# Series.isin(values) checks Whether elements in Series are contained in `values`
manufac.where(cond=manufac.isin(top_5_manufac), other="Other")

2 ms ± 178 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


As we can see this is almost 14 times faster than the apply() approach.

- The `mask(cond, other)` method **replaces values where the condition is True with corresponding value from 'other'**. It is equivalent to: **where(~cond, other)**

- The `clip(lower, upper)` method for handling **outliers**

This will **replace all values above the upper threshold with the upper value and all values lower than the lower threshold with the lower value.**

Clipping is handy if you have outliers in your data. In the city_mpg series the values ranges from 6 to 150. But, there are only 16 vehicles with city_mpg > 130 and 39 vehicles with city_mpg < 8 out of the total 41144 vehicles. Say, we wanted to clip those entries, we can do that very easily with the clip method. Or say, we wanted to clip the values between the 5% quantile and 95% quantile. We could also do that very easily with the clip method. All this is to say that, this is a very handy tool to use in the right situations.

In [141]:
# in the first case
print("Min: ", city_mpg.clip(lower=8, upper=120).min())
print("Max: ", city_mpg.clip(lower=8, upper=120).max())

# see that the values were replaced not dropped
len(city_mpg) == len(city_mpg.clip(lower=8, upper=120))

Lower:  8
Upper:  120


True

In [144]:
# in the second case
quantile_5 = city_mpg.quantile(0.05)
quantile_95 = city_mpg.quantile(0.95)

print(
    "Min: ",
    city_mpg.clip(lower=quantile_5, upper=quantile_95).min(),
    "Max: ",
    city_mpg.clip(lower=quantile_5, upper=quantile_95).max(),
)
print("5% Quantile: ", quantile_5, "95% Quantile: ", quantile_95)

Min:  11 Max:  27
5% Quantile:  11.0 95% Quantile:  27.0


## Missing values and How to handle them

In [71]:
# The cylinders column has missing values
cylinders = df.cylinders

**| Counting the total number of missing values**

The `series.isna()` function detects missing values. An interesting property of the **sum()** method is that it treats True as 1 and False as 0. This property can be used to count the number of missing values.

In [79]:
# the series.isna() function detects missing values
cylinders.isna().sum()

206

**| Let's see for which manufacturers the cylinders value is missing**

In [81]:
# Boolean mask for missing values
mask_missing = cylinders.isna()

In [84]:
manufac[mask_missing].value_counts().head(5)

Tesla     74
smart     16
Ford      15
Nissan    14
BMW       10
Name: make, dtype: int64

Here we can see that most of the cars with missing cylinders values are Teslas. Since Teslas are electric cars this makes sense. **Note:** An alternative way would be to use the `loc` method insted of this boolean masking.

Now let's discuss how to handle these missing values.

### _Handling missing values_ 

- The `.fillna()` method allows you to specify a replacement value for any missing data

In [88]:
cylinders.fillna(0).value_counts()  # Doesn't change in place

4.0     15938
6.0     14284
8.0      8801
5.0       771
12.0      626
3.0       279
0.0       206
10.0      170
2.0        59
16.0       10
Name: cylinders, dtype: int64

**Note that**, in this case it was reasonable to replace the "nan" values with "0" but it is not always the case. In other scenarios the **.mean(), .median(), .mode()** etc. may come in handy

- The `.dropna()` method will drop the indexes (i.e, rows) with missing values

In [89]:
# see that 'nan' count of 206 is the number of rows that was dropped
len(cylinders) - len(cylinders.dropna())

206

- The `.interpolate()` method will replace 'nan' with interpolation of the values around the missing value.

This is a very useful method to fill in missing values. For example, this comes in handy if the data is ordered (as time series data often is) and there are holes in the data. But, you have to make sure that the data you are manipulating has a trend that can be captured by interpolation. Otherwise, this may lead to disastarous results. 

In [95]:
# say we have a series that captures the [somewhat] upward trend of temp. as the season goes into summer from winter
temp = pd.Series([19, 20, 20, None, 22, 24, 23, 24])
temp.interpolate()

0    19.0
1    20.0
2    20.0
3    21.0
4    22.0
5    24.0
6    23.0
7    24.0
dtype: float64