**Table of contents**<a id='toc0_'></a>    
- [Import Statements](#toc1_1_)    
  - [Importing the data](#toc1_2_)    
- [Data type casting](#toc2_)    
    - [*Inspecting numerical limits of different integer and float types*](#toc2_1_1_)    
    - [*Checking memory usage*](#toc2_1_2_)    
    - [*String and Categorical type*](#toc2_1_3_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=2
	maxLevel=5
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

**Read the official documentation on pandas Series objects @ https://pandas.pydata.org/pandas-docs/stable/reference/series.html**

**`Note:`** We can actually use python built in functions on pandas series objects. i.e., **len, type, dir, in, sum, product, mean, sorted, max, min** etc.

Also, the notion of **chaining functions/methods** in pandas is similar to python.

### <a id='toc1_1_'></a>[Import Statements](#toc0_)

In [1]:
# import statements
import numpy as np
import pandas as pd

### <a id='toc1_2_'></a>[Importing the data](#toc0_)

One of the many datasets we will use for our examples in this notebook is the `/Data/vehicles.csv.zip` dataset.

In [2]:
# read the vehicles.csv dataset
df = pd.read_csv("Data/vehicles.csv.zip")

  df = pd.read_csv("Data/vehicles.csv.zip")


Columns of a dataframe can be accessed in various ways. One of which is to use the **dot i.e, ' . ' notation**.

In [3]:
# the city08 and highway08 columns from the vehicles.csv dataset provides information on
# miles per gallon usage while driving around in the city and highway respectively.
city_mpg = df.city08
highway_mpg = df.highway08

In [4]:
# The make in the vehicles dataset provides the manufacturer name (strings) and is stored as an object.
manufac = df.make

**Note:** The first thing we should do when we load in a dataset is to check the datatypes of each column and cast each of them to more suitable datatypes. This is to save space and speed up our code execution.

----------------------------

## <a id='toc2_'></a>[Data type casting](#toc0_)

-----------------------------

It is often the case that we need to convert between data types, usually for better performance (more manipulation
options or use less memory) or some other reasons. Whatever may be the case, Pandas provides a very useful function namely, `astype(dtype)` for converting data type of a Series or DataFrame object. 

> Some of the major datatypes available in pandas include: 

    object, int, float, bool, datetime, category etc.

Refer to this article @ https://pbpython.com/pandas_dtypes.html for a basic idea on the pandas data types.

#### <a id='toc2_1_1_'></a>[*Inspecting numerical limits of different integer and float types*](#toc0_)

The **default numeric type is 8 bytes wide (or, 64 bits i,e, int64 or float64)**. If you can use a narrower type, you can cut back on memory usage, giving you memory to process more data. You can use NumPy to inspect limits on integer and float types.

In [5]:
# integer
print(np.iinfo(np.int16))  # or, np.iinfo("int16")
print(np.iinfo(np.uint8))  # or, np.iinfo("uint8")

Machine parameters for int16
---------------------------------------------------------------
min = -32768
max = 32767
---------------------------------------------------------------

Machine parameters for uint8
---------------------------------------------------------------
min = 0
max = 255
---------------------------------------------------------------



In [6]:
# float
print(np.finfo("float16"))

Machine parameters for float16
---------------------------------------------------------------
precision =   3   resolution = 1.00040e-03
machep =    -10   eps =        9.76562e-04
negep =     -11   epsneg =     4.88281e-04
minexp =    -14   tiny =       6.10352e-05
maxexp =     16   max =        6.55040e+04
nexp =        5   min =        -max
smallest_normal = 6.10352e-05   smallest_subnormal = 5.96046e-08
---------------------------------------------------------------



#### <a id='toc2_1_2_'></a>[*Checking memory usage*](#toc0_)

To check how much memory the values of a Series or DataFrame is consuming we can use the `nbytes` method.

In [7]:
# by default, the data in city_mpg Series was stored as int64 type

# the max value in our seires object is 150
# so, we can't use int8 but we can cast to int16

# to see how much space is saved
city_mpg.nbytes - city_mpg.astype("int16").nbytes

246864

Using `.nbytes` with object types only shows how much memory the Pandas object is taking. The **make** in the vehicles dataset provides the manufacturer name (strings) and is stored as an object. To get the amount of memory that includes the strings, we need to use the `.memory_usage` method.

In [8]:
manufac.head(3)

0    Alfa Romeo
1       Ferrari
2         Dodge
Name: make, dtype: object

In [9]:
# examining memory usage with nbbytes function
manufac.nbytes

329152

In [10]:
# examining memory usage with memory_usage function
manufac.memory_usage(deep=True)

2606399

The value of _.nbytes_ is just the memory that the data is using and not the ancillary parts of the Series. The _.memory_usage_ includes the index memory and can include the contribution from object types.

#### <a id='toc2_1_3_'></a>[*String and Categorical type*](#toc0_)

A `categorical` series is useful for string data and can result in large memory savings. This is because for categorical data, instead of using python `string` to store the values, pandas optimizes it so that **repeating values are not duplicated**. You **still have all of the functionality found off of the .str attribute.**

For example, if we convert the make column from the vehicles dataframe i.e, the manufac series to category object, this will have much more improved memory footprint.

In [11]:
# the make column as categorical object
manufac_cat = df.make.astype("category")

In [12]:
manufac_cat.head(3)

0    Alfa Romeo
1       Ferrari
2         Dodge
Name: make, dtype: category
Categories (136, object): ['AM General', 'ASC Incorporated', 'Acura', 'Alfa Romeo', ..., 'Volvo', 'Wallace Environmental', 'Yugo', 'smart']

In [13]:
# examining the memory footprint
manufac.memory_usage() - manufac_cat.memory_usage()

241608

- **Custom & ordered categories**

> To define custom categories we need to use the `pd.Categorical(values, categories, ordered=False)` function. And, to have the categories in order we need to set, _ordered = True_

**Note:**
1. a Categorical **might have an order**, but numerical operations (additions, divisions, ...) are not possible.    
2. Assigning values **outside of categories** will result in replacing the value with **NaN** in the series object.
3. Order is defined by the **order of the categories, not lexical order of the values**.

In [14]:
# values
vals = manufac
# categories
cat = manufac.unique()

# to have an ordered category we just need to set, ordered = True
ord_manufac = pd.Categorical(values=vals, categories=cat)

In [15]:
ord_manufac

['Alfa Romeo', 'Ferrari', 'Dodge', 'Dodge', 'Subaru', ..., 'Subaru', 'Subaru', 'Subaru', 'Subaru', 'Subaru']
Length: 41144
Categories (136, object): ['Alfa Romeo', 'Ferrari', 'Dodge', 'Subaru', ..., 'Consulier Industries Inc', 'Goldacre', 'Isis Imports Ltd', 'PAS Inc - GMC']