**Table of contents**<a id='toc0_'></a>    
- [Import Statements](#toc1_1_)    
  - [Importing the data](#toc1_2_)    
- [Data type casting](#toc2_)    
    - [*Inspecting numerical limits of different integer and float types*](#toc2_1_1_)    
    - [*Checking memory usage*](#toc2_1_2_)    
    - [*String and Categorical type*](#toc2_1_3_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=2
	maxLevel=5
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

**Read the official documentation on pandas Series objects @ https://pandas.pydata.org/pandas-docs/stable/reference/series.html**

**`Note:`** We can actually use python built in functions on pandas series objects. i.e., **len, type, dir, in, sum, product, mean, sorted, max, min** etc.

Also, the notion of **chaining functions/methods** in pandas is similar to python.

### <a id='toc1_1_'></a>[Import Statements](#toc0_)

In [1]:
# import statements
import numpy as np
import pandas as pd

### <a id='toc1_2_'></a>[Importing the data](#toc0_)

One of the many datasets we will use for our examples in this notebook is the `/Data/vehicles.csv.zip` dataset.

In [2]:
# read the vehicles.csv dataset
df = pd.read_csv("Data/vehicles.csv.zip")

  df = pd.read_csv("Data/vehicles.csv.zip")


Columns of a dataframe can be accessed in various ways. One of which is to use the **dot i.e, ' . ' notation**.

In [3]:
# the city08 and highway08 columns from the vehicles.csv dataset provides information on
# miles per gallon usage while driving around in the city and highway respectively.
city_mpg = df.city08
highway_mpg = df.highway08

In [4]:
# The make in the vehicles dataset provides the manufacturer name (strings) and is stored as an object.
manufac = df.make

**Note:** The first thing we should do when we load in a dataset is to check the datatypes of each column and cast each of them to more suitable datatypes. This is to save space and speed up our code execution.

----------------------------

## <a id='toc2_'></a>[Data type casting](#toc0_)

-----------------------------

It is often the case that we need to convert between data types, usually for better performance (more manipulation
options or use less memory) or some other reasons. Whatever may be the case, Pandas provides a very useful function namely, `astype(dtype)` for converting data type of a Series or DataFrame object. 

> Some of the major datatypes available in pandas include: 

    object, int, float, bool, datetime, category etc.

Refer to this article @ https://pbpython.com/pandas_dtypes.html for a basic idea on the pandas data types.

#### <a id='toc2_1_1_'></a>[*Inspecting numerical limits of different integer and float types*](#toc0_)

The **default numeric type is 8 bytes wide (or, 64 bits i,e, int64 or float64)**. If you can use a narrower type, you can cut back on memory usage, giving you memory to process more data. You can use NumPy to inspect limits on integer and float types.

In [5]:
# integer
print(np.iinfo(np.int16))  # or, np.iinfo("int16")
print(np.iinfo(np.uint8))  # or, np.iinfo("uint8")

Machine parameters for int16
---------------------------------------------------------------
min = -32768
max = 32767
---------------------------------------------------------------

Machine parameters for uint8
---------------------------------------------------------------
min = 0
max = 255
---------------------------------------------------------------



In [6]:
# float
print(np.finfo("float16"))

Machine parameters for float16
---------------------------------------------------------------
precision =   3   resolution = 1.00040e-03
machep =    -10   eps =        9.76562e-04
negep =     -11   epsneg =     4.88281e-04
minexp =    -14   tiny =       6.10352e-05
maxexp =     16   max =        6.55040e+04
nexp =        5   min =        -max
smallest_normal = 6.10352e-05   smallest_subnormal = 5.96046e-08
---------------------------------------------------------------



#### <a id='toc2_1_2_'></a>[*Checking memory usage*](#toc0_)

To check how much memory the values of a Series or DataFrame is consuming we can use the `nbytes` method.

In [7]:
# by default, the data in city_mpg Series was stored as int64 type

# the max value in our seires object is 150
# so, we can't use int8 but we can cast to int16

# to see how much space is saved
city_mpg.nbytes - city_mpg.astype("int16").nbytes

246864

Using `.nbytes` with object types only shows how much memory the Pandas object is taking. The **make** in the vehicles dataset provides the manufacturer name (strings) and is stored as an object. To get the amount of memory that includes the strings, we need to use the `.memory_usage` method.

In [8]:
manufac.head(3)

0    Alfa Romeo
1       Ferrari
2         Dodge
Name: make, dtype: object

In [9]:
# examining memory usage with nbbytes function
manufac.nbytes

329152

In [10]:
# examining memory usage with memory_usage function
manufac.memory_usage(deep=True)

2606399

The value of _.nbytes_ is just the memory that the data is using and not the ancillary parts of the Series. The _.memory_usage_ includes the index memory and can include the contribution from object types.

#### <a id='toc2_1_3_'></a>[*String and Categorical type*](#toc0_)

A `categorical` series is useful for string data and can result in large memory savings. This is because for categorical data, instead of using python `string` to store the values, pandas optimizes it so that **repeating values are not duplicated**. The categorical type can be considered as a special type of string type. You **still have all of the functionality found off of the `.str` attribute.** Also, you can use the `.loc` or, `.iloc` similarly as you would with a string type (for filtering, slicing, partial slicing etc).

For example, if we convert the make column from the vehicles dataframe i.e, the manufac series to category object, this will have much more improved memory footprint.

In [11]:
# the make column as categorical object
manufac_cat = df.make.astype("category")

In [12]:
manufac_cat.head(3)

0    Alfa Romeo
1       Ferrari
2         Dodge
Name: make, dtype: category
Categories (136, object): ['AM General', 'ASC Incorporated', 'Acura', 'Alfa Romeo', ..., 'Volvo', 'Wallace Environmental', 'Yugo', 'smart']

In [13]:
# examining the memory footprint
manufac.memory_usage() - manufac_cat.memory_usage()

241608

- **Custom & ordered categories**

> To define custom categories we need to use the `pd.Categorical(values, categories, ordered=False)` function. And, to have the categories in order we need to set, _ordered = True_

**Note:**
1. A categorical **might have an order**, but numerical operations (additions, divisions, ...) are not possible.    
2. Assigning values **outside of categories** will result in replacing the value with **NaN** in the series object.
3. Order is defined by the **order of the categories, not lexical order of the values**.

In [14]:
# values
vals = manufac
# categories
cat = manufac.unique()

# to have an ordered category we just need to set, ordered = True
ord_manufac = pd.Categorical(values=vals, categories=cat)

In [15]:
ord_manufac

['Alfa Romeo', 'Ferrari', 'Dodge', 'Dodge', 'Subaru', ..., 'Subaru', 'Subaru', 'Subaru', 'Subaru', 'Subaru']
Length: 41144
Categories (136, object): ['Alfa Romeo', 'Ferrari', 'Dodge', 'Subaru', ..., 'Consulier Industries Inc', 'Goldacre', 'Isis Imports Ltd', 'PAS Inc - GMC']

- **Some useful methods for categorical series**

In [16]:
demo_cat_ser = pd.Series(["long_coat", "short_coat", "medium_coat", "wire_haired", "long_coat", "short_coat", "short_coat", "short_coat", "medium_coat", "short_coat", "medium_coat"]).astype("category")

In [17]:
demo_cat_ser.value_counts(dropna=False)

short_coat     5
medium_coat    3
long_coat      2
wire_haired    1
Name: count, dtype: int64

There are several useful methods available to the categorical objects. If the column or series was converted to a categorical type with `.astype("category")` methd then we can use the `.cat` accessor (on these categorical Series and Index objects) to access the available methods. But, if the categorical series was created by using the `pd.Categorical()` function then the special categorical methods are readily available with the `. (dot)` notation. Some of these methods are briefly discussed below:

> The `categories` method returns the categories of the categorical series.

In [18]:
demo_cat_ser.cat.categories

Index(['long_coat', 'medium_coat', 'short_coat', 'wire_haired'], dtype='object')

> The `set_categories(new_categories, ordered=False, rename=False)` method

**Note:**
1. Can be used to specify new categories for the categorical series as well as ordering the categories.
2. Values not included in the new categories will be set to NaN.

In [19]:
demo_cat_ser.cat.set_categories(["short_coat", "medium_coat", "long_coat"], ordered=True).value_counts(dropna=False)

short_coat     5
medium_coat    3
long_coat      2
NaN            1
Name: count, dtype: int64

> The `rename_categories()` method can be used to rename an existing category

**Note:**
1. `lambda` functions can be used to rename categories. 
2. You can't rename a category to an existing category (as a result this can't be used to collapse different categories into one single category).
3. For collapsing different categories into one, first we would need to use the `.replace()` method to replace all the names of the categories we want to collapse into one with a single name. Note that this operates on the string level and not on the categorical level and this will convert the series to *object* type. So we would need to convert the series to categorical type with the `astype()` method after replacing the category names.

In [20]:
demo_cat_ser.cat.rename_categories({"wire_haired": "wirehaired"}).value_counts(dropna=False)

short_coat     5
medium_coat    3
long_coat      2
wirehaired     1
Name: count, dtype: int64

In [21]:
demo_cat_ser.cat.rename_categories(lambda c: c.title()).value_counts()

Short_Coat     5
Medium_Coat    3
Long_Coat      2
Wire_Haired    1
Name: count, dtype: int64

> The `add_categories(new_categories)` method can be used to add new categories to the existing categories

**Note:**
1. The newly created categories will not be assigned any existing values but they'll be available for values added in the future.

In [22]:
demo_cat_ser.cat.add_categories(["silky_coat"]).value_counts()

short_coat     5
medium_coat    3
long_coat      2
wire_haired    1
silky_coat     0
Name: count, dtype: int64

> The `remove_categories(removals)` method

**Note:** 
1. `removals` must be included in the old categories. Values which were in the removed categories will be set to NaN.

In [23]:
demo_cat_ser.cat.remove_categories("wire_haired").value_counts(dropna=False)

short_coat     5
medium_coat    3
long_coat      2
NaN            1
Name: count, dtype: int64

- **Label encoding**

Label encoding is a technique that codes categorical values as integers. In Python, these codes often *start at `0` and end at `n - 1`*, where n is the number of categories. A *`-1` code is often used to indicate any missing values*. 

Label encoding is used to save memory and to simplify responses when using survey data (e.g, 1: yes, 2: no). 

Although the codes created through label encoding can be used in machine learning models, this is not the best encoding method for machine learning.

> To create the label codes we can use the ``.cat.codes`` method. If there is no order to the categories then the categories are ordered alphabatically when codes are generated.

In [None]:
manufac_codes = manufac_cat.cat.codes

In [28]:
manufac_codes.head()

0      3
1     38
2     31
3     31
4    119
dtype: int16

In [29]:
manufac_cat.head()

0    Alfa Romeo
1       Ferrari
2         Dodge
3         Dodge
4        Subaru
Name: make, dtype: category
Categories (136, object): ['AM General', 'ASC Incorporated', 'Acura', 'Alfa Romeo', ..., 'Volvo', 'Wallace Environmental', 'Yugo', 'smart']

>  Label encoding is often used in surveys. The responses and their corresponding codes are often kept in a code book or a data dictionary i.e, {1: Yes, 2:No} etc. If we do create a label encoding and save the new dataset, we would want to create a map from the new codes to the old values. This can be done with `zip`.

In [31]:
code_book = dict(zip(manufac_codes, manufac_cat))

In [32]:
code_book

{3: 'Alfa Romeo',
 38: 'Ferrari',
 31: 'Dodge',
 119: 'Subaru',
 126: 'Toyota',
 131: 'Volkswagen',
 132: 'Volvo',
 6: 'Audi',
 11: 'BMW',
 19: 'Buick',
 23: 'Cadillac',
 24: 'Chevrolet',
 25: 'Chrysler',
 22: 'CX Automotive',
 87: 'Nissan',
 41: 'Ford',
 51: 'Hyundai',
 54: 'Infiniti',
 69: 'Lexus',
 82: 'Mercury',
 78: 'Mazda',
 88: 'Oldsmobile',
 97: 'Plymouth',
 98: 'Pontiac',
 106: 'Rolls-Royce',
 33: 'Eagle',
 70: 'Lincoln',
 81: 'Mercedes-Benz',
 42: 'GMC',
 111: 'Saab',
 49: 'Honda',
 114: 'Saturn',
 84: 'Mitsubishi',
 56: 'Isuzu',
 60: 'Jeep',
 0: 'AM General',
 45: 'Geo',
 121: 'Suzuki',
 32: 'E. P. Dutton, Inc.',
 68: 'Land Rover',
 90: 'PAS, Inc',
 2: 'Acura',
 59: 'Jaguar',
 73: 'Lotus',
 48: 'Grumman Olson',
 99: 'Porsche',
 4: 'American Motors Corporation',
 63: 'Kia',
 67: 'Lamborghini',
 93: 'Panoz Auto-Development',
 76: 'Maserati',
 112: 'Saleen',
 5: 'Aston Martin',
 27: 'Dabryan Coach Builders Inc',
 37: 'Federal Coach',
 128: 'Vector',
 14: 'Bentley',
 29: 'Daewoo

> We can map the codes to the original values with the `map` method. This is useful for converting the codes back to the original values.

In [33]:
manufac_codes.map(code_book)

0        Alfa Romeo
1           Ferrari
2             Dodge
3             Dodge
4            Subaru
            ...    
41139        Subaru
41140        Subaru
41141        Subaru
41142        Subaru
41143        Subaru
Length: 41144, dtype: object

- **Some problems of using categorical type**

1. If the series has lots of unique values i.e, categories then memory savings will not be much.
2. Whenever we use `.str` methods or `.apply()` the series is converted to object type.
3. Most `numpy` operations don't work with categorical type.