<a href="https://colab.research.google.com/github/JonaJS/E_Pandas/blob/main/Chptr8_Conversion_Methods.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Automatic Conversion.

In [1]:
import pandas as pd
URL = 'https://github.com/mattharrison/datasets/raw/master/data/vehicles.csv.zip'

In [2]:
df = pd.read_csv(URL)
city_mpg = df.city08
highway_mpg = df.highway08

  df = pd.read_csv(URL)


In pandas 1.0, a new conversion method was introduced, `.convert_dtypes`. This tries to convert a Series to a type that supports `pd.NA`. In the case of our city_mpg series, it will change the type from `int64` to `Int64`:

In [3]:
# Before converting
city_mpg

0        19
1         9
2        23
3        10
4        17
         ..
41139    19
41140    20
41141    18
41142    18
41143    16
Name: city08, Length: 41144, dtype: int64

In [4]:
(city_mpg
 .isnull()
 .sum()
)

0

In [5]:
# After converting
print(city_mpg.convert_dtypes())

0        19
1         9
2        23
3        10
4        17
         ..
41139    19
41140    20
41141    18
41142    18
41143    16
Name: city08, Length: 41144, dtype: Int64


`.convert_dtypes` seems to be kinda magical. It is better to have explicit control over what happens with our data.

To specify a type of a series, you can try to use the `.astype` method. Our city mileage can be held in a 16-bit integer, however an 8-bit integer will not work, as the maximun value for that signed is 127, and we have some cars with a value of 150.

In [6]:
city_mpg.astype('Int16')

0        19
1         9
2        23
3        10
4        17
         ..
41139    19
41140    20
41141    18
41142    18
41143    16
Name: city08, Length: 41144, dtype: Int16

In [7]:
city_mpg.astype('Int8')

TypeError: cannot safely cast non-equivalent int64 to int8

Using the correct type can save significant amounts of memory. The defult numeric type is 8 bytes wide (64 bits, i.e. int64 or float64). If you can use narrower type, you can cut back on memory usage, giving you to process more data.

You can use NumPy to inspect limits on integer and float types.

In [8]:
import numpy as np
np.iinfo('int64')

iinfo(min=-9223372036854775808, max=9223372036854775807, dtype=int64)

In [9]:
np.iinfo('uint8')

iinfo(min=0, max=255, dtype=uint8)

In [10]:
np.finfo('float16')

finfo(resolution=0.001, min=-6.55040e+04, max=6.55040e+04, dtype=float16)

In [11]:
np.finfo('float64')

finfo(resolution=1e-15, min=-1.7976931348623157e+308, max=1.7976931348623157e+308, dtype=float64)

### Memory Usage.



To calculate the memory usage of the Series, you can use the `.nbytes` property or the `.memory_usage` method. `.memory_usage` is useful when dealing with `object` types as you can pass `deep=True` to include the amount of memory used by the Python objects in the Series.


Here we compare memory usage of default numeric integers to Int16:

In [12]:
city_mpg.dtype

dtype('int64')

In [13]:
city_mpg.nbytes

329152

In [14]:
city_mpg.astype('Int16').nbytes

123432

Using `.nbytes` with objects types only shows how much memory the Pandas object is taking. The *make* of the autos has strings and is stores as an `object`. ***To get the amount of memory that includes the strings, we need to use de `.memory_usage` method:***

In [17]:
make = df.make
make.nbytes

329152

In [18]:
make.memory_usage()

329280

In [19]:
make = df.make
make.memory_usage(deep=True)

2606395

The value of `.nbytes` is just the memory that the data is suing and not the ancillary parts of the Series. The `.memory_usage` includes the index memory and can include the contribution from `object` types.

In the next section, we dicuss converting to a categorical. We can see that we will save a lot of memory fo the `make` data:

In [20]:
(make
 .astype('category')
 .memory_usage(deep=True)
)

95888

### String and category types.

`.astype` method can also convert numeric series to strings just pasing `'str'` into it.

In [26]:
city_mpg.astype('str')

0        19
1         9
2        23
3        10
4        17
         ..
41139    19
41140    20
41141    18
41142    18
41143    16
Name: city08, Length: 41144, dtype: object

To convert to a categorical type, you can pass in `'category'` as a type:

In [27]:
city_mpg.astype('category')

0        19
1         9
2        23
3        10
4        17
         ..
41139    19
41140    20
41141    18
41142    18
41143    16
Name: city08, Length: 41144, dtype: category
Categories (105, int64): [6, 7, 8, 9, ..., 137, 138, 140, 150]

**A categorical Series is useful for string data and can result in large memory savings. This is because Pandas stores Python strings when you have a string data. When you convert it to categorical data, Pandas no longer uses Python strings for each value but optimizes it, so repeating values are not duplicated. You still have all of the functionality found off the `.str` attribute, but it comes with potentially large memory savings (if you have many duplicate values) and performance boosts as you do not need to perform as many strings operations.**

### Ordered Categories.

To create ordered categories, you need to define your own `CategoricalDType`.

In [29]:
values = pd.Series(sorted(set(city_mpg)))
cate_val = pd.CategoricalDtype(categories=values, ordered=True)
city_mpg.astype(cate_val)

0        19
1         9
2        23
3        10
4        17
         ..
41139    19
41140    20
41141    18
41142    18
41143    16
Name: city08, Length: 41144, dtype: category
Categories (105, int64): [6 < 7 < 8 < 9 ... 137 < 138 < 140 < 150]

### Converting to other types.

- `.to_numpy` (or `.values`) will give us a Numpy array of values.
- `.to_list` will return a Python list of values.

It's better staying away from these unless necessary.

A Series object is a column from a DataFrame. However, you might need to turn a Series back into a DataFrame. We can use the `.to_frame` method.

In [30]:
city_mpg.to_frame()

Unnamed: 0,city08
0,19
1,9
2,23
3,10
4,17
...,...
41139,19
41140,20
41141,18
41142,18
