In [1]:
import pandas as pd

# Conversion Methods
For changing the type of data.

### Automatic Conversions
`.convert dtypes` converts a Series to a type that supports `pd.NA` . 
For `city_mpg` series, it will change the type from `int64` to `Int64` :

In [2]:
url = 'https://github.com/mattharrison/datasets/raw/master/data/vehicles.csv.zip'
df=pd.read_csv(url)
city_mpg = df.city08
highway_mpg = df.highway08

  df=pd.read_csv(url)


In [4]:
city_mpg

0        19
1         9
2        23
3        10
4        17
         ..
41139    19
41140    20
41141    18
41142    18
41143    16
Name: city08, Length: 41144, dtype: int64

In [5]:
city_mpg.convert_dtypes()

0        19
1         9
2        23
3        10
4        17
         ..
41139    19
41140    20
41141    18
41142    18
41143    16
Name: city08, Length: 41144, dtype: Int64

In [6]:
city_mpg.astype('Int64')

0        19
1         9
2        23
3        10
4        17
         ..
41139    19
41140    20
41141    18
41142    18
41143    16
Name: city08, Length: 41144, dtype: Int64

In [7]:
city_mpg.astype('Int8')

TypeError: cannot safely cast non-equivalent int64 to int8

Using correct type can save significant amounts of memory. Default numeric type is 8 bytes wide (64 bits wide i.e int64/float64)

Can use numpy to inspect limits on integer and float types:


In [8]:
import numpy as np

In [9]:
np.iinfo('int64')

iinfo(min=-9223372036854775808, max=9223372036854775807, dtype=int64)

In [10]:
np.iinfo('uint8')

iinfo(min=0, max=255, dtype=uint8)

In [11]:
np.finfo('float64')

finfo(resolution=1e-15, min=-1.7976931348623157e+308, max=1.7976931348623157e+308, dtype=float64)

In [12]:
np.finfo('float64')

finfo(resolution=1e-15, min=-1.7976931348623157e+308, max=1.7976931348623157e+308, dtype=float64)

### Memory Usage
`.nbytes` or `.memory_usage` method. Latter is useful with `object` types as you can pass `deep=True` to include amount of memory used by the Python objects in the Series.

In [13]:
city_mpg.nbytes

329152

In [14]:
city_mpg.astype('Int16').nbytes 

123432

*make* of the autos has strings and is stored as an `object`. To get the amount of memory that includes strings, we need to use `.memory_usage` method.


In [16]:
make = df.make
make.nbytes

329152

In [20]:
make.memory_usage()

329280

In [21]:
make.memory_usage(deep=True)

2606395

In [23]:
(make.astype('category').memory_usage(deep=True))

95888

The `.astype` metthod can convert numeric series to strings if you pass `str` into it.

In [25]:
city_mpg.astype(str)

0        19
1         9
2        23
3        10
4        17
         ..
41139    19
41140    20
41141    18
41142    18
41143    16
Name: city08, Length: 41144, dtype: object

In [26]:
city_mpg.astype('category')

0        19
1         9
2        23
3        10
4        17
         ..
41139    19
41140    20
41141    18
41142    18
41143    16
Name: city08, Length: 41144, dtype: category
Categories (105, int64): [6, 7, 8, 9, ..., 137, 138, 140, 150]

Categorical series is useful for string data and can result in large memory savings. Specially for duplicate values.

### Ordered Categories:
need to define your own `CategoricalDtype`:

In [28]:
values = pd.Series(sorted(set(city_mpg)))

In [29]:
city_type = pd.CategoricalDtype(categories=values, ordered=True)

In [30]:
city_mpg.astype(city_type)

0        19
1         9
2        23
3        10
4        17
         ..
41139    19
41140    20
41141    18
41142    18
41143    16
Name: city08, Length: 41144, dtype: category
Categories (105, int64): [6 < 7 < 8 < 9 ... 137 < 138 < 140 < 150]

If you want a dataframe with just a single column, use `.to_frame` method:

In [31]:
city_mpg.to_frame()

Unnamed: 0,city08
0,19
1,9
2,23
3,10
4,17
...,...
41139,19
41140,20
41141,18
41142,18
