### Polars data types

The examples below use `.head()` to reduce the output to a few rows and take up less space.
If you want the full output, remove `.head()` from the code

This notebooked is divided into sections. If your code editor supports it, you can use the **Outline** functionality to easily go to the code section you are interested in.

For more details, check out the Polars API reference: https://pola-rs.github.io/polars/py-polars/html/reference/index.html

In [2]:
import polars as pl

In [61]:
# load a JSON file to a Polars DataFrame
video_games_reviews = pl.read_csv('../datasets/reviews_videogames_350_simplified.csv', infer_schema_length=10000, try_parse_dates=True)

In [62]:
# Configure the number of characters to show for each string column
pl.Config.set_fmt_str_lengths(15)

polars.config.Config

#### View the data types

In [63]:
# Printing a dataframe includes the data types:
# the title column is a string, the rating column is a float, and the review_date column is a date, etc.
video_games_reviews.head(2)

title,rating,review_date,review_text
str,f64,date,str
"""Killzone: Shad…",5.0,2016-12-02,"""First time hav…"
"""Resident Evil …",5.0,2014-07-29,"""good"""


In [64]:
# Show the data types of each column
video_games_reviews.schema

{'title': Utf8, 'rating': Float64, 'review_date': Date, 'review_text': Utf8}

#### Cast the data types

In [65]:
# Cast the rating column to integer64 data 
video_games_reviews.with_columns(pl.col('rating').cast(pl.Int64)).schema

{'title': Utf8, 'rating': Int64, 'review_date': Date, 'review_text': Utf8}

#### Choosing data types when reading a file

In [66]:
# Cast the rating column to float32 data
pl.read_csv('../datasets/reviews_videogames_350_simplified.csv', 
    infer_schema_length=10000, 
    try_parse_dates=True,
    dtypes={'rating': pl.Float32}).schema

{'title': Utf8, 'rating': Float32, 'review_date': Date, 'review_text': Utf8}

#### Choosing data types when creating a Dataframe

In [76]:
# Create an example dataframe
df_example = pl.DataFrame({
    'a': [1, 2, 3, 4, 5],
    'b': [1.0, 2.0, 3.0, 4.0, 5.0],
    'c': ['a', 'b', 'c', 'd', 'e'],
    'd': [True, False, True, False, True]}
)

df_example

a,b,c,d
i64,f64,str,bool
1,1.0,"""a""",True
2,2.0,"""b""",False
3,3.0,"""c""",True
4,4.0,"""d""",False
5,5.0,"""e""",True


In [77]:
# Set the schema of column 'b' to string: Utf8
df_example_force_data_types = pl.DataFrame({
    'a': [1, 2, 3, 4, 5],
    'b': [1.0, 2.0, 3.0, 4.0, 5.0],
    'c': ['a', 'b', 'c', 'd', 'e'],
    'd': [True, False, True, False, True]},
    schema={'a': pl.Int32, 'b': pl.Float32, 'c': pl.Utf8, 'd': pl.Int16}
)

df_example_force_data_types

a,b,c,d
i32,f32,str,i16
1,1.0,"""a""",1
2,2.0,"""b""",0
3,3.0,"""c""",1
4,4.0,"""d""",0
5,5.0,"""e""",1


#### Shrink the data type

In [79]:
# Shrink the data type of all columns

# Initial data types
print('Before :',df_example.schema)

# Shrinked data types
print('After :',df_example.with_columns(pl.all().shrink_dtype()).schema)

Before : {'a': Int64, 'b': Float64, 'c': Utf8, 'd': Boolean}
After : {'a': Int8, 'b': Float32, 'c': Utf8, 'd': Boolean}
