# [Expressions: Strings](https://docs.pola.rs/user-guide/expressions/strings/)

## The string namespace

Small note: if there are only ASCII symbols, then `.str.len_bytes().` will be faster.

In [2]:
import polars as pl

df = pl.DataFrame(
    {
        "language": ["English", "Dutch", "Portuguese", "Finish"],
        "fruit": ["pear", "peer", "pêra", "päärynä"],
    }
)

df.with_columns(
    pl.col("fruit").str.len_bytes().alias("byte_count"),
    pl.col("fruit").str.len_chars().alias("letter_count"),
)

language,fruit,byte_count,letter_count
str,str,u32,u32
"""English""","""pear""",4,4
"""Dutch""","""peer""",4,4
"""Portuguese""","""pêra""",5,4
"""Finish""","""päärynä""",10,7


## Parsing strings

### Check for the existence of a pattern

In [6]:
df.select(
    pl.col("fruit"),
    pl.col("fruit").str.starts_with("p").alias("starts_with_p"),
    pl.col("fruit").str.contains("p..r").alias("p..r"),
    pl.col("fruit").str.contains("e+").alias("e+"),
    pl.col("fruit").str.ends_with("r").alias("ends_with_r"), # make note of this
)

fruit,starts_with_p,p..r,e+,ends_with_r
str,bool,bool,bool,bool
"""pear""",True,True,True,True
"""peer""",True,True,True,True
"""pêra""",True,False,False,False
"""päärynä""",True,True,False,False


### Regex specification

polars uses the rust regex synthax [click here for the details](https://docs.rs/regex/latest/regex/#syntax). its diferent from the `re` module.

### Extract a pattern

In [7]:
df = pl.DataFrame(
    {
        "urls": [
            "http://vote.com/ballon_dor?candidate=messi&ref=polars",
            "http://vote.com/ballon_dor?candidat=jorginho&ref=polars",
            "http://vote.com/ballon_dor?candidate=ronaldo&ref=polars",
        ]
    }
)

df.select(
    pl.col("urls").str.extract(r"candidate=(\w+)", group_index=1), # grouping in regex is awesome, I needed to know that years ago.
)

urls
str
"""messi"""
""
"""ronaldo"""


In [10]:
df = pl.DataFrame({"text": ["123 bla 45 asd", "xyz 678 910t"]})
df.select(
    pl.col("text").str.extract_all(r"(\d+)").alias("extracted_nrs"),
)

extracted_nrs
list[str]
"[""123"", ""45""]"
"[""678"", ""910""]"


### replace a pattern

In [None]:
df = pl.DataFrame({"text": ["123abc", "abc456"]})
df.with_columns(
    pl.col("text").str.replace(r"\d","-"),
    pl.col("text").str.replace_all(r"\d","-").alias("text_replace_all")
)

text,text_replace_all
str,str
"""-23abc""","""---abc"""
"""abc-56""","""abc---"""
