# Chapter 11: String Manipulation

## 11.1 Strings and Objects

- Pandas 1.0 introduced the new ``'string'`` type. In addition to being more explicit than ``'object'``, it supports missing values that are not ``NaN``
- String methods return the nullable type when you use the 'string' series
- If the result of the string method is missing, pandas will use the newer types that have native pandas nullable types

In [1]:
import pandas as pd
import numpy as np

url = "http://github.com/mattharrison/datasets/raw/master/data/vehicles.csv.zip"
df = pd.read_csv(url)
city_mpg = df.city08
highway_mpg = df.highway08
make = df.make

  df = pd.read_csv(url)


In [2]:
# convert to string type
make.astype('string')

0        Alfa Romeo
1           Ferrari
2             Dodge
3             Dodge
4            Subaru
            ...    
41139        Subaru
41140        Subaru
41141        Subaru
41142        Subaru
41143        Subaru
Name: make, Length: 41144, dtype: string

## 11.2 Categorical Strings

- If we have low cardinality string columns, consider using categorical type
- Advantage is in memory savings and performance improvements as the operations need to be only done on the individual categories and not each value in the series

In [3]:
make.astype('category')

0        Alfa Romeo
1           Ferrari
2             Dodge
3             Dodge
4            Subaru
            ...    
41139        Subaru
41140        Subaru
41141        Subaru
41142        Subaru
41143        Subaru
Name: make, Length: 41144, dtype: category
Categories (136, object): ['AM General', 'ASC Incorporated', 'Acura', 'Alfa Romeo', ..., 'Volvo', 'Wallace Environmental', 'Yugo', 'smart']

## 11.3 The .str Accessor

The object ``'string'`` and ``'category'`` types have a ``.str`` accessor that provides string manipulation methods

In [4]:
make.str.lower()

0        alfa romeo
1           ferrari
2             dodge
3             dodge
4            subaru
            ...    
41139        subaru
41140        subaru
41141        subaru
41142        subaru
41143        subaru
Name: make, Length: 41144, dtype: object

In [5]:
make.str.find("e")

0        8
1        1
2        4
3        4
4       -1
        ..
41139   -1
41140   -1
41141   -1
41142   -1
41143   -1
Name: make, Length: 41144, dtype: int64

In [6]:
make.str.startswith("f")

0        False
1        False
2        False
3        False
4        False
         ...  
41139    False
41140    False
41141    False
41142    False
41143    False
Name: make, Length: 41144, dtype: bool

In [7]:
make.str.extract(r'([a-e])', expand=False)

0        a
1        e
2        d
3        d
4        b
        ..
41139    b
41140    b
41141    b
41142    b
41143    b
Name: make, Length: 41144, dtype: object

## 11.4 Searching

- Finding all of non-alphabetic (disregarding space) characters 

In [9]:
(make
.str.extract(r'([^a-z A-Z])', expand=False)
.value_counts())

-    1727
.      46
,       9
Name: make, dtype: int64

## 11.5 Splitting

- Pulling out value before the dash and converting it to a number using ``.split`` method

In [10]:
age = pd.Series(['0-10', '11-15', '11-15', '61-65', '46-50'])
age

0     0-10
1    11-15
2    11-15
3    61-65
4    46-50
dtype: object

In [12]:
age.str.split('-')

0     [0, 10]
1    [11, 15]
2    [11, 15]
3    [61, 65]
4    [46, 50]
dtype: object

- Provide the ``expand=True`` parameter to retrieve a dataframe

In [14]:
(age
.str.split('-', expand=True)
.iloc[:,0]
.astype(int))

0     0
1    11
2    11
3    61
4    46
Name: 0, dtype: int32

- Taking the average of the bin ranges
- ``axis='columns'`` applies to each row

In [18]:
(age
.str.split('-', expand=True)
.astype(int)
.mean(axis='columns'))

0     5.0
1    13.0
2    13.0
3    63.0
4    48.0
dtype: float64

In [16]:
age

0     0-10
1    11-15
2    11-15
3    61-65
4    46-50
dtype: object

## 11.7 Replacing Text

- To replace single characters, use ``.str.replace``
- If have complete replacements for many of the values, use ``.replace``. It tries to replace the whole string.
- We can use dictionary to specify complete replacements
- We can also use regular expression to replace just a portion of the strings with ``regex=True`` parameter
- Use ``.str.replace`` to replace substring and ``.replace`` to replace mappings of complete strings

In [21]:
make.str.replace('A', 'Ẳ')

0        Ẳlfa Romeo
1           Ferrari
2             Dodge
3             Dodge
4            Subaru
            ...    
41139        Subaru
41140        Subaru
41141        Subaru
41142        Subaru
41143        Subaru
Name: make, Length: 41144, dtype: object

In [22]:
#.replace replace whole string and doesnt work here
make.replace('A', 'Ẳ')

0        Alfa Romeo
1           Ferrari
2             Dodge
3             Dodge
4            Subaru
            ...    
41139        Subaru
41140        Subaru
41141        Subaru
41142        Subaru
41143        Subaru
Name: make, Length: 41144, dtype: object

In [24]:
# use dictionary
make.replace({'Audi': 'Ẳudi',
              'Acura': 'Ẳcura',
              'Ashton Martin': 'Ẳshton Martin',
              'Alfa Romeo': 'Ẳlfa Romeo'})

0        Ẳlfa Romeo
1           Ferrari
2             Dodge
3             Dodge
4            Subaru
            ...    
41139        Subaru
41140        Subaru
41141        Subaru
41142        Subaru
41143        Subaru
Name: make, Length: 41144, dtype: object

In [25]:
# use regular expression 
make.replace('A', 'Ẳ', regex=True)

0        Ẳlfa Romeo
1           Ferrari
2             Dodge
3             Dodge
4            Subaru
            ...    
41139        Subaru
41140        Subaru
41141        Subaru
41142        Subaru
41143        Subaru
Name: make, Length: 41144, dtype: object