<div align="center"> <h1>Processing Strings in Python</h1>
    <h2><a href="...">Richard Leibrandt</a></h2>
</div>

In this section we will explore string processing in Python using the row names of the 'mtcars2' Dataframe:

In [1]:
from plotnine.data import mtcars

names = list(mtcars.name)
mtcars.index = names
names

['Mazda RX4',
 'Mazda RX4 Wag',
 'Datsun 710',
 'Hornet 4 Drive',
 'Hornet Sportabout',
 'Valiant',
 'Duster 360',
 'Merc 240D',
 'Merc 230',
 'Merc 280',
 'Merc 280C',
 'Merc 450SE',
 'Merc 450SL',
 'Merc 450SLC',
 'Cadillac Fleetwood',
 'Lincoln Continental',
 'Chrysler Imperial',
 'Fiat 128',
 'Honda Civic',
 'Toyota Corolla',
 'Toyota Corona',
 'Dodge Challenger',
 'AMC Javelin',
 'Camaro Z28',
 'Pontiac Firebird',
 'Fiat X1-9',
 'Porsche 914-2',
 'Lotus Europa',
 'Ford Pantera L',
 'Ferrari Dino',
 'Maserati Bora',
 'Volvo 142E']

## String conversion: Uppercase, Lowercase, Capitalize

As an example let's take the second element of the list, and transform it:

In [2]:
example_string=names[1]
example_string

'Mazda RX4 Wag'

In [3]:
example_string.lower()   # lowercase

'mazda rx4 wag'

In [4]:
example_string.upper()   # uppercase

'MAZDA RX4 WAG'

In [5]:
lower = example_string.lower()   
lower.capitalize()  # capitalize first character

'Mazda rx4 wag'

In [6]:
example_string.swapcase()  # toggle case of each character

'mAZDA rx4 wAG'

If we need to apply one of these string methods to the whole list, the easiest way is using a list comprehension (showing only the first 5 elements):

In [7]:
[x.upper() for x in names][:5]

['MAZDA RX4',
 'MAZDA RX4 WAG',
 'DATSUN 710',
 'HORNET 4 DRIVE',
 'HORNET SPORTABOUT']

## Splitting a string

Strings can be splited on any character. If no character is supplied, the string will be splitted on spaces. The result is a list of substrings:

In [8]:
example_string.split()

['Mazda', 'RX4', 'Wag']

In [9]:
example_string.split("a")

['M', 'zd', ' RX4 W', 'g']

The first element of the resulting list is:

In [10]:
example_string.split()[0]

'Mazda'

The first element of each of our 'names' list (again showing only the first 5):

In [11]:
[x.split()[0] for x in names][:5]

['Mazda', 'Mazda', 'Datsun', 'Hornet', 'Hornet']

## Identify strings in strings

In [12]:
'Mazda' in example_string

True

To get all elements with 'Merc' we can do:

In [13]:
list_merc = [x for x in names if "Merc" in x]

In [14]:
print(list_merc)

['Merc 240D', 'Merc 230', 'Merc 280', 'Merc 280C', 'Merc 450SE', 'Merc 450SL', 'Merc 450SLC']


A possible usage of this technique is to filter the Dataframe for rows that have "Merc" in it's index:

In [15]:
mtcars.loc[list_merc, :]

Unnamed: 0,name,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
Merc 240D,Merc 240D,24.4,4,146.7,62,3.69,3.19,20.0,1,0,4,2
Merc 230,Merc 230,22.8,4,140.8,95,3.92,3.15,22.9,1,0,4,2
Merc 280,Merc 280,19.2,6,167.6,123,3.92,3.44,18.3,1,0,4,4
Merc 280C,Merc 280C,17.8,6,167.6,123,3.92,3.44,18.9,1,0,4,4
Merc 450SE,Merc 450SE,16.4,8,275.8,180,3.07,4.07,17.4,0,0,3,3
Merc 450SL,Merc 450SL,17.3,8,275.8,180,3.07,3.73,17.6,0,0,3,3
Merc 450SLC,Merc 450SLC,15.2,8,275.8,180,3.07,3.78,18.0,0,0,3,3


## Concatenation

In [16]:
a = ['a','b','c']
b = ['A','B','C']
"".join(a)

'abc'

In [17]:
"|".join(b)

'A|B|C'

## Replace and remove

In [18]:
example_string.replace('RX4', 'replacement text')

'Mazda replacement text Wag'

If you want to remove, just replace with "".

In [19]:
example_string.replace('RX4 ', "")

'Mazda Wag'

If a count argument is given, only the given number of ocurrences is replaced.

In [20]:
another_string = "Hello Python, goodbye Python! Nice to meet you Python."
another_string.replace("Python", "Guido", 2)

'Hello Guido, goodbye Guido! Nice to meet you Python.'

## Regular expressions

Regular expressions allow us to search for patterns instead of static strings. To use regular expressions in Python one must import the re module.

In [21]:
import re

Below we search for a string starting with "RX" followed by a single digit between 0 and 9.

In [22]:
re_result=re.search('RX[0-9]', example_string) 

re_result.group()

'RX4'

In [23]:
example_string2 = "Mazda RX7 Wag'"

re_result = re.search('RX[0-9]', example_string2) 

re_result.group()

'RX7'