### Panda's string functions

For pandas Series the string functions are accessed using the str attribute. They have the following general form: Series.str.<function/property> with the function names matching the corresponding string functions in Python.

Note: The str attribute is not defined for the pandas DataFrame, only for Series. We must apply any string function column-wise.

Let’s create an example series and look at a few string functions:

In [1]:
import pandas as pd
import numpy as np

s = pd.Series(
    [
        "0",
        "John Wood",
        "Colin Welsh",
        "my list",
        "02456",
        np.nan,
        "HELLO WORLD",
        "water%",
    ]
)

Here, we defined a Series of different string objects. If you recall from the first course, the python function str.lower() takes as input a string object and converts it to lowercase. Similarly, the pandas function str.lower() takes as input a Series and converts all strings in this Series to lowercase. Let’s give it a try:

In [2]:
s.str.lower()

0              0
1      john wood
2    colin welsh
3        my list
4          02456
5            NaN
6    hello world
7         water%
dtype: object

In [3]:
# The function str.upper() is the opposite of this:
s.str.upper()


0              0
1      JOHN WOOD
2    COLIN WELSH
3        MY LIST
4          02456
5            NaN
6    HELLO WORLD
7         WATER%
dtype: object

In [4]:
# We can get the length of each string in the Series:
s.str.len()

0     1.0
1     9.0
2    11.0
3     7.0
4     5.0
5     NaN
6    11.0
7     6.0
dtype: float64

For data cleaning and manipulations, we will be especially interested in splitting, stripping and replacing strings. Let’s give these a try:



In [5]:
s.str.split(" ")

0               [0]
1      [John, Wood]
2    [Colin, Welsh]
3        [my, list]
4           [02456]
5               NaN
6    [HELLO, WORLD]
7          [water%]
dtype: object

A nice feature of the str.split() function is that we can choose to have the results returned to us in a DataFrame instead of a Series of lists. We do this by including the expand=True parameter as follows:



In [6]:
substrings = s.str.split(" ", expand=True)
substrings

Unnamed: 0,0,1
0,0,
1,John,Wood
2,Colin,Welsh
3,my,list
4,02456,
5,,
6,HELLO,WORLD
7,water%,


Note that the number of columns is determined by the maximum size of the lists. In our case, we had lists of size one and two so the DataFrame has two columns. For strings that were not split pandas filled the second column of the DataFrame contains the entry 'None'. We can now easily access the substring by just indexing the DataFrame. For example

In [7]:
substrings[1]

0     None
1     Wood
2    Welsh
3     list
4     None
5      NaN
6    WORLD
7     None
Name: 1, dtype: object

Let’s now look at replacing a substring. The general syntax of the function is the following

In [8]:
s.str.replace('strA','strB')

0              0
1      John Wood
2    Colin Welsh
3        my list
4          02456
5            NaN
6    HELLO WORLD
7         water%
dtype: object

where 'strA' is the substring we want to replace and 'strB' is what we want to replace it by. Let’s give this a try

In [10]:
s.str.replace("%", " percent ")

0                 0
1         John Wood
2       Colin Welsh
3           my list
4             02456
5               NaN
6       HELLO WORLD
7    water percent 
dtype: object

If instead we just want to remove a specific substring or character we can use the function str.replace() and choose to replace it with the empty string. For example



In [11]:
s.str.replace("%", "")

0              0
1      John Wood
2    Colin Welsh
3        my list
4          02456
5            NaN
6    HELLO WORLD
7          water
dtype: object

Another useful function for us will be to index a particular slice of each string. For example suppose we want to get the first two characters of every string. We can do this by using the index directly



In [12]:
s.str[0:2]

0      0
1     Jo
2     Co
3     my
4     02
5    NaN
6     HE
7     wa
dtype: object

In [13]:
# Or we use the str.slice() function

s.str.slice(0, 2)

0      0
1     Jo
2     Co
3     my
4     02
5    NaN
6     HE
7     wa
dtype: object

We can even combine the action of slicing and replacing using the str.slice_replace() function. Here we must mention first the slice of the string that we want to be replaced and then what we want it replaced by. The general syntax looks like this

str.slice_replace(i,j,'str')

This command takes the substring at positions i to j-1 and replaces it with the string 'str'. Let’s give it a try

In [15]:
s.str.slice_replace(0, 2, "___")

0             ___
1      ___hn Wood
2    ___lin Welsh
3        ___ list
4          ___456
5             NaN
6    ___LLO WORLD
7         ___ter%
dtype: object

A common operation when working with text data is to test whether character strings contain a certain substring or pattern of characters. For instance, if we were only interested in posts about Andrew Wiggins, we’d need to match all posts that make mention of him and avoid matching posts that don’t mention him. he str.contains() function returns a Series of True/False values that indicate whether each string contains the given keyword. We can then use this Series of booleans to index our original Series and obtain those entries which correspond to the True values.

Here is an example

In [16]:
flag = s.str.contains("0")
flag

0     True
1    False
2    False
3    False
4     True
5      NaN
6    False
7    False
dtype: object

Note that the NaN entry returned NaN. If we wanted to make sure that we get back a Series of only True and False values we could use the parameter na=False which replaces NaN with a False:

In [17]:
flag = s.str.contains("0", na=False)
flag

0     True
1    False
2    False
3    False
4     True
5    False
6    False
7    False
dtype: bool

Let’s now get back the entries which contain the character '0':

In [18]:
s[flag]

0        0
4    02456
dtype: object

### Example: Cleaning up the movies dataset

Now that we have gotten an overview of the string functions available to us in pandas, it is time we put them to use with a real dataset. To do this we will use a dataset that you have seen before Kaggle TMDB 5000 Movie DataSet. 

In [19]:
# Import libraries
import pandas as pd
import numpy as np

# Load the data
movies = pd.read_csv("c2_tmdb_5000_movies.csv")

# for applying pandas string functions, let’s take a look at the first 5 rows of the first 3 text-based columns.
# Show top entries of the first 3 text-based columns

movies.select_dtypes("object").iloc[:5,:3]

Unnamed: 0,genres,homepage,keywords
0,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":..."
1,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",http://disney.go.com/disneypictures/pirates/,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na..."
2,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.sonypictures.com/movies/spectre/,"[{""id"": 470, ""name"": ""spy""}, {""id"": 818, ""name..."
3,"[{""id"": 28, ""name"": ""Action""}, {""id"": 80, ""nam...",http://www.thedarkknightrises.com/,"[{""id"": 849, ""name"": ""dc comics""}, {""id"": 853,..."
4,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://movies.disney.com/john-carter,"[{""id"": 818, ""name"": ""based on novel""}, {""id"":..."


In [20]:
#Column Genres
genres = movies["genres"]

# Let’s look at a full entry.
genres[0]

'[{"id": 28, "name": "Action"}, {"id": 12, "name": "Adventure"}, {"id": 14, "name": "Fantasy"}, {"id": 878, "name": "Science Fiction"}]'

We would like to replace this entry with just the names of the genres separated by a comma such as: 'Action, Adventure, Fantasy, Science Fiction' 

'Action, Adventure, Fantasy, Science Fiction' 

How can we go about this? Since each entry is a JSON string, we could use the json module.

In [21]:
import json

json_obj = json.loads(genres[0])  # Load json string
names = [
    x["name"] for x in json_obj
] 
names

['Action', 'Adventure', 'Fantasy', 'Science Fiction']

We can join lists of words into a string using the join function.

In [22]:
", ".join(names)

'Action, Adventure, Fantasy, Science Fiction'

We can easily apply this to the entire Series genres by wrapping everything into a lambda function:

In [23]:
genres.map(lambda s: ', '.join(x['name'] for x in json.loads(s)))

0       Action, Adventure, Fantasy, Science Fiction
1                        Adventure, Fantasy, Action
2                          Action, Adventure, Crime
3                    Action, Crime, Drama, Thriller
4                Action, Adventure, Science Fiction
                           ...                     
4798                        Action, Crime, Thriller
4799                                Comedy, Romance
4800               Comedy, Drama, Romance, TV Movie
4801                                               
4802                                    Documentary
Name: genres, Length: 4803, dtype: object

However, let’s see how we can use the text commands from the last unit to manually extract genres. Let’s start by striping the strings of the square brackets.

In [24]:
def transform(s):
    s = s.str.strip("[]")
    return s

We put this inside a function because we will keep adding some other functions inside and then we can make a single call to execute them all. For now, let’s see where this version of the function gets us to.

In [25]:
genres = transform(genres)
genres[0]

'{"id": 28, "name": "Action"}, {"id": 12, "name": "Adventure"}, {"id": 14, "name": "Fantasy"}, {"id": 878, "name": "Science Fiction"}'

So this successfully removed the square brackets. Now we would like to get rid of the other additional characters. We can do this by calling several string replacement functions, one for each sequence of characters that we would like to remove. Let’s give this a try.

In [26]:
def transform(s):
    s = s.str.strip("[]")
    s = s.str.replace("{", "", regex=True)
    s = s.str.replace("}", "", regex=True)
    s = s.str.replace(",", "", regex=True)
    s = s.str.replace('"id":', "", regex=True)
    s = s.str.replace('"name":', "", regex=True)
    s = s.str.replace('"', "", regex=True)
    return s

In [27]:
genres = transform(genres)
genres[0]

' 28  Action  12  Adventure  14  Fantasy  878  Science Fiction'

This is now definitely closer to what we wanted. The last challenge is to get rid of the numbers inside the string. But how can we do this? One option would be to use the replace method to remove each digit separately. This is quite tedious but it gets the job done. Let’s give it a try by adding the following lines inside the definition of our function transform().

In [28]:
def transform(s):
    s = s.str.strip("[]")
    s = s.str.replace("{", "", regex=True)
    s = s.str.replace("}", "", regex=True)
    s = s.str.replace(",", "", regex=True)
    s = s.str.replace('"id":', "", regex=True)
    s = s.str.replace('"name":', "", regex=True)
    s = s.str.replace('"', "", regex=True)
    s = s.str.replace("0", "", regex=True)
    s = s.str.replace("1", "", regex=True)
    s = s.str.replace("2", "", regex=True)
    s = s.str.replace("3", "", regex=True)
    s = s.str.replace("4", "", regex=True)
    s = s.str.replace("5", "", regex=True)
    s = s.str.replace("6", "", regex=True)
    s = s.str.replace("7", "", regex=True)
    s = s.str.replace("8", "", regex=True)
    s = s.str.replace("9", "", regex=True)
    return s

In [29]:
genres = transform(genres)
genres[0]

'   Action    Adventure    Fantasy    Science Fiction'

In the following unit, we will learn about regular expressions which enable string matching with less code. For example, instead of removing every digit separately, we can use regular expressions to match all numbers with a single regular expression pattern. Below, we show two alternative approaches to remove digits using regular expressions.

1st alternative
s = s.str.replace(‘[0-9]+’,’’, regex=True)

2nd alternative
s = s.str.replace(‘\d+’,’’, regex=True)

Don’t worry if you don’t understand how regular expressions work, as this is the topic of the next unit!

Almost there! We would like to remove some of the additional white spaces and also make sure to include a comma to separate the entries. There are 3 white spaces in front of the first entry and 4 white spaces separating each of the remaining entries. So what we could do is first replace all blocks of 4 white spaces with a ', ' and then remove the remaining three white spaces at the front. Let’s add the following two cleaning steps in the transform() function.

In [30]:
def transform(s):
    s = s.str.strip("[]")
    s = s.str.replace("{", "", regex=True)
    s = s.str.replace("}", "", regex=True)
    s = s.str.replace(",", "", regex=True)
    s = s.str.replace('"id":', "", regex=True)
    s = s.str.replace('"name":', "", regex=True)
    s = s.str.replace('"', "", regex=True)
    s = s.str.replace("0", "", regex=True)
    s = s.str.replace("1", "", regex=True)
    s = s.str.replace("2", "", regex=True)
    s = s.str.replace("3", "", regex=True)
    s = s.str.replace("4", "", regex=True)
    s = s.str.replace("5", "", regex=True)
    s = s.str.replace("6", "", regex=True)
    s = s.str.replace("7", "", regex=True)
    s = s.str.replace("8", "", regex=True)
    s = s.str.replace("9", "", regex=True)
    s = s.str.replace("    ", ", ")
    s = s.str.replace("   ", "")
    return s

**Note that the order in which we are adding the functions inside our routine is important since the transformations on the strings are applied in sequential order.**

Here is now our final result.

In [31]:
genres = transform(genres)
genres[0]

'Action, Adventure, Fantasy, Science Fiction'

Exactly what we wanted! To have these changes reflected in the original DataFrame we can use:

In [32]:
movies["genres"] = genres
#Taking a look into the new genres column
movies.loc[:, ["title", "genres"]].head(10)

Unnamed: 0,title,genres
0,Avatar,"Action, Adventure, Fantasy, Science Fiction"
1,Pirates of the Caribbean: At World's End,"Adventure, Fantasy, Action"
2,Spectre,"Action, Adventure, Crime"
3,The Dark Knight Rises,"Action, Crime, Drama, Thriller"
4,John Carter,"Action, Adventure, Science Fiction"
5,Spider-Man 3,"Fantasy, Action, Adventure"
6,Tangled,"Animation, Family"
7,Avengers: Age of Ultron,"Action, Adventure, Science Fiction"
8,Harry Potter and the Half-Blood Prince,"Adventure, Fantasy, Family"
9,Batman v Superman: Dawn of Justice,"Action, Adventure, Fantasy"
