# Working with Text Data

## Objectives

- Understand string operations within Pandas Series for data preprocessing.
- Learn case conversion, trimming, splitting, and pattern-searching methods in text data.
- Explore string manipulation techniques like concatenation, repetition, and replacement.
- Master boolean string methods for data querying and analysis.

## Background

This notebook delves into the versatility of string operations in Pandas, demonstrating how to manipulate, query, and preprocess text data in Series, showcasing a wide array of methods for efficient text analysis.

## Datasets Used

The notebook uses synthetic datasets.

## Text Operations in Pandas Series

This notebook is about string operations in Pandas Series.

In [1]:
import pandas as pd
import numpy as np

In [2]:
s = pd.Series(['Tommy   ', 'William Scott', 'John\n', 
               'ALBERT@', np.nan, '5678','PeterSmith', 34])
s

0         Tommy   
1    William Scott
2           John\n
3          ALBERT@
4              NaN
5             5678
6       PeterSmith
7               34
dtype: object

`lower()`: Converts strings in the Series/Index to lower case.

In [3]:
s.str.lower()

0         tommy   
1    william scott
2           john\n
3          albert@
4              NaN
5             5678
6       petersmith
7              NaN
dtype: object

`upper()`: Converts strings in the Series/Index to upper case.

In [4]:
s.str.upper()

0         TOMMY   
1    WILLIAM SCOTT
2           JOHN\n
3          ALBERT@
4              NaN
5             5678
6       PETERSMITH
7              NaN
dtype: object

`swapcase()`: swaps the case lower/upper.

In [5]:
s.str.swapcase()

0         tOMMY   
1    wILLIAM sCOTT
2           jOHN\n
3          albert@
4              NaN
5             5678
6       pETERsMITH
7              NaN
dtype: object

`islower()`: checks whether all characters in each string in the Series/Index in lower case or not. Returns Boolean

In [6]:
s.str.islower()

0    False
1    False
2    False
3    False
4      NaN
5    False
6    False
7      NaN
dtype: object

In [7]:
s.str.lower().str.islower()

0     True
1     True
2     True
3     True
4      NaN
5    False
6     True
7      NaN
dtype: object

In [8]:
s

0         Tommy   
1    William Scott
2           John\n
3          ALBERT@
4              NaN
5             5678
6       PeterSmith
7               34
dtype: object

`isupper()`: checks whether all characters in each string in the Series/Index in upper case or not. Returns Boolean.

In [9]:
s.str.isupper()

0    False
1    False
2    False
3     True
4      NaN
5    False
6    False
7      NaN
dtype: object

`isnumeric()`: checks whether all characters in each string in the Series/Index are numeric. Returns Boolean.

In [10]:
s.str.isnumeric()

0    False
1    False
2    False
3    False
4      NaN
5     True
6    False
7      NaN
dtype: object

`len()`: Computes String length()

In [11]:
s.str.len()

0     8.0
1    13.0
2     5.0
3     7.0
4     NaN
5     4.0
6    10.0
7     NaN
dtype: float64

`strip()`: Helps strip whitespace (including newline) from each string in the Series/index from both the sides.

Notice that 'John\n' was changed for 'John'

In [12]:
s.str.strip()

0            Tommy
1    William Scott
2             John
3          ALBERT@
4              NaN
5             5678
6       PeterSmith
7              NaN
dtype: object

`split()`: Splits each string with the given pattern. The result is a list for each row

In [13]:
s.str.split(' ')

0       [Tommy, , , ]
1    [William, Scott]
2            [John\n]
3           [ALBERT@]
4                 NaN
5              [5678]
6        [PeterSmith]
7                 NaN
dtype: object

`cat(sep='')`: concatenates the series/index elements with given separator

In [14]:
s = pd.Series(['Tom ',' John','Will Smith','123'])
s.str.cat(sep='_')

'Tom _ John_Will Smith_123'

`contains(pattern)`: returns a Boolean value True for each element if the substring contains in the element, else False

In [15]:
s.str.contains(' ')

0     True
1     True
2     True
3    False
dtype: bool

`replace(a,b)`: replaces the value a with the value b.

In [16]:
s.str.replace(' ','_')

0          Tom_
1         _John
2    Will_Smith
3           123
dtype: object

`repeat(value)`: repeats each element with specified number of times.

In [17]:
s.str.repeat(2)

0                Tom Tom 
1               John John
2    Will SmithWill Smith
3                  123123
dtype: object

In [18]:
s.str.repeat(5)

0                                 Tom Tom Tom Tom Tom 
1                             John John John John John
2    Will SmithWill SmithWill SmithWill SmithWill S...
3                                      123123123123123
dtype: object

Observe that the lenght of the Serie is the same.

In [19]:
print(len(s))
print(len(s.str.repeat(5)))

4
4


What changes is the lenght of the elements.

In [20]:
print(len(s[0]))
print(len(s.str.repeat(5)[0]))

4
20


`count(pattern)`: returns count of appearance of pattern in each element.

In [21]:
s.str.count('o')

0    1
1    1
2    0
3    0
dtype: int64

`startswith(pattern)`: returns true if the element in the Series/Index starts with the pattern.

In [22]:
s.str.startswith(' ')

0    False
1     True
2    False
3    False
dtype: bool

In [23]:
s.str.startswith('w')

0    False
1    False
2    False
3    False
dtype: bool

In [24]:
s.str.startswith('W')

0    False
1    False
2     True
3    False
dtype: bool

In [25]:
s.str.lower().str.startswith('w')

0    False
1    False
2     True
3    False
dtype: bool

`endswith(pattern)`: returns true if the element in the Series/Index ends with the pattern.

In [26]:
s.str.endswith(' ')

0     True
1    False
2    False
3    False
dtype: bool

`find(pattern)`: returns the first position of the first occurrence of the pattern. It returns -1 if the string is not found.

In [27]:
s.str.find('2')

0   -1
1   -1
2   -1
3    1
dtype: int64

In [28]:
s.str.find('ll')

0   -1
1   -1
2    2
3   -1
dtype: int64

`findall(pattern)`: returns a list of all occurrence of the pattern.

In [29]:
s.str.findall('ll')

0      []
1      []
2    [ll]
3      []
dtype: object

In [30]:
s = pd.Series(['red','orange','yellow','green','blue'])
s

0       red
1    orange
2    yellow
3     green
4      blue
dtype: object

In [31]:
# return the index of the first occurrence of the pattern
s.str.find('e')

0    1
1    5
2    1
3    2
4    3
dtype: int64

In [32]:
# return all the 'e' 
s.str.findall('e')

0       [e]
1       [e]
2       [e]
3    [e, e]
4       [e]
dtype: object

In [33]:
s.str.endswith('e')

0    False
1     True
2    False
3    False
4     True
dtype: bool

`get_dummies()`: returns the DataFrame with One-Hot Encoded values.

In [34]:
country = pd.Series(['USA','Colombia','Ecuador',
                     'Rep. Dominicana','Puerto Rico'])
country.str.get_dummies()

Unnamed: 0,Colombia,Ecuador,Puerto Rico,Rep. Dominicana,USA
0,0,0,0,0,1
1,1,0,0,0,0
2,0,1,0,0,0
3,0,0,0,1,0
4,0,0,1,0,0


In [35]:
sex = pd.Series(['Male','Female'])
sex.str.get_dummies()

Unnamed: 0,Female,Male
0,0,1
1,1,0


## Conclusions

Key Takeaways:
- Pandas provides extensive string methods for case conversion, trimming, and pattern matching in Series, enhancing data cleaning and preparation.
- Boolean string methods enable efficient querying of text data, facilitating data analysis tasks.
- Understanding and applying string manipulation techniques in Pandas is crucial for preprocessing text data, especially in tasks requiring pattern detection or data transformation.

## References

- VanderPlas, J. (2017) Python Data Science Handbook: Essential Tools for Working with Data. USA: O’Reilly Media, Inc. chapter 3