# Working with Text data

In [1]:
import pandas as pd
import numpy as np

## Data Type

### There are two ways to store text data in pandas:

*   **object** -dtype NumPy array.
*   **StringDtype** extension type. StringDtype is used to store text data.

In [4]:
x = pd.Series(["a", "b", "c"])
y = pd.Series(["a", "b", "c"], dtype="string")
z = pd.Series(["a", "b", "c"], dtype=pd.StringDtype())

### Type discussion:
*   "string" is a shorthand that pandas automatically interprets as `pd.StringDtype()`.
*   `pd.StringDtype()` is a more explicit, formal way of specifying the dtype.



In [10]:
print(f"The datatype of element of x that is defaulted one is {x.dtype}")
print(f"The datatype of element of y that is {y.dtype}")
print(f"The datatype of element of z that is {z.dtype}")

The datatype of element of x that is defaulted one is object
The datatype of element of y that is string
The datatype of element of z that is string


### Difference between the `object` dtype and `string` dtype
First Difference

For StringDtype, string accessor methods that return numeric output will always return a nullable integer dtype, rather than either int or float dtype, depending on the presence of NA values. Methods returning boolean output will return a nullable boolean dtype.

In [32]:
s = pd.Series(["a", None, "b"], dtype="string")
s2 = pd.Series(["a", None, "b"], dtype="object")
print(f"This the series s:\n{s}")
print(f"\nThe type of series s is: {s.dtype}")
print(f"\nThis the series s2:\n{s2}")
print(f"\nThe type of series s2 is: {s2.dtype}")
print(f"\nThis the series s.str.count('a'):\n{s.str.count('a')}")
print(f"\nThis the series s.str.count('a').dtype: {s.str.count('a').dtype}")
print(f"\nThis the series s2.str.count('a'):\n{s2.str.count('a')}")
print(f"\nThis the series s2.str.count('a').dtype: {s2.str.count('a').dtype}")
print(f"\nThis the series s.dropna().str.count('a'):\n{s.dropna().str.count('a')}")
print(f"\nThis the series s2.dropna().str.count('a'):\n{s2.dropna().str.count('a')}")

This the series s:
0       a
1    <NA>
2       b
dtype: string

The type of series s is: string

This the series s2:
0       a
1    None
2       b
dtype: object

The type of series s2 is: object

This the series s.str.count('a'):
0       1
1    <NA>
2       0
dtype: Int64

This the series s.str.count('a').dtype: Int64

This the series s2.str.count('a'):
0    1.0
1    NaN
2    0.0
dtype: float64

This the series s2.str.count('a').dtype: float64

This the series s.dropna().str.count('a'):
0    1
2    0
dtype: Int64

This the series s2.dropna().str.count('a'):
0    1
2    0
dtype: int64


## String & Index methods


* Series and Index are equipped with a set of string processing methods that make it easy to operate on each element of the array.
* Perhaps most importantly, these methods exclude missing/NA values automatically.
* These are accessed via the str attribute and generally have names matching the equivalent (scalar) built-in string methods.



### String Manipulation and Length Operations on a pandas Series
This code demonstrates string manipulation functions `(lower(), upper(), and len())` applied to a pandas.Series containing a mix of uppercase, lowercase, and NaN values. Each function processes the elements in the series as follows:

* `lower()`: Converts all strings to lowercase.
* `upper()`: Converts all strings to uppercase.
* `len()`: Computes the length of each string in the series (returns NaN for missing values).

In [36]:
s = pd.Series(
    ["A", "B", "C", "Aaba", "Baca", np.nan, "CABA", "dog", "cat"], dtype="string"
)
print(f"The series is s:\n{s}")
print(f"\nThe series is s.str.lower():\n{s.str.lower()}")
print(f"\nThe series is s.str.upper():\n{s.str.upper()}")
print(f"\nThe series is s.str.len():\n{s.str.len()}")

The series is s:
0       A
1       B
2       C
3    Aaba
4    Baca
5    <NA>
6    CABA
7     dog
8     cat
dtype: string

The series is s.str.lower():
0       a
1       b
2       c
3    aaba
4    baca
5    <NA>
6    caba
7     dog
8     cat
dtype: string

The series is s.str.upper():
0       A
1       B
2       C
3    AABA
4    BACA
5    <NA>
6    CABA
7     DOG
8     CAT
dtype: string

The series is s.str.len():
0       1
1       1
2       1
3       4
4       4
5    <NA>
6       4
7       3
8       3
dtype: Int64


### Using String Manipulation Functions with a pandas Index to Remove Whitespace

In this code, a pandas.Index object is created with names that contain leading, trailing, and extra spaces. We then apply various string manipulation functions `(strip(), lstrip(), and rstrip())` to remove whitespace in different ways:

* `strip()`: Removes both leading and trailing spaces from each string.
* `lstrip()`: Removes only the leading (left-side) spaces.
* `rstrip()`: Removes only the trailing (right-side) spaces.

In [44]:
idx = pd.Index(["    jack", "jill      ", "   jesse ", "frank   "])
print(f"The index is idx:\n{idx}")
print(f"The index is idx.str.strip():\n{idx.str.strip()}")
print(f"The index is idx.str.lstrip():\n{idx.str.lstrip()}")
print(f"The index is idx.str.rstrip():\n{idx.str.rstrip()}")

The index is idx:
Index(['    jack', 'jill      ', '   jesse ', 'frank   '], dtype='object')
The index is idx.str.strip():
Index(['jack', 'jill', 'jesse', 'frank'], dtype='object')
The index is idx.str.lstrip():
Index(['jack', 'jill      ', 'jesse ', 'frank   '], dtype='object')
The index is idx.str.rstrip():
Index(['    jack', 'jill', '   jesse', 'frank'], dtype='object')


## Cleaning and Standardizing DataFrame Column Names in pandas

In [48]:
df = pd.DataFrame(
    np.random.randn(10, 2), columns=[" Column A ", " Column B "], index=range(10)
)
df.head(5)

Unnamed: 0,Column A,Column B
0,-0.968111,0.276658
1,0.706178,0.09697
2,-0.799079,-1.90748
3,-2.752959,1.480384
4,0.137859,0.130684


In [58]:
print(f"The actual values of the column names: {df.columns}")
print(f"This is the stripped value of the column names: {df.columns.str.strip()}")
print(f"This is the lower case value of the column names: {df.columns.str.lower()}")
print(f"This is the lower and stripped value of the column names: {df.columns.str.lower().str.strip()}")
print(f"This is the best way to correct the column names: {df.columns.str.lower().str.strip().str.replace(' ','_')}")

The actual values of the column names: Index([' Column A ', ' Column B '], dtype='object')
This is the stripped value of the column names: Index(['Column A', 'Column B'], dtype='object')
This is the lower case value of the column names: Index([' column a ', ' column b '], dtype='object')
This is the lower and stripped value of the column names: Index(['column a', 'column b'], dtype='object')
This is the best way to correct the column names: Index(['column_a', 'column_b'], dtype='object')


## Splitting and replacing strings

### `split()`

In [71]:
s2 = pd.Series(["a_b_c_d_e_f", "c_d_e_f_g_h", np.nan, "f_g_h_i_j_k"], dtype="string")
print(f"The way to split the series:\n{s2.str.split('_')}")
print(f"\nThe way to access one element:\n{s2.str.split('_').str.get(2)}")
print(f"\nThe way to access one element:\n{s2.str.split('_').str[1]}")

The way to split the series:
0    [a, b, c, d, e, f]
1    [c, d, e, f, g, h]
2                  <NA>
3    [f, g, h, i, j, k]
dtype: object

The way to access one element:
0       c
1       e
2    <NA>
3       h
dtype: object

The way to access one element:
0       b
1       d
2    <NA>
3       g
dtype: object


### Using `split()` with the `expand=True` Parameter  
**Observation:** When splitting by an underscore (`_`), the number of splits determines the number of new columns created in the resulting dataset.

In [79]:
s2.str.split("_", expand=True)

Unnamed: 0,0,1,2,3,4,5
0,a,b,c,d,e,f
1,c,d,e,f,g,h
2,,,,,,
3,f,g,h,i,j,k


### Using `split("_", expand=True, n=1)`
The n parameter determines the maximum number of columns to be created. If the length of the split list exceeds the n value, the last column will contain the remaining part of the string as a whole.








In [73]:
s2.str.split("_", expand=True, n=1)

Unnamed: 0,0,1
0,a,b_c_d_e_f
1,c,d_e_f_g_h
2,,
3,f,g_h_i_j_k


### Using `rsplit("_",expand=True,n=4)`
Observation: The n parameter specifies the maximum number of columns to create from the right side of the string. If the length of the resulting split exceeds the n value, the first column will contain the remaining part of the string as a whole, while the other columns will contain the segments split by the underscore.

In [78]:
s2.str.rsplit("_", expand=True, n=4)

Unnamed: 0,0,1,2,3,4
0,a_b,c,d,e,f
1,c_d,e,f,g,h
2,,,,,
3,f_g,h,i,j,k


### Using `replace()`

In [92]:
s3 = pd.Series(
    ["A", "B", "C", "Aaba", "Baca", "", np.nan, "CABA", "dog", "cat","doghouse"],
    dtype="string",
)
print(s3)

0            A
1            B
2            C
3         Aaba
4         Baca
5             
6         <NA>
7         CABA
8          dog
9          cat
10    doghouse
dtype: string


### Using `replace(regex_expression,'replacement_string',case=bool,regex=bool)`

* **Pattern**: s3.str.replace("^.a|dog", "XX-XX ", case=False, regex=True) replaces occurrences of any character followed by 'a' at the start of the string or the word "dog" anywhere in the string.
* **Replacement**: Matches are replaced with "XX-XX ".
* **Case Insensitive**: The replacement is case-insensitive due to case=False

In [93]:
print(s3.str.replace("^.a|dog", "XX-XX ", case=False, regex=True))

0               A
1               B
2               C
3        XX-XX ba
4        XX-XX ca
5                
6            <NA>
7        XX-XX BA
8          XX-XX 
9         XX-XX t
10    XX-XX house
dtype: string


### Literal Replacement
If you want literal replacement of a string (equivalent to `str.replace()`), you can set the optional `regex` parameter to `False`, rather than escaping each character. In this case both `pattern` and `replacement` must be strings

In [112]:
dollars = pd.Series(["12", "-$10", "$10,000"], dtype="string")
print(dollars)
print("\nAfter setting regex to True:")
print(dollars.str.replace(r'\$','',regex=True))
print("\nAfter setting regex to False:")
print(dollars.str.replace("-$", "-", regex=False))


0         12
1       -$10
2    $10,000
dtype: string

After setting regex to True:
0        12
1       -10
2    10,000
dtype: string

After setting regex to False:
0         12
1        -10
2    $10,000
dtype: string


### Replacement - 1

In [154]:
def repl(m):
    # The time complexity of this is O(n), n is the length of the string
    return m.group(0)[::-1] #Inversing the string

In [155]:
pat = r"[a-z]+"
pd.Series(["foo 123", "bar baz", np.nan], dtype="string").str.replace(
    pat, repl, regex=True
)

Unnamed: 0,0
0,oof 123
1,rab zab
2,


### Replacement - 2

In [156]:
def repl_fixed(m):
    # m is a match object is created after matching the string with pattern.
    # match object has span which is nothing but length of the string.
    # <re.Match object; span=(0, 19), match='apple banana cherry'>
    return m.group("one") + " replaced " + m.group("two")
    # Replace the middle word with 'replaced'

In [157]:
pat = r"(?P<one>\w+) (?P<two>\w+) (?P<three>\w+)"
result = pd.Series(["cat dog mouse","hello world pattern is nice","cat-dog mouse", "apple banana cherry"], dtype="string").str.replace(
    pat, repl_fixed, regex=True
)
print(result)

0                cat replaced dog
1    hello replaced world is nice
2                   cat-dog mouse
3           apple replaced banana
dtype: string


## Concatenation

### Concatenating a single Series into a string
1. The code uses `pd.Series` to create a string Series `s` and demonstrates concatenation with different separators (comma, space, hash) using `s.str.cat(sep='separator')`.
2. Missing values (`NaN`) are ignored by default during concatenation.
3. A second Series `s2` shows concatenation with commas while including `NaN`, resulting in an omission of these values.
4. The `na_rep='NOTHING'` parameter allows for replacing missing values in the concatenated output.

In [182]:
s = pd.Series(["a", "b", "c", "d"], dtype="string")
print(f"This is the value if you want to concatenate with comma: {s.str.cat(sep=',')}")
print(f"This is the value if you want to concatenate with space: {s.str.cat(sep='')}")
print(f"This is the value if you want to concatenate with hash: {s.str.cat(sep='#')}")
# By default missing values are also ignored in this
s2 = pd.Series(["a","b",np.nan,"c","d",np.nan,"e","f"])
print(f"This is the value if you want to concatenate with comma: {s2.str.cat(sep=',')}")
# if you want to replace the missing value here is how you can do it
print(f"This is the value if you want to concatenate with comma: {s2.str.cat(sep=',',na_rep='NOTHING')}")

This is the value if you want to concatenate with comma: a,b,c,d
This is the value if you want to concatenate with space: abcd
This is the value if you want to concatenate with hash: a#b#c#d
This is the value if you want to concatenate with comma: a,b,c,d,e,f
This is the value if you want to concatenate with comma: a,b,NOTHING,c,d,NOTHING,e,f


### Concatenating a Series and something list-like into a Series

In [185]:
print(f"This is the value for list['A', 'B', 'C', 'D']:\n{s.str.cat(['A', 'B', 'C', 'D'])}")
# Missing values on either side will result in missing values in the result as well, unless na_rep is specified
t = ['a','b',np.nan,'c']
print(f"\nThis is the value for list['A', 'B', 'C', 'D']:\n{s.str.cat(t)}")
print(f"\nThis is the value for list['A', 'B', 'C', 'D']:\n{s.str.cat(t,na_rep='-')}")

This is the value for list['A', 'B', 'C', 'D']:
0    aA
1    bB
2    cC
3    dD
dtype: string

This is the value for list['A', 'B', 'C', 'D']:
0      aa
1      bb
2    <NA>
3      dc
dtype: string

This is the value for list['A', 'B', 'C', 'D']:
0    aa
1    bb
2    c-
3    dc
dtype: string


### Concatenating a Series and something array-like into a Series

In [194]:
t = pd.Series(t)
d = pd.concat([t, s], axis=1)
print(s)
print('\n')
print(d)
print('\n')
print(s.str.cat(d, na_rep="-"))

0    a
1    b
2    c
3    d
dtype: string


     0  1
0    a  a
1    b  b
2  NaN  c
3    c  d


0    aaa
1    bbb
2    c-c
3    dcd
dtype: string


### Concatenating a Series and an indexed object into a Series, with alignment

In [199]:
u = pd.Series(["-b", "-d", "-a", "-c"], index=[1, 3, 0, 2], dtype="string")
print(u)

1    -b
3    -d
0    -a
2    -c
dtype: string


In [200]:
print(s)

0    a
1    b
2    c
3    d
dtype: string


In [201]:
print(s.str.cat(u))

0    a-a
1    b-b
2    c-c
3    d-d
dtype: string


In [202]:
print(s.str.cat(u, join="left"))

0    a-a
1    b-b
2    c-c
3    d-d
dtype: string


## Indexing with `.str`

In [208]:
s = pd.Series(
    ["A", "B", "C", "Aaba", "Baca", np.nan, "CABA", "dog", "cat"], dtype="string"
)
print(s.str[0]) # This would contain the first alphabet of every string in the series
print(f"\n{s.str[1]}") # This would contain the second alphabet of every string in the series

0       A
1       B
2       C
3       A
4       B
5    <NA>
6       C
7       d
8       c
dtype: string

0    <NA>
1    <NA>
2    <NA>
3       a
4       a
5    <NA>
6       A
7       o
8       a
dtype: string


## Extracting substrings

### Extract first match in each subject (extract)

* The `extract` method accepts a regular expression with at least one capture group.
* Extracting a regular expression with more than one group returns a DataFrame with one column per group.

In [231]:
lis = ["a1", "b2", "c3","d0","e3"]
lis1= ["a1", "b2", "3","4","5"]
regex1 = r"([abcd])(\d)"
regex2 = r"(?P<letter>[ab])(?P<digit>\d)"
regex3 = r"([ab])?(\d)"

In [237]:
def seriesreturn(lis,pattern,flag):
  return pd.Series(lis,dtype="string").str.extract(pattern,expand=flag)

### Extract all matches in each subject (extractall)

* The `extractall` method returns every match.
* The result of `extractall` is always a DataFrame with a MultiIndex on its rows.
* The last level of the MultiIndex is named match and indicates the order in the subject.

In [243]:
s = pd.Series(["a1a2", "b1", "c1"], index=["A", "B", "C"], dtype="string")
print(s)
two_groups = "(?P<letter>[a-z])(?P<digit>[0-9])"
print("\n")
s.str.extract(two_groups, expand=True)

A    a1a2
B      b1
C      c1
dtype: string




Unnamed: 0,letter,digit
A,a,1
B,b,1
C,c,1


In [244]:
s.str.extractall(two_groups)

Unnamed: 0_level_0,Unnamed: 1_level_0,letter,digit
Unnamed: 0_level_1,match,Unnamed: 2_level_1,Unnamed: 3_level_1
A,0,a,1
A,1,a,2
B,0,b,1
C,0,c,1


## Testing for strings that match or contain a pattern

In [245]:
pattern = r"[0-9][a-z]"

pd.Series(
    ["1", "2", "3a", "3b", "03c", "4dx"],
    dtype="string",
).str.contains(pattern)

Unnamed: 0,0
0,False
1,False
2,True
3,True
4,True
5,True


### `match`
match tests whether there is a match of the regular expression that begins at the first character of the string; and contains tests whether there is a match of the regular expression at any position within the string.

In [246]:
pd.Series(
    ["1", "2", "3a", "3b", "03c", "4dx"],
    dtype="string",
).str.match(pattern)

Unnamed: 0,0
0,False
1,False
2,True
3,True
4,False
5,True


### `full match`
fullmatch tests whether the entire string matches the regular expression

In [247]:
pd.Series(
    ["1", "2", "3a", "3b", "03c", "4dx"],
    dtype="string",
).str.fullmatch(pattern)


Unnamed: 0,0
0,False
1,False
2,True
3,True
4,False
5,False


Methods like `match`, `fullmatch`, `contains`, `startswith`, and `endswith` take an extra `na` argument so missing values can be considered True or False:

In [248]:
s4 = pd.Series(
    ["A", "B", "C", "Aaba", "Baca", np.nan, "CABA", "dog", "cat"], dtype="string"
)

s4.str.contains("A", na=False)

Unnamed: 0,0
0,True
1,False
2,False
3,True
4,False
5,False
6,True
7,False
8,False


## Creating indicator variables

You can extract dummy variables from string columns. For example if they are separated by a `|`:



In [250]:
s = pd.Series(["a", "a|b", np.nan, "a|c"], dtype="string")

s.str.get_dummies(sep="|")

Unnamed: 0,a,b,c
0,1,0,0
1,1,1,0
2,0,0,0
3,1,0,1


String `Index` also supports `get_dummies` which returns a `MultiIndex`.

In [251]:
idx = pd.Index(["a", "a|b", np.nan, "a|c"])
idx.str.get_dummies(sep="|")

MultiIndex([(1, 0, 0),
            (1, 1, 0),
            (0, 0, 0),
            (1, 0, 1)],
           names=['a', 'b', 'c'])