<a href="https://colab.research.google.com/github/Saifullah785/python-data-science-handbook-notes/blob/main/03_10_Working_with_Strings.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Vectorized String Operations**

**Introducing Pandas String Operations**

In [2]:
import numpy as np
x =np.array([2,3,5,7,11,13])
x*2 # Multiply each element in the array by 2

array([ 4,  6, 10, 14, 22, 26])

In [3]:
data = ['peter','Paul','MARY','gUIDO']
[s.capitalize() for s in data] # Capitalize the first letter of each string in the list

['Peter', 'Paul', 'Mary', 'Guido']

In [4]:
data = ['peter','Paul',None,'MARY','gUIDO']
[s if s is None else s.capitalize() for s in data] # Capitalize each string in the list, handling None values

['Peter', 'Paul', None, 'Mary', 'Guido']

In [5]:
import pandas as pd
names = pd.Series(data) # Create a Pandas Series from the list
names # Display the Series

Unnamed: 0,0
0,peter
1,Paul
2,
3,MARY
4,gUIDO


In [6]:
names.str.capitalize() # Capitalize each string in the Pandas Series

Unnamed: 0,0
0,Peter
1,Paul
2,
3,Mary
4,Guido


#**Tables of Pandas String Methods**

In [7]:
monte = pd.Series(['Graham Chapman', 'John Cleese', 'Terry Gilliam',
'Eric Idle', 'Terry Jones', 'Michael Palin'])
monte # Display the Series

Unnamed: 0,0
0,Graham Chapman
1,John Cleese
2,Terry Gilliam
3,Eric Idle
4,Terry Jones
5,Michael Palin


# **Methods Similar to Python String Methods**

|             |             |              |             |
|-------------|-------------|--------------|-------------|
| `len()`     | `lower()`   | `translate()`| `islower()` |
| `ljust()`   | `upper()`   | `startswith()`| `isupper()` |
| `rjust()`   | `find()`    | `endswith()` | `isnumeric()`|
| `center()`  | `rfind()`   | `isalnum()`  | `isdecimal()`|
| `zfill()`   | `index()`   | `isalpha()`  | `split()`   |
| `strip()`   | `rindex()`  | `isdigit()`  | `rsplit()`  |
| `rstrip()`  | `capitalize()`| `isspace()`| `partition()`|
| `lstrip()`  | `swapcase()`| `istitle()`  | `rpartition()`|

In [8]:
monte.str.lower() # Convert all strings in the Series to lowercase

Unnamed: 0,0
0,graham chapman
1,john cleese
2,terry gilliam
3,eric idle
4,terry jones
5,michael palin


In [9]:
monte.str.len() # Get the length of each string in the Series

Unnamed: 0,0
0,14
1,11
2,13
3,9
4,11
5,13


In [10]:
monte.str.startswith('T') # Check if each string in the Series starts with 'T'

Unnamed: 0,0
0,False
1,False
2,True
3,False
4,True
5,False


In [11]:
monte.str.split() # Split each string in the Series into a list of words

Unnamed: 0,0
0,"[Graham, Chapman]"
1,"[John, Cleese]"
2,"[Terry, Gilliam]"
3,"[Eric, Idle]"
4,"[Terry, Jones]"
5,"[Michael, Palin]"


# **Methods Using Regular Expressions**

| Method   | Description                                                  |
|----------|--------------------------------------------------------------|
| match    | Calls re.match on each element, returning a Boolean.         |
| extract  | Calls re.match on each element, returning matched groups as strings. |
| findall  | Calls re.findall on each element                           |
| replace  | Replaces occurrences of pattern with some other string       |
| contains | Calls re.search on each element, returning a boolean       |
| count    | Counts occurrences of pattern                              |
| split    | Equivalent to str.split, but accepts regexps                 |
| rsplit   | Equivalent to str.rsplit, but accepts regexps                |

In [12]:
monte.str.extract('([A-Za-z]+)',expand=False) # Extract the first word from each string using a regular expression

Unnamed: 0,0
0,Graham
1,John
2,Terry
3,Eric
4,Terry
5,Michael


In [13]:
monte.str.findall(r'^[^AEIOU].*[^aeiou]$') # Find strings that do not start or end with a vowel using a regular expression

Unnamed: 0,0
0,[Graham Chapman]
1,[]
2,[Terry Gilliam]
3,[]
4,[Terry Jones]
5,[Michael Palin]


# **Miscellaneous Methods**

| Method        | Description                                                |
|---------------|------------------------------------------------------------|
| `get`         | Indexes each element                                       |
| `slice`       | Slices each element                                        |
| `slice_replace`| Replaces slice in each element with the passed value       |
| `cat`         | Concatenates strings                                       |
| `repeat`      | Repeats values                                             |
| `normalize`   | Returns Unicode form of strings                            |
| `pad`         | Adds whitespace to left, right, or both sides of strings   |
| `wrap`        | Splits long strings into lines with length less than a given width |
| `join`        | Joins strings in each element of the Series with the passed separator |
| `get_dummies` | Extracts dummy variables as a DataFrame                      |

**Vectorized item access and slicing**

In [14]:
monte.str[0:3]

Unnamed: 0,0
0,Gra
1,Joh
2,Ter
3,Eri
4,Ter
5,Mic


In [15]:
monte.str.split().str[-1]

Unnamed: 0,0
0,Chapman
1,Cleese
2,Gilliam
3,Idle
4,Jones
5,Palin


In [16]:
full_monte = pd.DataFrame({'name': monte,
                           'info': ['B|C|D', 'B|D', 'A|C',
                                    'B|D', 'B|C', 'B|C|D']})
full_monte

Unnamed: 0,name,info
0,Graham Chapman,B|C|D
1,John Cleese,B|D
2,Terry Gilliam,A|C
3,Eric Idle,B|D
4,Terry Jones,B|C
5,Michael Palin,B|C|D


In [17]:
full_monte['info'].str.get_dummies('|')

Unnamed: 0,A,B,C,D
0,0,1,1,1
1,0,1,0,1
2,1,0,1,0
3,0,1,0,1
4,0,1,1,0
5,0,1,1,1
