## Pandas -part 3

In [1]:
import pandas as pd
import numpy as np

Part 3 is about using Strings in with dataframes. Also Regular Expressions (working with strings)

## Pandas string methods 
- <code> series.str.<method>()</code>
    
|             |                  |                  |                  |
|-------------|------------------|------------------|------------------|
|``len()``    | ``lower()``      | ``translate()``  | ``islower()``    | 
|``ljust()``  | ``upper()``      | ``startswith()`` | ``isupper()``    | 
|``rjust()``  | ``find()``       | ``endswith()``   | ``isnumeric()``  | 
|``center()`` | ``rfind()``      | ``isalnum()``    | ``isdecimal()``  | 
|``zfill()``  | ``index()``      | ``isalpha()``    | ``split()``      | 
|``strip()``  | ``rindex()``     | ``isdigit()``    | ``rsplit()``     | 
|``rstrip()`` | ``capitalize()`` | ``isspace()``    | ``partition()``  | 
|``lstrip()`` |  ``swapcase()``  |  ``istitle()``   | ``rpartition()`` |

Notice that these have various return values. Some, like ``lower()``, return a series of strings:

In [25]:
string_series = pd.Series(
    ['Picard','Sisko','Riker',
     'Dax','Janeway','LaForge'] )

str_series = pd.Series(
    ['Captain Picard','Captain Sisko',
     'Captain Janeway','Lt.Cmdr Dax',
     'Chief Engineer LaForge','Doctor Crusher'])


In [26]:
# .lower()
print(string_series.str.lower() )

0     picard
1      sisko
2      riker
3        dax
4    janeway
5    laforge
dtype: object


In [27]:
# find out the length of each string
print(string_series.str.len())

0    6
1    5
2    5
3    3
4    7
5    7
dtype: int64


In [28]:
print(str_series)

# this is also known as 'Tokenization' 
# Tokenization is very important for NLP 
# (Natural Language Processing)

str_series.str.split() 

0            Captain Picard
1             Captain Sisko
2           Captain Janeway
3               Lt.Cmdr Dax
4    Chief Engineer LaForge
5            Doctor Crusher
dtype: object


0             [Captain, Picard]
1              [Captain, Sisko]
2            [Captain, Janeway]
3                [Lt.Cmdr, Dax]
4    [Chief, Engineer, LaForge]
5             [Doctor, Crusher]
dtype: object

### Regular Expressions (regex) 
is very important when working with text data, *especially if you get posted at **Ops**  and have to transmit commications between planets, Starbases and Federation allies.* Federation News report that having basic understanding and knowledge of Regex is essential for Officers on outposts in all space quadrants.


Regular Expressions is part of NLP and Machine Learning, it is a vast subject with many functions and is used everywhere in web searches and chatbots. 

## **Regular Expressions** ``re.<method>()``
- import re

| Method | Description |
|--------|-------------|
| ``match()`` | Call ``re.match()`` on each element, returning a boolean. |
| ``extract()`` | Call ``re.match()`` on each element, returning matched groups as strings.|
| ``findall()`` | Call ``re.findall()`` on each element |
| ``replace()`` | Replace occurrences of pattern with some other string|
| ``contains()`` | Call ``re.search()`` on each element, returning a boolean |
| ``count()`` | Count occurrences of pattern|
| ``split()``   | Equivalent to ``str.split()``, but accepts regexps |
| ``rsplit()`` | Equivalent to ``str.rsplit()``, but accepts regexps |


- https://www.w3schools.com/python/python_regex.asp




| Method | Description |
|--------|-------------|
| ``get()`` | Index each element |
| ``slice()`` | Slice each element|
| ``slice_replace()`` | Replace slice in each element with passed value|
| ``cat()``      | Concatenate strings|
| ``repeat()`` | Repeat values |
| ``normalize()`` | Return Unicode form of string |
| ``pad()`` | Add whitespace to left, right, or both sides of strings|
| ``wrap()`` | Split long strings into lines with length less than a given width|
| ``join()`` | Join strings in each element of the Series with passed separator|
| ``get_dummies()`` | extract dummy variables as a dataframe |

In [29]:
# EXTRACT method

# extract first names
string_series.str.extract('([A-Za-z]+)', expand=False)

0     Picard
1      Sisko
2      Riker
3        Dax
4    Janeway
5    LaForge
dtype: object

In [16]:
# FIND ALL method

print(str_series.str.findall(r'^[^AEIOU].*[^aeiou]$'),'\n' )

print(string_series.str.findall(r'^[^AEIOU].*[^aeiou]$') )

0     [Captain Picard]
1                   []
2    [Captain Janeway]
3        [Lt.Cmdr Dax]
4                   []
dtype: object 

0     [Picard]
1           []
2      [Riker]
3        [Dax]
4    [Janeway]
5           []
dtype: object


In [30]:
# SLICE a string 
# get first 3 characters of string

# NOTE:  
# df.str.slice(0, 3) === df.str[0:3] <-- the same

print(string_series.str[0:3],'\n')
print(str_series.str[0:3])

0    Pic
1    Sis
2    Rik
3    Dax
4    Jan
5    LaF
dtype: object 

0    Cap
1    Cap
2    Cap
3    Lt.
4    Chi
5    Doc
dtype: object


In [31]:
# get the last word 

# NOTE:
# indexing  df.str.get(i) and  df.str[i] 


print(string_series.str.split().str.get(-1),'\n' )
print(str_series.str.split().str.get(-1))

0     Picard
1      Sisko
2      Riker
3        Dax
4    Janeway
5    LaForge
dtype: object 

0     Picard
1      Sisko
2    Janeway
3        Dax
4    LaForge
5    Crusher
dtype: object


### Indicator variables
- A = alpha
- B = bravo
- G = gamma
- D = delta

In [32]:
coded_df = pd.DataFrame(
    {'Officer': string_series,
     'Code': ['A-B-D','A-B-G','A-B','G-D-B-A','D-A-B','A-B-D']
    })

coded_df

Unnamed: 0,Officer,Code
0,Picard,A-B-D
1,Sisko,A-B-G
2,Riker,A-B
3,Dax,G-D-B-A
4,Janeway,D-A-B
5,LaForge,A-B-D


In [37]:
coded_df = pd.DataFrame(
    {'Officer': str_series,
     'Code': ['A-B-D','A-B-G','A-B','G-D-B-A','D-A-B','A-B-D']
    })

coded_df

Unnamed: 0,Officer,Code
0,Captain Picard,A-B-D
1,Captain Sisko,A-B-G
2,Captain Janeway,A-B
3,Lt.Cmdr Dax,G-D-B-A
4,Chief Engineer LaForge,D-A-B
5,Doctor Crusher,A-B-D


In [41]:
# get dummies is used for data analysis of a categorical
# value say 0= Human, 1= Bajoran 
# this function fills the category with 0 or 1 
# which then can be used to perform functions like sum() etc

dum = coded_df['Code'].str.get_dummies('-')
dum

Unnamed: 0,A,B,D,G
0,1,1,1,0
1,1,1,0,1
2,1,1,0,0
3,1,1,1,1
4,1,1,1,0
5,1,1,1,0


### More on RegEx functions

In [68]:
import re # import the RegEx module

# the text of the message
txt = "The rain on Bajor is warm"

# search function for string that starts with 'The'
# and for the word 'Bajor'
x = re.search("^The.*Bajor", txt)


if x:
    print('{} and {} found'.format('{The}', '{Bajor}'))

{The} and {Bajor} found


In [69]:
# Check if the string starts with "The" 
# and ENDS with "Bajor":

txt = "The rain on Bajor is warm"
x = re.search("^The.*Bajor$", txt)

if x:
    print("YES! We have a match!")
else:
    print("No match")

No match


In [72]:
# Check if the string starts with "The" 
# and ENDS with "Bajor":

txt = "The rain on Bajor"  # changed text
x = re.search("^The.*Bajor$", txt)

if x:
    print("Message ends with {}".format('{Bajor}'))
else:
    print("No match found")

Message ends with {Bajor}


## RegEx **metacharacters**

In [73]:
txt = "The rain on Bajor"

# Find all lower case characters alphabetically 
# between "a" and "m": 

x = re.findall("[a-m]", txt)
print(x)

['h', 'e', 'a', 'i', 'a', 'j']


In [74]:
txt = "that will 3 strips of Latinum"

# Find all digit characters:
x = re.findall("\d", txt)
print(x)

['3']


In [80]:
txt = "Klingon Empire"

# Search for a sequence that starts with "Kl", 
# followed by four (any) characters, and an "n":

# NOTE: 
# the dots inside " " matter, must match word length
# "Kl [i] [n] [g] [o] n"
# "Kl . . . . n"

x = re.findall("Kl....n", txt)
print(x)

['Klingon']


In [84]:
txt = 'Bajoran Temple'

# notice the 2 dots in the string
b = re.findall("Ba..r", txt)
print(b)

# notice the 4 dots in the string
x = re.findall("Ba....n", txt)
print(x)

['Bajor']
['Bajoran']


In [142]:
# Check if the string starts with 'wormhole':
txt = "wormhole aliens"

x = re.findall("^wormhole", txt)
if x:
    print("message starts with {}".format(x))
else:
    print("No match found")

message starts with ['wormhole']


In [141]:
#Check if the string ends with 'prisons':
txt = "the shut down of Cardassian prisons"

x = re.findall("prisons$", txt)
if x:
    print("message ends with word {}".format(x))
else:
    print("No match found")

message ends with word ['prisons']


In [97]:
# Check if the string contains "lax" 
# followed by 0 or more "x" characters:

txt = "hard to relax on Talax, \
    everything with flax to the max"

x = re.findall("lax*", txt)

#print(x)

if x:
    print("found {} matches: {}".format(len(x),x))
else:
    print("No match found")

found 3 matches: ['lax', 'lax', 'lax']


In [100]:
# Check if the string contains "lax" 
# followed by 1 or more "x" characters:

x = re.findall("lax+", txt)

if x:
    print("found {} matches: {}".format(len(x),x))
else:
    print("No match found")

found 3 matches: ['lax', 'lax', 'lax']


In [135]:
# Check if the string contains "a" 
# followed by exactly two "l" characters:
txt = "all Bajorans want all Cardassian of Bajor"

x = re.findall("al{2}", txt)

if x:
    print("message has {} results for {}".format(len(x), x))
else:
    print("No match found")

message has 2 results for ['all', 'all']


In [139]:
# Check if the string contains 
# either "ketracel" or "white":

txt = "Alpha quadrant Jem'Hadar need more ketracel white"

x = re.findall("ketracel|white", txt)

# the space between the 'or' matters 
x1 = re.findall("ketracel | white", txt)
print(x1)

if x:
    print("message has {} results for {}".format(len(x), x))
else:
    print("No match found")

['ketracel ']
message has 2 results for ['ketracel', 'white']


### *Special*  **Sequences**

In [143]:
# Check if the string starts with "The":

txt = "The Cardassians enjoy the heat"

x = re.findall("\AThe", txt)

if x:
    print("message has {} results for {}".format(len(x), x))
else:
    print("No match found")

message has 1 results for ['The']


the "r" in the beginning is making sure that the string is being treated as a "raw string"

In [160]:
# Check if "The" is present at the beginning 
# r"\b + str"

txt = "The Cardassians enjoy the heat"

# NOTE: the 'b' vs 'B' is different! 
x = re.findall(r"\bThe", txt)
x1 = re.findall(r"\BThe", txt)
print(x1)

if x:
    print("message has {} results for {}".format(len(x), x))
else:
    print("No match found")

[]
message has 1 results for ['The']


In [161]:
# Check if "girl" is present at the end word 
# r"str + \b"  
# r"\b + str"

txt = "watch the dabo table, not the girl"

x = re.findall(r"girl\b", txt)
x1 = re.findall(r"girl\B", txt)
print(x1)

if x:
    print("message has {} results for {}".format(len(x), x))
else:
    print("No match found")

[]
message has 1 results for ['girl']


In [179]:
# Check if "abo" is present, 
# but NOT at the beginning of a word:

txt = "watch the dabo table, not dabo the girl"

x = re.findall(r"\Babo", txt)
print(x)

# Check if "abo" is present, 
# but NOT at the end of a word:
x2 = re.findall(r"abo\B", txt)
print(x2)

['abo', 'abo']
[]


In [182]:
# Check if the string contains any digits (numbers from 0-9):
txt = "watch the dabo table, not dabo the girl"
x = re.findall("\d", txt)
print(x)

# Return a match at every non-digit character:
# the result will be tokenized characters
x2 = re.findall("\D", txt)
print(x2)

[]
['w', 'a', 't', 'c', 'h', ' ', 't', 'h', 'e', ' ', 'd', 'a', 'b', 'o', ' ', 't', 'a', 'b', 'l', 'e', ',', ' ', 'n', 'o', 't', ' ', 'd', 'a', 'b', 'o', ' ', 't', 'h', 'e', ' ', 'g', 'i', 'r', 'l']


In [186]:
#Return a match at every white-space character:
x = re.findall("\s", txt)
print('white space: ',x)

#Return a match at every NON white-space character:
x2 = re.findall("\S", txt)
print('non white space ',x2)

white space:  [' ', ' ', ' ', ' ', ' ', ' ', ' ']
non white space  ['w', 'a', 't', 'c', 'h', 't', 'h', 'e', 'd', 'a', 'b', 'o', 't', 'a', 'b', 'l', 'e', ',', 'n', 'o', 't', 'd', 'a', 'b', 'o', 't', 'h', 'e', 'g', 'i', 'r', 'l']


In [191]:
# Return a match at every word character 
# (characters from a to Z, digits from 0-9, 
# and the underscore _ character):

txt = "There are 45 Trills on this DS9 station! Did you know?"
x = re.findall("\w", txt)
print(x) 

['T', 'h', 'e', 'r', 'e', 'a', 'r', 'e', '4', '5', 'T', 'r', 'i', 'l', 'l', 's', 'o', 'n', 't', 'h', 'i', 's', 'D', 'S', '9', 's', 't', 'a', 't', 'i', 'o', 'n', 'D', 'i', 'd', 'y', 'o', 'u', 'k', 'n', 'o', 'w']


In [192]:
# Return a match at every NON word character 
# (characters NOT between a and Z. Like "!", "?" 
# white-space etc.):

x = re.findall("\W", txt)
print(x)

[' ', ' ', ' ', ' ', ' ', ' ', ' ', '!', ' ', ' ', ' ', '?']


In [195]:
# Check if the string ends with "rain":
txt = "on the planet Troyius it hardly rains"
x = re.findall("rains\Z", txt)
x

['rains']

### Regular Expression **sets**

In [208]:
txt = "This encrypted message from Cardassia is 34 KB big.\
only person on the DS9 station who can crack it is Garak"

# Check if the string has any characters from a to z 
# lower case, and A to Z upper case:
x0 = re.findall("[a-zA-Z]", txt)
print("[a-zA-Z] :{}".format(x0))

# Check if the string has any a, r, or n characters:
x = re.findall("[arn]", txt)
print("\n [arn] :{}".format(x))

# Check if the string has any characters between a and n:
x2 = re.findall("[a-n]", txt)
print("\n [a-n] :{}".format(x2))

# Check if the string has other characters than a, r, or n:
x3 = re.findall("[^arn]", txt)
print("\n [^arn] :{}".format(x3))

# Check if the string has any 0, 1, 2, or 3 digits:
x4 = re.findall("[0-4]", txt)
print("\n [0-4] :{}".format(x4))

# Check if the string has any two-digit numbers, from 00 to 59:
x5 = re.findall("[0-5][0-9]", txt)
print("\n [0-5][0-9] :{}".format(x5))

[a-zA-Z] :['T', 'h', 'i', 's', 'e', 'n', 'c', 'r', 'y', 'p', 't', 'e', 'd', 'm', 'e', 's', 's', 'a', 'g', 'e', 'f', 'r', 'o', 'm', 'C', 'a', 'r', 'd', 'a', 's', 's', 'i', 'a', 'i', 's', 'K', 'B', 'b', 'i', 'g', 'o', 'n', 'l', 'y', 'p', 'e', 'r', 's', 'o', 'n', 'o', 'n', 't', 'h', 'e', 'D', 'S', 's', 't', 'a', 't', 'i', 'o', 'n', 'w', 'h', 'o', 'c', 'a', 'n', 'c', 'r', 'a', 'c', 'k', 'i', 't', 'i', 's', 'G', 'a', 'r', 'a', 'k']

 [arn] :['n', 'r', 'a', 'r', 'a', 'r', 'a', 'a', 'n', 'r', 'n', 'n', 'a', 'n', 'a', 'n', 'r', 'a', 'a', 'r', 'a']

 [a-n] :['h', 'i', 'e', 'n', 'c', 'e', 'd', 'm', 'e', 'a', 'g', 'e', 'f', 'm', 'a', 'd', 'a', 'i', 'a', 'i', 'b', 'i', 'g', 'n', 'l', 'e', 'n', 'n', 'h', 'e', 'a', 'i', 'n', 'h', 'c', 'a', 'n', 'c', 'a', 'c', 'k', 'i', 'i', 'a', 'a', 'k']

 [^arn] :['T', 'h', 'i', 's', ' ', 'e', 'c', 'y', 'p', 't', 'e', 'd', ' ', 'm', 'e', 's', 's', 'g', 'e', ' ', 'f', 'o', 'm', ' ', 'C', 'd', 's', 's', 'i', ' ', 'i', 's', ' ', '3', '4', ' ', 'K', 'B', ' ', 'b', 'i'

### the **sub()** function

In [212]:
# Replace all white-space characters with the digit "4":

txt = "The Promenade   is busy today "
x = re.sub("\s", "44", txt)
x

x2 = re.sub("\s", "<&>", txt)
print(x2)

The<&>Promenade<&><&><&>is<&>busy<&>today<&>


You have learned how to work with strings in dataframes and how regular expression works in text data. This concludes this notebook.