* Textual data, lots of it!
* Requires cleaning and preprocessing
* Textual to numeric

In [1]:
import pandas as pd

df = pd.DataFrame(
    
    {
        "Name": ["jane", "james", "matt", "emily", "john"],
        "Address": ["Houston,TX", "Dallas,TX", "Chicago,IL", "Phoenix,AZ", "San Diego,CA"],
        "Salary": ["64K-72K", "62K-70K", "69K-76K", "62K-72K", "71K-78K"],
        "Group": ["A-1-B-F","B-2-B-F","A-2-B-F","B-1-C-D","A-1-1-D"],
        "Category": ["  1x", " 1y", "2x  ", "1x", "1y  "]

    }
)

df

Unnamed: 0,Name,Address,Salary,Group,Category
0,jane,"Houston,TX",64K-72K,A-1-B-F,1x
1,james,"Dallas,TX",62K-70K,B-2-B-F,1y
2,matt,"Chicago,IL",69K-76K,A-2-B-F,2x
3,emily,"Phoenix,AZ",62K-72K,B-1-C-D,1x
4,john,"San Diego,CA",71K-78K,A-1-1-D,1y


## 1. Data type - 1

* Before pandas 1.0, only “object” datatype was used to store strings.
* Cause some drawbacks because non-string data can also be stored using “object” data type. 
* Pandas 1.0 introduces a new data type specific to string data which is StringDtype.
* As of now, we can still use object or StringDtype to store strings but in the future, we may be required to only use StringDtype.
* Object is still the default data type for strings.

In [2]:
df.dtypes

Name        object
Address     object
Salary      object
Group       object
Category    object
dtype: object

## 2. Data type - 2

* Object is still the default data type for strings. To use StringDtype, we need to explicitly state it.

* We can pass “string” or pd.StringDtype() argument to dtype parameter to select string datatype.

In [3]:
df["Address"] = df["Address"].astype(pd.StringDtype())

df.dtypes

Name                object
Address     string[python]
Salary              object
Group               object
Category            object
dtype: object

In [4]:
df["Salary"] = df["Salary"].astype("string")

df.dtypes

Name                object
Address     string[python]
Salary      string[python]
Group               object
Category            object
dtype: object

## 3. Data type - 3

In [5]:
df = pd.DataFrame(
    
    {
        "Name": ["jane", "james", "matt", "emily", "john"],
        "Address": ["Houston,TX", "Dallas,TX", "Chicago,IL", "Phoenix,AZ", "San Diego,CA"],
        "Salary": ["64K-72K", "62K-70K", "69K-76K", "62K-72K", "71K-78K"],
        "Group": ["A-1-B-F","B-2-B-F","A-2-B-F","B-1-C-D","A-1-1-D"],
        "Category": ["  1x", " 1y", "2x  ", "1x", "1y  "]

    },
    
    dtype="string"
)

df

Unnamed: 0,Name,Address,Salary,Group,Category
0,jane,"Houston,TX",64K-72K,A-1-B-F,1x
1,james,"Dallas,TX",62K-70K,B-2-B-F,1y
2,matt,"Chicago,IL",69K-76K,A-2-B-F,2x
3,emily,"Phoenix,AZ",62K-72K,B-1-C-D,1x
4,john,"San Diego,CA",71K-78K,A-1-1-D,1y


In [6]:
df.dtypes

Name        string[python]
Address     string[python]
Salary      string[python]
Group       string[python]
Category    string[python]
dtype: object

## 4. Uppercase and lowercase

* String methods are available via the str accessor

In [7]:
df["Name"] = df["Name"].str.upper()

df

Unnamed: 0,Name,Address,Salary,Group,Category
0,JANE,"Houston,TX",64K-72K,A-1-B-F,1x
1,JAMES,"Dallas,TX",62K-70K,B-2-B-F,1y
2,MATT,"Chicago,IL",69K-76K,A-2-B-F,2x
3,EMILY,"Phoenix,AZ",62K-72K,B-1-C-D,1x
4,JOHN,"San Diego,CA",71K-78K,A-1-1-D,1y


In [8]:
df["Name"] = df["Name"].str.lower()

df

Unnamed: 0,Name,Address,Salary,Group,Category
0,jane,"Houston,TX",64K-72K,A-1-B-F,1x
1,james,"Dallas,TX",62K-70K,B-2-B-F,1y
2,matt,"Chicago,IL",69K-76K,A-2-B-F,2x
3,emily,"Phoenix,AZ",62K-72K,B-1-C-D,1x
4,john,"San Diego,CA",71K-78K,A-1-1-D,1y


## 5. Capitalize

In [9]:
df["Name"] = df["Name"].str.capitalize()

df

Unnamed: 0,Name,Address,Salary,Group,Category
0,Jane,"Houston,TX",64K-72K,A-1-B-F,1x
1,James,"Dallas,TX",62K-70K,B-2-B-F,1y
2,Matt,"Chicago,IL",69K-76K,A-2-B-F,2x
3,Emily,"Phoenix,AZ",62K-72K,B-1-C-D,1x
4,John,"San Diego,CA",71K-78K,A-1-1-D,1y


## 6. Split

* A string may carry more than one piece of information and we may need to use them separately.

In [10]:
df["Address"].str.split(",")

0      [Houston, TX]
1       [Dallas, TX]
2      [Chicago, IL]
3      [Phoenix, AZ]
4    [San Diego, CA]
Name: Address, dtype: object

## 7. Extract after split

In [11]:
df["Address"].str.split(",").str[0]

0      Houston
1       Dallas
2      Chicago
3      Phoenix
4    San Diego
Name: Address, dtype: object

* Subscript ([1]) must be applied with str keyword

In [12]:
df["Address"].str.split(",")[0]

['Houston', 'TX']

## 8. Create column after split

In [13]:
df["State"] = df["Address"].str.split(",").str[1]

df

Unnamed: 0,Name,Address,Salary,Group,Category,State
0,Jane,"Houston,TX",64K-72K,A-1-B-F,1x,TX
1,James,"Dallas,TX",62K-70K,B-2-B-F,1y,TX
2,Matt,"Chicago,IL",69K-76K,A-2-B-F,2x,IL
3,Emily,"Phoenix,AZ",62K-72K,B-1-C-D,1x,AZ
4,John,"San Diego,CA",71K-78K,A-1-1-D,1y,CA


## 9. Split - expand parameter

In [14]:
df["Address"].str.split(",", expand=True)

Unnamed: 0,0,1
0,Houston,TX
1,Dallas,TX
2,Chicago,IL
3,Phoenix,AZ
4,San Diego,CA


In [15]:
df[["City", "State"]] = df["Address"].str.split(",", expand=True)

df

Unnamed: 0,Name,Address,Salary,Group,Category,State,City
0,Jane,"Houston,TX",64K-72K,A-1-B-F,1x,TX,Houston
1,James,"Dallas,TX",62K-70K,B-2-B-F,1y,TX,Dallas
2,Matt,"Chicago,IL",69K-76K,A-2-B-F,2x,IL,Chicago
3,Emily,"Phoenix,AZ",62K-72K,B-1-C-D,1x,AZ,Phoenix
4,John,"San Diego,CA",71K-78K,A-1-1-D,1y,CA,San Diego


## 10. Split - n parameter

In [16]:
df["Group"].str.split("-", expand=True)

Unnamed: 0,0,1,2,3
0,A,1,B,F
1,B,2,B,F
2,A,2,B,F
3,B,1,C,D
4,A,1,1,D


In [17]:
df["Group"].str.split("-", expand=True, n=2)

Unnamed: 0,0,1,2
0,A,1,B-F
1,B,2,B-F
2,A,2,B-F
3,B,1,C-D
4,A,1,1-D


In [18]:
df["Group"].str.split("-", expand=True, n=1)

Unnamed: 0,0,1
0,A,1-B-F
1,B,2-B-F
2,A,2-B-F
3,B,1-C-D
4,A,1-1-D


## 11. Split from right

* By default, splitting is done from the left. To do splitting from the right, use rsplit.

In [19]:
df["Group"].str.split("-", expand=True, n=2)

Unnamed: 0,0,1,2
0,A,1,B-F
1,B,2,B-F
2,A,2,B-F
3,B,1,C-D
4,A,1,1-D


In [20]:
df["Group"].str.rsplit("-", expand=True, n=2)

Unnamed: 0,0,1,2
0,A-1,B,F
1,B-2,B,F
2,A-2,B,F
3,B-1,C,D
4,A-1,1,D


## 12. Combining - 1

In [21]:
df

Unnamed: 0,Name,Address,Salary,Group,Category,State,City
0,Jane,"Houston,TX",64K-72K,A-1-B-F,1x,TX,Houston
1,James,"Dallas,TX",62K-70K,B-2-B-F,1y,TX,Dallas
2,Matt,"Chicago,IL",69K-76K,A-2-B-F,2x,IL,Chicago
3,Emily,"Phoenix,AZ",62K-72K,B-1-C-D,1x,AZ,Phoenix
4,John,"San Diego,CA",71K-78K,A-1-1-D,1y,CA,San Diego


In [22]:
df["City"].str.cat(df["State"], sep=",")

0      Houston,TX
1       Dallas,TX
2      Chicago,IL
3      Phoenix,AZ
4    San Diego,CA
Name: City, dtype: string

## 13. Combining - 2

In [23]:
df["City"] + "," + df["State"]

0      Houston,TX
1       Dallas,TX
2      Chicago,IL
3      Phoenix,AZ
4    San Diego,CA
dtype: string

## 14. Indexing

In [24]:
df

Unnamed: 0,Name,Address,Salary,Group,Category,State,City
0,Jane,"Houston,TX",64K-72K,A-1-B-F,1x,TX,Houston
1,James,"Dallas,TX",62K-70K,B-2-B-F,1y,TX,Dallas
2,Matt,"Chicago,IL",69K-76K,A-2-B-F,2x,IL,Chicago
3,Emily,"Phoenix,AZ",62K-72K,B-1-C-D,1x,AZ,Phoenix
4,John,"San Diego,CA",71K-78K,A-1-1-D,1y,CA,San Diego


In [25]:
df["Salary"].str[:3]

0    64K
1    62K
2    69K
3    62K
4    71K
Name: Salary, dtype: string

## 15. Indexing structure

<img src="Assets/string_1.png" class="juno_ui_theme_light" style="width:500px">

* Default start value is the index of the first character.
* Default end value is the index of the last character.
* Default step size is 1.

In [26]:
df["Name"]

0     Jane
1    James
2     Matt
3    Emily
4     John
Name: Name, dtype: string

In [27]:
df["Name"].str[0:3] # index starts from 0 and upper bound is exclusive

0    Jan
1    Jam
2    Mat
3    Emi
4    Joh
Name: Name, dtype: string

In [28]:
df["Name"].str[1:3]

0    an
1    am
2    at
3    mi
4    oh
Name: Name, dtype: string

## 16. Step size - 1

* Defaul step size is 1.

In [29]:
df["Name"]

0     Jane
1    James
2     Matt
3    Emily
4     John
Name: Name, dtype: string

In [30]:
df["Name"].str[::2]

0     Jn
1    Jms
2     Mt
3    Eiy
4     Jh
Name: Name, dtype: string

## 17. Step size - 2

In [31]:
df["Name"].str[::-1]

0     enaJ
1    semaJ
2     ttaM
3    ylimE
4     nhoJ
Name: Name, dtype: string

## 18. Stripping 

* Remove spaces or any other characters at the beginning or end of a string.

In [32]:
df["Category"]

0      1x
1      1y
2    2x  
3      1x
4    1y  
Name: Category, dtype: string

In [33]:
df["Category"].str.strip()

0    1x
1    1y
2    2x
3    1x
4    1y
Name: Category, dtype: string

## 19. Left strip

In [34]:
df["Category"].str.rstrip()

0      1x
1      1y
2      2x
3      1x
4      1y
Name: Category, dtype: string

## 20. Right strip

In [35]:
df["Salary"].str.rstrip("K")

0    64K-72
1    62K-70
2    69K-76
3    62K-72
4    71K-78
Name: Salary, dtype: string

## 21. Replacing - 1

* Replace a character or a sequence of characters in strings

In [36]:
df["Salary"].str.replace("K", "")

0    64-72
1    62-70
2    69-76
3    62-72
4    71-78
Name: Salary, dtype: string

## 22. Replacing - 2

In [37]:
df["Profession"] = ["dr", "doctor", "Dr", "engineer", "nurse"]

df

Unnamed: 0,Name,Address,Salary,Group,Category,State,City,Profession
0,Jane,"Houston,TX",64K-72K,A-1-B-F,1x,TX,Houston,dr
1,James,"Dallas,TX",62K-70K,B-2-B-F,1y,TX,Dallas,doctor
2,Matt,"Chicago,IL",69K-76K,A-2-B-F,2x,IL,Chicago,Dr
3,Emily,"Phoenix,AZ",62K-72K,B-1-C-D,1x,AZ,Phoenix,engineer
4,John,"San Diego,CA",71K-78K,A-1-1-D,1y,CA,San Diego,nurse


In [38]:
df["Profession"].str.replace(pat="dr", repl="doctor")

0      doctor
1      doctor
2          Dr
3    engineer
4       nurse
Name: Profession, dtype: object

## 23. Replacing - 3

In [39]:
df["Profession"].str.replace(pat="dr", repl="doctor", case=False)

0      doctor
1      doctor
2      doctor
3    engineer
4       nurse
Name: Profession, dtype: object

In [40]:
df["Profession"] = df["Profession"].str.replace(pat="dr", repl="doctor", case=False)

df

Unnamed: 0,Name,Address,Salary,Group,Category,State,City,Profession
0,Jane,"Houston,TX",64K-72K,A-1-B-F,1x,TX,Houston,doctor
1,James,"Dallas,TX",62K-70K,B-2-B-F,1y,TX,Dallas,doctor
2,Matt,"Chicago,IL",69K-76K,A-2-B-F,2x,IL,Chicago,doctor
3,Emily,"Phoenix,AZ",62K-72K,B-1-C-D,1x,AZ,Phoenix,engineer
4,John,"San Diego,CA",71K-78K,A-1-1-D,1y,CA,San Diego,nurse


## 24. Pandas replace - 1

* The replace function of Pandas

In [41]:
df["Profession"].replace(to_replace="doctor", value="doc")

0         doc
1         doc
2         doc
3    engineer
4       nurse
Name: Profession, dtype: object

## 25. Pandas replace - 2

In [42]:
df["Profession"].replace(
    to_replace={"doctor": "doc", "engineer": "eng"}
)

0      doc
1      doc
2      doc
3      eng
4    nurse
Name: Profession, dtype: object

## 26. Filtering - startswith - 1

In [43]:
df

Unnamed: 0,Name,Address,Salary,Group,Category,State,City,Profession
0,Jane,"Houston,TX",64K-72K,A-1-B-F,1x,TX,Houston,doctor
1,James,"Dallas,TX",62K-70K,B-2-B-F,1y,TX,Dallas,doctor
2,Matt,"Chicago,IL",69K-76K,A-2-B-F,2x,IL,Chicago,doctor
3,Emily,"Phoenix,AZ",62K-72K,B-1-C-D,1x,AZ,Phoenix,engineer
4,John,"San Diego,CA",71K-78K,A-1-1-D,1y,CA,San Diego,nurse


In [44]:
df["Name"].str.startswith("J")

0     True
1     True
2    False
3    False
4     True
Name: Name, dtype: boolean

In [45]:
df[df["Name"].str.startswith("J")]

Unnamed: 0,Name,Address,Salary,Group,Category,State,City,Profession
0,Jane,"Houston,TX",64K-72K,A-1-B-F,1x,TX,Houston,doctor
1,James,"Dallas,TX",62K-70K,B-2-B-F,1y,TX,Dallas,doctor
4,John,"San Diego,CA",71K-78K,A-1-1-D,1y,CA,San Diego,nurse


## 27. Filtering - startswith - 2

In [46]:
df[df["Name"].str.startswith("Ja")]

Unnamed: 0,Name,Address,Salary,Group,Category,State,City,Profession
0,Jane,"Houston,TX",64K-72K,A-1-B-F,1x,TX,Houston,doctor
1,James,"Dallas,TX",62K-70K,B-2-B-F,1y,TX,Dallas,doctor


In [47]:
df[df["Name"].str.startswith("ja")]

Unnamed: 0,Name,Address,Salary,Group,Category,State,City,Profession


## 28. Filtering - endswith

In [48]:
df[df["Name"].str.endswith("e")]

Unnamed: 0,Name,Address,Salary,Group,Category,State,City,Profession
0,Jane,"Houston,TX",64K-72K,A-1-B-F,1x,TX,Houston,doctor


## 29. Chaining - 1

In [49]:
df

Unnamed: 0,Name,Address,Salary,Group,Category,State,City,Profession
0,Jane,"Houston,TX",64K-72K,A-1-B-F,1x,TX,Houston,doctor
1,James,"Dallas,TX",62K-70K,B-2-B-F,1y,TX,Dallas,doctor
2,Matt,"Chicago,IL",69K-76K,A-2-B-F,2x,IL,Chicago,doctor
3,Emily,"Phoenix,AZ",62K-72K,B-1-C-D,1x,AZ,Phoenix,engineer
4,John,"San Diego,CA",71K-78K,A-1-1-D,1y,CA,San Diego,nurse


In [50]:
df["Name"].str.lower().str.startswith("j")

0     True
1     True
2    False
3    False
4     True
Name: Name, dtype: boolean

## 30. Chaining - 2

In [51]:
df["Address"].str.split(",", expand=True)[1].str.lower()

0    tx
1    tx
2    il
3    az
4    ca
Name: 1, dtype: string