# Working with Text Data

In [1]:
import pandas as pd

## This Module's Dataset
- This module's dataset (`chicago.csv`) is a collection of public sector employees in the city of Chicago.
- Each row inclues the employee's name, position, department, and salary.

In [8]:
chi = pd.read_csv("chicago.csv").dropna(how="all")
chi.head()
chi.info()

<class 'pandas.core.frame.DataFrame'>
Index: 32062 entries, 0 to 32061
Data columns (total 4 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   Name                    32062 non-null  object
 1   Position Title          32062 non-null  object
 2   Department              32062 non-null  object
 3   Employee Annual Salary  32062 non-null  object
dtypes: object(4)
memory usage: 1.2+ MB


In [9]:
chi.nunique()

Name                      31776
Position Title             1093
Department                   35
Employee Annual Salary     1156
dtype: int64

In [11]:
chi["Department"] = chi["Department"].astype("category")

In [13]:
chi = pd.read_csv("chicago.csv").dropna(how="all")
chi["Department"] = chi["Department"].astype("category")
chi.head()

Unnamed: 0,Name,Position Title,Department,Employee Annual Salary
0,"AARON, ELVIA J",WATER RATE TAKER,WATER MGMNT,$90744.00
1,"AARON, JEFFERY M",POLICE OFFICER,POLICE,$84450.00
2,"AARON, KARINA",POLICE OFFICER,POLICE,$84450.00
3,"AARON, KIMBERLEI R",CHIEF CONTRACT EXPEDITER,GENERAL SERVICES,$89880.00
4,"ABAD JR, VICENTE M",CIVIL ENGINEER IV,WATER MGMNT,$106836.00


In [30]:
chi["Position Title"] = chi["Position Title"].str.title().str.strip()
chi["Name"] = chi["Name"].str.title().str.strip()
chi["Department"] = chi["Department"].str.title().str.strip().str.replace("Mgmnt", "Management")
chi["Employee Annual Salary"] = chi["Employee Annual Salary"].str.title().str.replace("$", "").str.strip()
chi["Employee Annual Salary"].astype("float")
chi.head()

Unnamed: 0,Name,Position Title,Department,Employee Annual Salary
0,"Aaron, Elvia J",Water Rate Taker,Water Management,90744.0
1,"Aaron, Jeffery M",Police Officer,Police,84450.0
2,"Aaron, Karina",Police Officer,Police,84450.0
3,"Aaron, Kimberlei R",Chief Contract Expediter,General Services,89880.0
4,"Abad Jr, Vicente M",Civil Engineer Iv,Water Management,106836.0


## Common String Methods
- A **Series** has a special `str` attribute that exposes an object with string methods.
- Access the `str` attribute, then invoke the string method on the nested object.
- Most method names will match their Python method equivalents (`upper`, `lower`, `title`, etc).

In [31]:
chi["Position Title"] = chi["Position Title"].str.title().str.strip()
chi["Name"] = chi["Name"].str.title().str.strip()
chi["Department"] = chi["Department"].str.title().str.strip().str.replace("Mgmnt", "Management")
chi["Employee Annual Salary"] = chi["Employee Annual Salary"].str.title().str.replace("$", "").str.strip()
chi["Employee Annual Salary"].astype("float")
chi.head()

Unnamed: 0,Name,Position Title,Department,Employee Annual Salary
0,"Aaron, Elvia J",Water Rate Taker,Water Management,90744.0
1,"Aaron, Jeffery M",Police Officer,Police,84450.0
2,"Aaron, Karina",Police Officer,Police,84450.0
3,"Aaron, Kimberlei R",Chief Contract Expediter,General Services,89880.0
4,"Abad Jr, Vicente M",Civil Engineer Iv,Water Management,106836.0


In [35]:
water = chi["Position Title"].str.lower().str.contains("water")
chi[water]

Unnamed: 0,Name,Position Title,Department,Employee Annual Salary
0,"Aaron, Elvia J",Water Rate Taker,Water Management,90744.00
554,"Aluise, Vincent G",Foreman Of Water Pipe Construction,Water Management,102440.00
671,"Ander, Perry A",Water Chemist Ii,Water Management,82044.00
685,"Anderson, Andrew J",District Superintendent Of Water Distribution,Water Management,109272.00
702,"Anderson, Donald",Foreman Of Water Pipe Construction,Water Management,102440.00
...,...,...,...,...
29669,"Verma, Anupam",Managing Engineer - Water Management,Water Management,111192.00
30239,"Washington, Joseph",Water Chemist Iii,Water Management,89676.00
30544,"West, Thomas R",Gen Supt Of Water Management,Water Management,115704.00
30991,"Williams, Matthew",Foreman Of Water Pipe Construction,Water Management,102440.00


In [39]:
management = chi["Position Title"].str.lower().str.endswith("management")
chi.loc[management]

Unnamed: 0,Name,Position Title,Department,Employee Annual Salary
619,"Amado, Bonita S",Dir Of Facilities Management,Police,109008.0
1645,"Batie, Ferris C",Dir Of Facilities Management,General Services,112308.0
4484,"Chao, Jose A",Dir Of Facilities Management,Cultural Affairs,95820.0
6421,"Davis Iii, Wallace",Gen Supt Of Water Management,Water Management,119208.0
9751,"Gaspar, Ricardo",Dir Of Facilities Management,Aviation,106848.0
11466,"Hardy, Marie E",Asst Coord Of Collection Management,Public Library,83340.0
12412,"Hnatko, Wayne S",Asst Dir Of Buildings Management,General Services,110088.0
14897,"King, John T",Gen Supt Of Water Management,Water Management,114204.0
18237,"Mcfarland, Andrew S",Managing Engineer - Water Management,Water Management,111192.0
23792,"Reynolds, David J",Commissioner Of Fleet & Facility Management,General Services,157092.0


## Filtering with String Methods
- The `str.contains` method checks whether a substring exists anywhere in the string.
- The `str.startswith` method checks whether a substring exists at the start of the string.
- The `str.endswith` method checks whether a substring exists at the end of the string.

In [None]:
#see above

## String Methods on Index and Columns
- Use the `index` and `columns` attributes to access the **DataFrame** index/column labels.
- These objects support string methods via their own `str` attribute.

In [42]:
chi = pd.read_csv("chicago.csv", index_col="Name").dropna(how="all").sort_index()
chi["Position Title"] = chi["Position Title"].str.title().str.strip()
#chi["Name"] = chi["Name"].str.title().str.strip()
chi["Department"] = chi["Department"].str.title().str.strip().str.replace("Mgmnt", "Management")
chi["Employee Annual Salary"] = chi["Employee Annual Salary"].str.title().str.replace("$", "").str.strip()
chi["Employee Annual Salary"].astype("float")
chi.head()

Unnamed: 0_level_0,Position Title,Department,Employee Annual Salary
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
"AARON, ELVIA J",Water Rate Taker,Water Management,90744.0
"AARON, JEFFERY M",Police Officer,Police,84450.0
"AARON, KARINA",Police Officer,Police,84450.0
"AARON, KIMBERLEI R",Chief Contract Expediter,General Services,89880.0
"ABAD JR, VICENTE M",Civil Engineer Iv,Water Management,106836.0


In [45]:
chi.index = chi.index.str.strip().str.title()

In [46]:
chi.head()

Unnamed: 0_level_0,Position Title,Department,Employee Annual Salary
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
"Aaron, Elvia J",Water Rate Taker,Water Management,90744.0
"Aaron, Jeffery M",Police Officer,Police,84450.0
"Aaron, Karina",Police Officer,Police,84450.0
"Aaron, Kimberlei R",Chief Contract Expediter,General Services,89880.0
"Abad Jr, Vicente M",Civil Engineer Iv,Water Management,106836.0


In [57]:
chi = pd.read_csv("chicago.csv", index_col="Name").dropna(how="all").sort_index()
chi.head()

Unnamed: 0_level_0,Position Title,Department,Employee Annual Salary
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
"AARON, ELVIA J",WATER RATE TAKER,WATER MGMNT,$90744.00
"AARON, JEFFERY M",POLICE OFFICER,POLICE,$84450.00
"AARON, KARINA",POLICE OFFICER,POLICE,$84450.00
"AARON, KIMBERLEI R",CHIEF CONTRACT EXPEDITER,GENERAL SERVICES,$89880.00
"ABAD JR, VICENTE M",CIVIL ENGINEER IV,WATER MGMNT,$106836.00


## The split Method
- The `str.split` method splits a string by the occurrence of a delimiter. Pandas returns a **Series** of lists.
- Use the `str.get` method to access a nested list element by its index position.

In [58]:
chi = pd.read_csv("chicago.csv").dropna(how="all")
chi["Department"] = chi["Department"].astype("category")
chi.head()

Unnamed: 0,Name,Position Title,Department,Employee Annual Salary
0,"AARON, ELVIA J",WATER RATE TAKER,WATER MGMNT,$90744.00
1,"AARON, JEFFERY M",POLICE OFFICER,POLICE,$84450.00
2,"AARON, KARINA",POLICE OFFICER,POLICE,$84450.00
3,"AARON, KIMBERLEI R",CHIEF CONTRACT EXPEDITER,GENERAL SERVICES,$89880.00
4,"ABAD JR, VICENTE M",CIVIL ENGINEER IV,WATER MGMNT,$106836.00


In [63]:
# the most common first word in our job descriptions
chi["Position Title"].str.split(" ").str.get(0).value_counts()

Position Title
POLICE             10856
FIREFIGHTER-EMT     1509
SERGEANT            1186
POOL                 918
FIREFIGHTER          810
                   ...  
DENTIST                1
ASSOC                  1
TELEPHONE              1
MAYOR                  1
PREPRESS               1
Name: count, Length: 320, dtype: int64

## More Practice with Splits

In [64]:
chi = pd.read_csv("chicago.csv").dropna(how="all")
chi["Department"] = chi["Department"].astype("category")
chi.head()

Unnamed: 0,Name,Position Title,Department,Employee Annual Salary
0,"AARON, ELVIA J",WATER RATE TAKER,WATER MGMNT,$90744.00
1,"AARON, JEFFERY M",POLICE OFFICER,POLICE,$84450.00
2,"AARON, KARINA",POLICE OFFICER,POLICE,$84450.00
3,"AARON, KIMBERLEI R",CHIEF CONTRACT EXPEDITER,GENERAL SERVICES,$89880.00
4,"ABAD JR, VICENTE M",CIVIL ENGINEER IV,WATER MGMNT,$106836.00


In [76]:
# find the most common first name
chi["Name"].str.title().str.split(", ").str.get(1).str.strip().str.split(" ").str.get(0).value_counts()

Name
Michael     1153
John         899
James        676
Robert       622
Joseph       537
            ... 
Deena          1
Cherrise       1
Eartha         1
Ernika         1
Mac            1
Name: count, Length: 5091, dtype: int64

In [None]:
chi = pd.read_csv("chicago.csv").dropna(how="all")
chi["Department"] = chi["Department"].astype("category")
chi.head()

In [82]:
chi[["last", "first"]] = chi["Name"].str.title().str.split(",", expand=True) #use this to create new columns

In [79]:
chi.head

Unnamed: 0,Name,Position Title,Department,Employee Annual Salary,last,first
0,"AARON, ELVIA J",WATER RATE TAKER,WATER MGMNT,$90744.00,Aaron,Elvia J
1,"AARON, JEFFERY M",POLICE OFFICER,POLICE,$84450.00,Aaron,Jeffery M
2,"AARON, KARINA",POLICE OFFICER,POLICE,$84450.00,Aaron,Karina
3,"AARON, KIMBERLEI R",CHIEF CONTRACT EXPEDITER,GENERAL SERVICES,$89880.00,Aaron,Kimberlei R
4,"ABAD JR, VICENTE M",CIVIL ENGINEER IV,WATER MGMNT,$106836.00,Abad Jr,Vicente M


In [88]:
chi[["Primary Title", "Secondary Title"]] = chi["Position Title"].str.split(" ", expand=True, n=1)

In [89]:
chi.head()

Unnamed: 0,Name,Position Title,Department,Employee Annual Salary,last,first,Primary Title,Secondary Title
0,"AARON, ELVIA J",WATER RATE TAKER,WATER MGMNT,$90744.00,Aaron,Elvia J,WATER,RATE TAKER
1,"AARON, JEFFERY M",POLICE OFFICER,POLICE,$84450.00,Aaron,Jeffery M,POLICE,OFFICER
2,"AARON, KARINA",POLICE OFFICER,POLICE,$84450.00,Aaron,Karina,POLICE,OFFICER
3,"AARON, KIMBERLEI R",CHIEF CONTRACT EXPEDITER,GENERAL SERVICES,$89880.00,Aaron,Kimberlei R,CHIEF,CONTRACT EXPEDITER
4,"ABAD JR, VICENTE M",CIVIL ENGINEER IV,WATER MGMNT,$106836.00,Abad Jr,Vicente M,CIVIL,ENGINEER IV


## The expand and n Parameters of the split Method
- The `expand` parameter returns a **DataFrame** instead of a **Series** of lists.
- The `n` parameter limits the number of splits.

In [None]:
# see above