# Working with Text Data

## This Module's Dataset
- This module's dataset (`chicago.csv`) is a collection of public sector employees in the city of Chicago.
- Each row inclues the employee's name, position, department, and salary.

pip install pandas

In [2]:
import pandas as pd
import numpy as numpy

In [4]:
chicago = pd.read_csv("data/chicago.csv")

In [9]:
chicago.head()

Unnamed: 0,Name,Position Title,Department,Employee Annual Salary
0,"AARON, ELVIA J",WATER RATE TAKER,WATER MGMNT,$90744.00
1,"AARON, JEFFERY M",POLICE OFFICER,POLICE,$84450.00
2,"AARON, KARINA",POLICE OFFICER,POLICE,$84450.00
3,"AARON, KIMBERLEI R",CHIEF CONTRACT EXPEDITER,GENERAL SERVICES,$89880.00
4,"ABAD JR, VICENTE M",CIVIL ENGINEER IV,WATER MGMNT,$106836.00


In [6]:
chicago.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32063 entries, 0 to 32062
Data columns (total 4 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   Name                    32062 non-null  object
 1   Position Title          32062 non-null  object
 2   Department              32062 non-null  object
 3   Employee Annual Salary  32062 non-null  object
dtypes: object(4)
memory usage: 1002.1+ KB


In [10]:
chicago.nunique()

Name                      31776
Position Title             1093
Department                   35
Employee Annual Salary     1156
dtype: int64

In [11]:
chicago["Department"] = chicago["Department"].astype("category")
chicago.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32063 entries, 0 to 32062
Data columns (total 4 columns):
 #   Column                  Non-Null Count  Dtype   
---  ------                  --------------  -----   
 0   Name                    32062 non-null  object  
 1   Position Title          32062 non-null  object  
 2   Department              32062 non-null  category
 3   Employee Annual Salary  32062 non-null  object  
dtypes: category(1), object(3)
memory usage: 784.2+ KB


## Common String Methods
- A **Series** has a special `str` attribute that exposes an object with string methods.
- Access the `str` attribute, then invoke the string method on the nested object.
- Most method names will match their Python method equivalents (`upper`, `lower`, `title`, etc).

In [19]:
type(chicago["Position Title"].str.lower())

pandas.core.series.Series

In [16]:
type(chicago[["Position Title"]])

pandas.core.frame.DataFrame

In [17]:
type(chicago["Position Title"])

pandas.core.series.Series

In [20]:
chicago["Position Title"].str.lower()

0                water rate taker
1                  police officer
2                  police officer
3        chief contract expediter
4               civil engineer iv
                   ...           
32058              police officer
32059              police officer
32060              police officer
32061     chief data base analyst
32062                         NaN
Name: Position Title, Length: 32063, dtype: object

In [22]:
chicago["Department"].str.replace("MGMNT", "MANAGEMENT")

0        WATER MANAGEMENT
1                  POLICE
2                  POLICE
3        GENERAL SERVICES
4        WATER MANAGEMENT
               ...       
32058              POLICE
32059              POLICE
32060              POLICE
32061                DoIT
32062                 NaN
Name: Department, Length: 32063, dtype: object

In [24]:
chicago["Department char len"] = chicago["Department"].str.len()

In [25]:
chicago.head()

Unnamed: 0,Name,Position Title,Department,Employee Annual Salary,Department char len
0,"AARON, ELVIA J",WATER RATE TAKER,WATER MGMNT,$90744.00,11.0
1,"AARON, JEFFERY M",POLICE OFFICER,POLICE,$84450.00,6.0
2,"AARON, KARINA",POLICE OFFICER,POLICE,$84450.00,6.0
3,"AARON, KIMBERLEI R",CHIEF CONTRACT EXPEDITER,GENERAL SERVICES,$89880.00,16.0
4,"ABAD JR, VICENTE M",CIVIL ENGINEER IV,WATER MGMNT,$106836.00,11.0


## Filtering with String Methods
- The `str.contains` method checks whether a substring exists anywhere in the string.
- The `str.startswith` method checks whether a substring exists at the start of the string.
- The `str.endswith` method checks whether a substring exists at the end of the string.

In [63]:
water_positions = chicago['Position Title'].str.lower().str.contains("water", na=False)
water_positions.head()

0     True
1    False
2    False
3    False
4    False
Name: Position Title, dtype: bool

In [64]:
water_positions.unique()

array([ True, False])

In [53]:
chicago[water_positions].head()

Unnamed: 0,Name,Position Title,Department,Employee Annual Salary,Department char len
0,"AARON, ELVIA J",WATER RATE TAKER,WATER MGMNT,$90744.00,11.0
554,"ALUISE, VINCENT G",FOREMAN OF WATER PIPE CONSTRUCTION,WATER MGMNT,$102440.00,11.0
671,"ANDER, PERRY A",WATER CHEMIST II,WATER MGMNT,$82044.00,11.0
685,"ANDERSON, ANDREW J",DISTRICT SUPERINTENDENT OF WATER DISTRIBUTION,WATER MGMNT,$109272.00,11.0
702,"ANDERSON, DONALD",FOREMAN OF WATER PIPE CONSTRUCTION,WATER MGMNT,$102440.00,11.0


In [66]:
len(chicago[water_positions])

111

In [56]:
starts_with_civil = chicago['Position Title'].str.lower().str.startswith('civil', na=False)
starts_with_civil.head()

0    False
1    False
2    False
3    False
4     True
Name: Position Title, dtype: bool

In [67]:
type(starts_with_civil)

pandas.core.series.Series

In [59]:
chicago[starts_with_civil].head()

Unnamed: 0,Name,Position Title,Department,Employee Annual Salary,Department char len
4,"ABAD JR, VICENTE M",CIVIL ENGINEER IV,WATER MGMNT,$106836.00,11.0
25,"ABDULSATTAR, MUDHAR",CIVIL ENGINEER II,WATER MGMNT,$58536.00,11.0
34,"ABRAHAM, GIRLEY T",CIVIL ENGINEER IV,WATER MGMNT,$106836.00,11.0
55,"ABUTALEB, AHMAD H",CIVIL ENGINEER II,WATER MGMNT,$89676.00,11.0
147,"ADAMS, TANERA C",CIVIL ENGINEER IV,TRANSPORTN,$106836.00,10.0


## String Methods on Index and Columns
- Use the `index` and `columns` attributes to access the **DataFrame** index/column labels.
- These objects support string methods via their own `str` attribute.

In [68]:
chicago.head()

Unnamed: 0,Name,Position Title,Department,Employee Annual Salary,Department char len
0,"AARON, ELVIA J",WATER RATE TAKER,WATER MGMNT,$90744.00,11.0
1,"AARON, JEFFERY M",POLICE OFFICER,POLICE,$84450.00,6.0
2,"AARON, KARINA",POLICE OFFICER,POLICE,$84450.00,6.0
3,"AARON, KIMBERLEI R",CHIEF CONTRACT EXPEDITER,GENERAL SERVICES,$89880.00,16.0
4,"ABAD JR, VICENTE M",CIVIL ENGINEER IV,WATER MGMNT,$106836.00,11.0


In [70]:
chicago.columns

Index(['Name', 'Position Title', 'Department', 'Employee Annual Salary',
       'Department char len'],
      dtype='object')

In [71]:
chicago.index

RangeIndex(start=0, stop=32063, step=1)

In [80]:
# chicago = chicago.drop(columns = ['index', 'level_0'], axis=1)

In [94]:
chicago = chicago.reset_index(drop=True)
chicago.head()

Unnamed: 0,index,Name,Position Title,Department,Employee Annual Salary,Department char len
0,0,"AARON, ELVIA J",WATER RATE TAKER,WATER MGMNT,$90744.00,11.0
1,1,"AARON, JEFFERY M",POLICE OFFICER,POLICE,$84450.00,6.0
2,2,"AARON, KARINA",POLICE OFFICER,POLICE,$84450.00,6.0
3,3,"AARON, KIMBERLEI R",CHIEF CONTRACT EXPEDITER,GENERAL SERVICES,$89880.00,16.0
4,4,"ABAD JR, VICENTE M",CIVIL ENGINEER IV,WATER MGMNT,$106836.00,11.0


In [97]:
chicago.columns

Index(['index', 'Name', 'Position Title', 'Department',
       'Employee Annual Salary', 'Department char len'],
      dtype='object')

## The split Method
- The `str.split` method splits a string by the occurrence of a delimiter. Pandas returns a **Series** of lists.
- Use the `str.get` method to access a nested list element by its index position.

In [103]:
type(chicago)

pandas.core.frame.DataFrame

In [102]:
chicago.head()

Unnamed: 0,index,Name,Position Title,Department,Employee Annual Salary,Department char len
0,0,"AARON, ELVIA J",WATER RATE TAKER,WATER MGMNT,$90744.00,11.0
1,1,"AARON, JEFFERY M",POLICE OFFICER,POLICE,$84450.00,6.0
2,2,"AARON, KARINA",POLICE OFFICER,POLICE,$84450.00,6.0
3,3,"AARON, KIMBERLEI R",CHIEF CONTRACT EXPEDITER,GENERAL SERVICES,$89880.00,16.0
4,4,"ABAD JR, VICENTE M",CIVIL ENGINEER IV,WATER MGMNT,$106836.00,11.0


In [106]:
chicago["List Of Values"] = chicago['Position Title'].str.split(" ")

In [107]:
chicago.head()

Unnamed: 0,index,Name,Position Title,Department,Employee Annual Salary,Department char len,List Of Values
0,0,"AARON, ELVIA J",WATER RATE TAKER,WATER MGMNT,$90744.00,11.0,"[WATER, RATE, TAKER]"
1,1,"AARON, JEFFERY M",POLICE OFFICER,POLICE,$84450.00,6.0,"[POLICE, OFFICER]"
2,2,"AARON, KARINA",POLICE OFFICER,POLICE,$84450.00,6.0,"[POLICE, OFFICER]"
3,3,"AARON, KIMBERLEI R",CHIEF CONTRACT EXPEDITER,GENERAL SERVICES,$89880.00,16.0,"[CHIEF, CONTRACT, EXPEDITER]"
4,4,"ABAD JR, VICENTE M",CIVIL ENGINEER IV,WATER MGMNT,$106836.00,11.0,"[CIVIL, ENGINEER, IV]"


In [116]:
chicago[["First Name", "Last Name"]] = chicago['Name'].str.split(", ", expand=True)

In [117]:
chicago.head()

Unnamed: 0,index,Name,Position Title,Department,Employee Annual Salary,Department char len,List Of Values,Name Splitted,First Name,Last Name
0,0,"AARON, ELVIA J",WATER RATE TAKER,WATER MGMNT,$90744.00,11.0,"[WATER, RATE, TAKER]","[AARON, ELVIA J]",AARON,ELVIA J
1,1,"AARON, JEFFERY M",POLICE OFFICER,POLICE,$84450.00,6.0,"[POLICE, OFFICER]","[AARON, JEFFERY M]",AARON,JEFFERY M
2,2,"AARON, KARINA",POLICE OFFICER,POLICE,$84450.00,6.0,"[POLICE, OFFICER]","[AARON, KARINA]",AARON,KARINA
3,3,"AARON, KIMBERLEI R",CHIEF CONTRACT EXPEDITER,GENERAL SERVICES,$89880.00,16.0,"[CHIEF, CONTRACT, EXPEDITER]","[AARON, KIMBERLEI R]",AARON,KIMBERLEI R
4,4,"ABAD JR, VICENTE M",CIVIL ENGINEER IV,WATER MGMNT,$106836.00,11.0,"[CIVIL, ENGINEER, IV]","[ABAD JR, VICENTE M]",ABAD JR,VICENTE M


In [118]:
chicago["First Name"].value_counts()

First Name
WILLIAMS     293
JOHNSON      244
SMITH        241
BROWN        185
JONES        183
            ... 
ZUMARAS        1
ZUMBROCK       1
ZUMMO          1
ZUNICH         1
ZUNIGA JR      1
Name: count, Length: 13830, dtype: int64

In [104]:
chicago['Position Title'].value_counts()

Position Title
POLICE OFFICER                            9184
FIREFIGHTER-EMT                           1208
SERGEANT                                  1185
POOL MOTOR TRUCK DRIVER                    918
POLICE OFFICER (ASSIGNED AS DETECTIVE)     896
                                          ... 
OPERATIONS MANAGER - ANIMAL CONTROL          1
PUBLIC HEALTH NURSE III - EXCLUDED           1
EXECUTIVE DIR - POLICE BOARD                 1
FIRST DEPUTY SUPERINTENDENT                  1
MANAGING DEPUTY BUDGET DIRECTOR              1
Name: count, Length: 1093, dtype: int64

## More Practice with Splits

## The expand and n Parameters of the split Method
- The `expand` parameter returns a **DataFrame** instead of a **Series** of lists.
- The `n` parameter limits the number of splits.