# Working with Text Data

In [8]:
import pandas as pd;

## This Module's Dataset
- This module's dataset (`chicago.csv`) is a collection of public sector employees in the city of Chicago.
- Each row inclues the employee's name, position, department, and salary.

In [9]:
chicago=pd.read_csv("chicago.csv").dropna(how="all");
chicago.head()

Unnamed: 0,Name,Position Title,Department,Employee Annual Salary
0,"AARON, ELVIA J",WATER RATE TAKER,WATER MGMNT,$90744.00
1,"AARON, JEFFERY M",POLICE OFFICER,POLICE,$84450.00
2,"AARON, KARINA",POLICE OFFICER,POLICE,$84450.00
3,"AARON, KIMBERLEI R",CHIEF CONTRACT EXPEDITER,GENERAL SERVICES,$89880.00
4,"ABAD JR, VICENTE M",CIVIL ENGINEER IV,WATER MGMNT,$106836.00


In [10]:
chicago.info()

<class 'pandas.core.frame.DataFrame'>
Index: 32062 entries, 0 to 32061
Data columns (total 4 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   Name                    32062 non-null  object
 1   Position Title          32062 non-null  object
 2   Department              32062 non-null  object
 3   Employee Annual Salary  32062 non-null  object
dtypes: object(4)
memory usage: 1.2+ MB


In [11]:
chicago.nunique()

Name                      31776
Position Title             1093
Department                   35
Employee Annual Salary     1156
dtype: int64

In [12]:
chicago["Department"]=chicago["Department"].astype("category")

## Common String Methods
- A **Series** has a special `str` attribute that exposes an object with string methods.
- Access the `str` attribute, then invoke the string method on the nested object.
- Most method names will match their Python method equivalents (`upper`, `lower`, `title`, etc).
- returns copy's.

In [13]:
chicago["Position Title"].str.lower()
chicago["Position Title"].str.upper()
chicago["Position Title"].str.title()
chicago["Position Title"].str.len()
chicago["Position Title"].str.title().str.len()
chicago["Position Title"].str.strip()
chicago["Position Title"].str.lstrip()
chicago["Position Title"].str.rstrip()
chicago["Position Title"].str.title().str.len()
chicago["Department"].str.replace("MGMNT","MANAGEMENT").str.title()
chicago

Unnamed: 0,Name,Position Title,Department,Employee Annual Salary
0,"AARON, ELVIA J",WATER RATE TAKER,WATER MGMNT,$90744.00
1,"AARON, JEFFERY M",POLICE OFFICER,POLICE,$84450.00
2,"AARON, KARINA",POLICE OFFICER,POLICE,$84450.00
3,"AARON, KIMBERLEI R",CHIEF CONTRACT EXPEDITER,GENERAL SERVICES,$89880.00
4,"ABAD JR, VICENTE M",CIVIL ENGINEER IV,WATER MGMNT,$106836.00
...,...,...,...,...
32057,"ZYGADLO, MICHAEL J",FRM OF MACHINISTS - AUTOMOTIVE,GENERAL SERVICES,$99528.00
32058,"ZYGOWICZ, PETER J",POLICE OFFICER,POLICE,$87384.00
32059,"ZYMANTAS, MARK E",POLICE OFFICER,POLICE,$84450.00
32060,"ZYRKOWSKI, CARLO E",POLICE OFFICER,POLICE,$87384.00


## Filtering with String Methods
- The `str.contains` method checks whether a substring exists anywhere in the string.
- The `str.startswith` method checks whether a substring exists at the start of the string.
- The `str.endswith` method checks whether a substring exists at the end of the string.

In [14]:
chicago[chicago["Position Title"].str.lower().str.contains("water")]

Unnamed: 0,Name,Position Title,Department,Employee Annual Salary
0,"AARON, ELVIA J",WATER RATE TAKER,WATER MGMNT,$90744.00
554,"ALUISE, VINCENT G",FOREMAN OF WATER PIPE CONSTRUCTION,WATER MGMNT,$102440.00
671,"ANDER, PERRY A",WATER CHEMIST II,WATER MGMNT,$82044.00
685,"ANDERSON, ANDREW J",DISTRICT SUPERINTENDENT OF WATER DISTRIBUTION,WATER MGMNT,$109272.00
702,"ANDERSON, DONALD",FOREMAN OF WATER PIPE CONSTRUCTION,WATER MGMNT,$102440.00
...,...,...,...,...
29669,"VERMA, ANUPAM",MANAGING ENGINEER - WATER MANAGEMENT,WATER MGMNT,$111192.00
30239,"WASHINGTON, JOSEPH",WATER CHEMIST III,WATER MGMNT,$89676.00
30544,"WEST, THOMAS R",GEN SUPT OF WATER MANAGEMENT,WATER MGMNT,$115704.00
30991,"WILLIAMS, MATTHEW",FOREMAN OF WATER PIPE CONSTRUCTION,WATER MGMNT,$102440.00


In [39]:
chicago["Position Title"].str.lower().str.startswith("civil")
chicago.set_index("Name", inplace=True)


KeyError: 'Position Title'

## String Methods on Index and Columns
- Use the `index` and `columns` attributes to access the **DataFrame** index/column labels.
- These objects support string methods via their own `str` attribute.

In [16]:
chicago.index=chicago.index.str.strip().str.title()
chicago

Unnamed: 0_level_0,Position Title,Department,Employee Annual Salary
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
"Aaron, Elvia J",WATER RATE TAKER,WATER MGMNT,$90744.00
"Aaron, Jeffery M",POLICE OFFICER,POLICE,$84450.00
"Aaron, Karina",POLICE OFFICER,POLICE,$84450.00
"Aaron, Kimberlei R",CHIEF CONTRACT EXPEDITER,GENERAL SERVICES,$89880.00
"Abad Jr, Vicente M",CIVIL ENGINEER IV,WATER MGMNT,$106836.00
...,...,...,...
"Zygadlo, Michael J",FRM OF MACHINISTS - AUTOMOTIVE,GENERAL SERVICES,$99528.00
"Zygowicz, Peter J",POLICE OFFICER,POLICE,$87384.00
"Zymantas, Mark E",POLICE OFFICER,POLICE,$84450.00
"Zyrkowski, Carlo E",POLICE OFFICER,POLICE,$87384.00


In [18]:
chicago.columns=chicago.columns.str.upper()

## The split Method
- The `str.split` method splits a string by the occurrence of a delimiter. Pandas returns a **Series** of lists.
- Use the `str.get` method to access a nested list element by its index position.

In [27]:
"water rate taker".split()
chicago.columns

Index(['POSITION TITLE', 'DEPARTMENT', 'EMPLOYEE ANNUAL SALARY'], dtype='object')

In [25]:
chicago["POSITION TITLE"].str.split(" ")
chicago["POSITION TITLE"].str.split(" ").str.get(0).value_counts()

POSITION TITLE
POLICE             10856
FIREFIGHTER-EMT     1509
SERGEANT            1186
POOL                 918
FIREFIGHTER          810
                   ...  
DENTIST                1
ASSOC                  1
TELEPHONE              1
MAYOR                  1
PREPRESS               1
Name: count, Length: 320, dtype: int64

## More Practice with Splits

In [38]:
chicago

Unnamed: 0,level_0,index,Name,POSITION TITLE,DEPARTMENT,EMPLOYEE ANNUAL SALARY
0,0,0,"Aaron, Elvia J",WATER RATE TAKER,WATER MGMNT,$90744.00
1,1,1,"Aaron, Jeffery M",POLICE OFFICER,POLICE,$84450.00
2,2,2,"Aaron, Karina",POLICE OFFICER,POLICE,$84450.00
3,3,3,"Aaron, Kimberlei R",CHIEF CONTRACT EXPEDITER,GENERAL SERVICES,$89880.00
4,4,4,"Abad Jr, Vicente M",CIVIL ENGINEER IV,WATER MGMNT,$106836.00
...,...,...,...,...,...,...
32057,32057,32057,"Zygadlo, Michael J",FRM OF MACHINISTS - AUTOMOTIVE,GENERAL SERVICES,$99528.00
32058,32058,32058,"Zygowicz, Peter J",POLICE OFFICER,POLICE,$87384.00
32059,32059,32059,"Zymantas, Mark E",POLICE OFFICER,POLICE,$84450.00
32060,32060,32060,"Zyrkowski, Carlo E",POLICE OFFICER,POLICE,$87384.00


In [50]:
chicago=pd.read_csv("chicago.csv");
chicago
chicago["Name"].str.title().str.split(",  ").str.get(1)

0            Elvia J
1          Jeffery M
2             Karina
3        Kimberlei R
4          Vicente M
            ...     
32058        Peter J
32059         Mark E
32060        Carlo E
32061        Dariusz
32062            NaN
Name: Name, Length: 32063, dtype: object

## The expand and n Parameters of the split Method
- The `expand` parameter returns a **DataFrame** instead of a **Series** of lists.
- The `n` parameter limits the number of splits.

In [53]:
chicago[["First Name","Last Name"]]=chicago["Name"].str.split(",",expand=True)

In [55]:
chicago.head()

Unnamed: 0,Name,Position Title,Department,Employee Annual Salary,First Name,Last Name
0,"AARON, ELVIA J",WATER RATE TAKER,WATER MGMNT,$90744.00,AARON,ELVIA J
1,"AARON, JEFFERY M",POLICE OFFICER,POLICE,$84450.00,AARON,JEFFERY M
2,"AARON, KARINA",POLICE OFFICER,POLICE,$84450.00,AARON,KARINA
3,"AARON, KIMBERLEI R",CHIEF CONTRACT EXPEDITER,GENERAL SERVICES,$89880.00,AARON,KIMBERLEI R
4,"ABAD JR, VICENTE M",CIVIL ENGINEER IV,WATER MGMNT,$106836.00,ABAD JR,VICENTE M


In [61]:
chicago["Position Title"].str.split(" ", expand=True, n=1)

Unnamed: 0,0,1
0,WATER,RATE TAKER
1,POLICE,OFFICER
2,POLICE,OFFICER
3,CHIEF,CONTRACT EXPEDITER
4,CIVIL,ENGINEER IV
...,...,...
32058,POLICE,OFFICER
32059,POLICE,OFFICER
32060,POLICE,OFFICER
32061,CHIEF,DATA BASE ANALYST
