# Working with Text Data

In [None]:
import pandas as pd

## This Module's Dataset
- This module's dataset (`chicago.csv`) is a collection of public sector employees in the city of Chicago.
- Each row inclues the employee's name, position, department, and salary.

In [None]:
chicago = pd.read_csv('chicago.csv')
chicago = chicago.dropna(how='all')
chicago['Department'] = chicago['Department'].astype('category')

In [None]:
chicago

In [None]:
chicago.info()

## Common String Methods
- A **Series** has a special `str` attribute that exposes an object with string methods.
- Access the `str` attribute, then invoke the string method on the nested object.
- Most method names will match their Python method equivalents (`upper`, `lower`, `title`, etc).

In [None]:
chicago

In [None]:
chicago['Position Title'] = chicago['Position Title'].str.title()
chicago

In [None]:
chicago['Department'] = chicago['Department'].str.replace('MGMNT', 'MANAGEMENT').str.title()
chicago

## Filtering with String Methods
- The `str.contains` method checks whether a substring exists anywhere in the string.
- The `str.startswith` method checks whether a substring exists at the start of the string.
- The `str.endswith` method checks whether a substring exists at the end of the string.

In [22]:
chicago = pd.read_csv('chicago.csv')
chicago = chicago.dropna(how='all')
chicago['Department'] = chicago['Department'].astype('category')
chicago

Unnamed: 0,Name,Position Title,Department,Employee Annual Salary
0,"AARON, ELVIA J",WATER RATE TAKER,WATER MGMNT,$90744.00
1,"AARON, JEFFERY M",POLICE OFFICER,POLICE,$84450.00
2,"AARON, KARINA",POLICE OFFICER,POLICE,$84450.00
3,"AARON, KIMBERLEI R",CHIEF CONTRACT EXPEDITER,GENERAL SERVICES,$89880.00
4,"ABAD JR, VICENTE M",CIVIL ENGINEER IV,WATER MGMNT,$106836.00
...,...,...,...,...
32057,"ZYGADLO, MICHAEL J",FRM OF MACHINISTS - AUTOMOTIVE,GENERAL SERVICES,$99528.00
32058,"ZYGOWICZ, PETER J",POLICE OFFICER,POLICE,$87384.00
32059,"ZYMANTAS, MARK E",POLICE OFFICER,POLICE,$84450.00
32060,"ZYRKOWSKI, CARLO E",POLICE OFFICER,POLICE,$87384.00


In [23]:
engineer = chicago['Position Title'].str.lower().str.contains('engineer')


In [24]:
chicago.loc[engineer]

Unnamed: 0,Name,Position Title,Department,Employee Annual Salary
4,"ABAD JR, VICENTE M",CIVIL ENGINEER IV,WATER MGMNT,$106836.00
10,"ABBATEMARCO, JAMES J",FIRE ENGINEER-EMT,FIRE,$100320.00
20,"ABDUL-KARIM, MUHAMMAD A",ENGINEERING TECHNICIAN VI,WATER MGMNT,$108228.00
25,"ABDULSATTAR, MUDHAR",CIVIL ENGINEER II,WATER MGMNT,$58536.00
34,"ABRAHAM, GIRLEY T",CIVIL ENGINEER IV,WATER MGMNT,$106836.00
...,...,...,...,...
31975,"ZINCHUK, BRIAN C",OPERATING ENGINEER-GROUP C,WATER MGMNT,$93745.60
31994,"ZOCHOWSKI, DAVID J",OPERATING ENGINEER-GROUP C,AVIATION,$93745.60
32008,"ZOTTA, SANDINO",MECHANICAL ENGINEER IV,WATER MGMNT,$106836.00
32009,"ZOUBI, HAMZEH A",MECHANICAL ENGINEER III,GENERAL SERVICES,$64644.00


In [28]:
starts_with_water = chicago['Department'].str.lower().str.startswith('water')
chicago[starts_with_water]

Unnamed: 0,Name,Position Title,Department,Employee Annual Salary
0,"AARON, ELVIA J",WATER RATE TAKER,WATER MGMNT,$90744.00
4,"ABAD JR, VICENTE M",CIVIL ENGINEER IV,WATER MGMNT,$106836.00
20,"ABDUL-KARIM, MUHAMMAD A",ENGINEERING TECHNICIAN VI,WATER MGMNT,$108228.00
25,"ABDULSATTAR, MUDHAR",CIVIL ENGINEER II,WATER MGMNT,$58536.00
34,"ABRAHAM, GIRLEY T",CIVIL ENGINEER IV,WATER MGMNT,$106836.00
...,...,...,...,...
31983,"ZIVAT, MICHAEL",CONSTRUCTION LABORER,WATER MGMNT,$81536.00
31984,"ZIZUMBO, DANIEL",POOL MOTOR TRUCK DRIVER,WATER MGMNT,$72862.40
32008,"ZOTTA, SANDINO",MECHANICAL ENGINEER IV,WATER MGMNT,$106836.00
32038,"ZUNO, ERIK",LABORER - APPRENTICE,WATER MGMNT,$73382.40


In [30]:
starts_with_water = chicago['Position Title'].str.lower().str.endswith(' v')
chicago[starts_with_water]

Unnamed: 0,Name,Position Title,Department,Employee Annual Salary
35,"ABRAHAM, GODWIN K",ENGINEERING TECHNICIAN V,BUSINESS AFFAIRS,$98616.00
821,"AN, NICK H",TRAFFIC ENGINEER V,TRANSPORTN,$103644.00
1189,"BABAPOUR, YADI M",FILTRATION ENGINEER V,WATER MGMNT,$116784.00
1384,"BANEA, PHILIP",CITY PLANNER V,TRANSPORTN,$66768.00
1774,"BECQ GIRAUDON, EMILIE F",CIVIL ENGINEER V,TRANSPORTN,$116784.00
4357,"CAYANAN, NARCISO T",CIVIL ENGINEER V,TRANSPORTN,$116784.00
4366,"CECCHIN, JOHN R",CIVIL ENGINEER V,TRANSPORTN,$116784.00
7227,"DONALDSON, ANGELA M",FILTRATION ENGINEER V,WATER MGMNT,$116784.00
7875,"EKWUEME, AMOBI",CIVIL ENGINEER V,TRANSPORTN,$116784.00
8099,"ESPINOZA, FERNANDO",CITY PLANNER V,COMMUNITY DEVELOPMENT,$84996.00


## String Methods on Index and Columns
- Use the `index` and `columns` attributes to access the **DataFrame** index/column labels.
- These objects support string methods via their own `str` attribute.

## The split Method
- The `str.split` method splits a string by the occurrence of a delimiter. Pandas returns a **Series** of lists.
- Use the `str.get` method to access a nested list element by its index position.

## More Practice with Splits

## The expand and n Parameters of the split Method
- The `expand` parameter returns a **DataFrame** instead of a **Series** of lists.
- The `n` parameter limits the number of splits.