# Working with Text Data

In [1]:
import pandas as pd

## This Module's Dataset
- This module's dataset (`chicago.csv`) is a collection of public sector employees in the city of Chicago.
- Each row inclues the employee's name, position, department, and salary.

In [None]:
chicago = pd.read_csv('chicago.csv')
chicago = chicago.dropna(how='all')
chicago['Department'] = chicago['Department'].astype('category')

In [None]:
chicago

In [None]:
chicago.info()

## Common String Methods
- A **Series** has a special `str` attribute that exposes an object with string methods.
- Access the `str` attribute, then invoke the string method on the nested object.
- Most method names will match their Python method equivalents (`upper`, `lower`, `title`, etc).

In [None]:
chicago

In [None]:
chicago['Position Title'] = chicago['Position Title'].str.title()
chicago

In [None]:
chicago['Department'] = chicago['Department'].str.replace('MGMNT', 'MANAGEMENT').str.title()
chicago

## Filtering with String Methods
- The `str.contains` method checks whether a substring exists anywhere in the string.
- The `str.startswith` method checks whether a substring exists at the start of the string.
- The `str.endswith` method checks whether a substring exists at the end of the string.

In [None]:
chicago = pd.read_csv('chicago.csv')
chicago = chicago.dropna(how='all')
chicago['Department'] = chicago['Department'].astype('category')
chicago

In [None]:
engineer = chicago['Position Title'].str.lower().str.contains('engineer')


In [None]:
chicago.loc[engineer]

In [None]:
starts_with_water = chicago['Department'].str.lower().str.startswith('water')
chicago[starts_with_water]

In [None]:
starts_with_water = chicago['Position Title'].str.lower().str.endswith(' v')
chicago[starts_with_water]

## String Methods on Index and Columns
- Use the `index` and `columns` attributes to access the **DataFrame** index/column labels.
- These objects support string methods via their own `str` attribute.

In [None]:
chicago = pd.read_csv('chicago.csv')
chicago

In [None]:
chicago = chicago.set_index('Name')
chicago

In [None]:
chicago.index = chicago.index.str.title()
chicago

In [None]:
chicago.columns = chicago.columns.str.upper()
chicago

## The split Method
- The `str.split` method splits a string by the occurrence of a delimiter. Pandas returns a **Series** of lists.
- Use the `str.get` method to access a nested list element by its index position.

In [3]:
chicago = pd.read_csv('chicago.csv')
chicago

Unnamed: 0,Name,Position Title,Department,Employee Annual Salary
0,"AARON, ELVIA J",WATER RATE TAKER,WATER MGMNT,$90744.00
1,"AARON, JEFFERY M",POLICE OFFICER,POLICE,$84450.00
2,"AARON, KARINA",POLICE OFFICER,POLICE,$84450.00
3,"AARON, KIMBERLEI R",CHIEF CONTRACT EXPEDITER,GENERAL SERVICES,$89880.00
4,"ABAD JR, VICENTE M",CIVIL ENGINEER IV,WATER MGMNT,$106836.00
...,...,...,...,...
32058,"ZYGOWICZ, PETER J",POLICE OFFICER,POLICE,$87384.00
32059,"ZYMANTAS, MARK E",POLICE OFFICER,POLICE,$84450.00
32060,"ZYRKOWSKI, CARLO E",POLICE OFFICER,POLICE,$87384.00
32061,"ZYSKOWSKI, DARIUSZ",CHIEF DATA BASE ANALYST,DoIT,$113664.00


In [10]:
chicago['Name'].str.split(',  ').str.get(1).str.split(' ').str.get(0).str.capitalize().value_counts()

Name
Michael     1153
John         899
James        676
Robert       622
Joseph       537
            ... 
Russ           1
Fabiola        1
Jurdon         1
Nateesha       1
Lilya          1
Name: count, Length: 5091, dtype: int64

## The expand and n Parameters of the split Method
- The `expand` parameter returns a **DataFrame** instead of a **Series** of lists.
- The `n` parameter limits the number of splits.