# Working with Text Data

In [1]:
import pandas as pd

## This Module's Dataset
- This module's dataset (`chicago.csv`) is a collection of public sector employees in the city of Chicago.
- Each row inclues the employee's name, position, department, and salary.

In [3]:
chicago= pd.read_csv('chicago.csv').dropna(how='all')
chicago.head()

Unnamed: 0,Name,Position Title,Department,Employee Annual Salary
0,"AARON, ELVIA J",WATER RATE TAKER,WATER MGMNT,$90744.00
1,"AARON, JEFFERY M",POLICE OFFICER,POLICE,$84450.00
2,"AARON, KARINA",POLICE OFFICER,POLICE,$84450.00
3,"AARON, KIMBERLEI R",CHIEF CONTRACT EXPEDITER,GENERAL SERVICES,$89880.00
4,"ABAD JR, VICENTE M",CIVIL ENGINEER IV,WATER MGMNT,$106836.00


In [4]:
chicago.info()

<class 'pandas.core.frame.DataFrame'>
Index: 32062 entries, 0 to 32061
Data columns (total 4 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   Name                    32062 non-null  object
 1   Position Title          32062 non-null  object
 2   Department              32062 non-null  object
 3   Employee Annual Salary  32062 non-null  object
dtypes: object(4)
memory usage: 1.2+ MB


In [6]:
chicago.nunique()

Name                      31776
Position Title             1093
Department                   35
Employee Annual Salary     1156
dtype: int64

In [8]:
# according to the previous result, Department would be a good option to be converted as a category type in order to optimize dataframe memory
chicago['Department']= chicago['Department'].astype('category')

In [10]:
chicago.info()

<class 'pandas.core.frame.DataFrame'>
Index: 32062 entries, 0 to 32061
Data columns (total 4 columns):
 #   Column                  Non-Null Count  Dtype   
---  ------                  --------------  -----   
 0   Name                    32062 non-null  object  
 1   Position Title          32062 non-null  object  
 2   Department              32062 non-null  category
 3   Employee Annual Salary  32062 non-null  object  
dtypes: category(1), object(3)
memory usage: 1.0+ MB


In [11]:
# the best final dataframe structure:

chicago= pd.read_csv('chicago.csv').dropna(how= 'all')
chicago['Department'] = chicago['Department'].astype('category')
chicago.head()

Unnamed: 0,Name,Position Title,Department,Employee Annual Salary
0,"AARON, ELVIA J",WATER RATE TAKER,WATER MGMNT,$90744.00
1,"AARON, JEFFERY M",POLICE OFFICER,POLICE,$84450.00
2,"AARON, KARINA",POLICE OFFICER,POLICE,$84450.00
3,"AARON, KIMBERLEI R",CHIEF CONTRACT EXPEDITER,GENERAL SERVICES,$89880.00
4,"ABAD JR, VICENTE M",CIVIL ENGINEER IV,WATER MGMNT,$106836.00


## Common String Methods
- A **Series** has a special `str` attribute that exposes an object with string methods.
- Access the `str` attribute, then invoke the string method on the nested object.
- Most method names will match their Python method equivalents (`upper`, `lower`, `title`, etc).

In [12]:
chicago= pd.read_csv('chicago.csv').dropna(how= 'all')
chicago['Department'] = chicago['Department'].astype('category')
chicago.head()

Unnamed: 0,Name,Position Title,Department,Employee Annual Salary
0,"AARON, ELVIA J",WATER RATE TAKER,WATER MGMNT,$90744.00
1,"AARON, JEFFERY M",POLICE OFFICER,POLICE,$84450.00
2,"AARON, KARINA",POLICE OFFICER,POLICE,$84450.00
3,"AARON, KIMBERLEI R",CHIEF CONTRACT EXPEDITER,GENERAL SERVICES,$89880.00
4,"ABAD JR, VICENTE M",CIVIL ENGINEER IV,WATER MGMNT,$106836.00


In [16]:
'boris'.upper()
'BORIS'.lower()
'boris the animal'.title()
'men in black'.capitalize()

'Men in black'

In [39]:
chicago['Position Title'].str.lower()
chicago['Position Title'].str.upper()
chicago['Position Title'].str.title()
chicago['Position Title'].str.len()

# the result of such calls are just entire new series, so if we want to apply any other string method we need to access str attribute again
chicago['Position Title'].str.title().str.len()

chicago['Position Title'].str.strip()
chicago['Position Title'].str.rstrip()
chicago['Position Title'].str.lstrip()

chicago['Department'].str.replace('MGMNT', 'MANAGEMENT')

0        WATER MANAGEMENT
1                  POLICE
2                  POLICE
3        GENERAL SERVICES
4        WATER MANAGEMENT
               ...       
32057    GENERAL SERVICES
32058              POLICE
32059              POLICE
32060              POLICE
32061                DoIT
Name: Department, Length: 32062, dtype: object

## Filtering with String Methods
- The `str.contains` method checks whether a substring exists anywhere in the string.
- The `str.startswith` method checks whether a substring exists at the start of the string.
- The `str.endswith` method checks whether a substring exists at the end of the string.

In [None]:
chicago= pd.read_csv('chicago.csv').dropna(how= 'all')
chicago['Department'] = chicago['Department'].astype('category')
chicago.head()

## String Methods on Index and Columns
- Use the `index` and `columns` attributes to access the **DataFrame** index/column labels.
- These objects support string methods via their own `str` attribute.

In [None]:
chicago= pd.read_csv('chicago.csv').dropna(how= 'all')
chicago['Department'] = chicago['Department'].astype('category')
chicago.head()

## The split Method
- The `str.split` method splits a string by the occurrence of a delimiter. Pandas returns a **Series** of lists.
- Use the `str.get` method to access a nested list element by its index position.

In [None]:
chicago= pd.read_csv('chicago.csv').dropna(how= 'all')
chicago['Department'] = chicago['Department'].astype('category')
chicago.head()

## More Practice with Splits

In [None]:
chicago= pd.read_csv('chicago.csv').dropna(how= 'all')
chicago['Department'] = chicago['Department'].astype('category')
chicago.head()

## The expand and n Parameters of the split Method
- The `expand` parameter returns a **DataFrame** instead of a **Series** of lists.
- The `n` parameter limits the number of splits.

In [None]:
chicago= pd.read_csv('chicago.csv').dropna(how= 'all')
chicago['Department'] = chicago['Department'].astype('category')
chicago.head()