# Working with Text Data

In [3]:
import pandas as pd

## This Module's Dataset
____
- This module's dataset (`chicago.csv`) is a collection of public sector employees in the city of Chicago.
- Each row inclues the employee's name, position, department, and salary.

In [None]:
# Load the data, and remove rows where all the values are missing
chicago = pd.read_csv("chicago.csv").dropna(how="all")
chicago.head()

In [None]:
# check the summary of the dataframe
chicago.info()

In [None]:
# check the unique values of the department column
chicago.nunique()

Since the `Department` column is categorical, we can convert it to a category data type to store the data more efficiently. This will also make the analysis easier, and this can be done using the `astype` method.

In [16]:
# convert the department column to category

chicago["Department"] = chicago["Department"].astype("category")

In [None]:
# check the summary of the dataframe, after the conversion of department column to category
# confirm that the memory usage has reduced from 1.2+ MB to 1.6+ MB
chicago.info()

çombining all the steps together, we can write the following code:

*  load the dataset
*  convert the `Department` column to a category data type
*  display the memory usage of the dataset
*  display the first few rows of the datase

In [None]:
chicago = pd.read_csv("chicago.csv").dropna(how="all")
chicago["Department"] = chicago["Department"].astype("category")
chicago.head()

## Common String Methods
___
- A **Series** has a special `str` attribute that exposes an object with string methods.
- Access the `str` attribute, then invoke the string method on the nested object.
- Most method names will match their Python method equivalents (`upper`, `lower`, `title`, etc).

In [None]:
# Converts all characters in the "Position Title" column to lowercase
chicago["Position Title"].str.lower()

# Converts all characters in the "Position Title" column to uppercase
chicago["Position Title"].str.upper()

# Converts the first character of each word in the "Position Title" column to uppercase (title case)
chicago["Position Title"].str.title()

# Calculates the length (number of characters) of each string in the "Position Title" column
chicago["Position Title"].str.len()

# Converts the "Position Title" column to title case, then calculates the length of each string
chicago["Position Title"].str.title().str.len()

# Removes leading and trailing whitespace from each string in the "Position Title" column
chicago["Position Title"].str.strip()

# Removes leading whitespace from each string in the "Position Title" column
chicago["Position Title"].str.lstrip()

# Removes trailing whitespace from each string in the "Position Title" column
chicago["Position Title"].str.rstrip()

# Replaces occurrences of "MGMNT" with "MANAGEMENT" in the "Department" column, 
# then converts the modified strings to title case
chicago["Department"].str.replace("MGMNT", "MANAGEMENT").str.title()


## Filtering with String Methods
___
- The `str.contains` method checks whether a substring exists anywhere in the string.
- The `str.startswith` method checks whether a substring exists at the start of the string.
- The `str.endswith` method checks whether a substring exists at the end of the string.

In [31]:
chicago = pd.read_csv("chicago.csv").dropna(how="all")
chicago["Department"] = chicago["Department"].astype("category")
chicago.head()

Unnamed: 0,Name,Position Title,Department,Employee Annual Salary
0,"AARON, ELVIA J",WATER RATE TAKER,WATER MGMNT,$90744.00
1,"AARON, JEFFERY M",POLICE OFFICER,POLICE,$84450.00
2,"AARON, KARINA",POLICE OFFICER,POLICE,$84450.00
3,"AARON, KIMBERLEI R",CHIEF CONTRACT EXPEDITER,GENERAL SERVICES,$89880.00
4,"ABAD JR, VICENTE M",CIVIL ENGINEER IV,WATER MGMNT,$106836.00


if we want to filter the dataset to only employees who work in the `water` department, we can use the `str.contains` method to create a boolean mask, then use that mask to filter the dataset.

In [32]:
water_workers = chicago["Position Title"].str.lower().str.contains("water")
chicago[water_workers]

Unnamed: 0,Name,Position Title,Department,Employee Annual Salary
0,"AARON, ELVIA J",WATER RATE TAKER,WATER MGMNT,$90744.00
554,"ALUISE, VINCENT G",FOREMAN OF WATER PIPE CONSTRUCTION,WATER MGMNT,$102440.00
671,"ANDER, PERRY A",WATER CHEMIST II,WATER MGMNT,$82044.00
685,"ANDERSON, ANDREW J",DISTRICT SUPERINTENDENT OF WATER DISTRIBUTION,WATER MGMNT,$109272.00
702,"ANDERSON, DONALD",FOREMAN OF WATER PIPE CONSTRUCTION,WATER MGMNT,$102440.00
...,...,...,...,...
29669,"VERMA, ANUPAM",MANAGING ENGINEER - WATER MANAGEMENT,WATER MGMNT,$111192.00
30239,"WASHINGTON, JOSEPH",WATER CHEMIST III,WATER MGMNT,$89676.00
30544,"WEST, THOMAS R",GEN SUPT OF WATER MANAGEMENT,WATER MGMNT,$115704.00
30991,"WILLIAMS, MATTHEW",FOREMAN OF WATER PIPE CONSTRUCTION,WATER MGMNT,$102440.00


Filtering dataset base on prefix amd suffix can be done using the `str.startswith` and `str.endswith` methods.

In [33]:
starts_with_civil = chicago["Position Title"].str.lower().str.startswith("civil")
chicago.loc[starts_with_civil]

Unnamed: 0,Name,Position Title,Department,Employee Annual Salary
4,"ABAD JR, VICENTE M",CIVIL ENGINEER IV,WATER MGMNT,$106836.00
25,"ABDULSATTAR, MUDHAR",CIVIL ENGINEER II,WATER MGMNT,$58536.00
34,"ABRAHAM, GIRLEY T",CIVIL ENGINEER IV,WATER MGMNT,$106836.00
55,"ABUTALEB, AHMAD H",CIVIL ENGINEER II,WATER MGMNT,$89676.00
147,"ADAMS, TANERA C",CIVIL ENGINEER IV,TRANSPORTN,$106836.00
...,...,...,...,...
31623,"YANG, LUYANG",CIVIL ENGINEER V,TRANSPORTN,$116784.00
31656,"YEPEZ, JESUS",CIVIL ENGINEER IV,TRANSPORTN,$106836.00
31662,"YESUFU, STEPHANIE A",CIVIL ENGINEER III,TRANSPORTN,$92784.00
31797,"ZAKE, JOSHUA S",CIVIL ENGINEER IV,TRANSPORTN,$106836.00


In [34]:
ends_with_iv = chicago["Position Title"].str.lower().str.endswith("iv")
chicago[ends_with_iv]

Unnamed: 0,Name,Position Title,Department,Employee Annual Salary
4,"ABAD JR, VICENTE M",CIVIL ENGINEER IV,WATER MGMNT,$106836.00
34,"ABRAHAM, GIRLEY T",CIVIL ENGINEER IV,WATER MGMNT,$106836.00
145,"ADAMS, SHERYLL A",LIBRARIAN IV,PUBLIC LIBRARY,$97812.00
147,"ADAMS, TANERA C",CIVIL ENGINEER IV,TRANSPORTN,$106836.00
166,"ADENI, MOHAMED K",ACCOUNTANT IV,FINANCE,$97812.00
...,...,...,...,...
31777,"ZAFIRIS, CHRISTOPHER",ARCHITECT IV,DISABILITIES,$106836.00
31797,"ZAKE, JOSHUA S",CIVIL ENGINEER IV,TRANSPORTN,$106836.00
31870,"ZAVALA, FERNANDO",ACCOUNTANT IV,FINANCE,$97812.00
31884,"ZAWADSKI, JAMES",CLERK IV,LAW,$68028.00


## String Methods on Index and Columns
___
- Use the `index` and `columns` attributes to access the **DataFrame** index/column labels.
- These objects support string methods via their own `str` attribute.

In [10]:
# loading the data 
# setting the index to the Name column
# removing rows where all the values are missing
# sorting the index
# converting the Department column to category


chicago = pd.read_csv("chicago.csv", index_col="Name").dropna(how="all").sort_index()
chicago["Department"] = chicago["Department"].astype("category")
chicago.head()

Unnamed: 0_level_0,Position Title,Department,Employee Annual Salary
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
"AARON, ELVIA J",WATER RATE TAKER,WATER MGMNT,$90744.00
"AARON, JEFFERY M",POLICE OFFICER,POLICE,$84450.00
"AARON, KARINA",POLICE OFFICER,POLICE,$84450.00
"AARON, KIMBERLEI R",CHIEF CONTRACT EXPEDITER,GENERAL SERVICES,$89880.00
"ABAD JR, VICENTE M",CIVIL ENGINEER IV,WATER MGMNT,$106836.00


The `name` column in the datasset is now an index, and we can use the `str.upper` method to convert all the names to uppercase.

In [14]:
# remove all the leading and trailing whitespaces from the index
# convert the index to title case
chicago.index = chicago.index.str.strip().str.title()
chicago.head()

Unnamed: 0_level_0,Position Title,Department,Employee Annual Salary
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
"Aaron, Elvia J",WATER RATE TAKER,WATER MGMNT,$90744.00
"Aaron, Jeffery M",POLICE OFFICER,POLICE,$84450.00
"Aaron, Karina",POLICE OFFICER,POLICE,$84450.00
"Aaron, Kimberlei R",CHIEF CONTRACT EXPEDITER,GENERAL SERVICES,$89880.00
"Abad Jr, Vicente M",CIVIL ENGINEER IV,WATER MGMNT,$106836.00


In [15]:
# convert all the column names to uppercase
chicago.columns = chicago.columns.str.upper()
chicago.head()

Unnamed: 0_level_0,POSITION TITLE,DEPARTMENT,EMPLOYEE ANNUAL SALARY
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
"Aaron, Elvia J",WATER RATE TAKER,WATER MGMNT,$90744.00
"Aaron, Jeffery M",POLICE OFFICER,POLICE,$84450.00
"Aaron, Karina",POLICE OFFICER,POLICE,$84450.00
"Aaron, Kimberlei R",CHIEF CONTRACT EXPEDITER,GENERAL SERVICES,$89880.00
"Abad Jr, Vicente M",CIVIL ENGINEER IV,WATER MGMNT,$106836.00


In [13]:
chicago.head()

Unnamed: 0_level_0,Position Title,Department,Employee Annual Salary
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
"Aaron, Elvia J",WATER RATE TAKER,WATER MGMNT,$90744.00
"Aaron, Jeffery M",POLICE OFFICER,POLICE,$84450.00
"Aaron, Karina",POLICE OFFICER,POLICE,$84450.00
"Aaron, Kimberlei R",CHIEF CONTRACT EXPEDITER,GENERAL SERVICES,$89880.00
"Abad Jr, Vicente M",CIVIL ENGINEER IV,WATER MGMNT,$106836.00


## The split Method
___
- The `str.split` method splits a string by the occurrence of a delimiter. Pandas returns a **Series** of lists.
- Use the `str.get` method to access a nested list element by its index position.

In [4]:
chicago = pd.read_csv("chicago.csv").dropna(how="all")
chicago["Department"] = chicago["Department"].astype("category")
chicago.head()

Unnamed: 0,Name,Position Title,Department,Employee Annual Salary
0,"AARON, ELVIA J",WATER RATE TAKER,WATER MGMNT,$90744.00
1,"AARON, JEFFERY M",POLICE OFFICER,POLICE,$84450.00
2,"AARON, KARINA",POLICE OFFICER,POLICE,$84450.00
3,"AARON, KIMBERLEI R",CHIEF CONTRACT EXPEDITER,GENERAL SERVICES,$89880.00
4,"ABAD JR, VICENTE M",CIVIL ENGINEER IV,WATER MGMNT,$106836.00


In [13]:
"Water Rate Taker".split()

['Water', 'Rate', 'Taker']

In [21]:
# Split Strings into Lists:
chicago["Position Title"].str.split(" ")

0                        [WATER, RATE, TAKER]
1                           [POLICE, OFFICER]
2                           [POLICE, OFFICER]
3                [CHIEF, CONTRACT, EXPEDITER]
4                       [CIVIL, ENGINEER, IV]
                         ...                 
32057    [FRM, OF, MACHINISTS, -, AUTOMOTIVE]
32058                       [POLICE, OFFICER]
32059                       [POLICE, OFFICER]
32060                       [POLICE, OFFICER]
32061            [CHIEF, DATA, BASE, ANALYST]
Name: Position Title, Length: 32062, dtype: object

In [22]:
# Access Specific Elements:
chicago["Position Title"].str.split(" ").str.get(0)

0         WATER
1        POLICE
2        POLICE
3         CHIEF
4         CIVIL
          ...  
32057       FRM
32058    POLICE
32059    POLICE
32060    POLICE
32061     CHIEF
Name: Position Title, Length: 32062, dtype: object

In [23]:
# Access Specific Elements:
chicago["Position Title"].str.split(" ").str.get(1)

0            RATE
1         OFFICER
2         OFFICER
3        CONTRACT
4        ENGINEER
           ...   
32057          OF
32058     OFFICER
32059     OFFICER
32060     OFFICER
32061        DATA
Name: Position Title, Length: 32062, dtype: object

In [24]:
# The most common first word in our job positions/titles
chicago["Position Title"].str.split(" ").str.get(0).value_counts()

Position Title
POLICE             10856
FIREFIGHTER-EMT     1509
SERGEANT            1186
POOL                 918
FIREFIGHTER          810
                   ...  
DENTIST                1
ASSOC                  1
TELEPHONE              1
MAYOR                  1
PREPRESS               1
Name: count, Length: 320, dtype: int64

## More Practice with Splits
___

In [27]:
chicago = pd.read_csv("chicago.csv").dropna(how="all")
chicago["Department"] = chicago["Department"].astype("category")
chicago.head()

Unnamed: 0,Name,Position Title,Department,Employee Annual Salary
0,"AARON, ELVIA J",WATER RATE TAKER,WATER MGMNT,$90744.00
1,"AARON, JEFFERY M",POLICE OFFICER,POLICE,$84450.00
2,"AARON, KARINA",POLICE OFFICER,POLICE,$84450.00
3,"AARON, KIMBERLEI R",CHIEF CONTRACT EXPEDITER,GENERAL SERVICES,$89880.00
4,"ABAD JR, VICENTE M",CIVIL ENGINEER IV,WATER MGMNT,$106836.00


In [28]:
# Finding the most common first name among the employees

chicago["Name"].str.title().str.split(", ").str.get(1).str.strip().str.split(" ").str.get(0).value_counts()

Name
Michael     1153
John         899
James        676
Robert       622
Joseph       537
            ... 
Deena          1
Cherrise       1
Eartha         1
Ernika         1
Mac            1
Name: count, Length: 5091, dtype: int64

## The expand and n Parameters of the split Method
____
- The `expand` parameter returns a **DataFrame** instead of a **Series** of lists.
- The `n` parameter limits the number of splits.

In [78]:
chicago = pd.read_csv("chicago.csv").dropna(how="all")
chicago["Department"] = chicago["Department"].astype("category")
chicago.head()

Unnamed: 0,Name,Position Title,Department,Employee Annual Salary
0,"AARON, ELVIA J",WATER RATE TAKER,WATER MGMNT,$90744.00
1,"AARON, JEFFERY M",POLICE OFFICER,POLICE,$84450.00
2,"AARON, KARINA",POLICE OFFICER,POLICE,$84450.00
3,"AARON, KIMBERLEI R",CHIEF CONTRACT EXPEDITER,GENERAL SERVICES,$89880.00
4,"ABAD JR, VICENTE M",CIVIL ENGINEER IV,WATER MGMNT,$106836.00


In [77]:
chicago["Name"].str.split(",", expand=True)

Unnamed: 0,0,1
0,AARON,ELVIA J
1,AARON,JEFFERY M
2,AARON,KARINA
3,AARON,KIMBERLEI R
4,ABAD JR,VICENTE M
...,...,...
32057,ZYGADLO,MICHAEL J
32058,ZYGOWICZ,PETER J
32059,ZYMANTAS,MARK E
32060,ZYRKOWSKI,CARLO E


In [82]:
# the content of the column is split into two columns
# the first column contains the last name
# the second column contains the first name

chicago[['Last Name', 'First Name']] = chicago["Name"].str.split(",", expand=True)
chicago.head()


Unnamed: 0,Name,Position Title,Department,Employee Annual Salary,Last Name,First Name
0,"AARON, ELVIA J",WATER RATE TAKER,WATER MGMNT,$90744.00,AARON,ELVIA J
1,"AARON, JEFFERY M",POLICE OFFICER,POLICE,$84450.00,AARON,JEFFERY M
2,"AARON, KARINA",POLICE OFFICER,POLICE,$84450.00,AARON,KARINA
3,"AARON, KIMBERLEI R",CHIEF CONTRACT EXPEDITER,GENERAL SERVICES,$89880.00,AARON,KIMBERLEI R
4,"ABAD JR, VICENTE M",CIVIL ENGINEER IV,WATER MGMNT,$106836.00,ABAD JR,VICENTE M


In [None]:
# limit the number of splits to 1
chicago[["Primary Title", "Secondary Title"]] = chicago["Position Title"].str.split(" ", expand=True, n=1)
chicago


Unnamed: 0,Name,Position Title,Department,Employee Annual Salary,Last Name,First Name,Primary Title,Secondary Title
0,"AARON, ELVIA J",WATER RATE TAKER,WATER MGMNT,$90744.00,AARON,ELVIA J,WATER,RATE TAKER
1,"AARON, JEFFERY M",POLICE OFFICER,POLICE,$84450.00,AARON,JEFFERY M,POLICE,OFFICER
2,"AARON, KARINA",POLICE OFFICER,POLICE,$84450.00,AARON,KARINA,POLICE,OFFICER
3,"AARON, KIMBERLEI R",CHIEF CONTRACT EXPEDITER,GENERAL SERVICES,$89880.00,AARON,KIMBERLEI R,CHIEF,CONTRACT EXPEDITER
4,"ABAD JR, VICENTE M",CIVIL ENGINEER IV,WATER MGMNT,$106836.00,ABAD JR,VICENTE M,CIVIL,ENGINEER IV
...,...,...,...,...,...,...,...,...
32057,"ZYGADLO, MICHAEL J",FRM OF MACHINISTS - AUTOMOTIVE,GENERAL SERVICES,$99528.00,ZYGADLO,MICHAEL J,FRM,OF MACHINISTS - AUTOMOTIVE
32058,"ZYGOWICZ, PETER J",POLICE OFFICER,POLICE,$87384.00,ZYGOWICZ,PETER J,POLICE,OFFICER
32059,"ZYMANTAS, MARK E",POLICE OFFICER,POLICE,$84450.00,ZYMANTAS,MARK E,POLICE,OFFICER
32060,"ZYRKOWSKI, CARLO E",POLICE OFFICER,POLICE,$87384.00,ZYRKOWSKI,CARLO E,POLICE,OFFICER


In [93]:
list_a = [1,2,3,4]
list_b = ["a", "b", "c"]

list_a.append(list_b)

print(len(list_a))

5
