# Working With Text Data

* Practioner: Cleiber Garcia
* Date: 01 of February, 2023
* Objective: Practice working with text data

This notebook is based on the notebook written by Boris Pashkaver for the Course Data Analysis with Pandas and Python (Section 7: Working with Text Data), offered at Udemy. Although the degree of similarity between both notebooks is almost 100%, I built this notebook step by step.

Link: https://www.udemy.com/course/data-analysis-with-pandas/

For more information feel free to contact me at cleiber.garcia@gmail.com

## Importing Pandas DataFrame

In [1]:
import pandas as pd

## Loading the Data

In [3]:
chicago = pd.read_csv("chicago.csv")
chicago.head(5)

Unnamed: 0,Name,Position Title,Department,Employee Annual Salary
0,"AARON, ELVIA J",WATER RATE TAKER,WATER MGMNT,$90744.00
1,"AARON, JEFFERY M",POLICE OFFICER,POLICE,$84450.00
2,"AARON, KARINA",POLICE OFFICER,POLICE,$84450.00
3,"AARON, KIMBERLEI R",CHIEF CONTRACT EXPEDITER,GENERAL SERVICES,$89880.00
4,"ABAD JR, VICENTE M",CIVIL ENGINEER IV,WATER MGMNT,$106836.00


In [4]:
chicago.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32063 entries, 0 to 32062
Data columns (total 4 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   Name                    32062 non-null  object
 1   Position Title          32062 non-null  object
 2   Department              32062 non-null  object
 3   Employee Annual Salary  32062 non-null  object
dtypes: object(4)
memory usage: 1002.1+ KB


In [5]:
# Determining how many Department unique values
chicago["Department"].nunique()

35

In [6]:
# Determining the quantity of unique values in the chicago DataFrame
chicago.nunique()

Name                      31776
Position Title             1093
Department                   35
Employee Annual Salary     1156
dtype: int64

In [7]:
# Optimizing the column Department (changing from string to category)
chicago["Department"] = chicago["Department"].astype("category")

In [8]:
chicago.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32063 entries, 0 to 32062
Data columns (total 4 columns):
 #   Column                  Non-Null Count  Dtype   
---  ------                  --------------  -----   
 0   Name                    32062 non-null  object  
 1   Position Title          32062 non-null  object  
 2   Department              32062 non-null  category
 3   Employee Annual Salary  32062 non-null  object  
dtypes: category(1), object(3)
memory usage: 784.2+ KB


## Common String Methods: .lower(),  .upper(), .title() and .len()

In [9]:
# Load the data set and optimize memory for "Department" column
chicago = pd.read_csv("chicago.csv")
chicago["Department"] = chicago["Department"].astype("category")
chicago.head(5)

Unnamed: 0,Name,Position Title,Department,Employee Annual Salary
0,"AARON, ELVIA J",WATER RATE TAKER,WATER MGMNT,$90744.00
1,"AARON, JEFFERY M",POLICE OFFICER,POLICE,$84450.00
2,"AARON, KARINA",POLICE OFFICER,POLICE,$84450.00
3,"AARON, KIMBERLEI R",CHIEF CONTRACT EXPEDITER,GENERAL SERVICES,$89880.00
4,"ABAD JR, VICENTE M",CIVIL ENGINEER IV,WATER MGMNT,$106836.00


In [14]:
# Capitalize just initial name components
chicago["Name"] = chicago["Name"].str.title()

In [15]:
chicago

Unnamed: 0,Name,Position Title,Department,Employee Annual Salary
0,"Aaron, Elvia J",WATER RATE TAKER,WATER MGMNT,$90744.00
1,"Aaron, Jeffery M",POLICE OFFICER,POLICE,$84450.00
2,"Aaron, Karina",POLICE OFFICER,POLICE,$84450.00
3,"Aaron, Kimberlei R",CHIEF CONTRACT EXPEDITER,GENERAL SERVICES,$89880.00
4,"Abad Jr, Vicente M",CIVIL ENGINEER IV,WATER MGMNT,$106836.00
...,...,...,...,...
32058,"Zygowicz, Peter J",POLICE OFFICER,POLICE,$87384.00
32059,"Zymantas, Mark E",POLICE OFFICER,POLICE,$84450.00
32060,"Zyrkowski, Carlo E",POLICE OFFICER,POLICE,$87384.00
32061,"Zyskowski, Dariusz",CHIEF DATA BASE ANALYST,DoIT,$113664.00


In [18]:
# Capitalize just initial Position Title names
chicago["Position Title"] = chicago["Position Title"].str.title()

In [19]:
chicago

Unnamed: 0,Name,Position Title,Department,Employee Annual Salary
0,"Aaron, Elvia J",Water Rate Taker,WATER MGMNT,$90744.00
1,"Aaron, Jeffery M",Police Officer,POLICE,$84450.00
2,"Aaron, Karina",Police Officer,POLICE,$84450.00
3,"Aaron, Kimberlei R",Chief Contract Expediter,GENERAL SERVICES,$89880.00
4,"Abad Jr, Vicente M",Civil Engineer Iv,WATER MGMNT,$106836.00
...,...,...,...,...
32058,"Zygowicz, Peter J",Police Officer,POLICE,$87384.00
32059,"Zymantas, Mark E",Police Officer,POLICE,$84450.00
32060,"Zyrkowski, Carlo E",Police Officer,POLICE,$87384.00
32061,"Zyskowski, Dariusz",Chief Data Base Analyst,DoIT,$113664.00


In [20]:
# Determining the number of characters in Department values
chicago["Department"].str.len()

0        11.0
1         6.0
2         6.0
3        16.0
4        11.0
         ... 
32058     6.0
32059     6.0
32060     6.0
32061     4.0
32062     NaN
Name: Department, Length: 32063, dtype: float64

## The .str.replace() Method

In [21]:
# Load the data set and optimize memory for "Department" column
chicago = pd.read_csv("chicago.csv")
chicago["Department"] = chicago["Department"].astype("category")
chicago.head(5)

Unnamed: 0,Name,Position Title,Department,Employee Annual Salary
0,"AARON, ELVIA J",WATER RATE TAKER,WATER MGMNT,$90744.00
1,"AARON, JEFFERY M",POLICE OFFICER,POLICE,$84450.00
2,"AARON, KARINA",POLICE OFFICER,POLICE,$84450.00
3,"AARON, KIMBERLEI R",CHIEF CONTRACT EXPEDITER,GENERAL SERVICES,$89880.00
4,"ABAD JR, VICENTE M",CIVIL ENGINEER IV,WATER MGMNT,$106836.00


In [23]:
# Extract the three botton lines
chicago.tail(3)

Unnamed: 0,Name,Position Title,Department,Employee Annual Salary
32060,"ZYRKOWSKI, CARLO E",POLICE OFFICER,POLICE,$87384.00
32061,"ZYSKOWSKI, DARIUSZ",CHIEF DATA BASE ANALYST,DoIT,$113664.00
32062,,,,


In [24]:
# Delete rows with NaN values
chicago = chicago.dropna(how = "all")

In [25]:
chicago.tail(3)

Unnamed: 0,Name,Position Title,Department,Employee Annual Salary
32059,"ZYMANTAS, MARK E",POLICE OFFICER,POLICE,$84450.00
32060,"ZYRKOWSKI, CARLO E",POLICE OFFICER,POLICE,$87384.00
32061,"ZYSKOWSKI, DARIUSZ",CHIEF DATA BASE ANALYST,DoIT,$113664.00


In [26]:
# Extract the column "Department"
chicago["Department"]

0             WATER MGMNT
1                  POLICE
2                  POLICE
3        GENERAL SERVICES
4             WATER MGMNT
               ...       
32057    GENERAL SERVICES
32058              POLICE
32059              POLICE
32060              POLICE
32061                DoIT
Name: Department, Length: 32062, dtype: category
Categories (35, object): ['ADMIN HEARNG', 'ANIMAL CONTRL', 'AVIATION', 'BOARD OF ELECTION', ..., 'STREETS & SAN', 'TRANSPORTN', 'TREASURER', 'WATER MGMNT']

In [27]:
# Change the acronym "MGMNT" in the column "Department" for the word "MANAGEMENT"
chicago["Department"] = chicago["Department"].str.replace("MGMNT", "MANAGEMENT")

In [28]:
chicago["Department"]

0        WATER MANAGEMENT
1                  POLICE
2                  POLICE
3        GENERAL SERVICES
4        WATER MANAGEMENT
               ...       
32057    GENERAL SERVICES
32058              POLICE
32059              POLICE
32060              POLICE
32061                DoIT
Name: Department, Length: 32062, dtype: object

In [29]:
# Extract the column "Employee Annual Salary"
chicago["Employee Annual Salary"]

0         $90744.00
1         $84450.00
2         $84450.00
3         $89880.00
4        $106836.00
            ...    
32057     $99528.00
32058     $87384.00
32059     $84450.00
32060     $87384.00
32061    $113664.00
Name: Employee Annual Salary, Length: 32062, dtype: object

In [32]:
# Replace the $ sign for nothing in the column "Employee Annual Salary" and covert values to float data type
chicago["Employee Annual Salary"] = chicago["Employee Annual Salary"].str.replace("$", "").astype(float)
chicago.head(5)

  chicago["Employee Annual Salary"] = chicago["Employee Annual Salary"].str.replace("$", "").astype(float)


Unnamed: 0,Name,Position Title,Department,Employee Annual Salary
0,"AARON, ELVIA J",WATER RATE TAKER,WATER MANAGEMENT,90744.0
1,"AARON, JEFFERY M",POLICE OFFICER,POLICE,84450.0
2,"AARON, KARINA",POLICE OFFICER,POLICE,84450.0
3,"AARON, KIMBERLEI R",CHIEF CONTRACT EXPEDITER,GENERAL SERVICES,89880.0
4,"ABAD JR, VICENTE M",CIVIL ENGINEER IV,WATER MANAGEMENT,106836.0


In [34]:
# Find the average "Employee Annual Salary"
chicago["Employee Annual Salary"].mean()

80204.178633899

In [35]:
# Find the ten highest Annual Salaries
chicago["Employee Annual Salary"].nlargest(10)

8184     300000.0
7954     216210.0
25532    202728.0
8924     197736.0
8042     197724.0
19208    195000.0
3706     187680.0
18556    187680.0
29466    187680.0
13754    185364.0
Name: Employee Annual Salary, dtype: float64

In [36]:
# Find the ten smallest Annual Salaries
chicago["Employee Annual Salary"].nsmallest(10)

15102       0.96
12       2756.00
27       2756.00
47       2756.00
295      2756.00
380      2756.00
686      2756.00
751      2756.00
959      2756.00
1093     2756.00
Name: Employee Annual Salary, dtype: float64

In [40]:
# Find the employee with "Employee Annual Salary" equal $0.96
chicago[chicago["Employee Annual Salary"] == 0.96]

Unnamed: 0,Name,Position Title,Department,Employee Annual Salary
15102,"KOCH, STEVEN",ADMINISTRATIVE SECRETARY,MAYOR'S OFFICE,0.96


## Filtering With String Methods

In [42]:
chicago = pd.read_csv("chicago.csv") # Load the dataset "chicago.csv"
chicago["Department"] = chicago["Department"].astype("category") # Optimize memory usage for "Department" column
chicago.dropna(how = "all", inplace = True) # Drop rows with NaN values
chicago.tail(5) # Extract the 5 last rows

Unnamed: 0,Name,Position Title,Department,Employee Annual Salary
32057,"ZYGADLO, MICHAEL J",FRM OF MACHINISTS - AUTOMOTIVE,GENERAL SERVICES,$99528.00
32058,"ZYGOWICZ, PETER J",POLICE OFFICER,POLICE,$87384.00
32059,"ZYMANTAS, MARK E",POLICE OFFICER,POLICE,$84450.00
32060,"ZYRKOWSKI, CARLO E",POLICE OFFICER,POLICE,$87384.00
32061,"ZYSKOWSKI, DARIUSZ",CHIEF DATA BASE ANALYST,DoIT,$113664.00


In [47]:
# Extract the rows with column "Position Title" value equal "water" anywhere in the title
mask = chicago["Position Title"].str.lower().str.contains("water")
chicago[mask]

Unnamed: 0,Name,Position Title,Department,Employee Annual Salary
0,"AARON, ELVIA J",WATER RATE TAKER,WATER MGMNT,$90744.00
554,"ALUISE, VINCENT G",FOREMAN OF WATER PIPE CONSTRUCTION,WATER MGMNT,$102440.00
671,"ANDER, PERRY A",WATER CHEMIST II,WATER MGMNT,$82044.00
685,"ANDERSON, ANDREW J",DISTRICT SUPERINTENDENT OF WATER DISTRIBUTION,WATER MGMNT,$109272.00
702,"ANDERSON, DONALD",FOREMAN OF WATER PIPE CONSTRUCTION,WATER MGMNT,$102440.00
...,...,...,...,...
29669,"VERMA, ANUPAM",MANAGING ENGINEER - WATER MANAGEMENT,WATER MGMNT,$111192.00
30239,"WASHINGTON, JOSEPH",WATER CHEMIST III,WATER MGMNT,$89676.00
30544,"WEST, THOMAS R",GEN SUPT OF WATER MANAGEMENT,WATER MGMNT,$115704.00
30991,"WILLIAMS, MATTHEW",FOREMAN OF WATER PIPE CONSTRUCTION,WATER MGMNT,$102440.00


In [48]:
# Extract the rows for which "Position Title" value starts with "water"
mask = chicago["Position Title"].str.lower().str.startswith("water")
chicago[mask]

Unnamed: 0,Name,Position Title,Department,Employee Annual Salary
0,"AARON, ELVIA J",WATER RATE TAKER,WATER MGMNT,$90744.00
671,"ANDER, PERRY A",WATER CHEMIST II,WATER MGMNT,$82044.00
1054,"ASHLEY, KARMA T",WATER CHEMIST II,WATER MGMNT,$82044.00
1079,"ATKINS, JOANNA M",WATER CHEMIST II,WATER MGMNT,$82044.00
1181,"AZEEM, MOHAMMED A",WATER CHEMIST II,WATER MGMNT,$53172.00
...,...,...,...,...
28574,"THREATT, DENISE R",WATER QUALITY INSPECTOR,WATER MGMNT,$62004.00
28602,"TIGNOR, DARRYL B",WATER RATE TAKER,WATER MGMNT,$78948.00
28955,"TRAVIS COOK, LESLIE R",WATER RATE TAKER,WATER MGMNT,$78948.00
29584,"VELAZQUEZ, JOHN",WATER RATE TAKER,WATER MGMNT,$78948.00


In [51]:
# Extract the rows for which "Position Title" value ends with "officer"
mask = chicago["Position Title"].str.lower().str.endswith("officer")
chicago[mask]

Unnamed: 0,Name,Position Title,Department,Employee Annual Salary
1,"AARON, JEFFERY M",POLICE OFFICER,POLICE,$84450.00
2,"AARON, KARINA",POLICE OFFICER,POLICE,$84450.00
11,"ABBATE, TERRY M",POLICE OFFICER,POLICE,$90618.00
15,"ABDALLAH, ZAID",POLICE OFFICER,POLICE,$74028.00
16,"ABDELHADI, ABDALMAHD",POLICE OFFICER,POLICE,$81588.00
...,...,...,...,...
32054,"ZYCH, MATEUSZ",POLICE OFFICER,POLICE,$46668.00
32055,"ZYDEK, BRYAN",POLICE OFFICER,POLICE,$81588.00
32058,"ZYGOWICZ, PETER J",POLICE OFFICER,POLICE,$87384.00
32059,"ZYMANTAS, MARK E",POLICE OFFICER,POLICE,$84450.00


## Working With String Methods .strip(), .lstrip(), .rstrip()

In [61]:
chicago = pd.read_csv("chicago.csv") # Load the dataset "chicago.csv"
chicago["Department"] = chicago["Department"].astype("category") # Optimize memory usage for "Department" column
chicago.dropna(how = "all", inplace = True) # Drop rows with NaN values
chicago.head(5) # Extract the 5 first rows

Unnamed: 0,Name,Position Title,Department,Employee Annual Salary
0,"AARON, ELVIA J",WATER RATE TAKER,WATER MGMNT,$90744.00
1,"AARON, JEFFERY M",POLICE OFFICER,POLICE,$84450.00
2,"AARON, KARINA",POLICE OFFICER,POLICE,$84450.00
3,"AARON, KIMBERLEI R",CHIEF CONTRACT EXPEDITER,GENERAL SERVICES,$89880.00
4,"ABAD JR, VICENTE M",CIVIL ENGINEER IV,WATER MGMNT,$106836.00


In [62]:
# Example working with .lstrip() # left strip spaces
"   Hello World   ".lstrip()

'Hello World   '

In [63]:
# Example working with .rstrip() # right strip spaces
"   Hello World   ".rstrip()

'   Hello World'

In [64]:
# Example working with .strip() # both left and right strip spaces
"   Hello World   ".strip()

'Hello World'

In [65]:
# Removing left and right spaces from column "Name" values
chicago["Name"] = chicago["Name"].str.lstrip().str.rstrip()

In [66]:
# Removing left and right spaces from column "Position Title" values
chicago["Name"] = chicago["Name"].str.strip()

In [67]:
# Removing left and right spaces from column "Department" values
chicago["Name"] = chicago["Name"].str.strip()

In [68]:
chicago

Unnamed: 0,Name,Position Title,Department,Employee Annual Salary
0,"AARON, ELVIA J",WATER RATE TAKER,WATER MGMNT,$90744.00
1,"AARON, JEFFERY M",POLICE OFFICER,POLICE,$84450.00
2,"AARON, KARINA",POLICE OFFICER,POLICE,$84450.00
3,"AARON, KIMBERLEI R",CHIEF CONTRACT EXPEDITER,GENERAL SERVICES,$89880.00
4,"ABAD JR, VICENTE M",CIVIL ENGINEER IV,WATER MGMNT,$106836.00
...,...,...,...,...
32057,"ZYGADLO, MICHAEL J",FRM OF MACHINISTS - AUTOMOTIVE,GENERAL SERVICES,$99528.00
32058,"ZYGOWICZ, PETER J",POLICE OFFICER,POLICE,$87384.00
32059,"ZYMANTAS, MARK E",POLICE OFFICER,POLICE,$84450.00
32060,"ZYRKOWSKI, CARLO E",POLICE OFFICER,POLICE,$87384.00


## String Methods on Index and Columns

In [70]:
chicago = pd.read_csv("chicago.csv", index_col = "Name") # Load the dataset "chicago.csv" and index it by "Name"
chicago["Department"] = chicago["Department"].astype("category") # Optimize memory usage for "Department" column
chicago.dropna(how = "all", inplace = True) # Drop rows with NaN values
chicago.head(5) # Extract the 5 first rows

Unnamed: 0_level_0,Position Title,Department,Employee Annual Salary
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
"AARON, ELVIA J",WATER RATE TAKER,WATER MGMNT,$90744.00
"AARON, JEFFERY M",POLICE OFFICER,POLICE,$84450.00
"AARON, KARINA",POLICE OFFICER,POLICE,$84450.00
"AARON, KIMBERLEI R",CHIEF CONTRACT EXPEDITER,GENERAL SERVICES,$89880.00
"ABAD JR, VICENTE M",CIVIL ENGINEER IV,WATER MGMNT,$106836.00


In [74]:
# Drop spaces from the start and from the end of the index value
# Capitalize initial words of index
chicago.index = chicago.index.str.strip().str.title()

In [75]:
chicago.head(5)

Unnamed: 0_level_0,Position Title,Department,Employee Annual Salary
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
"Aaron, Elvia J",WATER RATE TAKER,WATER MGMNT,$90744.00
"Aaron, Jeffery M",POLICE OFFICER,POLICE,$84450.00
"Aaron, Karina",POLICE OFFICER,POLICE,$84450.00
"Aaron, Kimberlei R",CHIEF CONTRACT EXPEDITER,GENERAL SERVICES,$89880.00
"Abad Jr, Vicente M",CIVIL ENGINEER IV,WATER MGMNT,$106836.00


In [78]:
# Capitalize Column Titles
chicago.columns = chicago.columns.str.upper()
chicago.head(5)

Unnamed: 0_level_0,POSITION TITLE,DEPARTMENT,EMPLOYEE ANNUAL SALARY
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
"Aaron, Elvia J",WATER RATE TAKER,WATER MGMNT,$90744.00
"Aaron, Jeffery M",POLICE OFFICER,POLICE,$84450.00
"Aaron, Karina",POLICE OFFICER,POLICE,$84450.00
"Aaron, Kimberlei R",CHIEF CONTRACT EXPEDITER,GENERAL SERVICES,$89880.00
"Abad Jr, Vicente M",CIVIL ENGINEER IV,WATER MGMNT,$106836.00


## Splitting Characters with .str.split() Method

In [80]:
chicago = pd.read_csv("chicago.csv") # Load the dataset "chicago.csv"
chicago["Department"] = chicago["Department"].astype("category") # Optimize memory usage for "Department" column
chicago.dropna(how = "all", inplace = True) # Drop rows with NaN values
chicago.head(5) # Extract the 5 first rows

Unnamed: 0,Name,Position Title,Department,Employee Annual Salary
0,"AARON, ELVIA J",WATER RATE TAKER,WATER MGMNT,$90744.00
1,"AARON, JEFFERY M",POLICE OFFICER,POLICE,$84450.00
2,"AARON, KARINA",POLICE OFFICER,POLICE,$84450.00
3,"AARON, KIMBERLEI R",CHIEF CONTRACT EXPEDITER,GENERAL SERVICES,$89880.00
4,"ABAD JR, VICENTE M",CIVIL ENGINEER IV,WATER MGMNT,$106836.00


In [81]:
# Split words from the string below
"Today it is a beautiful day".split(" ")

['Today', 'it', 'is', 'a', 'beautiful', 'day']

In [88]:
# Find common names in DataFrame chicago
chicago["Name"].str.split(",").str.get(0).str.title().value_counts()

Williams     293
Johnson      244
Smith        241
Brown        185
Jones        183
            ... 
Horkavy        1
Horn           1
Horne Jr       1
Horner         1
Zyskowski      1
Name: Name, Length: 13829, dtype: int64

In [96]:
# Count the first word of "Position Title" values
chicago["Position Title"].str.split(" ").str.get(0).value_counts()

POLICE             10856
FIREFIGHTER-EMT     1509
SERGEANT            1186
POOL                 918
FIREFIGHTER          810
                   ...  
DENTIST                1
ASSOC                  1
TELEPHONE              1
MAYOR                  1
PREPRESS               1
Name: Position Title, Length: 320, dtype: int64

In [97]:
sum(chicago["Position Title"].str.split(" ").str.get(0).value_counts())

32062

In [118]:
# # Find the 5 most popular first name of employee in chicago DataFrame
chicago["Name"].str.split(",").str.get(1).str.strip().str.split(" ").str.get(0).value_counts().head(5)

MICHAEL    1153
JOHN        899
JAMES       676
ROBERT      622
JOSEPH      537
Name: Name, dtype: int64

In [112]:
# Find the most popular first name of employee in chicago DataFrame - step by step
chicago["Name"].str.split(",")

0            [AARON,   ELVIA J]
1          [AARON,   JEFFERY M]
2             [AARON,   KARINA]
3        [AARON,   KIMBERLEI R]
4        [ABAD JR,   VICENTE M]
                  ...          
32057    [ZYGADLO,   MICHAEL J]
32058     [ZYGOWICZ,   PETER J]
32059      [ZYMANTAS,   MARK E]
32060    [ZYRKOWSKI,   CARLO E]
32061    [ZYSKOWSKI,   DARIUSZ]
Name: Name, Length: 32062, dtype: object

In [113]:
chicago["Name"].str.split(",").str.get(1)

0              ELVIA J
1            JEFFERY M
2               KARINA
3          KIMBERLEI R
4            VICENTE M
             ...      
32057        MICHAEL J
32058          PETER J
32059           MARK E
32060          CARLO E
32061          DARIUSZ
Name: Name, Length: 32062, dtype: object

In [114]:
chicago["Name"].str.split(",").str.get(1).str.strip()

0            ELVIA J
1          JEFFERY M
2             KARINA
3        KIMBERLEI R
4          VICENTE M
            ...     
32057      MICHAEL J
32058        PETER J
32059         MARK E
32060        CARLO E
32061        DARIUSZ
Name: Name, Length: 32062, dtype: object

In [115]:
chicago["Name"].str.split(",").str.get(1).str.strip().str.split(" ")

0            [ELVIA, J]
1          [JEFFERY, M]
2              [KARINA]
3        [KIMBERLEI, R]
4          [VICENTE, M]
              ...      
32057      [MICHAEL, J]
32058        [PETER, J]
32059         [MARK, E]
32060        [CARLO, E]
32061         [DARIUSZ]
Name: Name, Length: 32062, dtype: object

In [116]:
chicago["Name"].str.split(",").str.get(1).str.strip().str.split(" ").str.get(0)

0            ELVIA
1          JEFFERY
2           KARINA
3        KIMBERLEI
4          VICENTE
           ...    
32057      MICHAEL
32058        PETER
32059         MARK
32060        CARLO
32061      DARIUSZ
Name: Name, Length: 32062, dtype: object

In [119]:
chicago["Name"].str.split(",").str.get(1).str.strip().str.split(" ").str.get(0).value_counts()

MICHAEL     1153
JOHN         899
JAMES        676
ROBERT       622
JOSEPH       537
            ... 
DEENA          1
CHERRISE       1
EARTHA         1
ERNIKA         1
MAC            1
Name: Name, Length: 5091, dtype: int64

In [120]:
chicago["Name"].str.split(",").str.get(1).str.strip().str.split(" ").str.get(0).value_counts().head(5)

MICHAEL    1153
JOHN        899
JAMES       676
ROBERT      622
JOSEPH      537
Name: Name, dtype: int64

## The expand and n Parameters of the .str.split() Method

In [121]:
chicago = pd.read_csv("chicago.csv") # Load the dataset "chicago.csv"
chicago["Department"] = chicago["Department"].astype("category") # Optimize memory usage for "Department" column
chicago.dropna(how = "all", inplace = True) # Drop rows with NaN values
chicago.head(5) # Extract the 5 first rows

Unnamed: 0,Name,Position Title,Department,Employee Annual Salary
0,"AARON, ELVIA J",WATER RATE TAKER,WATER MGMNT,$90744.00
1,"AARON, JEFFERY M",POLICE OFFICER,POLICE,$84450.00
2,"AARON, KARINA",POLICE OFFICER,POLICE,$84450.00
3,"AARON, KIMBERLEI R",CHIEF CONTRACT EXPEDITER,GENERAL SERVICES,$89880.00
4,"ABAD JR, VICENTE M",CIVIL ENGINEER IV,WATER MGMNT,$106836.00


In [123]:
chicago["Name"].str.split(",", expand = True)

Unnamed: 0,0,1
0,AARON,ELVIA J
1,AARON,JEFFERY M
2,AARON,KARINA
3,AARON,KIMBERLEI R
4,ABAD JR,VICENTE M
...,...,...
32057,ZYGADLO,MICHAEL J
32058,ZYGOWICZ,PETER J
32059,ZYMANTAS,MARK E
32060,ZYRKOWSKI,CARLO E


In [126]:
# Build two new columns: "Fisrt Name" and "Last Name"
chicago[["Last Name", "First Name"]] = chicago["Name"].str.split(",", expand = True)
chicago.head(5)

Unnamed: 0,Name,Position Title,Department,Employee Annual Salary,Last Name,First Name
0,"AARON, ELVIA J",WATER RATE TAKER,WATER MGMNT,$90744.00,AARON,ELVIA J
1,"AARON, JEFFERY M",POLICE OFFICER,POLICE,$84450.00,AARON,JEFFERY M
2,"AARON, KARINA",POLICE OFFICER,POLICE,$84450.00,AARON,KARINA
3,"AARON, KIMBERLEI R",CHIEF CONTRACT EXPEDITER,GENERAL SERVICES,$89880.00,AARON,KIMBERLEI R
4,"ABAD JR, VICENTE M",CIVIL ENGINEER IV,WATER MGMNT,$106836.00,ABAD JR,VICENTE M


In [129]:
chicago["Position Title"].str.split(" ", expand = True)

Unnamed: 0,0,1,2,3,4,5,6,7,8
0,WATER,RATE,TAKER,,,,,,
1,POLICE,OFFICER,,,,,,,
2,POLICE,OFFICER,,,,,,,
3,CHIEF,CONTRACT,EXPEDITER,,,,,,
4,CIVIL,ENGINEER,IV,,,,,,
...,...,...,...,...,...,...,...,...,...
32057,FRM,OF,MACHINISTS,-,AUTOMOTIVE,,,,
32058,POLICE,OFFICER,,,,,,,
32059,POLICE,OFFICER,,,,,,,
32060,POLICE,OFFICER,,,,,,,


In [130]:
chicago["Position Title"].str.split(" ", expand = True, n = 1) # split one time

Unnamed: 0,0,1
0,WATER,RATE TAKER
1,POLICE,OFFICER
2,POLICE,OFFICER
3,CHIEF,CONTRACT EXPEDITER
4,CIVIL,ENGINEER IV
...,...,...
32057,FRM,OF MACHINISTS - AUTOMOTIVE
32058,POLICE,OFFICER
32059,POLICE,OFFICER
32060,POLICE,OFFICER


In [132]:
chicago[["First Word PosTit", "Remaining Words PosTit"]] = chicago["Position Title"].str.split(" ", expand = True, n = 1)
chicago.head(5)

Unnamed: 0,Name,Position Title,Department,Employee Annual Salary,Last Name,First Name,First Word PosTit,Remaining Words PosTit
0,"AARON, ELVIA J",WATER RATE TAKER,WATER MGMNT,$90744.00,AARON,ELVIA J,WATER,RATE TAKER
1,"AARON, JEFFERY M",POLICE OFFICER,POLICE,$84450.00,AARON,JEFFERY M,POLICE,OFFICER
2,"AARON, KARINA",POLICE OFFICER,POLICE,$84450.00,AARON,KARINA,POLICE,OFFICER
3,"AARON, KIMBERLEI R",CHIEF CONTRACT EXPEDITER,GENERAL SERVICES,$89880.00,AARON,KIMBERLEI R,CHIEF,CONTRACT EXPEDITER
4,"ABAD JR, VICENTE M",CIVIL ENGINEER IV,WATER MGMNT,$106836.00,ABAD JR,VICENTE M,CIVIL,ENGINEER IV
