# Working with Text Data

In [1]:
import pandas as pd

## This Module's Dataset
- This module's dataset (`chicago.csv`) is a collection of public sector employees in the city of Chicago.
- Each row inclues the employee's name, position, department, and salary.

In [3]:
chicago= pd.read_csv('chicago.csv').dropna(how='all')
chicago.head()

Unnamed: 0,Name,Position Title,Department,Employee Annual Salary
0,"AARON, ELVIA J",WATER RATE TAKER,WATER MGMNT,$90744.00
1,"AARON, JEFFERY M",POLICE OFFICER,POLICE,$84450.00
2,"AARON, KARINA",POLICE OFFICER,POLICE,$84450.00
3,"AARON, KIMBERLEI R",CHIEF CONTRACT EXPEDITER,GENERAL SERVICES,$89880.00
4,"ABAD JR, VICENTE M",CIVIL ENGINEER IV,WATER MGMNT,$106836.00


In [4]:
chicago.info()

<class 'pandas.core.frame.DataFrame'>
Index: 32062 entries, 0 to 32061
Data columns (total 4 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   Name                    32062 non-null  object
 1   Position Title          32062 non-null  object
 2   Department              32062 non-null  object
 3   Employee Annual Salary  32062 non-null  object
dtypes: object(4)
memory usage: 1.2+ MB


In [6]:
chicago.nunique()

Name                      31776
Position Title             1093
Department                   35
Employee Annual Salary     1156
dtype: int64

In [8]:
# according to the previous result, Department would be a good option to be converted as a category type in order to optimize dataframe memory
chicago['Department']= chicago['Department'].astype('category')

In [10]:
chicago.info()

<class 'pandas.core.frame.DataFrame'>
Index: 32062 entries, 0 to 32061
Data columns (total 4 columns):
 #   Column                  Non-Null Count  Dtype   
---  ------                  --------------  -----   
 0   Name                    32062 non-null  object  
 1   Position Title          32062 non-null  object  
 2   Department              32062 non-null  category
 3   Employee Annual Salary  32062 non-null  object  
dtypes: category(1), object(3)
memory usage: 1.0+ MB


In [11]:
# the best final dataframe structure:

chicago= pd.read_csv('chicago.csv').dropna(how= 'all')
chicago['Department'] = chicago['Department'].astype('category')
chicago.head()

Unnamed: 0,Name,Position Title,Department,Employee Annual Salary
0,"AARON, ELVIA J",WATER RATE TAKER,WATER MGMNT,$90744.00
1,"AARON, JEFFERY M",POLICE OFFICER,POLICE,$84450.00
2,"AARON, KARINA",POLICE OFFICER,POLICE,$84450.00
3,"AARON, KIMBERLEI R",CHIEF CONTRACT EXPEDITER,GENERAL SERVICES,$89880.00
4,"ABAD JR, VICENTE M",CIVIL ENGINEER IV,WATER MGMNT,$106836.00


## Common String Methods
- A **Series** has a special `str` attribute that exposes an object with string methods.
- Access the `str` attribute, then invoke the string method on the nested object.
- Most method names will match their Python method equivalents (`upper`, `lower`, `title`, etc).

In [12]:
chicago= pd.read_csv('chicago.csv').dropna(how= 'all')
chicago['Department'] = chicago['Department'].astype('category')
chicago.head()

Unnamed: 0,Name,Position Title,Department,Employee Annual Salary
0,"AARON, ELVIA J",WATER RATE TAKER,WATER MGMNT,$90744.00
1,"AARON, JEFFERY M",POLICE OFFICER,POLICE,$84450.00
2,"AARON, KARINA",POLICE OFFICER,POLICE,$84450.00
3,"AARON, KIMBERLEI R",CHIEF CONTRACT EXPEDITER,GENERAL SERVICES,$89880.00
4,"ABAD JR, VICENTE M",CIVIL ENGINEER IV,WATER MGMNT,$106836.00


In [16]:
'boris'.upper()
'BORIS'.lower()
'boris the animal'.title()
'men in black'.capitalize()

'Men in black'

In [39]:
chicago['Position Title'].str.lower()
chicago['Position Title'].str.upper()
chicago['Position Title'].str.title()
chicago['Position Title'].str.len()

# the result of such calls are just entire new series, so if we want to apply any other string method we need to access str attribute again
chicago['Position Title'].str.title().str.len()

chicago['Position Title'].str.strip()
chicago['Position Title'].str.rstrip()
chicago['Position Title'].str.lstrip()

chicago['Department'].str.replace('MGMNT', 'MANAGEMENT')

0        WATER MANAGEMENT
1                  POLICE
2                  POLICE
3        GENERAL SERVICES
4        WATER MANAGEMENT
               ...       
32057    GENERAL SERVICES
32058              POLICE
32059              POLICE
32060              POLICE
32061                DoIT
Name: Department, Length: 32062, dtype: object

## Filtering with String Methods
- The `str.contains` method checks whether a substring exists anywhere in the string.
- The `str.startswith` method checks whether a substring exists at the start of the string.
- The `str.endswith` method checks whether a substring exists at the end of the string.

In [40]:
chicago= pd.read_csv('chicago.csv').dropna(how= 'all')
chicago['Department'] = chicago['Department'].astype('category')
chicago.head()

Unnamed: 0,Name,Position Title,Department,Employee Annual Salary
0,"AARON, ELVIA J",WATER RATE TAKER,WATER MGMNT,$90744.00
1,"AARON, JEFFERY M",POLICE OFFICER,POLICE,$84450.00
2,"AARON, KARINA",POLICE OFFICER,POLICE,$84450.00
3,"AARON, KIMBERLEI R",CHIEF CONTRACT EXPEDITER,GENERAL SERVICES,$89880.00
4,"ABAD JR, VICENTE M",CIVIL ENGINEER IV,WATER MGMNT,$106836.00


- The filtering procedure follows the exact same procedure as we saw in the previous sections of the course: we need to define a boolean series and search for indexes that carry the True value. The point is that we can use the strings methods we're learning in this section in order to help us (all the previous ones we've seen plus the new ones presented on this section)

In [42]:
water_workers= chicago['Position Title'].str.lower().str.contains('water')
chicago[water_workers].head()

Unnamed: 0,Name,Position Title,Department,Employee Annual Salary
0,"AARON, ELVIA J",WATER RATE TAKER,WATER MGMNT,$90744.00
554,"ALUISE, VINCENT G",FOREMAN OF WATER PIPE CONSTRUCTION,WATER MGMNT,$102440.00
671,"ANDER, PERRY A",WATER CHEMIST II,WATER MGMNT,$82044.00
685,"ANDERSON, ANDREW J",DISTRICT SUPERINTENDENT OF WATER DISTRIBUTION,WATER MGMNT,$109272.00
702,"ANDERSON, DONALD",FOREMAN OF WATER PIPE CONSTRUCTION,WATER MGMNT,$102440.00


In [44]:
starts_with_civil= chicago['Position Title'].str.lower().str.startswith('civil')
chicago[starts_with_civil].head()

Unnamed: 0,Name,Position Title,Department,Employee Annual Salary
4,"ABAD JR, VICENTE M",CIVIL ENGINEER IV,WATER MGMNT,$106836.00
25,"ABDULSATTAR, MUDHAR",CIVIL ENGINEER II,WATER MGMNT,$58536.00
34,"ABRAHAM, GIRLEY T",CIVIL ENGINEER IV,WATER MGMNT,$106836.00
55,"ABUTALEB, AHMAD H",CIVIL ENGINEER II,WATER MGMNT,$89676.00
147,"ADAMS, TANERA C",CIVIL ENGINEER IV,TRANSPORTN,$106836.00


In [46]:
ends_with_iv= chicago['Position Title'].str.lower().str.endswith('iv')
chicago[ends_with_iv].tail()

Unnamed: 0,Name,Position Title,Department,Employee Annual Salary
31777,"ZAFIRIS, CHRISTOPHER",ARCHITECT IV,DISABILITIES,$106836.00
31797,"ZAKE, JOSHUA S",CIVIL ENGINEER IV,TRANSPORTN,$106836.00
31870,"ZAVALA, FERNANDO",ACCOUNTANT IV,FINANCE,$97812.00
31884,"ZAWADSKI, JAMES",CLERK IV,LAW,$68028.00
32008,"ZOTTA, SANDINO",MECHANICAL ENGINEER IV,WATER MGMNT,$106836.00


## String Methods on Index and Columns
- Use the `index` and `columns` attributes to access the **DataFrame** index/column labels.
- These objects support string methods via their own `str` attribute.

In [64]:
chicago= pd.read_csv('chicago.csv', index_col= 'Name').dropna(how= 'all')
chicago['Department'] = chicago['Department'].astype('category')
chicago.head()

Unnamed: 0_level_0,Position Title,Department,Employee Annual Salary
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
"AARON, ELVIA J",WATER RATE TAKER,WATER MGMNT,$90744.00
"AARON, JEFFERY M",POLICE OFFICER,POLICE,$84450.00
"AARON, KARINA",POLICE OFFICER,POLICE,$84450.00
"AARON, KIMBERLEI R",CHIEF CONTRACT EXPEDITER,GENERAL SERVICES,$89880.00
"ABAD JR, VICENTE M",CIVIL ENGINEER IV,WATER MGMNT,$106836.00


In [65]:
chicago.index= chicago.index.str.strip().str.title()

In [66]:
chicago.head()

Unnamed: 0_level_0,Position Title,Department,Employee Annual Salary
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
"Aaron, Elvia J",WATER RATE TAKER,WATER MGMNT,$90744.00
"Aaron, Jeffery M",POLICE OFFICER,POLICE,$84450.00
"Aaron, Karina",POLICE OFFICER,POLICE,$84450.00
"Aaron, Kimberlei R",CHIEF CONTRACT EXPEDITER,GENERAL SERVICES,$89880.00
"Abad Jr, Vicente M",CIVIL ENGINEER IV,WATER MGMNT,$106836.00


In [70]:
chicago.columns= chicago.columns.str.replace(' ', '')
chicago.head()

Unnamed: 0_level_0,PositionTitle,Department,EmployeeAnnualSalary
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
"Aaron, Elvia J",WATER RATE TAKER,WATER MGMNT,$90744.00
"Aaron, Jeffery M",POLICE OFFICER,POLICE,$84450.00
"Aaron, Karina",POLICE OFFICER,POLICE,$84450.00
"Aaron, Kimberlei R",CHIEF CONTRACT EXPEDITER,GENERAL SERVICES,$89880.00
"Abad Jr, Vicente M",CIVIL ENGINEER IV,WATER MGMNT,$106836.00


## The split Method
- The `str.split` method splits a string by the occurrence of a delimiter. Pandas returns a **Series** of lists.
- Use the `str.get` method to access a nested list element by its index position.

In [71]:
chicago= pd.read_csv('chicago.csv').dropna(how= 'all')
chicago['Department'] = chicago['Department'].astype('category')
chicago.head()

Unnamed: 0,Name,Position Title,Department,Employee Annual Salary
0,"AARON, ELVIA J",WATER RATE TAKER,WATER MGMNT,$90744.00
1,"AARON, JEFFERY M",POLICE OFFICER,POLICE,$84450.00
2,"AARON, KARINA",POLICE OFFICER,POLICE,$84450.00
3,"AARON, KIMBERLEI R",CHIEF CONTRACT EXPEDITER,GENERAL SERVICES,$89880.00
4,"ABAD JR, VICENTE M",CIVIL ENGINEER IV,WATER MGMNT,$106836.00


In [88]:
# The most common first word in job titles column
positions= chicago['Position Title'].str.split(' ').str.get(0)
positions.value_counts()

Position Title
POLICE             10856
FIREFIGHTER-EMT     1509
SERGEANT            1186
POOL                 918
FIREFIGHTER          810
                   ...  
DENTIST                1
ASSOC                  1
TELEPHONE              1
MAYOR                  1
PREPRESS               1
Name: count, Length: 320, dtype: int64

## More Practice with Splits

In [89]:
chicago= pd.read_csv('chicago.csv').dropna(how= 'all')
chicago['Department'] = chicago['Department'].astype('category')
chicago.head()

Unnamed: 0,Name,Position Title,Department,Employee Annual Salary
0,"AARON, ELVIA J",WATER RATE TAKER,WATER MGMNT,$90744.00
1,"AARON, JEFFERY M",POLICE OFFICER,POLICE,$84450.00
2,"AARON, KARINA",POLICE OFFICER,POLICE,$84450.00
3,"AARON, KIMBERLEI R",CHIEF CONTRACT EXPEDITER,GENERAL SERVICES,$89880.00
4,"ABAD JR, VICENTE M",CIVIL ENGINEER IV,WATER MGMNT,$106836.00


In [107]:
# Finding the most common first name among the employees ('Name' column)

chicago['Name'].str.split(',').str.get(1).str.strip().str.split(' ').str.get(0).str.title().value_counts()

Name
Michael     1153
John         899
James        676
Robert       622
Joseph       537
            ... 
Deena          1
Cherrise       1
Eartha         1
Ernika         1
Mac            1
Name: count, Length: 5091, dtype: int64

## The expand and n Parameters of the split Method
- The `expand` parameter returns a **DataFrame** instead of a **Series** of lists.
- The `n` parameter limits the number of splits.

In [108]:
chicago= pd.read_csv('chicago.csv').dropna(how= 'all')
chicago['Department'] = chicago['Department'].astype('category')
chicago.head()

Unnamed: 0,Name,Position Title,Department,Employee Annual Salary
0,"AARON, ELVIA J",WATER RATE TAKER,WATER MGMNT,$90744.00
1,"AARON, JEFFERY M",POLICE OFFICER,POLICE,$84450.00
2,"AARON, KARINA",POLICE OFFICER,POLICE,$84450.00
3,"AARON, KIMBERLEI R",CHIEF CONTRACT EXPEDITER,GENERAL SERVICES,$89880.00
4,"ABAD JR, VICENTE M",CIVIL ENGINEER IV,WATER MGMNT,$106836.00


In [110]:
chicago['Name'].str.split(', ', expand= True)

Unnamed: 0,0,1
0,AARON,ELVIA J
1,AARON,JEFFERY M
2,AARON,KARINA
3,AARON,KIMBERLEI R
4,ABAD JR,VICENTE M
...,...,...
32057,ZYGADLO,MICHAEL J
32058,ZYGOWICZ,PETER J
32059,ZYMANTAS,MARK E
32060,ZYRKOWSKI,CARLO E


In [111]:
chicago[['Last Name', 'First Name']] = chicago['Name'].str.split(', ', expand= True)
chicago.head()

Unnamed: 0,Name,Position Title,Department,Employee Annual Salary,Last Name,First Name
0,"AARON, ELVIA J",WATER RATE TAKER,WATER MGMNT,$90744.00,AARON,ELVIA J
1,"AARON, JEFFERY M",POLICE OFFICER,POLICE,$84450.00,AARON,JEFFERY M
2,"AARON, KARINA",POLICE OFFICER,POLICE,$84450.00,AARON,KARINA
3,"AARON, KIMBERLEI R",CHIEF CONTRACT EXPEDITER,GENERAL SERVICES,$89880.00,AARON,KIMBERLEI R
4,"ABAD JR, VICENTE M",CIVIL ENGINEER IV,WATER MGMNT,$106836.00,ABAD JR,VICENTE M


In [112]:
chicago['Position Title'].str.split(' ', expand= True)

Unnamed: 0,0,1,2,3,4,5,6,7,8
0,WATER,RATE,TAKER,,,,,,
1,POLICE,OFFICER,,,,,,,
2,POLICE,OFFICER,,,,,,,
3,CHIEF,CONTRACT,EXPEDITER,,,,,,
4,CIVIL,ENGINEER,IV,,,,,,
...,...,...,...,...,...,...,...,...,...
32057,FRM,OF,MACHINISTS,-,AUTOMOTIVE,,,,
32058,POLICE,OFFICER,,,,,,,
32059,POLICE,OFFICER,,,,,,,
32060,POLICE,OFFICER,,,,,,,


In [114]:
chicago['Position Title'].str.split(' ', expand= True, n= 1)

Unnamed: 0,0,1
0,WATER,RATE TAKER
1,POLICE,OFFICER
2,POLICE,OFFICER
3,CHIEF,CONTRACT EXPEDITER
4,CIVIL,ENGINEER IV
...,...,...
32057,FRM,OF MACHINISTS - AUTOMOTIVE
32058,POLICE,OFFICER
32059,POLICE,OFFICER
32060,POLICE,OFFICER


In [115]:
chicago[['Position First Name', 'Position Last Name']] = chicago['Position Title'].str.split(' ', expand= True, n= 1)
chicago.head()

Unnamed: 0,Name,Position Title,Department,Employee Annual Salary,Last Name,First Name,Position First Name,Position Last Name
0,"AARON, ELVIA J",WATER RATE TAKER,WATER MGMNT,$90744.00,AARON,ELVIA J,WATER,RATE TAKER
1,"AARON, JEFFERY M",POLICE OFFICER,POLICE,$84450.00,AARON,JEFFERY M,POLICE,OFFICER
2,"AARON, KARINA",POLICE OFFICER,POLICE,$84450.00,AARON,KARINA,POLICE,OFFICER
3,"AARON, KIMBERLEI R",CHIEF CONTRACT EXPEDITER,GENERAL SERVICES,$89880.00,AARON,KIMBERLEI R,CHIEF,CONTRACT EXPEDITER
4,"ABAD JR, VICENTE M",CIVIL ENGINEER IV,WATER MGMNT,$106836.00,ABAD JR,VICENTE M,CIVIL,ENGINEER IV


## Chat GPT Exercises

In [174]:
import pandas as pd

chicago= pd.read_csv('chicago.csv').dropna(how= 'all')
#chicago['Department'] = chicago['Department'].astype('category')
chicago.head()

Unnamed: 0,Name,Position Title,Department,Employee Annual Salary
0,"AARON, ELVIA J",WATER RATE TAKER,WATER MGMNT,$90744.00
1,"AARON, JEFFERY M",POLICE OFFICER,POLICE,$84450.00
2,"AARON, KARINA",POLICE OFFICER,POLICE,$84450.00
3,"AARON, KIMBERLEI R",CHIEF CONTRACT EXPEDITER,GENERAL SERVICES,$89880.00
4,"ABAD JR, VICENTE M",CIVIL ENGINEER IV,WATER MGMNT,$106836.00


In [155]:
# Your Employee Annual Salary column is currently in string format, and contains commas (e.g., "50,000"). Clean it and convert it into a numeric format for further analysis.
chicago['Employee Annual Salary']= chicago['Employee Annual Salary'].str.replace(',', '.').str.replace('$','').astype('float64')
chicago.head()

# Task: Remove commas and convert the column into a float.

Unnamed: 0,Name,Position Title,Department,Employee Annual Salary
0,"AARON, ELVIA J",WATER RATE TAKER,WATER MGMNT,90744.0
1,"AARON, JEFFERY M",POLICE OFFICER,POLICE,84450.0
2,"AARON, KARINA",POLICE OFFICER,POLICE,84450.0
3,"AARON, KIMBERLEI R",CHIEF CONTRACT EXPEDITER,GENERAL SERVICES,89880.0
4,"ABAD JR, VICENTE M",CIVIL ENGINEER IV,WATER MGMNT,106836.0


In [156]:
# Create a new column, Position Title Length, which will store the length of each position title. Do not use any loops, just string methods.
chicago['Position Title Length']= chicago['Position Title'].str.len()
chicago.head()

# Task: Use string length methods to calculate and store the length of each position title.

Unnamed: 0,Name,Position Title,Department,Employee Annual Salary,Position Title Length
0,"AARON, ELVIA J",WATER RATE TAKER,WATER MGMNT,90744.0,16
1,"AARON, JEFFERY M",POLICE OFFICER,POLICE,84450.0,14
2,"AARON, KARINA",POLICE OFFICER,POLICE,84450.0,14
3,"AARON, KIMBERLEI R",CHIEF CONTRACT EXPEDITER,GENERAL SERVICES,89880.0,24
4,"ABAD JR, VICENTE M",CIVIL ENGINEER IV,WATER MGMNT,106836.0,17


In [157]:
# Filter the DataFrame to include only employees who work in the "Finance" department (case insensitive). Return only the Name and Department columns.
chicago.loc[chicago['Department'].str.capitalize() == 'Finance', ['Name', 'Department']]
chicago[chicago['Department'].str.capitalize() == 'Finance'][['Name', 'Department']]

# Task: Use the appropriate string filtering method.

# just a comment: both approaches provide same results; the difference is that the first one gives a view of the original dataframe (both objects tied together and any changes in one of them will automatically be applied to the other one), while the second one returns a copy of it (objects are not tied; they are two separate things)

# the second line first generates an entire new dataframe based on the filtered rows of the first one, and only after that it selects the columns
# the first one no: it is filtering and selecting the desired columns at once

Unnamed: 0,Name,Department
105,"ADAMCZYK JR, JAN",FINANCE
151,"ADAPON, NENITA P",FINANCE
166,"ADENI, MOHAMED K",FINANCE
254,"AHMED, MOHAMMAD A",FINANCE
297,"ALAM, SYED S",FINANCE
...,...,...
31756,"YUNG, TIMOTHY",FINANCE
31789,"ZAIDI, SYED K",FINANCE
31870,"ZAVALA, FERNANDO",FINANCE
31939,"ZHANG, KEFENG",FINANCE


In [158]:
# The Position Title column contains inconsistent capitalization. Your task is to make sure each position title starts with an uppercase letter and the rest are lowercase.
chicago['Position Title']= chicago['Position Title'].str.capitalize()
chicago.head()

# Task: Apply string methods to fix the capitalization of each position title.

Unnamed: 0,Name,Position Title,Department,Employee Annual Salary,Position Title Length
0,"AARON, ELVIA J",Water rate taker,WATER MGMNT,90744.0,16
1,"AARON, JEFFERY M",Police officer,POLICE,84450.0,14
2,"AARON, KARINA",Police officer,POLICE,84450.0,14
3,"AARON, KIMBERLEI R",Chief contract expediter,GENERAL SERVICES,89880.0,24
4,"ABAD JR, VICENTE M",Civil engineer iv,WATER MGMNT,106836.0,17


In [159]:
# There are some leading or trailing spaces in the Name and Position Title columns. Clean these columns by removing any extra spaces at the beginning and end.
columns_to_clean= ['Name', 'Position Title']

for col in columns_to_clean:
    chicago[col]= chicago[col].str.title().str.strip()

chicago.head()
# Task: Use string methods to clean the data.

Unnamed: 0,Name,Position Title,Department,Employee Annual Salary,Position Title Length
0,"Aaron, Elvia J",Water Rate Taker,WATER MGMNT,90744.0,16
1,"Aaron, Jeffery M",Police Officer,POLICE,84450.0,14
2,"Aaron, Karina",Police Officer,POLICE,84450.0,14
3,"Aaron, Kimberlei R",Chief Contract Expediter,GENERAL SERVICES,89880.0,24
4,"Abad Jr, Vicente M",Civil Engineer Iv,WATER MGMNT,106836.0,17


In [160]:
# Find all employees whose position title contains the word "Manager" (case insensitive). Return only the Name and Position Title columns.
employee_is_manager= chicago['Position Title'].str.title() == 'Manager'
chicago.loc[ employee_is_manager, ['Name', 'Position Title'] ]

# Task: Use string methods to filter titles containing "Manager".

Unnamed: 0,Name,Position Title


In [161]:
# In the Position Title column, replace the title "CEO" with "Chief Executive Officer".
(chicago['Position Title'].str.title() == 'Ceo').sum()

# Task: Use a string replace method to make this substitution.

# Answer: we don't have any row with the CEO value placed on Position Title Column. But if we had, we would need to apply the following syntax
chicago['Position Title'].str.title().str.replace('Ceo', 'Chief Executive Officer')

0                      Water Rate Taker
1                        Police Officer
2                        Police Officer
3              Chief Contract Expediter
4                     Civil Engineer Iv
                      ...              
32057    Frm Of Machinists - Automotive
32058                    Police Officer
32059                    Police Officer
32060                    Police Officer
32061           Chief Data Base Analyst
Name: Position Title, Length: 32062, dtype: object

In [162]:
# Create a new column called Department Abbreviation, where you extract the first letter of each word in the Department column and combine them. For example, "Human Resources" should become "HR".

## ??? 
chicago['Department'].str.split(' ').str.get(0).str.get(0).fillna('') +\
chicago['Department'].str.split(' ').str.get(1).str.get(0).fillna('') +\
chicago['Department'].str.split(' ').str.get(2).str.get(0).fillna('')

0        WM
1         P
2         P
3        GS
4        WM
         ..
32057    GS
32058     P
32059     P
32060     P
32061     D
Name: Department, Length: 32062, dtype: object

In [163]:
round(3.4553,2)

3.46

In [164]:
# Your Employee Annual Salary column has salaries in string format (e.g., "50,000"). Remove the commas and standardize the format (e.g., add a dollar sign to the beginning and round to the nearest whole number).

chicago.head()
chicago['Employee Annual Salary'].apply(lambda x: f'US$ {round(float(x))}')

# Task: Clean and format the salary string to match a standard format (e.g., "$50,000").

0         US$ 90744
1         US$ 84450
2         US$ 84450
3         US$ 89880
4        US$ 106836
            ...    
32057     US$ 99528
32058     US$ 87384
32059     US$ 84450
32060     US$ 87384
32061    US$ 113664
Name: Employee Annual Salary, Length: 32062, dtype: object

In [165]:
# reate a new column, Summary, which combines the Name, Position Title, and Department into one string, formatted as:
# "Name is a Position Title in Department department."


chicago['Summary']= chicago['Name'].str.title().str.split(',').str.get(1).str.strip()+\
' is a '+chicago['Position Title'].str.title()+\
' in Department '+\
chicago['Department'].str.title()

chicago.head()
# For example, "John Doe is a Software Engineer in IT department."

# Task: Use string concatenation to create the new summary column.

Unnamed: 0,Name,Position Title,Department,Employee Annual Salary,Position Title Length,Summary
0,"Aaron, Elvia J",Water Rate Taker,WATER MGMNT,90744.0,16,Elvia J is a Water Rate Taker in Department Wa...
1,"Aaron, Jeffery M",Police Officer,POLICE,84450.0,14,Jeffery M is a Police Officer in Department Po...
2,"Aaron, Karina",Police Officer,POLICE,84450.0,14,Karina is a Police Officer in Department Police
3,"Aaron, Kimberlei R",Chief Contract Expediter,GENERAL SERVICES,89880.0,24,Kimberlei R is a Chief Contract Expediter in D...
4,"Abad Jr, Vicente M",Civil Engineer Iv,WATER MGMNT,106836.0,17,Vicente M is a Civil Engineer Iv in Department...


In [166]:
# The Position Title column contains titles like "Senior Software Engineer" or "Junior Data Analyst". Create a new column Job Level that captures only the first word of the position title (e.g., "Senior", "Junior").

chicago['Position Title'].str.strip().str.split(' ').str.get(0)

# Task: Use the split() method to extract the first word.

0         Water
1        Police
2        Police
3         Chief
4         Civil
          ...  
32057       Frm
32058    Police
32059    Police
32060    Police
32061     Chief
Name: Position Title, Length: 32062, dtype: object

In [167]:
# In the Position Title column, some titles contain extra spaces between words (e.g., "Software Engineer"). Fix this issue by replacing multiple spaces with a single space.

chicago['Position Title']= chicago['Position Title'].str.replace('  ', ' ')
chicago.loc[16946, 'Position Title']
# Task: Use the replace() method or split() with join() to normalize the spacing.

'Manager Of Customer Services'

In [168]:
chicago['Name']

0            Aaron,  Elvia J
1          Aaron,  Jeffery M
2             Aaron,  Karina
3        Aaron,  Kimberlei R
4        Abad Jr,  Vicente M
                ...         
32057    Zygadlo,  Michael J
32058     Zygowicz,  Peter J
32059      Zymantas,  Mark E
32060    Zyrkowski,  Carlo E
32061    Zyskowski,  Dariusz
Name: Name, Length: 32062, dtype: object

In [169]:
# Create a new column Initials that contains the initials of the Name column (e.g., "John Doe" → "J.D.").
max_splits= chicago['Name'].str.split(',').str.len().max()
counter= 0
final_series= ''

while counter < max_splits:
    final_series += chicago['Name'].str.split(', ').str.get(counter).str.strip().str.get(0).fillna('')
    counter += 1

final_series
# Task: Use the split() method and string concatenation to extract initials.

0        AE
1        AJ
2        AK
3        AK
4        AV
         ..
32057    ZM
32058    ZP
32059    ZM
32060    ZC
32061    ZD
Name: Name, Length: 32062, dtype: object

In [170]:
# Filter the DataFrame to include only employees whose position title starts with "Junior" (case insensitive). Return only the Name and Position Title columns.
chicago['Position Title'].str.startswith('Junior').sum()

# Task: Use the str.startswith() method to filter out "Junior" positions.

0

In [176]:
# The Employee Annual Salary column contains salaries like "$50,000". Extract the numeric part of the salary (e.g., "50,000") and store it in a new column Salary Numeric.
chicago['Numeric Salary']= chicago['Employee Annual Salary'].str.replace('$','').astype(float).apply(lambda x: round(x)).astype('int')
chicago.head()

# Task: Use the split() or replace() method to remove the "$" symbol and keep only the numeric value.

Unnamed: 0,Name,Position Title,Department,Employee Annual Salary,Numeric Salary
0,"AARON, ELVIA J",WATER RATE TAKER,WATER MGMNT,$90744.00,90744
1,"AARON, JEFFERY M",POLICE OFFICER,POLICE,$84450.00,84450
2,"AARON, KARINA",POLICE OFFICER,POLICE,$84450.00,84450
3,"AARON, KIMBERLEI R",CHIEF CONTRACT EXPEDITER,GENERAL SERVICES,$89880.00,89880
4,"ABAD JR, VICENTE M",CIVIL ENGINEER IV,WATER MGMNT,$106836.00,106836


In [181]:
# Find all employees whose Position Title contains more than two words. Return only the Name and Position Title columns.

chicago.loc[ chicago['Position Title'].str.split(' ').str.len() > 2, ['Name', 'Position Title']]

# Task: Use the split() method to count the number of words in the Position Title column.


Unnamed: 0,Name,Position Title
0,"AARON, ELVIA J",WATER RATE TAKER
3,"AARON, KIMBERLEI R",CHIEF CONTRACT EXPEDITER
4,"ABAD JR, VICENTE M",CIVIL ENGINEER IV
5,"ABARCA, ANABEL",ASST TO THE ALDERMAN
6,"ABARCA, EMMANUEL",GENERAL LABORER - DSS
...,...,...
32046,"ZUREK, MARY H",SENIOR PUBLIC INFORMATION OFFICER
32050,"ZWARYCZ, THOMAS J",POOL MOTOR TRUCK DRIVER
32051,"ZWIESLER, MATTHEW",AIRPORT OPERATIONS SUPVSR I
32057,"ZYGADLO, MICHAEL J",FRM OF MACHINISTS - AUTOMOTIVE
