## Processing and manipulating text data in Pandas 

### Import the pandas library 

In [1]:
import pandas as pd

### Load the CSV file from this [link](https://raw.githubusercontent.com/Prajwalk09/Data-Analysis-with-Pandas-and-Python/refs/heads/main/Working%20with%20Text%20Data/chicago.csv) into a DataFrame and assign it to a variable named `data`

In [2]:
url = "https://raw.githubusercontent.com/Prajwalk09/Data-Analysis-with-Pandas-and-Python/refs/heads/main/Working%20with%20Text%20Data/chicago.csv"
data = pd.read_csv(url)

### Display the first 5 rows of the DataFrame 

In [3]:
data.head()

Unnamed: 0,Name,Position Title,Department,Employee Annual Salary
0,"AARON, ELVIA J",WATER RATE TAKER,WATER MGMNT,$90744.00
1,"AARON, JEFFERY M",POLICE OFFICER,POLICE,$84450.00
2,"AARON, KARINA",POLICE OFFICER,POLICE,$84450.00
3,"AARON, KIMBERLEI R",CHIEF CONTRACT EXPEDITER,GENERAL SERVICES,$89880.00
4,"ABAD JR, VICENTE M",CIVIL ENGINEER IV,WATER MGMNT,$106836.00


### Display a concise summary of the `data` DataFrame, including the number of entries, column data types, and memory usage

In [4]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32063 entries, 0 to 32062
Data columns (total 4 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   Name                    32062 non-null  object
 1   Position Title          32062 non-null  object
 2   Department              32062 non-null  object
 3   Employee Annual Salary  32062 non-null  object
dtypes: object(4)
memory usage: 1002.1+ KB


### Important Note:  
When applying string methods to text data in a column, always prefix the column with the `str` keyword.  
For example:  
```python
data['Column_Name'].str.method()
```

### Convert all the values in the `Name` column of the `data` DataFrame to lowercase using an appropriate string method
<span style="color:blue; font-size:14px">Do <span style="color:red; font-weight:bold">not</span> make the changes in place</span>

In [5]:
data['Name'].str.lower()

0            aaron,  elvia j
1          aaron,  jeffery m
2             aaron,  karina
3        aaron,  kimberlei r
4        abad jr,  vicente m
                ...         
32058     zygowicz,  peter j
32059      zymantas,  mark e
32060    zyrkowski,  carlo e
32061    zyskowski,  dariusz
32062                    NaN
Name: Name, Length: 32063, dtype: object

### Format all the values in the `Name` column of the `data` DataFrame to title case using an appropriate string method
<span style="color:blue; font-size:14px">Do <span style="color:red; font-weight:bold">not</span> make the changes in place</span>

In [6]:
data['Name'].str.title()

0            Aaron,  Elvia J
1          Aaron,  Jeffery M
2             Aaron,  Karina
3        Aaron,  Kimberlei R
4        Abad Jr,  Vicente M
                ...         
32058     Zygowicz,  Peter J
32059      Zymantas,  Mark E
32060    Zyrkowski,  Carlo E
32061    Zyskowski,  Dariusz
32062                    NaN
Name: Name, Length: 32063, dtype: object

### Compute the length of each string in the `Position Title` column of the `data` DataFrame

In [7]:
data['Position Title'].str.len()

0        16.0
1        14.0
2        14.0
3        24.0
4        17.0
         ... 
32058    14.0
32059    14.0
32060    14.0
32061    23.0
32062     NaN
Name: Position Title, Length: 32063, dtype: float64

### Replace occurrences of `'MGMNT'` with `'MANAGEMENT'` in the `Department` column of the `data` DataFrame

In [8]:
data['Department'] = data['Department'].str.replace("MGMNT", "MANAGEMENT")

### Display the first 5 rows of the DataFrame once the previous change has been made 

In [9]:
data.head()

Unnamed: 0,Name,Position Title,Department,Employee Annual Salary
0,"AARON, ELVIA J",WATER RATE TAKER,WATER MANAGEMENT,$90744.00
1,"AARON, JEFFERY M",POLICE OFFICER,POLICE,$84450.00
2,"AARON, KARINA",POLICE OFFICER,POLICE,$84450.00
3,"AARON, KIMBERLEI R",CHIEF CONTRACT EXPEDITER,GENERAL SERVICES,$89880.00
4,"ABAD JR, VICENTE M",CIVIL ENGINEER IV,WATER MANAGEMENT,$106836.00


### Transform the `'Employee Annual Salary'` column in the `data` DataFrame:
1. Remove the dollar sign (`$`) from its values.  
2. Convert the resulting values to the `float` data type.

In [10]:
data['Employee Annual Salary'] = data['Employee Annual Salary'].str.replace('$', '', regex = False)

In [11]:
data['Employee Annual Salary'] = data['Employee Annual Salary'].astype(float)

### Remove rows from the `data` DataFrame where all the values are missing (NaN).
<span style="color:blue; font-weight:bold; font-size:14px">Make sure that the changes are done inplace</span>

In [12]:
data.dropna(how = 'all', inplace = True)

### Convert the text in the `'Position Title'` column to lowercase and extract rows where it contains the word `'water'`

In [13]:
data[data['Position Title'].str.lower().str.contains('water')]

Unnamed: 0,Name,Position Title,Department,Employee Annual Salary
0,"AARON, ELVIA J",WATER RATE TAKER,WATER MANAGEMENT,90744.0
554,"ALUISE, VINCENT G",FOREMAN OF WATER PIPE CONSTRUCTION,WATER MANAGEMENT,102440.0
671,"ANDER, PERRY A",WATER CHEMIST II,WATER MANAGEMENT,82044.0
685,"ANDERSON, ANDREW J",DISTRICT SUPERINTENDENT OF WATER DISTRIBUTION,WATER MANAGEMENT,109272.0
702,"ANDERSON, DONALD",FOREMAN OF WATER PIPE CONSTRUCTION,WATER MANAGEMENT,102440.0
...,...,...,...,...
29669,"VERMA, ANUPAM",MANAGING ENGINEER - WATER MANAGEMENT,WATER MANAGEMENT,111192.0
30239,"WASHINGTON, JOSEPH",WATER CHEMIST III,WATER MANAGEMENT,89676.0
30544,"WEST, THOMAS R",GEN SUPT OF WATER MANAGEMENT,WATER MANAGEMENT,115704.0
30991,"WILLIAMS, MATTHEW",FOREMAN OF WATER PIPE CONSTRUCTION,WATER MANAGEMENT,102440.0


### Filter the rows where the `'Position Title'` starts with the word `'WATER'`

In [14]:
data[data['Position Title'].str.startswith('WATER')]

Unnamed: 0,Name,Position Title,Department,Employee Annual Salary
0,"AARON, ELVIA J",WATER RATE TAKER,WATER MANAGEMENT,90744.0
671,"ANDER, PERRY A",WATER CHEMIST II,WATER MANAGEMENT,82044.0
1054,"ASHLEY, KARMA T",WATER CHEMIST II,WATER MANAGEMENT,82044.0
1079,"ATKINS, JOANNA M",WATER CHEMIST II,WATER MANAGEMENT,82044.0
1181,"AZEEM, MOHAMMED A",WATER CHEMIST II,WATER MANAGEMENT,53172.0
...,...,...,...,...
28574,"THREATT, DENISE R",WATER QUALITY INSPECTOR,WATER MANAGEMENT,62004.0
28602,"TIGNOR, DARRYL B",WATER RATE TAKER,WATER MANAGEMENT,78948.0
28955,"TRAVIS COOK, LESLIE R",WATER RATE TAKER,WATER MANAGEMENT,78948.0
29584,"VELAZQUEZ, JOHN",WATER RATE TAKER,WATER MANAGEMENT,78948.0


### Filter the rows where the `'Position Title'` ends with the word `'MANAGEMENT'`

In [15]:
data[data['Position Title'].str.endswith('MANAGEMENT')]

Unnamed: 0,Name,Position Title,Department,Employee Annual Salary
619,"AMADO, BONITA S",DIR OF FACILITIES MANAGEMENT,POLICE,109008.0
1645,"BATIE, FERRIS C",DIR OF FACILITIES MANAGEMENT,GENERAL SERVICES,112308.0
4484,"CHAO, JOSE A",DIR OF FACILITIES MANAGEMENT,CULTURAL AFFAIRS,95820.0
6421,"DAVIS III, WALLACE",GEN SUPT OF WATER MANAGEMENT,WATER MANAGEMENT,119208.0
9751,"GASPAR, RICARDO",DIR OF FACILITIES MANAGEMENT,AVIATION,106848.0
11466,"HARDY, MARIE E",ASST COORD OF COLLECTION MANAGEMENT,PUBLIC LIBRARY,83340.0
12412,"HNATKO, WAYNE S",ASST DIR OF BUILDINGS MANAGEMENT,GENERAL SERVICES,110088.0
14897,"KING, JOHN T",GEN SUPT OF WATER MANAGEMENT,WATER MANAGEMENT,114204.0
18237,"MCFARLAND, ANDREW S",MANAGING ENGINEER - WATER MANAGEMENT,WATER MANAGEMENT,111192.0
23792,"REYNOLDS, DAVID J",COMMISSIONER OF FLEET & FACILITY MANAGEMENT,GENERAL SERVICES,157092.0


### Split the `'Name'` column by commas, get the first part of each name, convert it to title case, and display the count of each unique value

In [16]:
data['Name'].str.split(',').str.get(0).str.title().value_counts()

Williams     293
Johnson      244
Smith        241
Brown        185
Jones        183
            ... 
Horkavy        1
Horn           1
Horne Jr       1
Horner         1
Zyskowski      1
Name: Name, Length: 13829, dtype: int64

### Split the `'Name'` column into two new columns: `'First Name'` and `'Last Name'`, using a comma as the delimiter

In [17]:
data[['First Name', 'Last Name']] = data['Name'].str.split(',', expand = True)

### Display the first 5 rows of the DataFrame after making the previous change 

In [18]:
data.head()

Unnamed: 0,Name,Position Title,Department,Employee Annual Salary,First Name,Last Name
0,"AARON, ELVIA J",WATER RATE TAKER,WATER MANAGEMENT,90744.0,AARON,ELVIA J
1,"AARON, JEFFERY M",POLICE OFFICER,POLICE,84450.0,AARON,JEFFERY M
2,"AARON, KARINA",POLICE OFFICER,POLICE,84450.0,AARON,KARINA
3,"AARON, KIMBERLEI R",CHIEF CONTRACT EXPEDITER,GENERAL SERVICES,89880.0,AARON,KIMBERLEI R
4,"ABAD JR, VICENTE M",CIVIL ENGINEER IV,WATER MANAGEMENT,106836.0,ABAD JR,VICENTE M
