## Working with text data

How to fix and optimize it

### Exploring the data

In [50]:
import pandas as pd

In [51]:
chicago = pd.read_csv('db/chicago.csv')
chicago

Unnamed: 0,Name,Position Title,Department,Employee Annual Salary
0,"AARON, ELVIA J",WATER RATE TAKER,WATER MGMNT,$90744.00
1,"AARON, JEFFERY M",POLICE OFFICER,POLICE,$84450.00
2,"AARON, KARINA",POLICE OFFICER,POLICE,$84450.00
3,"AARON, KIMBERLEI R",CHIEF CONTRACT EXPEDITER,GENERAL SERVICES,$89880.00
4,"ABAD JR, VICENTE M",CIVIL ENGINEER IV,WATER MGMNT,$106836.00
...,...,...,...,...
32058,"ZYGOWICZ, PETER J",POLICE OFFICER,POLICE,$87384.00
32059,"ZYMANTAS, MARK E",POLICE OFFICER,POLICE,$84450.00
32060,"ZYRKOWSKI, CARLO E",POLICE OFFICER,POLICE,$87384.00
32061,"ZYSKOWSKI, DARIUSZ",CHIEF DATA BASE ANALYST,DoIT,$113664.00


In [52]:
chicago.describe()

Unnamed: 0,Name,Position Title,Department,Employee Annual Salary
count,32062,32062,32062,32062
unique,31776,1093,35,1156
top,"HERNANDEZ, JUAN C",POLICE OFFICER,POLICE,$87384.00
freq,4,9184,12618,2394


In [53]:
chicago.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32063 entries, 0 to 32062
Data columns (total 4 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   Name                    32062 non-null  object
 1   Position Title          32062 non-null  object
 2   Department              32062 non-null  object
 3   Employee Annual Salary  32062 non-null  object
dtypes: object(4)
memory usage: 1002.1+ KB


In [54]:
chicago.isnull().values.any()

True

In [55]:
chicago.nunique()

Name                      31776
Position Title             1093
Department                   35
Employee Annual Salary     1156
dtype: int64

### Optimizing it

    Dropping null lines

In [56]:
chicago.dropna(how='all', inplace=True) # without the 'inplace=True' this would onlycreate a view and not change the dataframe

chicago.isnull().values.any()

False

    Renaming the columns

In [57]:
chicago.columns = ['Name', 'Position_Title', 'Department', 'Annual_Salary']
chicago.columns

Index(['Name', 'Position_Title', 'Department', 'Annual_Salary'], dtype='object')

    Changing data types and formating

In [58]:
chicago['Department']=chicago['Department'].astype('category')

chicago.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 32062 entries, 0 to 32061
Data columns (total 4 columns):
 #   Column          Non-Null Count  Dtype   
---  ------          --------------  -----   
 0   Name            32062 non-null  object  
 1   Position_Title  32062 non-null  object  
 2   Department      32062 non-null  category
 3   Annual_Salary   32062 non-null  object  
dtypes: category(1), object(3)
memory usage: 1.0+ MB


In order to apply a string method to a Serie or df we need to call the 'str' method first

In [59]:
chicago['Position_Title']=chicago['Position_Title'].str.lower()

In [67]:
chicago['Department']=chicago['Department'].str.lower()

In [65]:
chicago['Name']=chicago['Name'].str.title() # Capitalizes the first letter of each word

    Removing the '$' symbol an converting to float

In [73]:
chicago["Annual_Salary"]=(chicago['Annual_Salary'].str.replace('$','')).astype(float)

  chicago["Annual_Salary"]=(chicago['Annual_Salary'].str.replace('$','')).astype(float)
