## Métodos para objetos de tipo string en Pandas

El dataset utilizado contiene el ``nombre``, ``departamento``y ``salario``de cada empleado que hay en el sector público de la ciudad de Chicago (actualizado a 2024).

Más contexto puede encontrarse en [Kaggle](https://www.kaggle.com/datasets/chicago/chicago-citywide-payroll-data) o en el siguiente [link](https://data.cityofchicago.org/Administration-Finance/Current-Employee-Names-Salaries-and-Position-Title/xzkq-xp2w/about_data).

In [1]:
import pandas as pd
import os

In [2]:
# importación de datos
data_path = '../Datasets/'
filename = 'Chicago_employee_salaries.csv' 
fullpath = os.path.join(data_path, filename)

df = pd.read_csv(filepath_or_buffer=fullpath)
df.head(5)

Unnamed: 0,Name,Job Titles,Department,Full or Part-Time,Salary or Hourly,Typical Hours,Annual Salary,Hourly Rate
0,"GARCIA, CHRISTOPHER A",CLERK - BD OF ELECTIONS,BOARD OF ELECTION COMMISSIONERS,F,SALARY,,34176.0,
1,"SILVA, JAVIER",PROJECT COORD,CHICAGO DEPARTMENT OF PUBLIC HEALTH,F,SALARY,,71004.0,
2,"HARRISON, LETECHIA",SR PROCUREMENT SPECIALIST,DEPARTMENT OF PROCUREMENT SERVICES,F,SALARY,,90660.0,
3,"WASHINGTON, MAURICE",PROJECT COORD,DEPARTMENT OF FAMILY AND SUPPORT SERVICES,F,SALARY,,71004.0,
4,"ASHFORD, MARQUISHA D",INQUIRY AIDE III,OFFICE OF PUBLIC SAFETY ADMINISTRATION,F,SALARY,,46056.0,


No todas las columnas nos interesan, podemos omitir horas típicas y hourly rate.

In [3]:
filename = 'Chicago_employee_salaries.csv' 
fullpath = os.path.join(data_path, filename)

df = pd.read_csv(filepath_or_buffer=fullpath, usecols=["Name","Job Titles","Department","Annual Salary"])
df.head(5)

Unnamed: 0,Name,Job Titles,Department,Annual Salary
0,"GARCIA, CHRISTOPHER A",CLERK - BD OF ELECTIONS,BOARD OF ELECTION COMMISSIONERS,34176.0
1,"SILVA, JAVIER",PROJECT COORD,CHICAGO DEPARTMENT OF PUBLIC HEALTH,71004.0
2,"HARRISON, LETECHIA",SR PROCUREMENT SPECIALIST,DEPARTMENT OF PROCUREMENT SERVICES,90660.0
3,"WASHINGTON, MAURICE",PROJECT COORD,DEPARTMENT OF FAMILY AND SUPPORT SERVICES,71004.0
4,"ASHFORD, MARQUISHA D",INQUIRY AIDE III,OFFICE OF PUBLIC SAFETY ADMINISTRATION,46056.0


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 31932 entries, 0 to 31931
Data columns (total 4 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Name           31932 non-null  object 
 1   Job Titles     31932 non-null  object 
 2   Department     31932 non-null  object 
 3   Annual Salary  24795 non-null  float64
dtypes: float64(1), object(3)
memory usage: 998.0+ KB


In [5]:
df.shape

(31932, 4)

In [6]:
# Podemos crear un dataframe que contenga el número de valores nulos de este dataset

df.isna().sum().to_frame().head(10)

Unnamed: 0,0
Name,0
Job Titles,0
Department,0
Annual Salary,7137


**Recordar los métodos dropna para eliminar valores nulos (DataFramesI).**

In [7]:
df_filtered = df.dropna(how='any') # si alguna entrada es nan, no se toma en cuenta
df_filtered

Unnamed: 0,Name,Job Titles,Department,Annual Salary
0,"GARCIA, CHRISTOPHER A",CLERK - BD OF ELECTIONS,BOARD OF ELECTION COMMISSIONERS,34176.00
1,"SILVA, JAVIER",PROJECT COORD,CHICAGO DEPARTMENT OF PUBLIC HEALTH,71004.00
2,"HARRISON, LETECHIA",SR PROCUREMENT SPECIALIST,DEPARTMENT OF PROCUREMENT SERVICES,90660.00
3,"WASHINGTON, MAURICE",PROJECT COORD,DEPARTMENT OF FAMILY AND SUPPORT SERVICES,71004.00
4,"ASHFORD, MARQUISHA D",INQUIRY AIDE III,OFFICE OF PUBLIC SAFETY ADMINISTRATION,46056.00
...,...,...,...,...
31924,"SYAS, VICTORIA M",PARAMEDIC I/C,CHICAGO FIRE DEPARTMENT,101052.00
31925,"COOGAN, PATRICK M",LIEUTENANT-EMT,CHICAGO FIRE DEPARTMENT,135144.00
31927,"DE LA ROSA, JOSEPH",FIRE ENGINEER-EMT,CHICAGO FIRE DEPARTMENT,118830.00
31929,"GRADILLA, IVON",SUPERVISING TRAFFIC CONTROL AIDE,OFFICE OF EMERGENCY MANAGEMENT AND COMMUNICATIONS,74844.00


In [8]:
# también podemos utilizar este método desde la importación de datos

filename = 'Chicago_employee_salaries.csv' 
fullpath = os.path.join(data_path, filename)

df = pd.read_csv(filepath_or_buffer=fullpath, usecols=["Name","Job Titles","Department","Annual Salary"]).dropna(how='any')
df.head(5)

Unnamed: 0,Name,Job Titles,Department,Annual Salary
0,"GARCIA, CHRISTOPHER A",CLERK - BD OF ELECTIONS,BOARD OF ELECTION COMMISSIONERS,34176.0
1,"SILVA, JAVIER",PROJECT COORD,CHICAGO DEPARTMENT OF PUBLIC HEALTH,71004.0
2,"HARRISON, LETECHIA",SR PROCUREMENT SPECIALIST,DEPARTMENT OF PROCUREMENT SERVICES,90660.0
3,"WASHINGTON, MAURICE",PROJECT COORD,DEPARTMENT OF FAMILY AND SUPPORT SERVICES,71004.0
4,"ASHFORD, MARQUISHA D",INQUIRY AIDE III,OFFICE OF PUBLIC SAFETY ADMINISTRATION,46056.0


## Métodos string típicos

* Las Series de Pandas tienen un atributo ``str`` que expone un objeto a métodos de string, que ya hemos visto.
* La mayoría de estos métodos son similares a los de python (ejemplo, upper, lower, title)



In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 24795 entries, 0 to 31931
Data columns (total 4 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Name           24795 non-null  object 
 1   Job Titles     24795 non-null  object 
 2   Department     24795 non-null  object 
 3   Annual Salary  24795 non-null  float64
dtypes: float64(1), object(3)
memory usage: 968.6+ KB


In [10]:
# podemos cambiar el tipo de objeto de la columna Departament

df['Department'] = df['Department'].astype("category")
df['Job Titles'] = df['Job Titles'].astype("category")
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 24795 entries, 0 to 31931
Data columns (total 4 columns):
 #   Column         Non-Null Count  Dtype   
---  ------         --------------  -----   
 0   Name           24795 non-null  object  
 1   Job Titles     24795 non-null  category
 2   Department     24795 non-null  category
 3   Annual Salary  24795 non-null  float64 
dtypes: category(2), float64(1), object(1)
memory usage: 695.4+ KB


In [11]:
df['Job Titles'].str.lower()

0                 clerk - bd of elections
1                           project coord
2               sr procurement specialist
3                           project coord
4                        inquiry aide iii
                       ...               
31924                       paramedic i/c
31925                      lieutenant-emt
31927                   fire engineer-emt
31929    supervising traffic control aide
31931            chief operating engineer
Name: Job Titles, Length: 24795, dtype: object

In [12]:
df['Job Titles'].str.upper()

0                 CLERK - BD OF ELECTIONS
1                           PROJECT COORD
2               SR PROCUREMENT SPECIALIST
3                           PROJECT COORD
4                        INQUIRY AIDE III
                       ...               
31924                       PARAMEDIC I/C
31925                      LIEUTENANT-EMT
31927                   FIRE ENGINEER-EMT
31929    SUPERVISING TRAFFIC CONTROL AIDE
31931            CHIEF OPERATING ENGINEER
Name: Job Titles, Length: 24795, dtype: object

In [13]:
df['Job Titles'].str.title()

0                 Clerk - Bd Of Elections
1                           Project Coord
2               Sr Procurement Specialist
3                           Project Coord
4                        Inquiry Aide Iii
                       ...               
31924                       Paramedic I/C
31925                      Lieutenant-Emt
31927                   Fire Engineer-Emt
31929    Supervising Traffic Control Aide
31931            Chief Operating Engineer
Name: Job Titles, Length: 24795, dtype: object

In [14]:
df['Job Titles'].str.len()

0        23
1        13
2        25
3        13
4        16
         ..
31924    13
31925    14
31927    17
31929    32
31931    24
Name: Job Titles, Length: 24795, dtype: int64

In [15]:
df['Job Titles'].str.title().str.len()

0        23
1        13
2        25
3        13
4        16
         ..
31924    13
31925    14
31927    17
31929    32
31931    24
Name: Job Titles, Length: 24795, dtype: int64

In [16]:
df['Job Titles'].str.split('-')

0                [CLERK ,  BD OF ELECTIONS]
1                           [PROJECT COORD]
2               [SR PROCUREMENT SPECIALIST]
3                           [PROJECT COORD]
4                        [INQUIRY AIDE III]
                        ...                
31924                       [PARAMEDIC I/C]
31925                     [LIEUTENANT, EMT]
31927                  [FIRE ENGINEER, EMT]
31929    [SUPERVISING TRAFFIC CONTROL AIDE]
31931            [CHIEF OPERATING ENGINEER]
Name: Job Titles, Length: 24795, dtype: object

## Filtrado con métodos string

In [37]:
filename = 'Chicago_employee_salaries.csv' 
fullpath = os.path.join(data_path, filename)

df = pd.read_csv(filepath_or_buffer=fullpath,
                 usecols=["Name","Job Titles","Department","Annual Salary"]).dropna(how='any')
df['Department'] = df['Department'].astype("category")
df['Job Titles'] = df['Job Titles'].astype("category")

df.head(5)

Unnamed: 0,Name,Job Titles,Department,Annual Salary
0,"GARCIA, CHRISTOPHER A",CLERK - BD OF ELECTIONS,BOARD OF ELECTION COMMISSIONERS,34176.0
1,"SILVA, JAVIER",PROJECT COORD,CHICAGO DEPARTMENT OF PUBLIC HEALTH,71004.0
2,"HARRISON, LETECHIA",SR PROCUREMENT SPECIALIST,DEPARTMENT OF PROCUREMENT SERVICES,90660.0
3,"WASHINGTON, MAURICE",PROJECT COORD,DEPARTMENT OF FAMILY AND SUPPORT SERVICES,71004.0
4,"ASHFORD, MARQUISHA D",INQUIRY AIDE III,OFFICE OF PUBLIC SAFETY ADMINISTRATION,46056.0


In [38]:
is_water = df["Job Titles"].str.lower().str.contains("water")
is_water

0        False
1        False
2        False
3        False
4        False
         ...  
31924    False
31925    False
31927    False
31929    False
31931    False
Name: Job Titles, Length: 24795, dtype: bool

In [39]:
# método contains checa si un string está contenido en algún lugar de la columna
# Esto regresa una serie de True, False, así que hay que incluirlo como condición

is_water = df["Job Titles"].str.lower().str.contains("water")
df[is_water]

Unnamed: 0,Name,Job Titles,Department,Annual Salary
519,"MACIEL, SANDRA M",WATER QUALITY INSPECTOR,DEPARTMENT OF WATER MANAGEMENT,77424.0
2210,"BOYCE, ADNER L",WATER CHEMIST III,DEPARTMENT OF WATER MANAGEMENT,112020.0
2304,"O NEAL, ROZELLA",SUPVSR OF WATER RATE TAKERS,DEPARTMENT OF WATER MANAGEMENT,135120.0
2734,"GALLAGHER, MICHAEL",SAFETY SPECIALIST - WATER MGMT,DEPARTMENT OF WATER MANAGEMENT,112224.0
2890,"BOLTON, BRIAN E",WATER METER ASSESSOR,DEPARTMENT OF WATER MANAGEMENT,112212.0
...,...,...,...,...
21458,"HUYNH, ALEXANDER T",MANAGING ENGINEER - WATER MANAGEMENT,DEPARTMENT OF WATER MANAGEMENT,136404.0
21591,"THOME, JAMES A",ASST SUPT OF WATER METERS,DEPARTMENT OF WATER MANAGEMENT,115488.0
21627,"MCFARLAND, ANDREW S",MANAGING ENGINEER - WATER MANAGEMENT,DEPARTMENT OF WATER MANAGEMENT,136404.0
21793,"KRUEGER, ANGELA",ASST ENGINEER OF WATER PUMPING,DEPARTMENT OF WATER MANAGEMENT,140544.0


In [40]:
starts_civil = df["Job Titles"].str.lower().str.startswith("civil")
df[starts_civil]

Unnamed: 0,Name,Job Titles,Department,Annual Salary
445,"GONZALEZ, ALONDRA",CIVIL ENGINEER III,CHICAGO DEPARTMENT OF TRANSPORTATION,84972.0
799,"LUKE, STEFAN",CIVIL ENGINEER IV,CHICAGO DEPARTMENT OF TRANSPORTATION,93708.0
1818,"BALTSAS, JOHN M",CIVIL ENGINEER III,DEPARTMENT OF WATER MANAGEMENT,122196.0
1834,"PATEL, MAYUR J",CIVIL ENGINEER IV,CHICAGO DEPARTMENT OF TRANSPORTATION,115872.0
2073,"HAMEIRI, AVIKAM",CIVIL ENGINEER V,DEPARTMENT OF BUILDINGS,145872.0
...,...,...,...,...
19622,"ALSUFY, ASEM S",CIVIL ENGINEER II,DEPARTMENT OF WATER MANAGEMENT,79368.0
19636,"TRUONG, MINH Q",CIVIL ENGINEER IV,DEPARTMENT OF WATER MANAGEMENT,101448.0
20611,"CROCKER, MATTHEW L",CIVIL ENGINEER V,CHICAGO DEPARTMENT OF TRANSPORTATION,106080.0
21716,"ARUNA, REMIGIJUS",CIVIL ENGINEER V,DEPARTMENT OF WATER MANAGEMENT,117792.0


In [41]:
ends_iv = df["Job Titles"].str.lower().str.endswith("iv")
df[ends_iv]

Unnamed: 0,Name,Job Titles,Department,Annual Salary
799,"LUKE, STEFAN",CIVIL ENGINEER IV,CHICAGO DEPARTMENT OF TRANSPORTATION,93708.0
1207,"JONES, YASMIN Y",LIBRARIAN IV,CHICAGO PUBLIC LIBRARY,91884.0
1811,"GLADNEY, ANGELA M",CLERK IV,CHICAGO POLICE DEPARTMENT,84972.0
1832,"BANNA, FEDAA N",ELECTRICAL ENGINEER IV,CHICAGO DEPARTMENT OF AVIATION,133428.0
1834,"PATEL, MAYUR J",CIVIL ENGINEER IV,CHICAGO DEPARTMENT OF TRANSPORTATION,115872.0
...,...,...,...,...
20828,"KEITH, YOLANDA",CLERK IV,CHICAGO DEPARTMENT OF PUBLIC HEALTH,48960.0
24531,"SUAREZ, MIGUEL A",CLERK IV,CHICAGO DEPARTMENT OF AVIATION,48960.0
29015,"ESQUIVEL, ANNA B",PUBLIC HEALTH NURSE IV,CHICAGO DEPARTMENT OF PUBLIC HEALTH,107328.0
29468,"AIKENS, CLORA M",PUBLIC HEALTH NURSE IV,CHICAGO DEPARTMENT OF PUBLIC HEALTH,118428.0


## Métodos str para índices y Columnas

In [42]:
# hacemos importación de datos y utilizamos el nombre del empleado como index

filename = 'Chicago_employee_salaries.csv' 
fullpath = os.path.join(data_path, filename)

df = pd.read_csv(filepath_or_buffer=fullpath, index_col="Name",
                 usecols=["Name","Job Titles","Department","Annual Salary"]).dropna(how='any')
df['Department'] = df['Department'].astype("category")
df['Job Titles'] = df['Job Titles'].astype("category")

df.head(5)

Unnamed: 0_level_0,Job Titles,Department,Annual Salary
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
"GARCIA, CHRISTOPHER A",CLERK - BD OF ELECTIONS,BOARD OF ELECTION COMMISSIONERS,34176.0
"SILVA, JAVIER",PROJECT COORD,CHICAGO DEPARTMENT OF PUBLIC HEALTH,71004.0
"HARRISON, LETECHIA",SR PROCUREMENT SPECIALIST,DEPARTMENT OF PROCUREMENT SERVICES,90660.0
"WASHINGTON, MAURICE",PROJECT COORD,DEPARTMENT OF FAMILY AND SUPPORT SERVICES,71004.0
"ASHFORD, MARQUISHA D",INQUIRY AIDE III,OFFICE OF PUBLIC SAFETY ADMINISTRATION,46056.0


In [43]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 24795 entries, GARCIA, CHRISTOPHER A to POWELL, TIMOTHY M
Data columns (total 3 columns):
 #   Column         Non-Null Count  Dtype   
---  ------         --------------  -----   
 0   Job Titles     24795 non-null  category
 1   Department     24795 non-null  category
 2   Annual Salary  24795 non-null  float64 
dtypes: category(2), float64(1)
memory usage: 501.6+ KB


In [45]:
df.index.str.strip().str.title()

Index(['Garcia, Christopher A', 'Silva, Javier', 'Harrison, Letechia',
       'Washington, Maurice', 'Ashford, Marquisha D', 'Flynn, Shanika',
       'Mathis, Desiree', 'Mack, Qiana S', 'Guzman, Melissa M',
       'Perez, William B',
       ...
       'Serb, Steven J', 'Pellegrini, Nicholas R', 'Oliva, Christina',
       'Brown, Thomas M', 'Lovell, Amy M', 'Syas, Victoria M',
       'Coogan, Patrick M', 'De La Rosa, Joseph', 'Gradilla, Ivon',
       'Powell, Timothy M'],
      dtype='object', name='Name', length=24795)

In [46]:
df.index.str.strip().str.upper()

Index(['GARCIA, CHRISTOPHER A', 'SILVA, JAVIER', 'HARRISON, LETECHIA',
       'WASHINGTON, MAURICE', 'ASHFORD, MARQUISHA D', 'FLYNN, SHANIKA',
       'MATHIS, DESIREE', 'MACK, QIANA S', 'GUZMAN, MELISSA M',
       'PEREZ, WILLIAM B',
       ...
       'SERB, STEVEN J', 'PELLEGRINI, NICHOLAS R', 'OLIVA, CHRISTINA',
       'BROWN, THOMAS M', 'LOVELL, AMY M', 'SYAS, VICTORIA M',
       'COOGAN, PATRICK M', 'DE LA ROSA, JOSEPH', 'GRADILLA, IVON',
       'POWELL, TIMOTHY M'],
      dtype='object', name='Name', length=24795)

## Método split

* el método ``str.get``accesa a un elemento de la lista por su índice

In [47]:
filename = 'Chicago_employee_salaries.csv' 
fullpath = os.path.join(data_path, filename)

df = pd.read_csv(filepath_or_buffer=fullpath, index_col="Name",
                 usecols=["Name","Job Titles","Department","Annual Salary"]).dropna(how='any')
df['Department'] = df['Department'].astype("category")
df['Job Titles'] = df['Job Titles'].astype("category")

df.head(5)

Unnamed: 0_level_0,Job Titles,Department,Annual Salary
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
"GARCIA, CHRISTOPHER A",CLERK - BD OF ELECTIONS,BOARD OF ELECTION COMMISSIONERS,34176.0
"SILVA, JAVIER",PROJECT COORD,CHICAGO DEPARTMENT OF PUBLIC HEALTH,71004.0
"HARRISON, LETECHIA",SR PROCUREMENT SPECIALIST,DEPARTMENT OF PROCUREMENT SERVICES,90660.0
"WASHINGTON, MAURICE",PROJECT COORD,DEPARTMENT OF FAMILY AND SUPPORT SERVICES,71004.0
"ASHFORD, MARQUISHA D",INQUIRY AIDE III,OFFICE OF PUBLIC SAFETY ADMINISTRATION,46056.0


In [49]:
df['Job Titles'].str.split(" ") # crea listas

Name
GARCIA, CHRISTOPHER A            [CLERK, -, BD, OF, ELECTIONS]
SILVA, JAVIER                                 [PROJECT, COORD]
HARRISON, LETECHIA               [SR, PROCUREMENT, SPECIALIST]
WASHINGTON, MAURICE                           [PROJECT, COORD]
ASHFORD, MARQUISHA D                      [INQUIRY, AIDE, III]
                                         ...                  
SYAS, VICTORIA M                              [PARAMEDIC, I/C]
COOGAN, PATRICK M                             [LIEUTENANT-EMT]
DE LA ROSA, JOSEPH                        [FIRE, ENGINEER-EMT]
GRADILLA, IVON           [SUPERVISING, TRAFFIC, CONTROL, AIDE]
POWELL, TIMOTHY M                 [CHIEF, OPERATING, ENGINEER]
Name: Job Titles, Length: 24795, dtype: object

In [50]:
# nos quedamos con el primer elemento de esta lista
df['Job Titles'].str.split(" ").str.get(0)

Name
GARCIA, CHRISTOPHER A             CLERK
SILVA, JAVIER                   PROJECT
HARRISON, LETECHIA                   SR
WASHINGTON, MAURICE             PROJECT
ASHFORD, MARQUISHA D            INQUIRY
                              ...      
SYAS, VICTORIA M              PARAMEDIC
COOGAN, PATRICK M        LIEUTENANT-EMT
DE LA ROSA, JOSEPH                 FIRE
GRADILLA, IVON              SUPERVISING
POWELL, TIMOTHY M                 CHIEF
Name: Job Titles, Length: 24795, dtype: object

In [51]:
# podemos hacer estadística de estos valores
df['Job Titles'].str.split(" ").str.get(0).value_counts()

Job Titles
POLICE                 10458
FIREFIGHTER-EMT         2147
SERGEANT                1262
PARAMEDIC                704
FIRE                     563
                       ...  
INFRASTRUCTURE             1
FUEL                       1
INTAKE                     1
COMMANDER-LOGISTICS        1
COMMANDER-PARAMEDIC        1
Name: count, Length: 293, dtype: int64

In [52]:
# podríamos por ejemplo encontrar el primer nombre más común

filename = 'Chicago_employee_salaries.csv' 
fullpath = os.path.join(data_path, filename)

df = pd.read_csv(filepath_or_buffer=fullpath,
                 usecols=["Name","Job Titles","Department","Annual Salary"]).dropna(how='any')
df['Department'] = df['Department'].astype("category")
df['Job Titles'] = df['Job Titles'].astype("category")
df.head()

Unnamed: 0,Name,Job Titles,Department,Annual Salary
0,"GARCIA, CHRISTOPHER A",CLERK - BD OF ELECTIONS,BOARD OF ELECTION COMMISSIONERS,34176.0
1,"SILVA, JAVIER",PROJECT COORD,CHICAGO DEPARTMENT OF PUBLIC HEALTH,71004.0
2,"HARRISON, LETECHIA",SR PROCUREMENT SPECIALIST,DEPARTMENT OF PROCUREMENT SERVICES,90660.0
3,"WASHINGTON, MAURICE",PROJECT COORD,DEPARTMENT OF FAMILY AND SUPPORT SERVICES,71004.0
4,"ASHFORD, MARQUISHA D",INQUIRY AIDE III,OFFICE OF PUBLIC SAFETY ADMINISTRATION,46056.0


In [53]:
df["Name"].str.split(", ").str.slice(start=0,stop=1,step=1)

0            [GARCIA]
1             [SILVA]
2          [HARRISON]
3        [WASHINGTON]
4           [ASHFORD]
             ...     
31924          [SYAS]
31925        [COOGAN]
31927    [DE LA ROSA]
31929      [GRADILLA]
31931        [POWELL]
Name: Name, Length: 24795, dtype: object

In [54]:
lista = [1,2,3,4,5]
lista[0:2]

[1, 2]

In [55]:
df["Name"].str.title().str.split(", ").str.get(1).str.strip().str.split(" ").str.get(0).value_counts()

Name
Michael    734
John       456
Daniel     397
Joseph     343
David      339
          ... 
Padraic      1
Saadeh       1
Camron       1
Skyeler      1
Derec        1
Name: count, Length: 5138, dtype: int64

### Parámetro expand del método split
* El parámetro ``expand``permite regresar un dataframe en lugar de una Serie de listas dentro de un split.

In [56]:
filename = 'Chicago_employee_salaries.csv' 
fullpath = os.path.join(data_path, filename)

df = pd.read_csv(filepath_or_buffer=fullpath,
                 usecols=["Name","Job Titles","Department","Annual Salary"]).dropna(how='any')
df['Department'] = df['Department'].astype("category")
df['Job Titles'] = df['Job Titles'].astype("category")
df.head()

Unnamed: 0,Name,Job Titles,Department,Annual Salary
0,"GARCIA, CHRISTOPHER A",CLERK - BD OF ELECTIONS,BOARD OF ELECTION COMMISSIONERS,34176.0
1,"SILVA, JAVIER",PROJECT COORD,CHICAGO DEPARTMENT OF PUBLIC HEALTH,71004.0
2,"HARRISON, LETECHIA",SR PROCUREMENT SPECIALIST,DEPARTMENT OF PROCUREMENT SERVICES,90660.0
3,"WASHINGTON, MAURICE",PROJECT COORD,DEPARTMENT OF FAMILY AND SUPPORT SERVICES,71004.0
4,"ASHFORD, MARQUISHA D",INQUIRY AIDE III,OFFICE OF PUBLIC SAFETY ADMINISTRATION,46056.0


In [57]:
# podemos separar entre primer nombre y apellido y crear nuevas columnas

df[["Apellido","Nombre"]] = df["Name"].str.split(",", expand=True)

In [58]:
df.head(5)

Unnamed: 0,Name,Job Titles,Department,Annual Salary,Apellido,Nombre
0,"GARCIA, CHRISTOPHER A",CLERK - BD OF ELECTIONS,BOARD OF ELECTION COMMISSIONERS,34176.0,GARCIA,CHRISTOPHER A
1,"SILVA, JAVIER",PROJECT COORD,CHICAGO DEPARTMENT OF PUBLIC HEALTH,71004.0,SILVA,JAVIER
2,"HARRISON, LETECHIA",SR PROCUREMENT SPECIALIST,DEPARTMENT OF PROCUREMENT SERVICES,90660.0,HARRISON,LETECHIA
3,"WASHINGTON, MAURICE",PROJECT COORD,DEPARTMENT OF FAMILY AND SUPPORT SERVICES,71004.0,WASHINGTON,MAURICE
4,"ASHFORD, MARQUISHA D",INQUIRY AIDE III,OFFICE OF PUBLIC SAFETY ADMINISTRATION,46056.0,ASHFORD,MARQUISHA D


## Ejemplos

In [59]:
filename='netflix_titles.csv'
fullpath = os.path.join(data_path,filename)

df_netflix = pd.read_csv(fullpath)
df_netflix.head(5)

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,s1,Movie,Dick Johnson Is Dead,Kirsten Johnson,,United States,"September 25, 2021",2020,PG-13,90 min,Documentaries,"As her father nears the end of his life, filmm..."
1,s2,TV Show,Blood & Water,,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",South Africa,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t..."
2,s3,TV Show,Ganglands,Julien Leclercq,"Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...",,"September 24, 2021",2021,TV-MA,1 Season,"Crime TV Shows, International TV Shows, TV Act...",To protect his family from a powerful drug lor...
3,s4,TV Show,Jailbirds New Orleans,,,,"September 24, 2021",2021,TV-MA,1 Season,"Docuseries, Reality TV","Feuds, flirtations and toilet talk go down amo..."
4,s5,TV Show,Kota Factory,,"Mayur More, Jitendra Kumar, Ranjan Raj, Alam K...",India,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, Romantic TV Shows, TV ...",In a city of coaching centers known to train I...


In [60]:
df_netflix.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8807 entries, 0 to 8806
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   show_id       8807 non-null   object
 1   type          8807 non-null   object
 2   title         8807 non-null   object
 3   director      6173 non-null   object
 4   cast          7982 non-null   object
 5   country       7976 non-null   object
 6   date_added    8797 non-null   object
 7   release_year  8807 non-null   int64 
 8   rating        8803 non-null   object
 9   duration      8804 non-null   object
 10  listed_in     8807 non-null   object
 11  description   8807 non-null   object
dtypes: int64(1), object(11)
memory usage: 825.8+ KB


In [61]:
df_netflix["cast"].dropna().str.split(', ').values.tolist()

[['Ama Qamata',
  'Khosi Ngema',
  'Gail Mabalane',
  'Thabang Molaba',
  'Dillon Windvogel',
  'Natasha Thahane',
  'Arno Greeff',
  'Xolile Tshabalala',
  'Getmore Sithole',
  'Cindy Mahlangu',
  'Ryle De Morny',
  'Greteli Fincham',
  'Sello Maake Ka-Ncube',
  'Odwa Gwanya',
  'Mekaila Mathys',
  'Sandi Schultz',
  'Duane Williams',
  'Shamilla Miller',
  'Patrick Mofokeng'],
 ['Sami Bouajila',
  'Tracy Gotoas',
  'Samuel Jouy',
  'Nabiha Akkari',
  'Sofia Lesaffre',
  'Salim Kechiouche',
  'Noureddine Farihi',
  'Geert Van Rampelberg',
  'Bakary Diombera'],
 ['Mayur More',
  'Jitendra Kumar',
  'Ranjan Raj',
  'Alam Khan',
  'Ahsaas Channa',
  'Revathi Pillai',
  'Urvi Singh',
  'Arun Kumar'],
 ['Kate Siegel',
  'Zach Gilford',
  'Hamish Linklater',
  'Henry Thomas',
  'Kristin Lehman',
  'Samantha Sloyan',
  'Igby Rigney',
  'Rahul Kohli',
  'Annarah Cymone',
  'Annabeth Gish',
  'Alex Essoe',
  'Rahul Abburi',
  'Matt Biedel',
  'Michael Trucco',
  'Crystal Balint',
  'Louis Oliv

In [62]:
df_netflix["cast"].dropna().str.split(', ')

1       [Ama Qamata, Khosi Ngema, Gail Mabalane, Thaba...
2       [Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nab...
4       [Mayur More, Jitendra Kumar, Ranjan Raj, Alam ...
5       [Kate Siegel, Zach Gilford, Hamish Linklater, ...
6       [Vanessa Hudgens, Kimiko Glenn, James Marsden,...
                              ...                        
8801    [Ali Suliman, Saleh Bakri, Yasa, Ali Al-Jabri,...
8802    [Mark Ruffalo, Jake Gyllenhaal, Robert Downey ...
8804    [Jesse Eisenberg, Woody Harrelson, Emma Stone,...
8805    [Tim Allen, Courteney Cox, Chevy Chase, Kate M...
8806    [Vicky Kaushal, Sarah-Jane Dias, Raaghav Chana...
Name: cast, Length: 7982, dtype: object

In [63]:
lista_n = df_netflix["cast"].dropna().str.split(', ').values.tolist() # pasa a lista los valores

lista_n = [element for sublist in lista_n for element in sublist] # se aplana la lista

serie_n = pd.Series(lista_n)

print(serie_n.value_counts().to_frame())

                   count
Anupam Kher           43
Shah Rukh Khan        35
Julie Tejwani         33
Takahiro Sakurai      32
Naseeruddin Shah      32
...                  ...
Tejashree Pradhan      1
Neha Joshi             1
Ayesha Omer            1
Samina Peerzada        1
Waseem Abbas           1

[36439 rows x 1 columns]
