# How to use indexes on strings

In this chapter, we’ll use a different dataset called <font color='red'>staff.csv</font>. Let’s begin by creating a <font color='red'>DataFrame</font> by reading the file.

In [1]:
import pandas as pd

staff = pd.read_csv("staff.csv")

print(f"\nStaff data frame has the following columns: \n{list(staff.columns)}\n")

print(staff)


Staff data frame has the following columns: 
['name', 'city', 'date_of_birth', 'start_date', 'salary', 'department']

               name             city date_of_birth  start_date   salary  \
0          John Doe      Houston, TX    1998-11-04  2018-08-11  $65,000   
1          Jane Doe     San Jose, CA    1995-08-05  2017-08-24  $70,000   
2        Matt smith       Dallas, TX    1996-11-25  2020-04-16  $58,500   
3     Ashley Harris        Miami, FL    1995-01-08  2021-02-11  $49,500   
4  Jonathan targett  Santa Clara, CA    1998-08-14  2020-09-01  $62,000   
5         Hale Cole      Atlanta, GA    2000-10-24  2021-10-20  $54,500   

        department  
0       Accounting  
1    Field Quality  
2  human resources  
3       accounting  
4    field quality  
5      engineering  


In [2]:
staff.columns

Index(['name', 'city', 'date_of_birth', 'start_date', 'salary', 'department'], dtype='object')

Textual data is an important component of data science. Some areas require working with textual data excessively, such as <u>natural language processing (NLP)</u>. The Pandas library provides several functions and methods for working with textual data. They can be accessed via the **str accessor**. The first operation we’ll discover uses indexes of strings.

A string is a sequence of characters, so each character has an <u>associated index</u>. The indexes of characters can be used to select an individual character or a slice from a string. For instance, we can get the first letter of the strings in the name column as below:

In [3]:
import pandas as pd

staff = pd.read_csv("staff.csv")

print(staff["name"].str[0])

0    J
1    J
2    M
3    A
4    J
5    H
Name: name, dtype: object


In [4]:
staff["name"]

0            John Doe
1            Jane Doe
2          Matt smith
3       Ashley Harris
4    Jonathan targett
5           Hale Cole
Name: name, dtype: object

The strings have integer indexes starting from zero. If we want to take a slice from a string, we simply need to specify the start and end index. For example, we can select the first three letters of the name column as below:

In [5]:
import pandas as pd

staff = pd.read_csv("staff.csv")

print(staff["name"].str[0:3])

0    Joh
1    Jan
2    Mat
3    Ash
4    Jon
5    Hal
Name: name, dtype: object


If the desired slice starts from the **first index (zero)**, we needn’t write the initial index. Thus, the following line of code does the same thing as above.

In [6]:
staff["name"].str[:3]


0    Joh
1    Jan
2    Mat
3    Ash
4    Jon
5    Hal
Name: name, dtype: object

It’s important to note that the **<u>upper bound is exclusive**</u>. Thus, [:3] indicates the **indexes 0, 1, and 2**. It’s possible to use an index that starts from the end of a string. In this case, the indexes start from -1 and continue as -2, -3, and so on. The following line of code returns the last two characters of the city column.

In [7]:
import pandas as pd

staff = pd.read_csv("staff.csv")

print(staff["name"].str[-2:])

0    oe
1    oe
2    th
3    is
4    tt
5    le
Name: name, dtype: object


To make the slicing and indexing operations even more flexible, Pandas allows for customizing the step size as well. For instance, we can create a slice that involves every other character, starting from the second-to-last index.

In [8]:
import pandas as pd

staff = pd.read_csv("staff.csv")

print(staff["name"].str[1::2])

0        onDe
1        aeDe
2       atsih
3      slyHri
4    oahntret
5        aeCl
Name: name, dtype: object


In [9]:
staff["name"]

0            John Doe
1            Jane Doe
2          Matt smith
3       Ashley Harris
4    Jonathan targett
5           Hale Cole
Name: name, dtype: object

The structure is as follows:

If the end is left blank, then the slice goes up to the end of the string.

**str[start : end : step size]**