**Table of contents**<a id='toc0_'></a>    
- [Import Statements](#toc1_1_)    
  - [Importing the data](#toc1_2_)    
- [String Manipulation](#toc2_)    
    - [*Searching through strings: the `.str.extract()` method*](#toc2_1_1_)    
    - [*Replacing text: `.str.replace()` and `<Series>.replace()`*](#toc2_1_2_)    
    - [*Splitting text: the `.str.split()` method*](#toc2_1_3_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=2
	maxLevel=5
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

**Read the official documentation on pandas Series objects @ https://pandas.pydata.org/pandas-docs/stable/reference/series.html**

**`Note:`** We can actually use python built in functions on pandas series objects. i.e., **len, type, dir, in, sum, product, mean, sorted, max, min** etc.

Also, the notion of **chaining functions/methods** in pandas is similar to python.

### <a id='toc1_1_'></a>[Import Statements](#toc0_)

In [1]:
# import statements
import numpy as np
import pandas as pd

### <a id='toc1_2_'></a>[Importing the data](#toc0_)

One of the many datasets we will use for our examples in this notebook is the `/Data/vehicles.csv.zip` dataset.

In [2]:
# read the vehicles.csv dataset
df = pd.read_csv("Data/vehicles.csv.zip")

  df = pd.read_csv("Data/vehicles.csv.zip")


Columns of a dataframe can be accessed in various ways. One of which is to use the **dot i.e, ' . ' notation**.

In [3]:
# the city08 and highway08 columns from the vehicles.csv dataset provides information on
# miles per gallon usage while driving around in the city and highway respectively.
city_mpg = df.city08
highway_mpg = df.highway08

In [4]:
# The make in the vehicles dataset provides the manufacturer name (strings) and is stored as an object.
manufac = df.make

**Note:** The first thing we should do when we load in a dataset is to check the datatypes of each column and cast each of them to more suitable datatypes. This is to save space and speed up our code execution.

---------------------

## <a id='toc2_'></a>[String Manipulation](#toc0_)

------------------

Usually, by default pandas stores string type data as objects. But objects can mean various things such as, python lists, dictionaries or custom classes. Thus to have more flexibility over how we treat and use strings we can convert object type to string with the astype() method. If the strings has low cardinality (few unique values) we can also use categorical type which will decrease the processing time further.

In [5]:
manufac_str = manufac.astype("string")

**The .str accessor provides many string manipulation methods, most of which works similarly to the python string methods. The available string methods with the .str accessor are,**

In [6]:
[str_methods for str_methods in dir(manufac.astype("string").str) if str_methods.startswith("_") is False]

['capitalize',
 'casefold',
 'cat',
 'center',
 'contains',
 'count',
 'decode',
 'encode',
 'endswith',
 'extract',
 'extractall',
 'find',
 'findall',
 'fullmatch',
 'get',
 'get_dummies',
 'index',
 'isalnum',
 'isalpha',
 'isdecimal',
 'isdigit',
 'islower',
 'isnumeric',
 'isspace',
 'istitle',
 'isupper',
 'join',
 'len',
 'ljust',
 'lower',
 'lstrip',
 'match',
 'normalize',
 'pad',
 'partition',
 'removeprefix',
 'removesuffix',
 'repeat',
 'replace',
 'rfind',
 'rindex',
 'rjust',
 'rpartition',
 'rsplit',
 'rstrip',
 'slice',
 'slice_replace',
 'split',
 'startswith',
 'strip',
 'swapcase',
 'title',
 'translate',
 'upper',
 'wrap',
 'zfill']

#### <a id='toc2_1_1_'></a>[*Searching through strings: the `.str.extract()` method*](#toc0_)

Returns a **dataframe** with the first match from each regular expression capture group (separated by first brackets) in its own column (uses named groups for column names). Returns a **series** if **expand=False**.

In [7]:
# the following regex i.e, (?P<letter>^[A-C]) defines that, the Capture Group is named 'letter'
# it will search if any of the character in the list, [ABC] is present at the start of a string Element
manufac_str.str.extract(r"(?P<letter>^[A-C])", expand=False).value_counts()

letter
C    5336
B    2796
A    1610
Name: count, dtype: Int64

#### <a id='toc2_1_2_'></a>[*Replacing text: `.str.replace()` and `<Series>.replace()`*](#toc0_)

Although both of these methods can perform both of the operations, one should use **.str.replace() to replace substrings and \<Series>.replace() to replace complete strings.** To replace a substring with \<Series>.replace() set, regex=True.

In [8]:
# replacing partial string with .str.replace()
manufac_str.str.replace("A", "Å").sort_values().tail()

20820           Åutokraft Limited
19489    Åvanti Motor Corporation
18246    Åvanti Motor Corporation
24051              Åzure Dynamics
24050              Åzure Dynamics
Name: make, dtype: string

In [9]:
# replacing partial string with .replace() method
manufac_str.replace("A", "Å", regex=True).sort_values().tail()

20820           Åutokraft Limited
19489    Åvanti Motor Corporation
18246    Åvanti Motor Corporation
24051              Åzure Dynamics
24050              Åzure Dynamics
Name: make, dtype: string

#### <a id='toc2_1_3_'></a>[*Splitting text: the `.str.split()` method*](#toc0_)

This may be useful when dealing with survey data that has binned numeric values. By default, the split method will return a series of list of the splited values. But, this is difficult to manipulate. So, what we can do is, set **expand=True** and this will return a DataFrame. Then we can access the individual columns of the dataframe if we wanted a series object.

In [10]:
# example of splitting binned data
age = pd.Series(["1-10", "11-20", "21-30", "31-40", "41-50"])

In [11]:
age_df = (
    age.astype(dtype="string")
    .str.split("-", expand=True)
    .rename(columns={0: "low", 1: "high"})
)

In [12]:
age_df

Unnamed: 0,low,high
0,1,10
1,11,20
2,21,30
3,31,40
4,41,50


In [13]:
age_low = age_df.low
age_low

0     1
1    11
2    21
3    31
4    41
Name: low, dtype: string

**The `.str.partition(sep, expand=True)` also works similarly.** It will return a dataframe with 3 columns: the element before the sep, the sep, and the part after.