# Deep dive into Pandas Series data structure

**Table of contents**<a id='toc0_'></a>    
- [Import Statements](#toc1_)    
- [Importing the data](#toc2_)    
- [Data type casting](#toc3_)    
- [Mathematical operations and Index Alignment](#toc4_)    
- [Aggregate methods](#toc5_)    
  - [Multiple aggregation: The _.agg()_ function](#toc5_1_)    
- [Manipulation methods](#toc6_)    
- [String Manipulation](#toc7_)    
  - [Searching through strings: the `.str.extract()` method](#toc7_1_)    
  - [Replacing text: `.str.replace()` and `<Series>.replace()`](#toc7_2_)    
  - [Splitting text: the `.str.split()` method](#toc7_3_)    
- [Missing values and How to handle them](#toc8_)    
  - [_Handling missing values_](#toc8_1_)    
- [Indexing Operations](#toc9_)    
  - [*Renaming Indexes*](#toc9_1_)    
  - [*Resetting Index Labels*](#toc9_2_)    
  - [*The `.loc[]` method*](#toc9_3_)    
  - [*The `.iloc[]` method*](#toc9_4_)    
  - [*Filtering Index Labels with `.filter(items, like, regex)`*](#toc9_5_)    
  - [*Reindexing with `.reindex(index)`*](#toc9_6_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=2
	maxLevel=4
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

**Read the official documentation on pandas Series objects @ https://pandas.pydata.org/pandas-docs/stable/reference/series.html**

**`Note:`** We can actually use python built in functions on pandas series objects. i.e., **len, type, dir, in, sum, product, mean, sorted, max, min** etc.

Also, the notion of **chaining functions/methods** in pandas is similar to python.

## <a id='toc1_'></a>[Import Statements](#toc0_)

--------------------------

In [1]:
# import statements
import numpy as np
import pandas as pd

---------------------------

## <a id='toc2_'></a>[Importing the data](#toc0_)

------------------------

One of the many datasets we will use for our examples in this notebook is the `/Data/vehicles.csv.zip` dataset.

In [2]:
# read the vehicles.csv dataset
df = pd.read_csv("Data/vehicles.csv.zip")

  df = pd.read_csv("Data/vehicles.csv.zip")


Columns of a dataframe can be accessed in various ways. One of which is to use the **dot i.e, ' . ' notation**.

In [3]:
# the city08 and highway08 columns from the vehicles.csv dataset provides information on
# miles per gallon usage while driving around in the city and highway respectively.
city_mpg = df.city08
highway_mpg = df.highway08

In [4]:
# The make in the vehicles dataset provides the manufacturer name (strings) and is stored as an object.
manufac = df.make

**Note:** The first thing we should do when we load in a dataset is to check the datatypes of each column and cast each of them to more suitable datatypes. This is to save space and speed up the process. We will learn more about it in the next section, i.e, Data type casting.

----------------------------

## <a id='toc3_'></a>[Data type casting](#toc0_)

-----------------------------

It is often the case that we need to convert between data types, usually for better performance (more manipulation
options or use less memory) or some other reasons. Whatever may be the case, Pandas provides a very useful function namely, `astype(dtype)` for converting data type of a Series or DataFrame object. 

> Some of the major datatypes available in pandas include: 

    object, int, float, bool, datetime, category etc.

Refer to this article @ https://pbpython.com/pandas_dtypes.html for a basic idea on the pandas data types.

- Inspecting numerical limits of different integer and float types 

The **default numeric type is 8 bytes wide (or, 64 bits i,e, int64 or float64)**. If you can use a narrower type, you can cut back on memory usage, giving you memory to process more data. You can use NumPy to inspect limits on integer and float types.

In [5]:
# integer
print(np.iinfo(np.int16))  # or, np.iinfo("int16")
print(np.iinfo(np.uint8))  # or, np.iinfo("uint8")

Machine parameters for int16
---------------------------------------------------------------
min = -32768
max = 32767
---------------------------------------------------------------

Machine parameters for uint8
---------------------------------------------------------------
min = 0
max = 255
---------------------------------------------------------------



In [6]:
# float
print(np.finfo("float16"))

Machine parameters for float16
---------------------------------------------------------------
precision =   3   resolution = 1.00040e-03
machep =    -10   eps =        9.76562e-04
negep =     -11   epsneg =     4.88281e-04
minexp =    -14   tiny =       6.10352e-05
maxexp =     16   max =        6.55040e+04
nexp =        5   min =        -max
smallest_normal = 6.10352e-05   smallest_subnormal = 5.96046e-08
---------------------------------------------------------------



- Checking memory usage

To check how much memory the values of a Series or DataFrame is consuming we can use the `nbytes` method.

In [7]:
# by default, the data in city_mpg Series was stored as int64 type

# the max value in our seires object is 150
# so, we can't use int8 but we can cast to int16

# to see how much space is saved
city_mpg.nbytes - city_mpg.astype("int16").nbytes

246864

Using `.nbytes` with object types only shows how much memory the Pandas object is taking. The **make** in the vehicles dataset provides the manufacturer name (strings) and is stored as an object. To get the amount of memory that includes the strings, we need to use the `.memory_usage` method.

In [8]:
manufac.head(3)

0    Alfa Romeo
1       Ferrari
2         Dodge
Name: make, dtype: object

In [9]:
# examining memory usage with nbbytes function
manufac.nbytes

329152

In [10]:
# examining memory usage with memory_usage function
manufac.memory_usage(deep=True)

2606395

The value of _.nbytes_ is just the memory that the data is using and not the ancillary parts of the Series. The _.memory_usage_ includes the index memory and can include the contribution from object types.

- **String and Category type**

A `categorical` series is useful for string data and can result in large memory savings. This is because for categorical data, instead of using python `string` to store the values, pandas optimizes it so that **repeating values are not duplicated**. You **still have all of the functionality found off of the .str attribute.**

For example, if we convert the make column from the vehicles dataframe i.e, the manufac series to category object, this will have much more improved memory footprint.

In [11]:
# the make column as categorical object
manufac_cat = df.make.astype("category")

In [12]:
manufac_cat.head(3)

0    Alfa Romeo
1       Ferrari
2         Dodge
Name: make, dtype: category
Categories (136, object): ['AM General', 'ASC Incorporated', 'Acura', 'Alfa Romeo', ..., 'Volvo', 'Wallace Environmental', 'Yugo', 'smart']

In [13]:
# examining the memory footprint
manufac.memory_usage() - manufac_cat.memory_usage()

241608

- **Custom & ordered categories**

To define custom categories we need to use the `pd.Categorical(values, categories, ordered=False)` function. And, to have the categories in order we need to set, _ordered = True_

**Note:**
1. a Categorical **might have an order**, but numerical operations (additions, divisions, ...) are not possible.    
2. Assigning values **outside of categories** will result in replacing the value with **NaN** in the series object.
3. Order is defined by the **order of the categories, not lexical order of the values**.

In [14]:
# values
vals = manufac
# categories
cat = manufac.unique()

# to have an ordered category we just need to set, ordered = True
ord_manufac = pd.Categorical(values=vals, categories=cat)

In [15]:
ord_manufac

['Alfa Romeo', 'Ferrari', 'Dodge', 'Dodge', 'Subaru', ..., 'Subaru', 'Subaru', 'Subaru', 'Subaru', 'Subaru']
Length: 41144
Categories (136, object): ['Alfa Romeo', 'Ferrari', 'Dodge', 'Subaru', ..., 'Consulier Industries Inc', 'Goldacre', 'Isis Imports Ltd', 'PAS Inc - GMC']

---------------------------------------------

## <a id='toc4_'></a>[Mathematical operations and Index Alignment](#toc0_)

------------------------------------------------

- **Mathematical operators**

> The available mathematical operations include: 

    +, -, /, // (floor division), % (modulus), @ (matrix multiplication), ** (power), <, <=, ==, !=, >=, >, & (binary and), ^ (binary xor), | (binary or).

However, pandas will **align the index** before performing any of these operations. Aligning will take each index entry in the left series and match it up with every entry with the same name in the index of the right series. Because of index alignment, you will want to **make sure that the indexes are, unique (no duplicates) and, common to both series**. Otherwise, the produced result will not be as expected.


Example of how index alignment can cause to produce unintended results --

In [16]:
s1 = pd.Series([10, 20, 30], index=[1, 2, 2])
s2 = pd.Series([15, 28, 32], index=[2, 2, 3])

In [17]:
s1 + s2

1     NaN
2    35.0
2    48.0
2    45.0
2    58.0
3     NaN
dtype: float64

Note that, index-1 and index-3 has NaN values. Whereas, for index-2, every index-2 value from s1 is matched with every index-2 value from s2. 

- **Operator methods**

Pandas also provides **operator methods** for each of the mathematical operations. The benefit is, the operator methods have a **fill_value** parameter. **By default, the operator methods will produce the same output as the mathematical operators themselves. But, if a fill_value is defined, then when one of the operands is missing, the method will use the fill_value instead.**

> Some of the operator methods include: 
    
    add(), sub(), mul(), div(), mod(), pow(), rfloordiv(), lt(), gt(), eq(), ne(), le(), ge(), dot(), product() etc.

For instance, in the above example if we wanted to have values at index-1 and index-3 instead of NaN values we can use the **s1.add(s2, fill_value=0)** method. Probably, you can already guess how this will work out.

In [18]:
# using the add() method to add the elements of both series with fill_value=0
s1.add(s2, fill_value=0)

1    10.0
2    35.0
2    48.0
2    45.0
2    58.0
3    32.0
dtype: float64

Learn about all the available operator methods @ https://pandas.pydata.org/pandas-docs/stable/reference/series.html#binary-operator-functions

---------------------------------------------

## <a id='toc5_'></a>[Aggregate methods](#toc0_)

------------------------------------------------

Aggregate methods collapse the values of a series down to a scalar. Thus, allowing you to take detailed data and collapse it to a single value e.g, sum, count, mean, median etc.

> Some of the commonly used aggregate methods are, 

    all(), any(), count(), prod(), min(), max(), nsmallest(), nlargest(), cummax(), cumsum(), cumprod(), mean(), median(), mode(), sum(), std(), var(), quantile(), unique(), nunique(), value_counts(), describe() etc.

- Total number of rows in a Series

In [19]:
city_mpg.size

41144

- Counting non-null values

In [20]:
# Count total number of non-NA/non null values in a series
print("no of non-NA vals:", city_mpg.count())

# but if we use count as an attribute then this will return a series containing the non-NA values

# city_mpg.count

no of non-NA vals: 41144


- Counting number of null values

In [21]:
city_mpg.isna().sum()

0

- Largest n elements

In [22]:
# return the largest n values
print("3 largest values: \n", city_mpg.nlargest(3))

3 largest values: 
 31256    150
32599    150
33423    150
Name: city08, dtype: int64


- Cumilitive functions

In [23]:
# cumilitive sum of a series
city_mpg.cumsum().tail()

41139    755704
41140    755724
41141    755742
41142    755760
41143    755776
Name: city08, dtype: int64

- Quantile values

Quantile is where probability distribution is divided into areas of equal probability. If we consider percentages, we first divide the distribution into 100 pieces. When we look into PDF, the 5th quantile is the point that cuts off an area of 5% in the lower tail of the distribution.

In [24]:
# Quantile by default returns the 50% quantile.
# We can also pass in a list. In such case, this will return a series object

city_mpg.quantile([0.25, 0.5, 0.75])  # 25%, 50% and 75% quantiles

0.25    15.0
0.50    17.0
0.75    20.0
Name: city08, dtype: float64

- Generate descriptive statistics

In [25]:
city_mpg.describe()

count    41144.000000
mean        18.369045
std          7.905886
min          6.000000
25%         15.000000
50%         17.000000
75%         20.000000
max        150.000000
Name: city08, dtype: float64

- Unique values in a series

In [26]:
# return unique values in a series as ndarray
manufac.unique()

array(['Alfa Romeo', 'Ferrari', 'Dodge', 'Subaru', 'Toyota', 'Volkswagen',
       'Volvo', 'Audi', 'BMW', 'Buick', 'Cadillac', 'Chevrolet',
       'Chrysler', 'CX Automotive', 'Nissan', 'Ford', 'Hyundai',
       'Infiniti', 'Lexus', 'Mercury', 'Mazda', 'Oldsmobile', 'Plymouth',
       'Pontiac', 'Rolls-Royce', 'Eagle', 'Lincoln', 'Mercedes-Benz',
       'GMC', 'Saab', 'Honda', 'Saturn', 'Mitsubishi', 'Isuzu', 'Jeep',
       'AM General', 'Geo', 'Suzuki', 'E. P. Dutton, Inc.', 'Land Rover',
       'PAS, Inc', 'Acura', 'Jaguar', 'Lotus', 'Grumman Olson', 'Porsche',
       'American Motors Corporation', 'Kia', 'Lamborghini',
       'Panoz Auto-Development', 'Maserati', 'Saleen', 'Aston Martin',
       'Dabryan Coach Builders Inc', 'Federal Coach', 'Vector', 'Bentley',
       'Daewoo', 'Qvale', 'Roush Performance', 'Autokraft Limited',
       'Bertone', 'Panther Car Company Limited', 'Texas Coach Company',
       'TVR Engineering Ltd', 'Morgan', 'MINI', 'Yugo', 'BMW Alpina',
       'Renaul

In [27]:
# return a series containing counts of unique values
manufac.value_counts(dropna=False)  # by default, dropna=True

Chevrolet                      4003
Ford                           3371
Dodge                          2583
GMC                            2494
Toyota                         2071
                               ... 
Volga Associated Automobile       1
Panos                             1
Mahindra                          1
Excalibur Autos                   1
London Coach Co Inc               1
Name: make, Length: 136, dtype: int64

### <a id='toc5_1_'></a>[Multiple aggregation: The _.agg()_ function](#toc0_)

The `.agg()` function can be used to perform multiple aggregate operations on a series object at the same time. You can also define your own aggregate functions.

In [28]:
city_mpg.agg(
    ["min", "idxmin", "max", "idxmax", "mean", "std", "var", "quantile", "all", "sum"]
)

min                 6
idxmin           7901
max               150
idxmax          31256
mean        18.369045
std          7.905886
var         62.503036
quantile         17.0
all              True
sum            755776
Name: city08, dtype: object

To learn about all the available aggregate functions see the documentation @ https://pandas.pydata.org/pandas-docs/stable/reference/series.html#computations-descriptive-stats

--------------------------

## <a id='toc6_'></a>[Manipulation methods](#toc0_)

----------------------------------

When working on a dataset, usually the **most used** methods are some kind of manipulation methods. These are specially useful when we are cleaning up our dataset or, are exploring it to understand it better.

- Applying a custom function to every element of a Series (also works on DataFrames)

The `apply(func)` method will call the `func` function on every element of a Series. 

This is **usually not wise to use** since this will dramatically increase the computation time. We already have a wide range of predefined pandas methods and functions for almost all the manipulation operations we can desire. But for some reason or other, if we can't find any suitable methods then we can define our own function and call it using the apply() method. **Note that,** we only need to pass in the name of the function and not call them.

In [29]:
# let's say, we only want to keep the top 5 manufacturers and replace other values with "Other" in the manufac Series

In [30]:
# defining the custom manipulation function
top_5_manufac = manufac.value_counts().index[0:5]


def custom_manipulation(val):
    if val in top_5_manufac:
        return val
    else:
        return "Other"

In [31]:
%%timeit # magic functions needs to go at the top of the cell

# applying the custom_manipulation function on manufac Series
manufac.apply(custom_manipulation)

45 ms ± 1.43 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [32]:
cust_manufac = manufac.apply(custom_manipulation)

In [33]:
cust_manufac.value_counts()

Other        26622
Chevrolet     4003
Ford          3371
Dodge         2583
GMC           2494
Toyota        2071
Name: make, dtype: int64

- The `where(cond, other)` method

This method **replaces values where the condition is False with corresponding value from 'other'**.

In [34]:
# if we were to do the same thing as the above example with where() method

In [35]:
%%timeit
# Series.isin(values) checks Whether elements in Series are contained in `values`
manufac.where(cond=manufac.isin(top_5_manufac), other="Other")

2.9 ms ± 209 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


As we can see, this is almost 14-18 times faster than the apply() approach.

- The `mask(cond, other)` method **replaces values where the condition is True with corresponding value from 'other'**. It is equivalent to: **where(~cond, other)**

- The `clip(lower, upper)` method for handling **outliers**

This will **replace all values above the upper threshold with the upper value and all values lower than the lower threshold with the lower value.**

Clipping is handy if you have outliers in your data. In the city_mpg series the values ranges from 6 to 150. But, there are only 16 vehicles with city_mpg > 130 and 39 vehicles with city_mpg < 8 out of the total 41144 vehicles. Say, we wanted to clip those entries, we can do that very easily with the clip method. Or say, we wanted to clip the values between the 5% quantile and 95% quantile. We could also do that very easily with the clip method. All this is to say that, this is a very handy tool to use in the right situations.

In [36]:
# in the first case
city_mpg_clip = city_mpg.clip(lower=8, upper=120)

# see that the values were replaced not dropped
print(len(city_mpg) == len(city_mpg_clip))

# Min and Max values of the city_mpg_clip
city_mpg_clip.agg(["min", "max"])

True


min      8
max    120
Name: city08, dtype: int64

In [37]:
# in the second case
quantile_5 = city_mpg.quantile(0.05)
quantile_95 = city_mpg.quantile(0.95)

city_mpg_clip_quantile = city_mpg.clip(lower=quantile_5, upper=quantile_95)

print("5% Quantile: ", quantile_5, "\n95% Quantile: ", quantile_95)

city_mpg_clip_quantile.agg(["min", "max"])

5% Quantile:  11.0 
95% Quantile:  27.0


min    11
max    27
Name: city08, dtype: int64

- The `sort_values()` method

This method will **sort the values (by default in ascending order) and also rearrange the index accordingly.**

**Note that**, because the indexes are rearranged i.e, index are aligned, we can still do math operations (and many other operations) on a sorted series.

In [38]:
highway_mpg.sort_values().head()

23231    9
1979     9
26858    9
1990     9
23176    9
Name: highway08, dtype: int64

- The `sort_index()` method

By default this method **will sort the indexes in ascending order**.

In [39]:
manufac.sort_values().sort_index(ascending=False).head()

41143    Subaru
41142    Subaru
41141    Subaru
41140    Subaru
41139    Subaru
Name: make, dtype: object

- The `drop_duplicates()` method 

This will drop the rows with duplicate values. This mehtod has an argument called **keep**. By default its value is set to "First". If set to "False" it will remove any duplicated values including the initial value.

In [40]:
# manufacturers who have only one car listed on the data series
manufac.drop_duplicates(keep=False)

602                      E. P. Dutton, Inc.
1790                 Panoz Auto-Development
6266                                  Qvale
11033                Lambda Control Systems
13553                   London Coach Co Inc
16606                                Shelby
19027         Import Foreign Auto Sales Inc
19352    S and S Coach Company  E.p. Dutton
19353      Superior Coaches Div E.p. Dutton
19670                   Vixen Motor Company
20657           Volga Associated Automobile
20881                                 Panos
21147                           London Taxi
21443                       Excalibur Autos
23115                              Mahindra
24738                                Fisker
25795                      ASC Incorporated
32341                                 Karma
32788                            Koenigsegg
34303                       Aurora Cars Ltd
34386                        RUF Automobile
35193                   JBA Motorcars, Inc.
35753             Grumman Allied

- Ranking the Values, the `rank(axis=0, method='average', ascending=True)` method

In [41]:
# This method defines how to rank the records that have the same values (ties).
# Available ranking methods are, ‘average’, ‘min’, ‘max’, ‘first’, ‘dense’. default = ‘average’.

In [42]:
# Ranking based on city_mpg
city_mpg.rank(method="min").sort_values().tail(6)

25615    41139.0
34563    41139.0
34564    41141.0
32599    41142.0
31256    41142.0
33423    41142.0
Name: city08, dtype: float64

- The `replace(to_replace, value, regex=False)` method

Replace values given in `to_replace` with `value`. This differs from updating with `.loc` or `.iloc`, which require
you to specify a location to update with some value.

This is a very **versetile and useful** method. To see its full functionality and usage example refer to the docs.

In [43]:
manufac.replace("Subaru", "Subaru!!")

0        Alfa Romeo
1           Ferrari
2             Dodge
3             Dodge
4          Subaru!!
            ...    
41139      Subaru!!
41140      Subaru!!
41141      Subaru!!
41142      Subaru!!
41143      Subaru!!
Name: make, Length: 41144, dtype: object

- #### **Binning Data:** the `pd.cut(array_like, bins, labels, include_lowest=False)` and the `pd.qcut()` methods

As the name suggests, we can **categorize** our data values **into specific bins of predefined size** (by default, in a half open interval with lower limit excluded and upper limit included). 

By default, the **cut()** method will generate categories set to the half open interval but we can change this behaviour and define specific bin names with the labels argument. The **qcut()** method is used to generate bins using quantile values so that all the bins have roughly same amount of data. 

We can either define how many bins we want or, we can also define the bin sizes as a list.

In [44]:
manufac_val_count = manufac.value_counts()

In [45]:
# definging bins by edges
pd.cut(
    manufac_val_count,
    [0, 500, 1000, 2000, 3000, 5000],
    labels=[
        "Manufacturer for less than 500 cars",
        "Manufacturer for more than 500 but less than 1000 cars",
        "Manufacturer for more than 1000 but less than 2000 cars",
        "Manufacturer for more than 2000 but less than 3000 cars",
        "Manufacturer for more than 3000 cars",
    ],
)

Chevrolet                                   Manufacturer for more than 3000 cars
Ford                                        Manufacturer for more than 3000 cars
Dodge                          Manufacturer for more than 2000 but less than ...
GMC                            Manufacturer for more than 2000 but less than ...
Toyota                         Manufacturer for more than 2000 but less than ...
                                                     ...                        
Volga Associated Automobile                  Manufacturer for less than 500 cars
Panos                                        Manufacturer for less than 500 cars
Mahindra                                     Manufacturer for less than 500 cars
Excalibur Autos                              Manufacturer for less than 500 cars
London Coach Co Inc                          Manufacturer for less than 500 cars
Name: make, Length: 136, dtype: category
Categories (5, object): ['Manufacturer for less than 500 cars' < 'Ma

In [46]:
# binning the data into 10 groups that have equal sized bins
pd.qcut(manufac_val_count, 10, duplicates="drop")

Chevrolet                      (965.0, 4003.0]
Ford                           (965.0, 4003.0]
Dodge                          (965.0, 4003.0]
GMC                            (965.0, 4003.0]
Toyota                         (965.0, 4003.0]
                                    ...       
Volga Associated Automobile       (0.999, 2.0]
Panos                             (0.999, 2.0]
Mahindra                          (0.999, 2.0]
Excalibur Autos                   (0.999, 2.0]
London Coach Co Inc               (0.999, 2.0]
Name: make, Length: 136, dtype: category
Categories (9, interval[float64, right]): [(0.999, 2.0] < (2.0, 3.0] < (3.0, 5.0] < (5.0, 13.0] ... (56.0, 151.5] < (151.5, 469.0] < (469.0, 965.0] < (965.0, 4003.0]]

---------------------

## <a id='toc7_'></a>[String Manipulation](#toc0_)

------------------

Usually, by default pandas stores string type data as objects. But objects can mean various things such as, python lists, dictionaries or custom classes. Thus to have more flexibility over how we treat and use strings we can convert object type to string with the astype() method. If the strings has low cardinality we can also use categorical type which will decrease the processing time further.

In [47]:
manufac_str = manufac.astype("string")

**The .str accessor provides many string manipulation methods, most of which works similarly to the python string methods. Some of the string methods are,**

    .str.capitalize()
    .str.lower()
    .str.upper()
    .str.normalize()
    .str.strip()
    
    .str.center()
    .str.ljust()
    
    .str.contains()
    .str.extract()
    .str.match()
    
    .str.count()
    .str.find()
    .str.index()
    
    .str.join()
    .str.split()
    .str.slice()
    .str.partition()

### <a id='toc7_1_'></a>[Searching through strings: the `.str.extract()` method](#toc0_)

Returns a **dataframe** with the first match from each regular expression capture group (separated by first brackets) in its own column (uses named groups for column names). Returns a **series** if **expand=False**.

In [48]:
# the following regex i.e, (?P<letter>^[A-C]) defines that, the Capture Group is named 'letter'
# it will search if any of the character in the list, [ABC] is present at the start of a string Element
manufac_str.str.extract(r"(?P<letter>^[A-C])", expand=False).value_counts()

C    5336
B    2796
A    1610
Name: letter, dtype: Int64

### <a id='toc7_2_'></a>[Replacing text: `.str.replace()` and `<Series>.replace()`](#toc0_)

Although both of these methods can perform both of the operations, one should use **.str.replace() to replace substrings and \<Series>.replace() to replace complete strings.** To replace a substring with .replace() method set, regex=True.

In [49]:
# replacing partial string with .str.replace()
manufac_str.str.replace("A", "Å").sort_values().tail()

20820           Åutokraft Limited
19489    Åvanti Motor Corporation
18246    Åvanti Motor Corporation
24051              Åzure Dynamics
24050              Åzure Dynamics
Name: make, dtype: string

In [50]:
# replacing partial string with .replace() method
manufac_str.replace("A", "Å", regex=True).sort_values().tail()

20820           Åutokraft Limited
19489    Åvanti Motor Corporation
18246    Åvanti Motor Corporation
24051              Åzure Dynamics
24050              Åzure Dynamics
Name: make, dtype: string

### <a id='toc7_3_'></a>[Splitting text: the `.str.split()` method](#toc0_)

This may be useful when dealing with survey data that has binned numeric values. By default, the split method will return a series of list of the splited values. But, this is difficult to manipulate. So, what we can do is, set **expand=True** and this will return a DataFrame. Then we can access the individual columns of the dataframe if we wanted a series object.

In [51]:
# example of splitting binned data
age = pd.Series(["1-10", "11-20", "21-30", "31-40", "41-50"])

In [52]:
age_df = (
    age.astype(dtype="string")
    .str.split("-", expand=True)
    .rename(columns={0: "low", 1: "high"})
)

In [53]:
age_df

Unnamed: 0,low,high
0,1,10
1,11,20
2,21,30
3,31,40
4,41,50


In [54]:
age_low = age_df.low
age_low

0     1
1    11
2    21
3    31
4    41
Name: low, dtype: string

**The `.str.partition()` also works similarly.** .str.partition(sep, expand=True) will return a dataframe with 3 columns: element before the sep, the sep, and the part after.

------------------------------------

## <a id='toc8_'></a>[Missing values and How to handle them](#toc0_)

------------------------------

In [55]:
# The cylinders column has missing values
cylinders = df.cylinders

> **Counting the total number of missing values**

The `series.isna()` function detects missing values. An interesting characteristic of the **sum()** method is that it treats **True as 1 and False as 0**. This property can be used to count the number of missing values.

In [56]:
# the series.isna() function detects missing values
cylinders.isna().sum()

206

> **Let's see for which manufacturers the cylinders value is missing**

In [57]:
# Boolean mask for missing values
mask_missing = cylinders.isna()

In [58]:
# value_counts() is used to count unique values
manufac[mask_missing].value_counts().head(5)

Tesla     74
smart     16
Ford      15
Nissan    14
BMW       10
Name: make, dtype: int64

Here we can see that most of the cars with missing cylinders values are Tesla cars. Since they are electric cars this makes sense. **Note:** An alternative way would be to use the `loc` method insted of this boolean masking.

Now let's discuss how to handle these missing values.

### <a id='toc8_1_'></a>[_Handling missing values_](#toc0_)

- The `.fillna()` method allows you to specify a replacement value for any missing data

In [59]:
cylinders.fillna(0).value_counts()  # Doesn't change in place

4.0     15938
6.0     14284
8.0      8801
5.0       771
12.0      626
3.0       279
0.0       206
10.0      170
2.0        59
16.0       10
Name: cylinders, dtype: int64

**Note that**, in this case it was reasonable to replace the "nan" values with "0" but it is not always the case. In other scenarios the **.mean(), .median(), .mode()** etc. may come in handy.

- The `.dropna()` method will drop the indexes (i.e, rows) with missing values

In [60]:
# see that 'nan' count of 206 is the number of rows that was dropped
len(cylinders) - len(cylinders.dropna())

206

- The `.interpolate()` method will replace 'nan' with interpolation of the values around the missing value.

This is a very useful method to fill in missing values. For example, this comes in handy if the data is ordered (as time series data often is) and there are holes in the data. But, you have to make sure that the data you are manipulating has a trend that can be captured by interpolation. Otherwise, this may lead to disastarous results. 

In [61]:
# say we have a series that captures the [somewhat] upward trend of temp. as the season goes into summer from winter
temp = pd.Series([19, 20, 20, None, 22, 24, 23, 24])
temp.interpolate()  # doesn't change in place

0    19.0
1    20.0
2    20.0
3    21.0
4    22.0
5    24.0
6    23.0
7    24.0
dtype: float64

----------------------

## <a id='toc9_'></a>[Indexing Operations](#toc0_)

---------------------

We will see later when we discuss about DataFrames that most of what we learn here (indexing of Series objects) applies to the DataFrame objects as well.

To view the index of a Series we can use the `.index` method.

In [62]:
city_mpg.index

RangeIndex(start=0, stop=41144, step=1)

### <a id='toc9_1_'></a>[*Renaming Indexes*](#toc0_)

Many of the operations we will discuss here works on the index position while others work on the index label. If these are both integer values, it can be a little confusing but becomes more clear if the index has string labels. So first, we will relabel the indexes to some string values.

The `.rename(index)` method will return a new series with the original values but new index labels. If you pass in a scalar value it will change the .name attribute of the new series it returns, leaving the index intact.

We can pass in a dictionary to map the previous index label to the new label. It also accepts a series, a scalar or, a function that takes an old label and returns a new label or a sequence. When we pass in a series and the index values are the same, the values from the series that we passed in are used as the index.

In [63]:
# renaming the index labels of the city_mpg series with manufacturers names
# to_dict() will create a dict with the format of, idx as key: series content as value
city_rnm = city_mpg.rename(index=manufac.to_dict())

In [64]:
city_rnm.index

Index(['Alfa Romeo', 'Ferrari', 'Dodge', 'Dodge', 'Subaru', 'Subaru', 'Subaru',
       'Toyota', 'Toyota', 'Toyota',
       ...
       'Saab', 'Saturn', 'Saturn', 'Saturn', 'Saturn', 'Subaru', 'Subaru',
       'Subaru', 'Subaru', 'Subaru'],
      dtype='object', length=41144)

### <a id='toc9_2_'></a>[*Resetting Index Labels*](#toc0_)

Sometimes we need a unique index to perform an operation. If you want to set the index to monotonic increasing, and therefore unique integers starting at zero, you can use the `.reset_index()` method. By default, this method will return a dataframe, moving the current index into a new column. To drop the current index and return a Series, we can set **drop=True**.

In [65]:
city_rnm.reset_index(drop=True).head()

0    19
1     9
2    23
3    10
4    17
Name: city08, dtype: int64

### <a id='toc9_3_'></a>[*The `.loc[]` method*](#toc0_) [&#8593;](#toc0_)

The **.loc** attribute is **primarily label based**, but may also be used with a boolean array.

Allowable inputs may be:
- **Scalar:** if only a scalar index label is passed it will return a series in case of duplicate labels and a scalar in case of a unique label. For it to return a series in all cases we have to pass in the scalar as a list.
- **Array like:** a list or array of labels. Will return a series object.
- **Slice object:** one thing to note, to slice a series with duplicate index labels we will first need to sort the indexes with **.sort_index()**. Slicing with .loc includes both the start and end string.
- **A boolean array:** of the same length as the series.
- An alignable pandas **Index object**.
- **A callable function:** that returns one of the above.

In [66]:
# scalar as input to .loc
city_rnm.loc["Ferrari"].sample(3)

Ferrari    11
Ferrari     9
Ferrari    16
Name: city08, dtype: int64

In [67]:
# array/list as input to .loc
city_rnm.loc[["Ferrari", "Honda", "Toyota"]].sample(4)

Toyota    17
Toyota    17
Toyota    17
Honda     15
Name: city08, dtype: int64

In [68]:
# slice object as input to .loc
city_rnm.sort_index().loc["Federal":"Ferrari"]

Federal Coach    15
Federal Coach    13
Federal Coach    13
Federal Coach    14
Federal Coach    13
                 ..
Ferrari          13
Ferrari           8
Ferrari           9
Ferrari          13
Ferrari          10
Name: city08, Length: 243, dtype: int64

In [69]:
# slicing with partial strings
city_rnm.sort_index().loc["F":"H"]

Federal Coach                15
Federal Coach                13
Federal Coach                13
Federal Coach                14
Federal Coach                13
                             ..
Grumman Allied Industries    16
Grumman Olson                30
Grumman Olson                31
Grumman Olson                26
Grumman Olson                27
Name: city08, Length: 6377, dtype: int64

In [70]:
# Boolean array as input to .loc
boolean_mask = city_rnm > 120
city_rnm.loc[boolean_mask].sample(3)

Volkswagen    126
Volkswagen    126
BMW           137
Name: city08, dtype: int64

In [71]:
# Function as input to .loc

# say, we estimate that in the coming year due to regulations, all the vehicles will loose
# 10% of the current mileage and we want to calculate that mileage from our current data
# and see which cars will still have mpg > 120
city_rnm.mul(0.9).loc[lambda x: x > 120].sample(3)

Tesla    126.0
Scion    124.2
Tesla    122.4
Name: city08, dtype: float64

### <a id='toc9_4_'></a>[*The `.iloc[]` method*](#toc0_) [&#8593;](#toc0_)

The **.iloc** attribute operates on **indexes and not index labels**. It can also be used with a boolean array.

Allowable inputs may be:
- **Scalar:** The value, a scalar, at that index.
- **Array like:** a list or array of indexes. Will return a series object.
- **Slice object:** end of the slice is exclusive i.e, works similarly as list slicing would.
- **A numpy array of booleans (or, a python list):** of the same length as the series. Note that, it must be numpy array or a list and not pandas series objects (aka, boolean array).
- **A callable function:** that returns one of the above.
- **A tuple:** applicable for DataFrame objects. A tuple of row and column indexes.

In [72]:
# a boolean array i.e, a pandas series object of boolean values
# to a numpy array or a python list before it can be used with .iloc
mask = city_rnm > 120
city_rnm.iloc[mask.to_numpy()].sample()

smart    122
Name: city08, dtype: int64

### <a id='toc9_5_'></a>[*Filtering Index Labels with `.filter(items, like, regex)`*](#toc0_)

- **items** (passed as a list) is used for exact matches. Note that exact match (with items) fails with duplicate index labels but if the index doesn't exist it will not throw an error.
- **like** is used for substring matches.
- **regex** allows to specify a regular expression to match against index values.

In [73]:
# items
try:
    city_rnm.filter(items=["Panos"])
except ValueError as err:
    print(err)

cannot reindex on an axis with duplicate labels


  city_rnm.filter(items=["Panos"])


In [74]:
# like
city_rnm.filter(like="B")

BMW              14
BMW              14
BMW              11
Buick            21
Buick            17
                 ..
BMW              15
BMW              14
Buick            19
Buick            18
Mercedes-Benz    16
Name: city08, Length: 4344, dtype: int64

In [75]:
# regex for filtering labels that starts with A/B/C
city_rnm.filter(regex="^[A-C].")

Alfa Romeo    19
Audi          17
Audi          17
BMW           14
BMW           14
              ..
Chevrolet     12
Chevrolet     11
Chevrolet     15
Chevrolet     16
Chevrolet     10
Name: city08, Length: 9742, dtype: int64

### <a id='toc9_6_'></a>[*Reindexing with `.reindex(index)`*](#toc0_)

Index is array like which defines the new labels / index to conform to. But note that, reindexing on an axis with duplicate labels will not work. If we pass in labels that are not in the index, it will not throw an error rather it will insert missing values.

In [76]:
city_mpg.head()

0    19
1     9
2    23
3    10
4    17
Name: city08, dtype: int64

In [77]:
city_mpg.reindex(index=[0, 1, 2, 2, "nan"])

0      19.0
1       9.0
2      23.0
2      23.0
nan     NaN
Name: city08, dtype: float64

---------------------------------