**Table of contents**<a id='toc0_'></a>    
- [Import Statements](#toc1_1_)    
  - [Importing the data](#toc1_2_)    
- [Manipulation methods](#toc2_)    
    - [*The .`apply(func)` method: Applying a custom function to every element of a Series (also works on DataFrames)*](#toc2_1_1_)    
    - [*The ``.where(cond, other)`` method: Replace values where the condition is false (with the value of 'other')*](#toc2_1_2_)    
    - [*The `.clip(lower, upper)` method for handling **outliers***](#toc2_1_3_)    
    - [*Binning Data: the `pd.cut(array_like, bins, labels, include_lowest=False)` and the `pd.qcut()` methods*](#toc2_1_4_)    
    - [Some other useful methods](#toc2_1_5_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=2
	maxLevel=5
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

**Read the official documentation on pandas Series objects @ https://pandas.pydata.org/pandas-docs/stable/reference/series.html**

**`Note:`** We can actually use python built in functions on pandas series objects. i.e., **len, type, dir, in, sum, product, mean, sorted, max, min** etc.

Also, the notion of **chaining functions/methods** in pandas is similar to that of python.

### <a id='toc1_1_'></a>[Import Statements](#toc0_)

In [1]:
# import statements
import numpy as np
import pandas as pd

### <a id='toc1_2_'></a>[Importing the data](#toc0_)

One of the many datasets we will use for our examples in this notebook is the `/Data/vehicles.csv.zip` dataset.

In [2]:
# read the vehicles.csv dataset
df = pd.read_csv("Data/vehicles.csv.zip")

  df = pd.read_csv("Data/vehicles.csv.zip")


Columns of a dataframe can be accessed in various ways. One of which is to use the **dot i.e, ' . ' notation**.

In [3]:
# the city08 and highway08 columns from the vehicles.csv dataset provides information on
# miles per gallon usage while driving around in the city and highway respectively.
city_mpg = df.city08
highway_mpg = df.highway08

In [4]:
# The make in the vehicles dataset provides the manufacturer name (strings) and is stored as an object.
manufac = df.make

**Note:** The first thing we should do when we load in a dataset is to check the datatypes of each column and cast each of them to more suitable datatypes. This is to save space and speed up our code execution.

--------------------------

## <a id='toc2_'></a>[Manipulation methods](#toc0_)

----------------------------------

When working on a dataset, usually the **most used** methods are some kind of manipulation methods. These are specially useful when we are cleaning up or exploring the dataset.

#### <a id='toc2_1_1_'></a>[*The .`apply(func)` method: Applying a custom function to every element of a Series (also works on DataFrames)*](#toc0_)

The `apply(func)` method will call the `func` function on every element of a Series. 

This is **usually not wise to use** since this will dramatically increase the computation time. We already have a wide range of predefined pandas methods and functions for almost all the manipulation operations we can desire. But for some reason or other, if we can't find any suitable methods then we can define our own function and call it using the apply() method. **Note that,** we only need to pass in the name of the function and not call them.

In [5]:
# let's say, we only want to keep the top 5 manufacturers and replace other values with "Other" in the manufac Series

In [6]:
# defining the custom manipulation function
top_5_manufac = manufac.value_counts().index[0:5]


def custom_manipulation(val):
    if val in top_5_manufac:
        return val
    else:
        return "Other"

In [7]:
%%timeit # magic functions needs to go at the top of the cell

# applying the custom_manipulation function on manufac Series
manufac.apply(custom_manipulation)

49.8 ms ± 3.9 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [8]:
cust_manufac = manufac.apply(custom_manipulation)

In [9]:
cust_manufac.value_counts()

make
Other        26622
Chevrolet     4003
Ford          3371
Dodge         2583
GMC           2494
Toyota        2071
Name: count, dtype: int64

#### <a id='toc2_1_2_'></a>[*The ``.where(cond, other)`` method: Replace values where the condition is false (with the value of 'other')*](#toc0_)

This method **replaces values where the condition is False with corresponding value from 'other'**.

In [10]:
# if we were to do the same thing as the above example with where() method

In [11]:
%%timeit
# Series.isin(values) checks Whether elements in Series are contained in `values`
manufac.where(cond=manufac.isin(top_5_manufac), other="Other")

2.39 ms ± 388 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


As we can see, this is more than 10 times faster than the apply() approach.

- The `mask(cond, other)` method **replaces values where the condition is True with corresponding value from 'other'**. It is equivalent to: **where(~cond, other)**

#### <a id='toc2_1_3_'></a>[*The `.clip(lower, upper)` method for handling **outliers***](#toc0_)

This will **replace all values above the upper threshold with the upper value and all values lower than the lower threshold with the lower value.**

Clipping is handy if you have outliers in your data. In the city_mpg series the values ranges from 6 to 150. But, there are only 16 vehicles with city_mpg > 130 and 39 vehicles with city_mpg < 8 out of the total 41144 vehicles. Say, we wanted to clip those entries, we can do that very easily with the clip method. Or say, we wanted to clip the values between the 5% quantile and 95% quantile. We could also do that very easily with the clip method. All this is to say that, this is a very handy tool to use in the right situations.

In [12]:
# in the first case
city_mpg_clip = city_mpg.clip(lower=8, upper=120)

# see that the values were replaced not dropped
print(len(city_mpg) == len(city_mpg_clip))

# Min and Max values of the city_mpg_clip
city_mpg_clip.agg(["min", "max"])

True


min      8
max    120
Name: city08, dtype: int64

In [13]:
# in the second case
quantile_5 = city_mpg.quantile(0.05)
quantile_95 = city_mpg.quantile(0.95)

city_mpg_clip_quantile = city_mpg.clip(lower=quantile_5, upper=quantile_95)

print("5% Quantile: ", quantile_5, "\n95% Quantile: ", quantile_95)

city_mpg_clip_quantile.agg(["min", "max"])

5% Quantile:  11.0 
95% Quantile:  27.0


min    11
max    27
Name: city08, dtype: int64

#### <a id='toc2_1_4_'></a>[*Binning Data: the `pd.cut(array_like, bins, labels, include_lowest=False)` and the `pd.qcut()` methods*](#toc0_)

As the name suggests, we can **categorize** our data values **into specific bins of predefined size** (by default, in a half open interval with lower limit excluded and upper limit included). 

By default, the **cut()** method will generate categories set to the half open interval but we can change this behaviour and define specific bin names with the labels argument. The **qcut()** method is used to generate bins using quantile values so that all the bins have roughly same amount of data. 

We can either define how many bins we want or, we can also define the bin sizes as a list.

In [14]:
manufac_val_count = manufac.value_counts()

In [15]:
# definging bins by edges
pd.cut(
    manufac_val_count,
    [0, 500, 1000, 2000, 3000, 5000],
    labels=[
        "Manufacturer for less than 500 cars",
        "Manufacturer for more than 500 but less than 1000 cars",
        "Manufacturer for more than 1000 but less than 2000 cars",
        "Manufacturer for more than 2000 but less than 3000 cars",
        "Manufacturer for more than 3000 cars",
    ],
)

make
Chevrolet                                   Manufacturer for more than 3000 cars
Ford                                        Manufacturer for more than 3000 cars
Dodge                          Manufacturer for more than 2000 but less than ...
GMC                            Manufacturer for more than 2000 but less than ...
Toyota                         Manufacturer for more than 2000 but less than ...
                                                     ...                        
Volga Associated Automobile                  Manufacturer for less than 500 cars
Panos                                        Manufacturer for less than 500 cars
Mahindra                                     Manufacturer for less than 500 cars
Excalibur Autos                              Manufacturer for less than 500 cars
London Coach Co Inc                          Manufacturer for less than 500 cars
Name: count, Length: 136, dtype: category
Categories (5, object): ['Manufacturer for less than 500 cars'

In [16]:
# binning the data into 10 groups that have equal sized bins
pd.qcut(manufac_val_count, 10, duplicates="drop")

make
Chevrolet                      (965.0, 4003.0]
Ford                           (965.0, 4003.0]
Dodge                          (965.0, 4003.0]
GMC                            (965.0, 4003.0]
Toyota                         (965.0, 4003.0]
                                    ...       
Volga Associated Automobile       (0.999, 2.0]
Panos                             (0.999, 2.0]
Mahindra                          (0.999, 2.0]
Excalibur Autos                   (0.999, 2.0]
London Coach Co Inc               (0.999, 2.0]
Name: count, Length: 136, dtype: category
Categories (9, interval[float64, right]): [(0.999, 2.0] < (2.0, 3.0] < (3.0, 5.0] < (5.0, 13.0] ... (56.0, 151.5] < (151.5, 469.0] < (469.0, 965.0] < (965.0, 4003.0]]

#### <a id='toc2_1_5_'></a>[Some other useful methods](#toc0_)

- The `sort_values()` method

This method will **sort the values (by default in ascending order) and also rearrange the index accordingly.**

**Note that**, because the indexes are rearranged i.e, index are aligned, we can still do math operations (and many other operations) on a sorted series.

In [17]:
highway_mpg.sort_values().head()

23231    9
1979     9
26858    9
1990     9
23176    9
Name: highway08, dtype: int64

- The `sort_index()` method

By default this method **will sort the indexes in ascending order**.

In [18]:
manufac.sort_values().sort_index(ascending=False).head()

41143    Subaru
41142    Subaru
41141    Subaru
41140    Subaru
41139    Subaru
Name: make, dtype: object

- The `drop_duplicates()` method 

This will drop the rows with duplicate values. This mehtod has an argument called **keep**. By default its value is set to "First". If set to "False" it will remove all duplicated values including the initial value.

In [19]:
# manufacturers who have only one car listed on the data series
manufac.drop_duplicates(keep=False)

602                      E. P. Dutton, Inc.
1790                 Panoz Auto-Development
6266                                  Qvale
11033                Lambda Control Systems
13553                   London Coach Co Inc
16606                                Shelby
19027         Import Foreign Auto Sales Inc
19352    S and S Coach Company  E.p. Dutton
19353      Superior Coaches Div E.p. Dutton
19670                   Vixen Motor Company
20657           Volga Associated Automobile
20881                                 Panos
21147                           London Taxi
21443                       Excalibur Autos
23115                              Mahindra
24738                                Fisker
25795                      ASC Incorporated
32341                                 Karma
32788                            Koenigsegg
34303                       Aurora Cars Ltd
34386                        RUF Automobile
35193                   JBA Motorcars, Inc.
35753             Grumman Allied

- Ranking the Values, the `rank(axis=0, method='average', ascending=True)` method

In [20]:
# This method defines how to rank the records that have the same values (ties).
# Available ranking methods are, ‘average’, ‘min’, ‘max’, ‘first’, ‘dense’. default = ‘average’.

In [21]:
# Ranking based on city_mpg
city_mpg.rank(method="min").sort_values().tail(6)

25615    41139.0
34563    41139.0
34564    41141.0
32599    41142.0
31256    41142.0
33423    41142.0
Name: city08, dtype: float64

- The `replace(to_replace, value, regex=False)` method

Replace values given in `to_replace` with `value`. This differs from updating with `.loc` or `.iloc`, which require
you to specify a location to update with some value.

This is a very **versetile and useful** method. To see its full functionality and usage example refer to the docs.

In [22]:
manufac.replace("Subaru", "Subaru!!")

0        Alfa Romeo
1           Ferrari
2             Dodge
3             Dodge
4          Subaru!!
            ...    
41139      Subaru!!
41140      Subaru!!
41141      Subaru!!
41142      Subaru!!
41143      Subaru!!
Name: make, Length: 41144, dtype: object