# Effective Pandas - Notes and Exercises

## Set-up

### Packages

In [1]:
import pandas as pd
import numpy as np

### Data

In [2]:
url = "https://github.com/mattharrison/datasets/raw/master/data/vehicles.csv.zip"
df = pd.read_csv(url, low_memory=False)
city_mpg = df["city08"]
highway_mpg = df["highway08"]

## Chapter 4: Series Introduction

### Exercises

#### Using Jupyter, create a series with the temperature values for the last seven days. Filter out the values below the mean.

In [4]:
temp_values = pd.Series([-3.0, -1.5, 3, 0, 2, 5, -1])
print(temp_values.mean())

0.6428571428571429


In [5]:
temp_values[temp_values<temp_values.mean()]

0   -3.0
1   -1.5
3    0.0
6   -1.0
dtype: float64

#### Using Jupyter, create a series with your favorite colors. Use a categorical type.

In [8]:
fav_colors = pd.Series(["blue", "yellow", "turquoise"], dtype="category")
fav_colors

0         blue
1       yellow
2    turquoise
dtype: category
Categories (3, object): ['blue', 'turquoise', 'yellow']

## Chapter 5: Series Deep Dive

### Notes

We can check out the attributes of a specific object by using ``dir()``.

In [3]:
city_mpg

0        19
1         9
2        23
3        10
4        17
         ..
41139    19
41140    20
41141    18
41142    18
41143    16
Name: city08, Length: 41144, dtype: int64

In [4]:
len(dir(city_mpg))

417

### Exercises

#### Explore the documentation for five attributes of a series from Jupyter.

In [11]:
city_mpg.add(-100)

0       -81
1       -91
2       -77
3       -90
4       -83
         ..
41139   -81
41140   -80
41141   -82
41142   -82
41143   -84
Name: city08, Length: 41144, dtype: int64

In [12]:
# Compute correlation between series and itself shifted by i

city_mpg.autocorr(2)

0.29762739792527243

In [13]:
# Compute correlation with another series

city_mpg.corr(highway_mpg)

0.9393654130640566

In [16]:
# apply a function on the series

city_mpg.transform("sqrt")

0        4.358899
1        3.000000
2        4.795832
3        3.162278
4        4.123106
           ...   
41139    4.358899
41140    4.472136
41141    4.242641
41142    4.242641
41143    4.000000
Name: city08, Length: 41144, dtype: float64

In [18]:
# index according to position

city_mpg.take([1, 66, 3987])

1        9
66      16
3987    15
Name: city08, dtype: int64

In [7]:
# Check if Series is monotonically increasing

city_mpg.is_monotonic

False

In [77]:
# method that counts number of unique values
print(f"{city_mpg.nunique()}, same as {len(city_mpg.unique())}")

105, same as 105


#### How many attributes are found on the ``.str`` attribute? Look at the documentation for three of them.

In [9]:
len(dir(fav_colors.str))

95

In [10]:
fav_colors.str.startswith("t")

0    False
1    False
2     True
dtype: bool

In [28]:
# Add linebreaks after certain number of characters

for col in fav_colors.str.wrap(7):
    print(col)

blue
yellow
turquoi
se


In [30]:
# Split strings at last occurrence of separator

fav_colors.str.rpartition("u")

Unnamed: 0,0,1,2
0,bl,u,e
1,,,yellow
2,turq,u,oise


In [33]:
fav_colors.str.join("-")

0              b-l-u-e
1          y-e-l-l-o-w
2    t-u-r-q-u-o-i-s-e
dtype: object

#### How many attributes are found on the ``.dt`` attribute? Look at the documentation for three of them.

In [45]:
dates = pd.Series(["01.01.2021", "14.04.1987", "26.12.1977"], dtype = "datetime64[ns]")

In [46]:
len(dir(dates.dt))

79

In [49]:
# month property
dates.dt.month

0     1
1     4
2    12
dtype: int64

In [65]:
# method that returns names of weekdays
dates.dt.day_name()

0     Friday
1    Tuesday
2     Monday
dtype: object

In [66]:
# month start property
dates.dt.is_month_start

0     True
1    False
2    False
dtype: bool

In [68]:
# method that returns the corresponding time period
dates.dt.to_period("M")

0    2021-01
1    1987-04
2    1977-12
dtype: period[M]

## Chapter 6: Operators (& Dunder Methods)

### Notes

In pure Python, there's a thing called **Dunder Methods** which start and end with a double underscore (e.g. `.__add__`). Pandas Series have some of these methods as well. Therefore they support operations such as division.

To use such methods on two series together (e.g. `Series_1 + Series_2`), they need to have indexes that are **unique** and **common** to both series. The operation is then broadcast to the entire series at once (vectorization), which is very efficient.

There is also a `.__iter__` method that can be used to loop through a series. However, this is usually not recommended, since there are more efficient ways to do most things.

Pandas also provides **Operator Methods** (`s1.add(s2)`) besides the pure Python ones (`s1+s2`). There it is possible to specify additional parameters, e.g. `fill_value=0` instead of the default `NA`. Also, operator methods can be chained together (`s1.add(s2, fill_value=0).div(2)`), which is very useful and readable. Basically all common arithmetic and boolean operations are available as pandas methods.

### Exercises

With a dataset of your choice:

In [11]:
url = "https://github.com/mattharrison/datasets/raw/master/data/vehicles.csv.zip"
df = pd.read_csv(url, low_memory=False)
city_mpg = df["city08"]
highway_mpg = df["highway08"]

#### Add a numeric series to itself.

In [12]:
city_mpg.head()

0    19
1     9
2    23
3    10
4    17
Name: city08, dtype: int64

In [8]:
city_mpg + city_mpg

0        38
1        18
2        46
3        20
4        34
         ..
41139    38
41140    40
41141    36
41142    36
41143    32
Name: city08, Length: 41144, dtype: int64

#### Add 10 to a numeric series.

In [9]:
city_mpg + 10

0        29
1        19
2        33
3        20
4        27
         ..
41139    29
41140    30
41141    28
41142    28
41143    26
Name: city08, Length: 41144, dtype: int64

#### Add a numeric series to itself using the `.add` method.

In [10]:
city_mpg.add(city_mpg).head()

0    38
1    18
2    46
3    20
4    34
Name: city08, dtype: int64

#### Read the documentation for the `.add` method.

## Chapter 7: Aggregate Methods

### Notes

There are aggregate methods (e.g. `.mean()`) which return a scalar and aggregate properties (e.g. `.is_unique`) which return a boolean.

A couple of useful tricks:
- We can filter with boolean methods (see previous chapter) and then count with `.sum()` or get the percentage with `.mul(100).mean()`.

In [25]:
# How many cars have city mpg greater than 20?
print(city_mpg.gt(20).sum())
print(city_mpg.gt(20).mul(100).mean(), "%")

10272
24.965973167412017 %


- `.quantile()` returns the median (50% quantile) by default, but it also accepts a list of levels and returns a series.

In [35]:
print(city_mpg.quantile())
print("Levels:\n", city_mpg.quantile([0.1, 0.5, 0.9]))

17.0
Levels:
 0.1    13.0
0.5    17.0
0.9    24.0
Name: city08, dtype: float64


- `.agg()` accepts a list of aggregation functions (can be pandas, pure python, numpy, own,...) and returns a series.

In [38]:
import numpy as np

def return_second_last(series):
    return series.iloc[-2]

print(city_mpg.agg(["mean", max, np.var, return_second_last]))

mean                   18.369045
max                   150.000000
var                    62.503036
return_second_last     18.000000
Name: city08, dtype: float64


### Exercises

With a dataset of your choice:

In [64]:
url = "https://github.com/mattharrison/datasets/raw/master/data/vehicles.csv.zip"
df = pd.read_csv(url, low_memory=False)
city_mpg = df["city08"]
highway_mpg = df["highway08"]

#### Find the count of non-missing values of a series.

In [65]:
# Count ignores missing values
highway_mpg.count()

41144

In [66]:
# There are none anyway
highway_mpg.isna().sum()

0

In [77]:
highway_mpg.hasnans

False

#### Find the number of entries of a series.

In [67]:
highway_mpg.size

41144

In [68]:
# Add a missing value
highway_mpg_2 = pd.concat([highway_mpg, pd.Series([np.nan])])

In [69]:
# Check count and size again
print(highway_mpg_2.count())
print(highway_mpg_2.size)

41144
41145


#### Find the number of unique entries of a series.

In [70]:
highway_mpg.nunique()

92

#### Find the mean value of a series.

In [71]:
highway_mpg.mean()

24.504666537040638

#### Find the maximum value of a series.

In [72]:
highway_mpg.max()

124

#### Use the `.agg` method to find all of the above.

In [74]:
highway_mpg.agg(["count", "size", "nunique", "mean", max])

count      41144.000000
size       41144.000000
nunique       92.000000
mean          24.504667
max          124.000000
Name: highway08, dtype: float64

## Chapter 8: Conversion Methods

### Notes

It is useful to have control over the type of data in a series. Using the correct type can save a lot of memory. These are the most relevant methods:

- `.convert_dtypes()`: tries to convert data to a type that allows `pd.NA`
- `.astype()`: convert to a specific type
- `.nbytes` (property): gives the amount of memory that the data is using
- `.memory_usage()`: includes also the *make* of the object (e.g. index) etc.

Good to know:

- Strings saved as strings use much more memory than strings saved as categories.
- `.to_numpy()` or `.values` gives a numpy array, `.to_list` returns a python list. NumPy can sometimes speed up the code, while python lists tend to slow it down. `.to_frame()` turns a Series into a DataFrame.
- For dates, using `.astype(dates)` is not recommended. `pd.to_datetime()` is far better.

### Exercises

With a dataset of your choice:

In [7]:
url = "https://github.com/mattharrison/datasets/raw/master/data/vehicles.csv.zip"
df = pd.read_csv(url, low_memory=False)
city_mpg = df["city08"]
highway_mpg = df["highway08"]

#### Convert a numeric column to a smaller type.

In [83]:
city_mpg_32 = city_mpg.astype("int32")

#### Calculate the memory savings by converting to smaller numeric types.

In [90]:
print("Memory saved by conversion:", city_mpg.nbytes - city_mpg_32.nbytes)
print("Memory saved by conversion (including make):", 
      city_mpg.memory_usage(deep=True) - city_mpg_32.memory_usage(deep=True))


Memory saved by conversion: 164576
Memory saved by conversion (including make): 164576


#### Convert a string column into a categorical type.

In [96]:
strings_col = df["drive"]
cat_col = strings_col.astype("category")

#### Calculate the memory savings by converting to a categorical type.

In [98]:
print("Memory saved by conversion:", strings_col.nbytes - cat_col.nbytes)
print("Memory saved by conversion (including make):", 
      strings_col.memory_usage(deep=True) - cat_col.memory_usage(deep=True))

Memory saved by conversion: 287952
Memory saved by conversion (including make): 2986383


## Chapter 9: Manipulation Methods

### Notes

- `.apply()`: allows you to apply a function element-wise to every value. Depending on what function it is used with, it can either be very useful or super slow. An example:

In [13]:
%%timeit

def gt20(val):
    return val > 20

city_mpg.apply(gt20)

4.6 ms ± 1.26 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [14]:
%%timeit

city_mpg.gt(20)

77.8 µs ± 8 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)


**Note:** Using the pandas built in function directly is **much** faster than using basic python with `.apply()`!



- `.where()`: keeps values from the series it is called on where the boolean array is true, but where the boolean array is false, it uses the value of the second parameter. It can sometimes replace `.apply()`.

In [16]:
top5 = df["make"].value_counts().index[:5]

def generalize_top5(val):
    if val in top5:
        return val
    return "Other"

In [17]:
%%timeit

df["make"].apply(generalize_top5)

13.3 ms ± 2.05 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [18]:
%%timeit

df["make"].where(df["make"].isin(top5), other="Other")

1.32 ms ± 217 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


- `.mask()`: this is the complement of the `.where()` method. Where the condition is False, it keeps the original values; if it is True, it replaces the value with the other parameter.

In [20]:
df["make"].mask(df["make"].isin(top5), other="Top5")   # Replaces all the top 5 values

0        Alfa Romeo
1           Ferrari
2              Top5
3              Top5
4            Subaru
            ...    
41139        Subaru
41140        Subaru
41141        Subaru
41142        Subaru
41143        Subaru
Name: make, Length: 41144, dtype: object

In [21]:
df["make"].mask(~df["make"].isin(top5), other="Other")   # negate the boolean and we get the same result as with .where

0        Other
1        Other
2        Dodge
3        Dodge
4        Other
         ...  
41139    Other
41140    Other
41141    Other
41142    Other
41143    Other
Name: make, Length: 41144, dtype: object

Since `.mask()` is the opposite of `.where()`, we can ignore it and just use `.where()` with the required boolean array.

- **if else** phrases can be done in pandas with `.where()`, but if they go beyond two options, multiple `.where()` statements need to be chained together, which can easily get cumbersome. As an alternative, it is possible to use the `select` function from numpy, which allows a list of contidions and values in one call. An example:

In [24]:
# we want to keep values in top 5, mark those in top 10 with 'top10' and use 'Other' for the rest

vc = df["make"].value_counts()
top5 = vc.index[:5]
top10 = vc.index[:10]


In [25]:
# with apply - worst option

def generalize(val):
    if val in top5:
        return val
    elif val in top10:
        return "Top10"
    else:
        return "Other"
    
df["make"].apply(generalize)

0        Other
1        Other
2        Dodge
3        Dodge
4        Other
         ...  
41139    Other
41140    Other
41141    Other
41142    Other
41143    Other
Name: make, Length: 41144, dtype: object

In [26]:
# with pandas method .where()

df["make"].where(df["make"].isin(top5), "Top10").where(df["make"].isin(top10), "Other")

0        Other
1        Other
2        Dodge
3        Dodge
4        Other
         ...  
41139    Other
41140    Other
41141    Other
41142    Other
41143    Other
Name: make, Length: 41144, dtype: object

In [30]:
# with numpy function select

import numpy as np
pd.Series(np.select([df["make"].isin(top5), df["make"].isin(top10)], 
                    [df["make"], "Top10"], "Other"),
         index=df["make"].index)

0        Other
1        Other
2        Dodge
3        Dodge
4        Other
         ...  
41139    Other
41140    Other
41141    Other
41142    Other
41143    Other
Length: 41144, dtype: object

- missing values: it is important to check for missing values and try to find out why they are missing.

In [32]:
# cylinders has missing values
df["cylinders"].isna().sum()

206

In [33]:
# check the make in those rows
missing = df["cylinders"].isna()
df["make"].loc[missing]

7138     Nissan
7139     Toyota
8143     Toyota
8144       Ford
8146       Ford
          ...  
34563     Tesla
34564     Tesla
34565     Tesla
34566     Tesla
34567     Tesla
Name: make, Length: 206, dtype: object

**Note**: If we index `.loc[]` with a boolean array, it returns the rows where the boolean array is true.

- filling in missing data: in the example above it looks like the cylinder information is missing for electric cars. They don't have any cylinders, so we can fill the missing values with 0.

In [36]:
df["cylinders"].fillna("0").loc[7136:7141]

7136    6
7137    6
7138    0
7139    0
7140    6
7141    6
Name: cylinders, dtype: object

- `.interpolate` can also be used for filling in missing data, if the data is ordered (e.g. time series).

In [5]:
temp = pd.Series([32, 40, None, 42, 39, 32])
temp.interpolate()

0    32.0
1    40.0
2    41.0
3    42.0
4    39.0
5    32.0
dtype: float64

- `.clip` can be used to trim outliers from a Series to be within a specified range.

In [9]:
city_mpg.loc[:446]

0      19
1       9
2      23
3      10
4      17
       ..
442    15
443    15
444    15
445    15
446    31
Name: city08, Length: 447, dtype: int64

In [10]:
city_mpg.loc[:446].clip(
    lower=city_mpg.quantile(0.05), 
    upper=city_mpg.quantile(0.95))

0      19
1      11
2      23
3      11
4      17
       ..
442    15
443    15
444    15
445    15
446    27
Name: city08, Length: 447, dtype: int64

Note that the values outside the range do not get dropped, but rather adjusted to be within the bounds.

- `.sort_values` sorts values in ascending order and rearranges the index accordingly. Other math operations which include index alignment can still be performed on a sorted dataframe. It makes no difference, because the indices are matched before calculations are performed. You can use `na_position` to specify where you want NA values to appear.

In [13]:
(city_mpg+highway_mpg) / 2

0        22.0
1        11.5
2        28.0
3        11.0
4        20.0
         ... 
41139    22.5
41140    24.0
41141    21.0
41142    21.0
41143    18.5
Length: 41144, dtype: float64

In [14]:
(city_mpg.sort_values() + highway_mpg) / 2

0        22.0
1        11.5
2        28.0
3        11.0
4        20.0
         ... 
41139    22.5
41140    24.0
41141    21.0
41142    21.0
41143    18.5
Length: 41144, dtype: float64

- `.sort_index` does the reverse, it sorts the values by their index:

In [15]:
city_mpg.sort_values().sort_index()

0        19
1         9
2        23
3        10
4        17
         ..
41139    19
41140    20
41141    18
41142    18
41143    16
Name: city08, Length: 41144, dtype: int64

- `.drop_duplicates()` removes values that appear more than once. You can choose to remove the first or the last duplicated value, or all of them.

In [6]:
city_mpg.drop_duplicates()

0         19
1          9
2         23
3         10
4         17
        ... 
34364    127
34409    114
34564    140
34565    115
34566    104
Name: city08, Length: 105, dtype: int64

- You can rank the values in a series with the `.rank()` method. It keeps the original index, but assigns a rank to each value. The ranking method can be controlled with the `method` parameter.

In [9]:
city_mpg.rank()

0        27060.5
1          235.5
2        35830.0
3          607.5
4        19484.0
          ...   
41139    27060.5
41140    29719.5
41141    23528.0
41142    23528.0
41143    15479.0
Name: city08, Length: 41144, dtype: float64

- Map values to new values with the `.replace()` method. You can use a dictionary, two lists or just two strings (or regex). 

In [12]:
df["make"].replace(to_replace={"Alfa Romeo":"AR", "Ferrari":"F", "Dodge":"D"})

0            AR
1             F
2             D
3             D
4        Subaru
          ...  
41139    Subaru
41140    Subaru
41141    Subaru
41142    Subaru
41143    Subaru
Name: make, Length: 41144, dtype: object

In [14]:
df["make"].replace(to_replace=r"Alfa (Romeo)", value=r"\1 & Juliet", regex=True)

0        Romeo & Juliet
1               Ferrari
2                 Dodge
3                 Dodge
4                Subaru
              ...      
41139            Subaru
41140            Subaru
41141            Subaru
41142            Subaru
41143            Subaru
Name: make, Length: 41144, dtype: object

- the function `pd.cut()` is useful for binning data. It returns a new series with the specified number of categories (with more or less the same bin width). It is also possible to provide the edges of the bins or to use `pd.qcut()` to bin the values into quantiles (more or less the same number of values in each bin).

In [24]:
city_mpg

0        19
1         9
2        23
3        10
4        17
         ..
41139    19
41140    20
41141    18
41142    18
41143    16
Name: city08, Length: 41144, dtype: int64

In [18]:
pd.cut(city_mpg, 10)

0        (5.856, 20.4]
1        (5.856, 20.4]
2         (20.4, 34.8]
3        (5.856, 20.4]
4        (5.856, 20.4]
             ...      
41139    (5.856, 20.4]
41140    (5.856, 20.4]
41141    (5.856, 20.4]
41142    (5.856, 20.4]
41143    (5.856, 20.4]
Name: city08, Length: 41144, dtype: category
Categories (10, interval[float64]): [(5.856, 20.4] < (20.4, 34.8] < (34.8, 49.2] < (49.2, 63.6] ... (92.4, 106.8] < (106.8, 121.2] < (121.2, 135.6] < (135.6, 150.0]]

In [21]:
pd.qcut(city_mpg, 10)

0         (18.0, 20.0]
1        (5.999, 13.0]
2         (21.0, 24.0]
3        (5.999, 13.0]
4         (16.0, 17.0]
             ...      
41139     (18.0, 20.0]
41140     (18.0, 20.0]
41141     (17.0, 18.0]
41142     (17.0, 18.0]
41143     (15.0, 16.0]
Name: city08, Length: 41144, dtype: category
Categories (10, interval[float64]): [(5.999, 13.0] < (13.0, 14.0] < (14.0, 15.0] < (15.0, 16.0] ... (18.0, 20.0] < (20.0, 21.0] < (21.0, 24.0] < (24.0, 150.0]]

You can also provide the labels for the bins.

In [25]:
pd.qcut(city_mpg, 4, labels=["bad", "ok", "good", "very good"])

0             good
1              bad
2        very good
3              bad
4               ok
           ...    
41139         good
41140         good
41141         good
41142         good
41143           ok
Name: city08, Length: 41144, dtype: category
Categories (4, object): ['bad' < 'ok' < 'good' < 'very good']

### Exercises

With a dataset of your choice:

1) Create a series from a numeric column that has the value of 'high' if it is equal to or above the mean and 'low' if it is below the mean using `.apply`.

In [7]:
%time
city_mpg.apply(lambda x: "high" if x > city_mpg.mean() else "low")

CPU times: user 1e+03 ns, sys: 1e+03 ns, total: 2 µs
Wall time: 5.25 µs


0        high
1         low
2        high
3         low
4         low
         ... 
41139    high
41140    high
41141     low
41142     low
41143     low
Name: city08, Length: 41144, dtype: object

2) Create a series from a numeric column that has the value of 'high' if it is equal to or above the mean and 'low' if it is below the mean using `np.select`.

In [6]:
%time
pd.Series(np.select([city_mpg > city_mpg.mean()], 
                    ["high"], 
                    default="low"))

CPU times: user 1 µs, sys: 0 ns, total: 1 µs
Wall time: 2.86 µs


0        high
1         low
2        high
3         low
4         low
         ... 
41139    high
41140    high
41141     low
41142     low
41143     low
Length: 41144, dtype: object

3) Time the differences between the previous two solutions to see which is faster.

5.25 µs with `.apply()` vs. 2.86 µs with `np.select()`; the second option is almost twice as fast.

4) Replace the missing values of a numeric series with the median value.

In [42]:
random_series = pd.Series([np.nan] + np.random.rand(20).tolist() +  [np.nan])
random_series.fillna(random_series.median())

0     0.590157
1     0.983597
2     0.647546
3     0.767996
4     0.232433
5     0.953680
6     0.454912
7     0.461545
8     0.238809
9     0.400473
10    0.582743
11    0.551858
12    0.881041
13    0.009166
14    0.597571
15    0.683640
16    0.961846
17    0.023470
18    0.790773
19    0.450391
20    0.765048
21    0.590157
dtype: float64

5) Clip the values of a numeric series to between 10th and 90th percentiles.

In [57]:
city_mpg.clip(lower=city_mpg.quantile(0.1), upper=city_mpg.quantile(0.9))

0        19
1        13
2        23
3        13
4        17
         ..
41139    19
41140    20
41141    18
41142    18
41143    16
Name: city08, Length: 41144, dtype: int64

6) Using a categorical column, replace any value that is not in the top 5 most frequent values with 'Other'.

In [69]:
top5_fuel_types = df["fuelType"].value_counts().index[:5]
df["fuelType"].where(df["fuelType"].isin(top5_fuel_types), "Other")

0        Regular
1        Regular
2        Regular
3        Regular
4        Premium
          ...   
41139    Regular
41140    Regular
41141    Regular
41142    Regular
41143    Premium
Name: fuelType, Length: 41144, dtype: object

7) Using a categorical column, replace any value that is not in the top 10 most frequent values with 'Other'.

In [76]:
top10_models = df["model"].value_counts().index[:10]
df["model"].where(df["model"].isin(top10_models), "Other")

0        Other
1        Other
2        Other
3        Other
4        Other
         ...  
41139    Other
41140    Other
41141    Other
41142    Other
41143    Other
Name: model, Length: 41144, dtype: object

8) Make a function that takes a categorical series and a number (n) and returns a replace series that replaces any value that is not in the top n most frequent values with 'Other'.

In [80]:
def replace_small_categories(cat_series, n):
    
    top_n_cats = cat_series.value_counts().index[:n]
    new_cat_series = cat_series.where(cat_series.isin(top_n_cats), "Other")
    
    return new_cat_series

replace_small_categories(df["fuelType"], 3).value_counts()

Regular            26447
Premium            11542
Other               1838
Gasoline or E85     1317
Name: fuelType, dtype: int64

9) Using a numeric column, bin it into 10 groups that have the same width.

In [83]:
pd.cut(city_mpg, 10)

0        (5.856, 20.4]
1        (5.856, 20.4]
2         (20.4, 34.8]
3        (5.856, 20.4]
4        (5.856, 20.4]
             ...      
41139    (5.856, 20.4]
41140    (5.856, 20.4]
41141    (5.856, 20.4]
41142    (5.856, 20.4]
41143    (5.856, 20.4]
Name: city08, Length: 41144, dtype: category
Categories (10, interval[float64]): [(5.856, 20.4] < (20.4, 34.8] < (34.8, 49.2] < (49.2, 63.6] ... (92.4, 106.8] < (106.8, 121.2] < (121.2, 135.6] < (135.6, 150.0]]

10) Using a numeric column, bin it into 10 groups that have equal sized bins. 

In [84]:
pd.qcut(city_mpg, 10)

0         (18.0, 20.0]
1        (5.999, 13.0]
2         (21.0, 24.0]
3        (5.999, 13.0]
4         (16.0, 17.0]
             ...      
41139     (18.0, 20.0]
41140     (18.0, 20.0]
41141     (17.0, 18.0]
41142     (17.0, 18.0]
41143     (15.0, 16.0]
Name: city08, Length: 41144, dtype: category
Categories (10, interval[float64]): [(5.999, 13.0] < (13.0, 14.0] < (14.0, 15.0] < (15.0, 16.0] ... (18.0, 20.0] < (20.0, 21.0] < (21.0, 24.0] < (24.0, 150.0]]

## Chapter 10: Indexing Operations

### Notes

Main topics:
- changing the index
- accessing parts of a series using `[]`, `.loc[]` and `.iloc[]`
    


### Exercises