In [1]:
import pandas as pd
import numpy as np

In [2]:
url = 'https://github.com/mattharrison/datasets/raw/master/data/vehicles.csv.zip'
df=pd.read_csv(url)
city_mpg = df.city08
highway_mpg = df.highway08

  df=pd.read_csv(url)


# Manipulation Methods
- "workhorses of pandas"
- useful for understanding, cleaning up and modelling, use methods that operate on a series and return a new series to stick it back in the dataframe
- these methods manipulate the series values but preserve the index

## `.apply` and `.where`:
- `.apply` is curious method that should be avoided in most cases.
    - allows you to apply a function element-wise to every value
    - if you pass in a NumPy function that works on an array, it will broadcast the operation to the series
    - `.apply` sucks because it operates on each individual value in a series. So the fujnction is called once for every value (and if u have many values, many calls)
    
  

In [3]:
def gt20(val):
    return val > 20

In [4]:
%%timeit
city_mpg.apply(gt20)

6.2 ms ± 74.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [5]:
%%timeit
city_mpg.gt(20)

92.5 µs ± 97.1 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)


Want to limit the make dataset to show the top five makes and label everything else as Other. To do that, we can use first `.value_counts` method to get the frequencies and then...: 

In [6]:
make = df.make

In [7]:
make

0        Alfa Romeo
1           Ferrari
2             Dodge
3             Dodge
4            Subaru
            ...    
41139        Subaru
41140        Subaru
41141        Subaru
41142        Subaru
41143        Subaru
Name: make, Length: 41144, dtype: object

In [8]:
make.value_counts()

make
Chevrolet                      4003
Ford                           3371
Dodge                          2583
GMC                            2494
Toyota                         2071
                               ... 
Volga Associated Automobile       1
Panos                             1
Mahindra                          1
Excalibur Autos                   1
London Coach Co Inc               1
Name: count, Length: 136, dtype: int64

Using `.apply` to replace everything thats not top 5:

In [9]:
top5 = make.value_counts().index[:5]
def generalize_top5(val):
    if val in top5:
        return val
    return 'Other'

In [10]:
make.apply(generalize_top5)

0        Other
1        Other
2        Dodge
3        Dodge
4        Other
         ...  
41139    Other
41140    Other
41141    Other
41142    Other
41143    Other
Name: make, Length: 41144, dtype: object

the `.where` method takes a *boolean array* to mark where a condition is true.

In [11]:
make.where(make.isin(top5), other='Other')

0        Other
1        Other
2        Dodge
3        Dodge
4        Other
         ...  
41139    Other
41140    Other
41141    Other
41142    Other
41143    Other
Name: make, Length: 41144, dtype: object

In [12]:
%%timeit
make.where(make.isin(top5), other='Other')

1.96 ms ± 93.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [13]:
%%timeit
make.apply(generalize_top5)

39.7 ms ± 643 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


The complement of the `.where` method is the `.mask` method. Whereever the condition is `False` it keeps the original values; if it is `True` it replaces the value with the other parameter:

In [15]:
make.mask(make.isin(top5), other='Other')

0        Alfa Romeo
1           Ferrari
2             Other
3             Other
4            Subaru
            ...    
41139        Subaru
41140        Subaru
41141        Subaru
41142        Subaru
41143        Subaru
Name: make, Length: 41144, dtype: object

In [16]:
make.mask(~make.isin(top5), other='Other')

0        Other
1        Other
2        Dodge
3        Dodge
4        Other
         ...  
41139    Other
41140    Other
41141    Other
41142    Other
41143    Other
Name: make, Length: 41144, dtype: object

The ~ tilde performs an inversion of the boolean array, switching all T to F and vice versa.

### Missing Data:
- filling in missing data is important because many machine learning algorithms do not work if there is missing data.

*cylinders* column has missing values:


In [18]:
cyl = df.cylinders
(cyl.isna().sum())  #.isna converts the property to booleans (making non empty cells = 1 and empty =0)

206

In [19]:
missing = cyl.isna()
make.loc[missing]

7138     Nissan
7139     Toyota
8143     Toyota
8144       Ford
8146       Ford
          ...  
34563     Tesla
34564     Tesla
34565     Tesla
34566     Tesla
34567     Tesla
Name: make, Length: 206, dtype: object

In [20]:
cyl[cyl.isna()] 

7138    NaN
7139    NaN
8143    NaN
8144    NaN
8146    NaN
         ..
34563   NaN
34564   NaN
34565   NaN
34566   NaN
34567   NaN
Name: cylinders, Length: 206, dtype: float64

In [21]:
cyl.fillna(0).loc[7136:7141]

7136    6.0
7137    6.0
7138    0.0
7139    0.0
7140    6.0
7141    6.0
Name: cylinders, dtype: float64

Another thing we can do is : `.interpolate` 

In [22]:
temp = pd.Series([32,40, None, 42, 39,32])
temp

0    32.0
1    40.0
2     NaN
3    42.0
4    39.0
5    32.0
dtype: float64

In [23]:
temp.interpolate()

0    32.0
1    40.0
2    41.0
3    42.0
4    39.0
5    32.0
dtype: float64

Clipping outliers in the data use `.clip` method.

In [27]:
city_mpg.loc[:446]

0      19
1       9
2      23
3      10
4      17
       ..
442    15
443    15
444    15
445    15
446    31
Name: city08, Length: 447, dtype: int64

The first 446 values range from 9 to 31 in city, we can trim the values to be between the 5th and 95th quantile with:

In [28]:
(city_mpg
    .loc[:446]
    .clip(lower=city_mpg.quantile(.05),
          upper=city_mpg.quantile(.95))
          )

0      19
1      11
2      23
3      11
4      17
       ..
442    15
443    15
444    15
445    15
446    27
Name: city08, Length: 447, dtype: int64

- changed entry 446 from 31 to 27 (clipped it)

In [33]:
city_mpg.sort_values()

7901       6
34557      6
37161      6
21060      6
35887      6
        ... 
34563    138
34564    140
32599    150
31256    150
33423    150
Name: city08, Length: 41144, dtype: int64

In [34]:
city_mpg.sort_values().sort_index()

0        19
1         9
2        23
3        10
4        17
         ..
41139    19
41140    20
41141    18
41142    18
41143    16
Name: city08, Length: 41144, dtype: int64

In [35]:
city_mpg.drop_duplicates()

0         19
1          9
2         23
3         10
4         17
        ... 
34364    127
34409    114
34564    140
34565    115
34566    104
Name: city08, Length: 105, dtype: int64

In [42]:
city_mpg.drop_duplicates(keep = False)

8147      84
23028     59
23029     79
24471    107
25699     60
25953     93
32740    131
32842    125
34173    123
34364    127
34409    114
34564    140
34565    115
34566    104
Name: city08, dtype: int64

In [43]:
city_mpg.rank() #if two values are the same, their rank will be average of the positions they take

0        27060.5
1          235.5
2        35830.0
3          607.5
4        19484.0
          ...   
41139    27060.5
41140    29719.5
41141    23528.0
41142    23528.0
41143    15479.0
Name: city08, Length: 41144, dtype: float64

In [45]:
city_mpg.rank(method= 'min')

0        25555.0
1          136.0
2        35119.0
3          336.0
4        17467.0
          ...   
41139    25555.0
41140    28567.0
41141    21502.0
41142    21502.0
41143    13492.0
Name: city08, Length: 41144, dtype: float64

In [46]:
city_mpg.rank(method='dense') #does not skip any positions

0        14.0
1         4.0
2        18.0
3         5.0
4        12.0
         ... 
41139    14.0
41140    15.0
41141    13.0
41142    13.0
41143    11.0
Name: city08, Length: 41144, dtype: float64

In [47]:
make.replace('Subaru', 'スバル')

0        Alfa Romeo
1           Ferrari
2             Dodge
3             Dodge
4               スバル
            ...    
41139           スバル
41140           スバル
41141           スバル
41142           スバル
41143           スバル
Name: make, Length: 41144, dtype: object

The to_replace parameter’s value can contain a regular expression if you provide the regex=True
parameter. In this example we use regular expression capture groups (they are specified in the
expression by the parentheses). In value parameter we refer to these groups (\1 refers to the contents
inside the first parentheses and \2 refers to the contents in the second parentheses) when replacing
the original value:

In [49]:
make.replace(r'(Fer)ra(r.*)',
             value=r'\2-other-\1', regex=True)

0          Alfa Romeo
1        ri-other-Fer
2               Dodge
3               Dodge
4              Subaru
             ...     
41139          Subaru
41140          Subaru
41141          Subaru
41142          Subaru
41143          Subaru
Name: make, Length: 41144, dtype: object

Binning data, using `cut` fucntion can create bins of equal width:

In [50]:
pd.cut(city_mpg, 10)

0        (5.856, 20.4]
1        (5.856, 20.4]
2         (20.4, 34.8]
3        (5.856, 20.4]
4        (5.856, 20.4]
             ...      
41139    (5.856, 20.4]
41140    (5.856, 20.4]
41141    (5.856, 20.4]
41142    (5.856, 20.4]
41143    (5.856, 20.4]
Name: city08, Length: 41144, dtype: category
Categories (10, interval[float64, right]): [(5.856, 20.4] < (20.4, 34.8] < (34.8, 49.2] < (49.2, 63.6] ... (92.4, 106.8] < (106.8, 121.2] < (121.2, 135.6] < (135.6, 150.0]]

In [51]:
pd.cut(city_mpg, [0, 10, 20, 40, 70, 150])

0        (10, 20]
1         (0, 10]
2        (20, 40]
3         (0, 10]
4        (10, 20]
           ...   
41139    (10, 20]
41140    (10, 20]
41141    (10, 20]
41142    (10, 20]
41143    (10, 20]
Name: city08, Length: 41144, dtype: category
Categories (5, interval[int64, right]): [(0, 10] < (10, 20] < (20, 40] < (40, 70] < (70, 150]]

In [52]:
pd.qcut(city_mpg, 10)

0         (18.0, 20.0]
1        (5.999, 13.0]
2         (21.0, 24.0]
3        (5.999, 13.0]
4         (16.0, 17.0]
             ...      
41139     (18.0, 20.0]
41140     (18.0, 20.0]
41141     (17.0, 18.0]
41142     (17.0, 18.0]
41143     (15.0, 16.0]
Name: city08, Length: 41144, dtype: category
Categories (10, interval[float64, right]): [(5.999, 13.0] < (13.0, 14.0] < (14.0, 15.0] < (15.0, 16.0] ... (18.0, 20.0] < (20.0, 21.0] < (21.0, 24.0] < (24.0, 150.0]]

In [53]:
pd.qcut(city_mpg, 10, labels = list(range(1,11)))

0        7
1        1
2        9
3        1
4        5
        ..
41139    7
41140    7
41141    6
41142    6
41143    4
Name: city08, Length: 41144, dtype: category
Categories (10, int64): [1 < 2 < 3 < 4 ... 7 < 8 < 9 < 10]

0         19
1          9
2         23
3         10
4         17
        ... 
34364    127
34409    114
34564    140
34565    115
34566    104
Name: city08, Length: 105, dtype: int64