## Efective Pandas

Practice notebook of exercises contained in Effective Pandas:Patterns for Data Manipulation 
by Matt Harrison

### Series

Series are used to model one dimensional data. A serie is one dimensional, therefore, it has a single axis- the index



In [1]:
import pandas as pd

In [2]:
songs2=pd.Series([145,142,38,13],name='counts')

In [4]:
#In this example 0,1,2,3 are the "Axis labels" and 145,142,38,13 are the values
#A pandas pd has two axis one for the rows and one for the columns
songs2

0    145
1    142
2     38
3     13
Name: counts, dtype: int64

In [5]:
#the index is an attribute of the series object, we can inspect it
songs2.index

RangeIndex(start=0, stop=4, step=1)

In [6]:
#We can set an string index for the Series object
songs3=pd.Series([145,142,38,13],name='counts',index=['Paul','John','George','Ringo'])

In [9]:
songs3.index

Index(['Paul', 'John', 'George', 'Ringo'], dtype='object')

##### NaN= Not a number
NaN is usually ignored in arithmetic operations

In [11]:
import numpy as np
nan_series=pd.Series([2,np.nan],index=['Ono','Clapton'])

In [13]:
#notice that given that pandas percived a Nan in the series, it coerced the Series type to float64 which supports
#NaN values, when int64 does not
nan_series

Ono        2.0
Clapton    NaN
dtype: float64

In [15]:
#example of how pandas ignores a NaN value---useing count()
nan_series.count()

1

In [16]:
#if you want to explore the number of entries even if there are some NaN values you can use the .size property
nan_series.size

2

In [17]:
#Even in general pandas int64 does not support NaN values, there is one similar format of integers that
#does it it the Int64 format (notice the capital I instead of i)

nan_series2=pd.Series([2,np.nan],['Ono','Clapton'],dtype="Int64")

In [19]:
nan_series2

Ono           2
Clapton    <NA>
dtype: Int64

In [20]:
#you can use .astype() in order to a Series to accept a NaN
nan_series.astype('Int64')

Ono           2
Clapton    <NA>
dtype: Int64

#### Pandas series behaves similar to NumPy

In [21]:
numpy_ser=np.array([145,142,38,13])

In [24]:
songs3[1] #we can perform index operations on both

142

In [25]:
numpy_ser[1]

142

In [27]:
#Numpy and pandas have methods in common
print(songs3.mean())
print(numpy_ser.mean())


84.5
84.5


In [28]:
#Masking with a boolean array
mask=songs3>songs3.median()

In [30]:
mask

Paul       True
John       True
George    False
Ringo     False
Name: counts, dtype: bool

In [31]:
#once we have a mask we can use it as a filter
songs3[mask]

Paul    145
John    142
Name: counts, dtype: int64

### Categorical Data in pd Series

In [34]:
#Creating a category
s=pd.Series(['m','l','xs','s','xl'],dtype='category')

In [35]:
#We can check if the categories have an order
s.cat.ordered

False

In [37]:
#If we want to set a particular order of the categories we can use:
s2=pd.Series(['m','l','xs','s','xl'])
size_type=pd.api.types.CategoricalDtype(
    categories=['s','m','l'],ordered=True)

s3=s2.astype(size_type)

In [39]:
#notice that given that we didnt assign an order for 'xs' and 'xl' the method astype converted those values to Nan

s3

0      m
1      l
2    NaN
3      s
4    NaN
dtype: category
Categories (3, object): ['s' < 'm' < 'l']

In [42]:
#We could perform comparisons on the categories
#What values are greater than s
s3>'s'

0     True
1     True
2    False
3    False
4    False
dtype: bool

In [44]:
#We could also add an order to an existing serie

s.cat.reorder_categories(['xs','s','m','l','xl'],ordered=True)

0     m
1     l
2    xs
3     s
4    xl
dtype: category
Categories (5, object): ['xs' < 's' < 'm' < 'l' < 'xl']

In [50]:
#We can perform string operations in categorical data
s3=s3.str.upper()

In [51]:
s3

0      M
1      L
2    NaN
3      S
4    NaN
dtype: object

Exercises on Chapter 4

1) Create a series with the temperature values of the last seven days and filter out the values below the mean

In [52]:
temperatures=pd.Series(['22','30','21','25','28','24','21'],dtype="Int64")

In [53]:
temperatures_below=temperatures<temperatures.mean()

In [54]:
temperatures[temperatures_below]

0    22
2    21
5    24
6    21
dtype: Int64

2) Create a Serie with you favorite colors

In [57]:
colors=pd.Series(['Pink','Red','Black'],dtype='category')
colors.cat.reorder_categories(['Pink','Red', 'Black'], ordered=True)

0     Pink
1      Red
2    Black
dtype: category
Categories (3, object): ['Pink' < 'Red' < 'Black']

In [58]:
colors

0     Pink
1      Red
2    Black
dtype: category
Categories (3, object): ['Black', 'Pink', 'Red']

## Series Deep Dive

In [68]:
df=pd.read_csv('vehicles.csv')

  df=pd.read_csv('vehicles.csv')


In [69]:
city_mpg=df.city08

In [70]:
highway_mpg=df.highway08

In [71]:
city_mpg

0        19
1         9
2        23
3        10
4        17
         ..
41139    19
41140    20
41141    18
41142    18
41143    16
Name: city08, Length: 41144, dtype: int64

In [72]:
highway_mpg

0        25
1        14
2        33
3        12
4        23
         ..
41139    26
41140    28
41141    24
41142    24
41143    21
Name: highway08, Length: 41144, dtype: int64

In [74]:
#How many methods does pd series have?
#Pandas series have 420 methods
len(dir(city_mpg))

420

In [76]:
#we can explore the disponible methods by tipping point after the name of a series and then TAB
#city_mpg.

Excercises:

1) How many methods does the str attribute have?

In [77]:
len(dir(colors.str))

99