# Pandas Basics I

We will start with the practice of basic operations in Pandas. It is very important to get familiar with this stuff because we will be using it again and again throughout this course.

We will cover an introduction to Pandas, specifically:

    Attributes of Pandas objects
    Counting values in Series
    Altering labels
    .dt and .str accessors
    Sorting

One of the great things about the frequently used Python packages is that their documentation is really good. We can usually easily google anything we want to do in Pandas. We will also be working intensively with the official documentation throughout this course.

We can continue in the notebook from the previous activity. If you decide to create a new one, don't forget to import the packages.

## Attributes of Pandas objects

Pandas objects have a number of attributes enabling us to access metadata:

    Shape: gives the axis dimensions of the object, consistent with ndarray

    Axis labels:
        Series: index (only axis)
        DataFrame: index (rows) and columns


In [1]:
import numpy as np
import pandas as pd

In [2]:
df = pd.DataFrame(np.random.randn(8, 3), columns=['A', 'B', 'C'])

In [3]:
df.columns = [x.lower() for x in df.columns]

In [4]:
df

Unnamed: 0,a,b,c
0,-1.195252,-0.848525,-0.688063
1,0.564029,0.287802,-1.211854
2,-0.864288,-0.62209,0.648326
3,-0.430905,-0.218526,-0.109536
4,-1.043536,0.350322,0.849434
5,0.030655,-1.702364,-0.549075
6,0.107179,3.245787,0.2397
7,0.593894,0.172199,0.635085


We can think of the Pandas objects (Index, Series, DataFrame) as containers for arrays, which hold the actual data and do the actual computation. To get the actual data inside an Index or Series, use the attribute .array.

In [5]:
df['a'].array

<NumpyExtensionArray>
[ -1.1952521662648472,   0.5640285325004911,  -0.8642877692615747,
 -0.43090549165127373,  -1.0435358498845664,  0.03065533980990658,
  0.10717913694955286,   0.5938937734643264]
Length: 8, dtype: float64

The full list of the attributes that are available for each data type can be found in the documentation: - Series - DataFrames

## Counting values in Series

The value_counts() Series method and top-level function computes a histogram of a 1D array of values.

In [6]:
data = np.random.randint(0, 7, size=50)

In [7]:
data

array([3, 6, 2, 2, 6, 4, 0, 5, 4, 3, 2, 6, 1, 6, 3, 6, 5, 5, 4, 1, 1, 1,
       4, 1, 3, 2, 0, 0, 5, 6, 5, 3, 0, 4, 1, 1, 6, 5, 4, 3, 4, 0, 1, 6,
       5, 6, 0, 0, 1, 4])

In [8]:
s = pd.Series(data)

In [9]:
s.value_counts()

6    9
1    9
4    8
0    7
5    7
3    6
2    4
Name: count, dtype: int64

Similarly, we can get the most frequently occurring value(s) (mode()) of the values in a Series or DataFrame.

In [10]:
s5 = pd.Series([1, 1, 3, 3, 3, 5, 5, 7, 7, 7])

In [11]:
s5.mode()


0    3
1    7
dtype: int64

In [12]:
df5 = df5 = pd.DataFrame({"A": np.random.randint(0, 7, size=50), 
                          "B": np.random.randint(-10, 15, size=50)})

In [13]:
df5.mode()


Unnamed: 0,A,B
0,0,3.0
1,4,


 Warning

Even though mode() can be called on both Series and DataFrame, value_counts() can only be used on 1D arrays, therefore, not on DataFrames.


## .dt and .str accessors
### .dt accessor

Series has an accessor to succinctly return datetime-like properties for the values of the Series, if it is a datetime/period-like Series. This will return a Series, indexed like an existing Series.

In [14]:
s = pd.Series(pd.date_range('20130101 09:10:12', periods=4))

In [15]:
s

0   2013-01-01 09:10:12
1   2013-01-02 09:10:12
2   2013-01-03 09:10:12
3   2013-01-04 09:10:12
dtype: datetime64[ns]

In [16]:
s.dt.hour

0    9
1    9
2    9
3    9
dtype: int32

In [17]:
s.dt.second

0    12
1    12
2    12
3    12
dtype: int32

In [18]:
s.dt.day

0    1
1    2
2    3
3    4
dtype: int32

In [19]:
s.dt.dayofweek

0    1
1    2
2    3
3    4
dtype: int32

We can easily produce timezone-aware transformations:

In [20]:
stz = s.dt.tz_localize('US/Eastern')

In [21]:
stz

0   2013-01-01 09:10:12-05:00
1   2013-01-02 09:10:12-05:00
2   2013-01-03 09:10:12-05:00
3   2013-01-04 09:10:12-05:00
dtype: datetime64[ns, US/Eastern]

In [22]:
stz.dt.tz

<DstTzInfo 'US/Eastern' LMT-1 day, 19:04:00 STD>

We can also chain these types of operations:

In [23]:
s.dt.tz_localize('UTC').dt.tz_convert('US/Eastern')

0   2013-01-01 04:10:12-05:00
1   2013-01-02 04:10:12-05:00
2   2013-01-03 04:10:12-05:00
3   2013-01-04 04:10:12-05:00
dtype: datetime64[ns, US/Eastern]

### .str accessor

Series is equipped with a set of string processing methods that make it easy to operate on each element of the array. These are accessed via the Series’s str attribute and generally have names matching the equivalent (scalar) built-in string methods.

In [24]:
s = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat'],
              dtype="string")

In [25]:
s.str.lower()

0       a
1       b
2       c
3    aaba
4    baca
5    <NA>
6    caba
7     dog
8     cat
dtype: string

Using .str accessor, we can apply all string functions from standard Python to our Series.

## Sorting

There are three types of sorting in Pandas: 1. Sorting by index labels 2. Sorting by column values 3. Sorting by a combination of both

### By index

The Series.sort_index() and DataFrame.sort_index() methods are used to sort a Pandas object by its index levels.

In [26]:
df = pd.DataFrame({
        'one': pd.Series(np.random.randn(3), index=['a', 'b', 'c']),
        'two': pd.Series(np.random.randn(4), index=['a', 'b', 'c', 'd']),
        'three': pd.Series(np.random.randn(3), index=['b', 'c', 'd'])})

In [27]:
unsorted_df = df.reindex(index=['a', 'd', 'c', 'b'],
                          columns=['three', 'two', 'one'])

In [28]:
unsorted_df

Unnamed: 0,three,two,one
a,,1.069751,-0.999548
d,-0.418281,-0.896384,
c,-0.934628,-0.824306,0.226103
b,0.278278,-0.552513,-0.649304


In [29]:
unsorted_df.sort_index()#sort by index

Unnamed: 0,three,two,one
a,,1.069751,-0.999548
b,0.278278,-0.552513,-0.649304
c,-0.934628,-0.824306,0.226103
d,-0.418281,-0.896384,


In [30]:
unsorted_df.sort_index(ascending=False)#sort by index

Unnamed: 0,three,two,one
d,-0.418281,-0.896384,
c,-0.934628,-0.824306,0.226103
b,0.278278,-0.552513,-0.649304
a,,1.069751,-0.999548


In [31]:
unsorted_df.sort_index(axis=1)#sort by column

Unnamed: 0,one,three,two
a,-0.999548,,1.069751
d,,-0.418281,-0.896384
c,0.226103,-0.934628,-0.824306
b,-0.649304,0.278278,-0.552513


In [32]:
unsorted_df['three'].sort_index()#sort by index

a         NaN
b    0.278278
c   -0.934628
d   -0.418281
Name: three, dtype: float64

### By values

The Series.sort_values() method is used to sort a Series by its values. The DataFrame.sort_values() method is used to sort a DataFrame by its column or row values.

In [33]:
df1 = pd.DataFrame({'one': [2, 1, 1, 1],
                        'two': [1, 3, 2, 4],
                        'three': [5, 4, 3, 2]})

In [34]:
df1.sort_values(by='two') #sort by columns 'two'

Unnamed: 0,one,two,three
0,2,1,5
2,1,2,3
1,1,3,4
3,1,4,2


In [35]:
df1[['one', 'two', 'three']].sort_values(by=['one', 'two'])#sort by columns 'one' and 'two'

Unnamed: 0,one,two,three
2,1,2,3
1,1,3,4
3,1,4,2
0,2,1,5


These methods have a special treatment of NA values via the na_position argument:

In [36]:
s[2] = np.nan

In [37]:
s.sort_values()

0       A
3    Aaba
1       B
4    Baca
6    CABA
8     cat
7     dog
2    <NA>
5    <NA>
dtype: string

In [38]:
s.sort_values(na_position='first')

2    <NA>
5    <NA>
0       A
3    Aaba
1       B
4    Baca
6    CABA
8     cat
7     dog
dtype: string

by parameter in sort_values() method can refer to either columns or index level names. 

We can use the name of the index to sort by both an index and a column.

In [39]:
idx = pd.MultiIndex.from_tuples([('a', 1), ('a', 2), ('a', 2),
                                   ('b', 2), ('b', 1), ('b', 1)])

In [40]:
idx.names = ['first','second']

In [41]:
df_multi = pd.DataFrame({'A': np.arange(6, 0, -1)},
                            index=idx)

In [42]:
df_multi

Unnamed: 0_level_0,Unnamed: 1_level_0,A
first,second,Unnamed: 2_level_1
a,1,6
a,2,5
a,2,4
b,2,3
b,1,2
b,1,1


In [43]:
df_multi.sort_values(by=['second', 'A'])# Sort DataFrame by 'second' (index) and 'A' (column)

Unnamed: 0_level_0,Unnamed: 1_level_0,A
first,second,Unnamed: 2_level_1
b,1,1
b,1,2
a,1,6
b,2,3
a,2,4
a,2,5
