# Pandas Basics 1

* Attributes of Pandas objects
* Counting values in Series
* Altering labels
* **.dt** and **.str** accessors
* Sorting

One of the great things about the frequently used Python packages is that their documentation is really good. We can usually easily google anything we want to do in Pandas. We will also be working intensively with the official documentation throughout this module and the course.

## Attributes od Pandas objects

**Pandas** objects have a number of attributes enabling us to access metadata:

* Shape: gives the acis dimensions of the object, consistent with ndarray  
* Axis labels:
  * Series: index (only axis)
  * DataFrame: index (rows) and columns 
  
  
 **These atributes can be savely assigned too!**

In [70]:
import numpy as np
import pandas as pd
import datetime

In [23]:
df = pd.DataFrame(np.random.randn(8,3), index=df.index.array,
                 columns=['A', 'B', 'C'])

In [24]:
df.columns = [x.lower() for x in df.columns]

In [25]:
df

Unnamed: 0,a,b,c
0,-0.580088,-1.128502,0.256331
1,-0.195684,-0.087318,0.368795
2,1.684591,-0.340272,-0.576588
3,-1.588251,0.390996,-0.775116
4,-0.079312,-1.67124,0.534221
5,-1.221278,0.838284,-0.109499
6,1.155161,-0.111214,-0.38721
7,-0.162975,2.066352,1.143388


We can think of the **Pandas** objects (**Index, Series, DataFrame**) as containers for arrays, which hold the actual data and do the actual computation. To get the actual data inside an **Index** or **Series**, use the attribute **.array**.

In [26]:
df.a.array

<PandasArray>
[ -0.5800877974199303, -0.19568371532700307,    1.684590535656961,
  -1.5882507181798031, -0.07931191419102261,  -1.2212783360377637,
   1.1551614126263876, -0.16297469509362408]
Length: 8, dtype: float64

The full list of the attributes that are available for each data type can be found in the documentation: -**Series** -**DataFrames**

## Counting values in Series

The **value_counts()** Series method and top-level function computes a histogram of a 1D array of values.

In [27]:
data = np.random.randint(0, 7, size=50)

In [28]:
data

array([5, 4, 5, 1, 3, 6, 6, 5, 4, 3, 2, 0, 4, 2, 6, 4, 4, 0, 4, 3, 5, 4,
       1, 6, 4, 5, 4, 2, 0, 2, 6, 5, 3, 0, 5, 2, 1, 0, 1, 2, 6, 4, 5, 5,
       3, 2, 1, 6, 5, 6])

In [29]:
s = pd.Series(data)

In [30]:
s.value_counts()

5    10
4    10
6     8
2     7
3     5
1     5
0     5
dtype: int64

Similarly, we can get the most frequently occurring value(s) (**mode()**) of the values in a Series or DataFrame.

In [31]:
s5 = pd.Series([1, 1, 3, 3, 3, 5, 5, 7, 7, 7])
s5.mode()

0    3
1    7
dtype: int64

In [44]:
df5 = pd.DataFrame({'A': np.random.randint(0, 7, size=50),
                    'B': np.random.randint(-10, 15, size=50)})

In [45]:
df5.mode()

Unnamed: 0,A,B
0,3.0,-10
1,,6


*Even though **mode()** can be called on both Series and DataFrame, **value_counts()** can only be used on 1d arrays, therefore, not on DataFrames*

# Altering labels
## Reindexing

**reindex()** is the fundamental data alighnment method in **Pandas**. It is used to implement nearly all other features relying on a label-alignment functionality. To reindex meanse to conform the data to match a given set of labels along a particular axis. This accomplisehes several things:
* Reorders the existing data to match a new set of labels
* Inserts missing value (NA) markers in label locations where no data for that label existed

Here is a simple example:

In [49]:
s = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])
s

a    0.562198
b   -0.091051
c    0.492291
d   -0.539372
e   -0.115275
dtype: float64

In [50]:
s.reindex(['e', 'b', 'f', 'd'])

e   -0.115275
b   -0.091051
f         NaN
d   -0.539372
dtype: float64

We can see that we have **NaN** for the index **f**. This happens because we didn't have a label **f** in the original Series.

With a DataFrame, we can simultaneously reindex the index and columns:

In [51]:
df = pd.DataFrame({
    'one': pd.Series(np.random.randn(3), index=['a', 'b', 'c']),
    'two': pd.Series(np.random.randn(4), index=['a', 'b', 'c', 'd']),
    'three': pd.Series(np.random.randn(3), index=['b', 'c', 'd'])})

In [52]:
df.reindex(index=['c', 'f', 'b'], columns=['three', 'two', 'one'])

Unnamed: 0,three,two,one
c,1.932542,-1.445821,-1.687763
f,,,
b,-2.079307,-0.272225,0.830594


We may also use **reindex** with an **axis** keyword:

In [55]:
df.reindex(['j', 'e', 's'], axis='index')

Unnamed: 0,one,two,three
j,,,
e,,,
s,,,


**Index objects containing the actual axis labels can be shared between objexts. So if we have a Series and a DataFrame, the following can be done: *python in [207]: rs = s.reindex(df.index)***

## Dropping labels from an axis

A method closely related to **reindex** is the **drop()** function. It removes a set of labels from an axis:

In [56]:
df

Unnamed: 0,one,two,three
a,1.410044,0.169138,
b,0.830594,-0.272225,-2.079307
c,-1.687763,-1.445821,1.932542
d,,1.538125,-2.126752


In [57]:
df.drop(['a', 'd'], axis=0)

Unnamed: 0,one,two,three
b,0.830594,-0.272225,-2.079307
c,-1.687763,-1.445821,1.932542


In [59]:
df.drop(['one'], axis=1)

Unnamed: 0,two,three
a,0.169138,
b,-0.272225,-2.079307
c,-1.445821,1.932542
d,1.538125,-2.126752


## Renaming

The **rename()** method allows us to relabel an axis based on some mapping (a dict or Series) or an arbitrary function. 

In [60]:
s

a    0.562198
b   -0.091051
c    0.492291
d   -0.539372
e   -0.115275
dtype: float64

In [61]:
s.rename(str.upper)

A    0.562198
B   -0.091051
C    0.492291
D   -0.539372
E   -0.115275
dtype: float64

A **dict** or **Series** can also be used:

In [62]:
df.rename(columns={'one': 'foo', 'two': 'bar'},
         index={'a': 'apple', 'b': 'banana', 'd': 'durian'})

Unnamed: 0,foo,bar,three
apple,1.410044,0.169138,
banana,0.830594,-0.272225,-2.079307
c,-1.687763,-1.445821,1.932542
durian,,1.538125,-2.126752


**DataFrame.rename()** also supports an "axis-style" calling convention, where we specify a single mapper and an axis to apply that mapping to. 

In [64]:
df.rename({'one': 'foo', 'two': 'bar'}, axis='columns')
df.rename({'a': 'apple', 'b': 'banana', 'd': 'durian'}, axis='index')

Unnamed: 0,one,two,three
apple,1.410044,0.169138,
banana,0.830594,-0.272225,-2.079307
c,-1.687763,-1.445821,1.932542
durian,,1.538125,-2.126752


## **.dt** and **.str** accessors  
### **.dt** accessor
**Series** has an accessor to succinctly return datetime-like properties for the values of the Series, if it is a datetime/period-like Series. This will return a Series, indexed like an existing Series.

In [75]:
#datetime

s = pd.Series(pd.date_range('20130101 09:10:12', periods=4))
s

0   2013-01-01 09:10:12
1   2013-01-02 09:10:12
2   2013-01-03 09:10:12
3   2013-01-04 09:10:12
dtype: datetime64[ns]

In [76]:
s.dt.hour

0    9
1    9
2    9
3    9
dtype: int64

##### *Exctract the vollowing values from Series s: 1. Seconds 2. Day 3. Day of the week*

In [79]:
s.dt.second

0    12
1    12
2    12
3    12
dtype: int64

In [80]:
s.dt.day

0    1
1    2
2    3
3    4
dtype: int64

In [81]:
s.dt.dayofweek

0    1
1    2
2    3
3    4
dtype: int64

  
#### We can easily produce timezone-aware transformations:  


In [84]:
stz = s.dt.tz_localize('US/Eastern')
stz

0   2013-01-01 09:10:12-05:00
1   2013-01-02 09:10:12-05:00
2   2013-01-03 09:10:12-05:00
3   2013-01-04 09:10:12-05:00
dtype: datetime64[ns, US/Eastern]

In [86]:
stz.dt.tz

<DstTzInfo 'US/Eastern' LMT-1 day, 19:04:00 STD>

  
#### We can also chain these types of operations:  


In [88]:
s.dt.tz_localize('UTC').dt.tz_convert('US/Eastern')

0   2013-01-01 04:10:12-05:00
1   2013-01-02 04:10:12-05:00
2   2013-01-03 04:10:12-05:00
3   2013-01-04 04:10:12-05:00
dtype: datetime64[ns, US/Eastern]

### **.str** accessor
Series is equipped with **a set of string processing methods** that make it easy to operate on each element of the array. These are accessed via the Series's **str** attribute and generally have names matching the equivalent (scalar) built-in string methods.

In [90]:
s = pd.Series(['A', 'B', 'C', 'Adaba', 'Baca', np.nan, 'CABA', 'dog', 'cat'],
             dtype="string")
s.str.lower()

0        a
1        b
2        c
3    adaba
4     baca
5     <NA>
6     caba
7      dog
8      cat
dtype: string

## Sorting

There are three types of sorting in Pandas:  
1. Sorting by index labels. 
2. Sorting by column values. 
3. Sorting by a combination of both.  

### By index

The **Series.sort_index()** and **DataFrame.sort_index()** methods are used to sort a Pandas object by its index levels.

In [93]:
df = pd.DataFrame({
    'one': pd.Series(np.random.randn(3), index = ['a', 'b', 'c']),
    'two': pd.Series(np.random.randn(4), index = ['a', 'b', 'c', 'd']),
    'three': pd.Series(np.random.randn(3), index=['b', 'c', 'd'])})

unsorted_df = df.reindex(index=['a', 'd', 'c', 'b'],
                        columns=['three', 'two', 'one'])

unsorted_df

Unnamed: 0,three,two,one
a,,-0.831441,0.727979
d,-1.311436,-0.891394,
c,-0.605537,0.338304,0.368009
b,0.412593,0.566571,-1.455312


In [94]:
# Sort DataFrame by index
unsorted_df.sort_index()

Unnamed: 0,three,two,one
a,,-0.831441,0.727979
b,0.412593,0.566571,-1.455312
c,-0.605537,0.338304,0.368009
d,-1.311436,-0.891394,


In [96]:
# Sort DataFrame by index
unsorted_df.sort_index()

Unnamed: 0,three,two,one
a,,-0.831441,0.727979
b,0.412593,0.566571,-1.455312
c,-0.605537,0.338304,0.368009
d,-1.311436,-0.891394,


In [97]:
unsorted_df.sort_index(ascending = False)

Unnamed: 0,three,two,one
d,-1.311436,-0.891394,
c,-0.605537,0.338304,0.368009
b,0.412593,0.566571,-1.455312
a,,-0.831441,0.727979


In [100]:
# Sort DataFrame by colunn names
unsorted_df.sort_index(axis=1)

Unnamed: 0,one,three,two
a,0.727979,,-0.831441
d,,-1.311436,-0.891394
c,0.368009,-0.605537,0.338304
b,-1.455312,0.412593,0.566571


In [101]:
# Sort Series by index
unsorted_df['three'].sort_index()

a         NaN
b    0.412593
c   -0.605537
d   -1.311436
Name: three, dtype: float64

### By values

The **Series.sort_values()** method is used to sort a Series by its values.  
The **DataFrame.sort_values()** method is used to sort a DataFrame by its Column or row values.

In [103]:
df1 = pd.DataFrame({'one': [2, 1, 1, 1],
                   'two': [1, 3, 2, 4],
                   'three': [5, 4, 3, 2]})

# Sort DataFrame by column "two"
df1.sort_values(by='two')

Unnamed: 0,one,two,three
0,2,1,5
2,1,2,3
1,1,3,4
3,1,4,2


In [104]:
# Sort DataFrame by column "two"

df1.sort_values(by='two')

Unnamed: 0,one,two,three
0,2,1,5
2,1,2,3
1,1,3,4
3,1,4,2


In [105]:
# Sort DataFrame by columns "one" and "two"
df1[['one', 'two', 'three']].sort_values(by=['one', 'two'])

Unnamed: 0,one,two,three
2,1,2,3
1,1,3,4
3,1,4,2
0,2,1,5


These methods have a special treatment of NA values via the na_position argument:

In [106]:
s[2] = np.nan
s.sort_values()

0        A
3    Adaba
1        B
4     Baca
6     CABA
8      cat
7      dog
2     <NA>
5     <NA>
dtype: string

In [107]:
s.sort_values(na_position='first')

2     <NA>
5     <NA>
0        A
3    Adaba
1        B
4     Baca
6     CABA
8      cat
7      dog
dtype: string

#### *by* parameter in *sort_values()* method can refer to either columns or index level names.

We can use the name of the index to sort by both an index and a column.

In [112]:
# Build MultiIndex

idx = pd.MultiIndex.from_tuples([('a', 1), ('a', 2), ('a', 2),
                                  ('b', 2), ('b', 1), ('b', 1)])

idx.names = ['first', 'second']

# Build DataFrame

df_multi = pd.DataFrame({'A': np.arange(6, 0, -1)},
                       index=idx)

df_multi

Unnamed: 0_level_0,Unnamed: 1_level_0,A
first,second,Unnamed: 2_level_1
a,1,6
a,2,5
a,2,4
b,2,3
b,1,2
b,1,1


In [113]:
# Sort DataFrame by 'second' (index) and 'A' (column)

df_multi.sort_values(by=['second', 'A'])

Unnamed: 0_level_0,Unnamed: 1_level_0,A
first,second,Unnamed: 2_level_1
b,1,1
b,1,2
a,1,6
b,2,3
a,2,4
a,2,5
