# Pandas Basics

In [1]:
import numpy as np
import pandas as pd

Series and DataFrame have the binary comparison methods eq, ne, lt, gt, le, and ge whose behavior is vectorized:

In [2]:
df = pd.DataFrame({
        'one': pd.Series(np.random.randn(3), index=['a', 'b', 'c']),
        'two': pd.Series(np.random.randn(4), index=['a', 'b', 'c', 'd']),
        'three': pd.Series(np.random.randn(3), index=['b', 'c', 'd'])})

In [3]:
df2 = df.copy()

In [4]:
df.gt(df2)

Unnamed: 0,one,two,three
a,False,False,False
b,False,False,False
c,False,False,False
d,False,False,False


In [5]:
df2.ne(df)

Unnamed: 0,one,two,three
a,False,False,True
b,False,False,False
c,False,False,False
d,True,False,False


We can apply reductions: **empty, any(), all()** and **bool()** to provide a way to summarize a boolean result

In [6]:
(df > 0).all()

one      False
two      False
three    False
dtype: bool

In [7]:
(df > 0).any()

one      True
two      True
three    True
dtype: bool

In [8]:
(df > 0).any().any()

True

To evaluate single-element pandas as a boolean, use the method **bool()**:

In [9]:
pd.Series([True]).bool()

True

In [10]:
pd.Series([False]).bool()

False

In [11]:
pd.DataFrame([[True]]).bool()

True

In [12]:
pd.DataFrame([[False]]).bool()

False

## Object comparison
You can conveniently perform element-wise comparisons when comparing a pandas data structure with a scalar value:

In [13]:
pd.Series(['foo', 'bar', 'baz']) == 'foo'

0     True
1    False
2    False
dtype: bool

Pandas also handles element-wise comparisons between different array-like objects of the same length:

In [14]:
pd.Series(['foo','bar','baz']) == pd.Index(['foo', 'bar', 'qux'])

0     True
1     True
2    False
dtype: bool

Often you may find that there is more than one way to compute the same result. For example consider *df + df* and *df * 2*. to test that these two computations produce the same result, given the tools shown above, you might imagine using **(df + df == df * 2).all().all()**

In [15]:
(df + df == df * 2).all().all()

False

The result is False, let's dive a bit deeper:

In [16]:
(df + df == df * 2).all()

one      False
two       True
three    False
dtype: bool

In [17]:
df + df == df * 2

Unnamed: 0,one,two,three
a,True,True,False
b,True,True,True
c,True,True,True
d,False,True,True


This happens because np.nan == np.nan returns False. So, Pandas objects (such as **Series** or **DataFrames**) have an **equals()** method for testing equality, with NaNs in correcponding locations treated as equal:

In [18]:
(df+df).equals(df * 2)

True

## Descriptive Statistics
There exists a large number of methods for computing descriptive statistics and other related operations on Series, DataFrame. All of them are vectorized. Most of them are aggregations and produce a lower dimensional result. These generally take an axis as an argument, which can be specified with an integer:

In [19]:
#Aggregation for each column
df.mean(0)

one     -0.194999
two     -0.810660
three    0.151186
dtype: float64

In [20]:
#aggregation for each index
df.mean(1)

a   -0.376202
b   -1.002115
c    0.455200
d   -0.490465
dtype: float64

by applying vectorized operations, we can describe various statistical procedures, like **standardization** (rendering data zero mean and standard deviation 1) very concisely:

In [21]:
ts_stand = (df - df.mean())/df.std()

In [22]:
ts_stand.std()

one      1.0
two      1.0
three    1.0
dtype: float64

## Describe
There is a function called **describe()** which computes a variety of summary statistics about a **Series** or the columns of a **DataFrame**:

In [23]:
series = pd.Series(np.random.randn(1000))

In [24]:
series[::2] = np.nan

In [25]:
series.describe()

count    500.000000
mean      -0.056410
std        0.985669
min       -3.241225
25%       -0.739566
50%       -0.076814
75%        0.609163
max        2.782368
dtype: float64

In [26]:
frame = pd.DataFrame(np.random.randn(1000, 5), columns=['a','b','c','d','e'])

In [27]:
frame.iloc[::2] = np.nan

In [28]:
frame.describe()

Unnamed: 0,a,b,c,d,e
count,500.0,500.0,500.0,500.0,500.0
mean,0.04999,0.07101,-0.007234,0.027557,-0.039284
std,1.001009,0.956795,0.959434,0.949941,1.013563
min,-2.747828,-2.496801,-2.659755,-2.662026,-2.974584
25%,-0.673492,-0.568158,-0.65967,-0.66062,-0.708518
50%,-0.006255,0.063941,-0.019203,0.028811,-0.013005
75%,0.719684,0.703167,0.664296,0.705656,0.671771
max,3.054749,3.417706,2.908934,2.860135,2.813484


For non-numerical Series objects, **describe()** will give a simple summary of the number of unique values and the most frequently occurring values:

In [29]:
s = pd.Series(['a','a','b','b','a','a', np.nan,'c','d','a'])

In [30]:
s.describe()

count     9
unique    4
top       a
freq      5
dtype: object

## Index of Min/Max Values
the idxmin() and idxmax() functions on Series and DataFrame compute the index labels with the minimum and maximum corresponding values:

In [31]:
s1 = pd.Series(np.random.randn(5))

In [32]:
s1

0   -0.830181
1    0.837229
2    2.238835
3    1.605637
4    0.667196
dtype: float64

In [33]:
s1.idxmin(), s1.idxmax()

(0, 2)

In [34]:
df1 = pd.DataFrame(np.random.randn(5, 3), columns=['A','B','C'])

In [35]:
df1

Unnamed: 0,A,B,C
0,-0.47187,-0.197041,-2.158075
1,-0.963045,-0.683484,1.798157
2,-1.290032,-1.446335,0.885046
3,1.261889,-1.198736,0.236263
4,0.527494,0.915877,-1.4146


In [36]:
df1.idxmin(axis=0)

A    2
B    2
C    0
dtype: int64

In [37]:
df1.idxmax(axis=1)

0    B
1    C
2    C
3    A
4    B
dtype: object

## Iterations
The behaviour of basic iterations over pandas object depends on the type. When iterating over a Series, it is regarded as array-like, and basic iterations produces the values. DataFrames follow the dict like convention of iterating over the keys of the objects. 
basic iteration (for i in object) produces:
<br>
<br> * Series: values
    * DataFrame: column labels

In [38]:
df = pd.DataFrame({'col1': np.random.randn(3),
                      'col2': np.random.rand(3)}, index=['a','b','c'])

In [39]:
for col in df:
    print(col)

col1
col2


## Pandas Viewing

In [40]:
s = pd.Series([1, 3, 5, np.nan, 6, 8])
s

0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64

## Instruction
create a DataFrame by passign a Numpy array with a datetime index and labelled columns:

In [41]:
dates = pd.date_range('20130101', periods=6)
dates

DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')

## Instruction
Create a DataFrame from a dict of objects

In [42]:
df2 = pd.DataFrame({'A': 1.0,
                       'B': pd.Timestamp('20130102'),
                       'C': pd.Series(1, index=list(range(4)), dtype='float32'),
                       'D': np.array([3] * 4, dtype='int32'),
                       'E': pd.Categorical(["test", "train", "test", "train"]),
                       'F': 'foo'})
df2

Unnamed: 0,A,B,C,D,E,F
0,1.0,2013-01-02,1.0,3,test,foo
1,1.0,2013-01-02,1.0,3,train,foo
2,1.0,2013-01-02,1.0,3,test,foo
3,1.0,2013-01-02,1.0,3,train,foo


The columns of the resulting DataFrames have different dtypes.

In [43]:
df2.dtypes


A           float64
B    datetime64[ns]
C           float32
D             int32
E          category
F            object
dtype: object

Tab completion for column names and public attributes is enabled in Jupyter. Just type "df." and hit tab to see the list.

## Pandas Accessing

In [45]:
dates = pd.date_range('20130101', periods=6)
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))

In [46]:
df2 = pd.DataFrame({'A': 1.,
                        'B': pd.Timestamp('20130102'),
                        'C': pd.Series(1, index=list(range(4)), dtype='float32'),
                        'D': np.array([3] * 4, dtype='int32'),
                        'E': pd.Categorical(["test", "train", "test", "train"]),
                        'F': 'foo'})

## Getting
Below is a table with a small cheat-sheet on how to get the values of DataFrame
* **OPERATION  -   SYNTAX  -   RESULT**
* Select Column - df[col] - Series
* Select row by label - df.loc[label] - Series
* Select row by integer loc - df.ilov[loc] - Series
* Slice rows - df[5:10] - DataFrame
* Select rows by boolean vector - df[bool_vec]  -  DataFrame

In [47]:
#Select a Column
df['A']

2013-01-01   -0.299155
2013-01-02   -1.647502
2013-01-03   -1.703519
2013-01-04   -1.177224
2013-01-05   -0.336356
2013-01-06    1.283956
Freq: D, Name: A, dtype: float64

In [48]:
#This is equivalent to 
df.A

2013-01-01   -0.299155
2013-01-02   -1.647502
2013-01-03   -1.703519
2013-01-04   -1.177224
2013-01-05   -0.336356
2013-01-06    1.283956
Freq: D, Name: A, dtype: float64

The first of the above two options is recommended as it avoids conflict with dataframe methods.

In [49]:
#selecting via [] which slices the rows
df[0:3]

Unnamed: 0,A,B,C,D
2013-01-01,-0.299155,-0.138388,-0.380035,0.530224
2013-01-02,-1.647502,3.206346,-0.222915,1.019781
2013-01-03,-1.703519,0.518502,-1.142597,0.033134


In [50]:
df['20130102':'20130104']

Unnamed: 0,A,B,C,D
2013-01-02,-1.647502,3.206346,-0.222915,1.019781
2013-01-03,-1.703519,0.518502,-1.142597,0.033134
2013-01-04,-1.177224,-1.16292,-0.745135,0.559658


## Selection by Label

In [51]:
#select first row based on its index value
df.loc['2013-01-01']

A   -0.299155
B   -0.138388
C   -0.380035
D    0.530224
Name: 2013-01-01 00:00:00, dtype: float64

In [52]:
#select more than one column by their column names
df.loc[:, ['A', 'B']]

Unnamed: 0,A,B
2013-01-01,-0.299155,-0.138388
2013-01-02,-1.647502,3.206346
2013-01-03,-1.703519,0.518502
2013-01-04,-1.177224,-1.16292
2013-01-05,-0.336356,0.397195
2013-01-06,1.283956,1.426828


In [53]:
#we can use label slicing and include both endpoints
df.loc['20130102' : '20130104', ['A','B']]

Unnamed: 0,A,B
2013-01-02,-1.647502,3.206346
2013-01-03,-1.703519,0.518502
2013-01-04,-1.177224,-1.16292


The command above will return the Series and the one below will return just a scalar value, which is simply, a number:

In [54]:
df.loc[dates[0], 'A']

-0.29915456049484884

## Select by Position
We can also select based on the actual DF position

In [55]:
df.iloc[3]

A   -1.177224
B   -1.162920
C   -0.745135
D    0.559658
Name: 2013-01-04 00:00:00, dtype: float64

In [56]:
#We can use slicing as well. This is similar to NumPy/Python
df.iloc[3:5, 0:2]

Unnamed: 0,A,B
2013-01-04,-1.177224,-1.16292
2013-01-05,-0.336356,0.397195


In [57]:
#for all rose, use :
df.iloc[1:3, :]

Unnamed: 0,A,B,C,D
2013-01-02,-1.647502,3.206346,-0.222915,1.019781
2013-01-03,-1.703519,0.518502,-1.142597,0.033134


## Select by dtype
The **select_dtypes()** method implements subsetting of columns based on their dtype. By subsetting, we mean taking only the selection of columns based on their dtype.

In [58]:
df = pd.DataFrame({'string': list('abc'),
                       'int64': list(range(1, 4)),
                       'uint8': np.arange(3, 6).astype('u1'),
                       'float64': np.arange(4.0, 7.0),
                       'bool1': [True, False, True],
                       'bool2': [False, True, False],
                       'dates': pd.date_range('now', periods=3),
                       'category': pd.Series(list("ABC")).astype('category')})

In [59]:
#select only **bool** columns from df above.
df.select_dtypes(include=[bool])

Unnamed: 0,bool1,bool2
0,True,False
1,False,True
2,True,False


## Boolean Indexing
in this section, we will use columns' values to filter data.

In [60]:
#take rows where column 'A' is higher than 0
df[df['float64'] >= 5]

Unnamed: 0,string,int64,uint8,float64,bool1,bool2,dates,category
1,b,2,4,5.0,False,True,2022-09-28 21:27:54.630809,B
2,c,3,5,6.0,True,False,2022-09-29 21:27:54.630809,C


We can also use function **isin()** for filters.

In [61]:
#create a copy of df and store in df2
df2 = df.copy()
df2['E'] = ['one', 'two', 'three']

In [62]:
#now we can use isin() to take only rows where 
#E is two or four
df2[df2['E'].isin(['one', 'two'])]

Unnamed: 0,string,int64,uint8,float64,bool1,bool2,dates,category,E
0,a,1,3,4.0,True,False,2022-09-27 21:27:54.630809,A,one
1,b,2,4,5.0,False,True,2022-09-28 21:27:54.630809,B,two


In [63]:
#We can also set values in the DataFrame.
#Setting Values by position
df.iat[0,1] = -1

In [65]:
#or
df.iloc[0,1] = 2

In [66]:
#Setting values by label
df.at[0, 'float64'] = -10

In [67]:
#or
df.loc[0, 'float64'] = -20

In [68]:
#Setting by assigning with a NumPy array:
df.loc[:, 'uint8'] = np.array([50] * len(df))