Pandas is a package that enables working a lot with tabular data, time series data and in general statistical data sets that are used in AI. It offers useful tools for handling missing data and other tools for manipulating the data. Moreover, it also provides utilities for reading and writing data from a variety of different sources such as CSV files and JSON format files. 

There are two main data structures used by Pandas

<i>Series</i> <br>
Series is a one-dimensional labelled array that can hold data of any particular type. The axis it uses to align the data is termed an index. 
<pre>s = pd.Series(data, index=index)</pre>

Let us consider an example where we would like to store the number of a fruit such as apples that are sold for a particular number of days. This could be created as follows:


Pandas is frequently used for processing tabular data with several columns. The best way to explore using the Package is
by considering various examples.

In [1]:
import pandas as pd;
import numpy as np;

from datetime import date
fromdate = date.fromisoformat('2019-12-01')
datelist = pd.date_range(fromdate, periods=5).tolist() 
apples_sold = pd.Series(np.random.randint(10,20,5),name='num_apples',index=datelist)
print(apples_sold);
print(apples_sold['2019-12-01'])
print(datelist[apples_sold.argmax()])

2019-12-01    10
2019-12-02    17
2019-12-03    19
2019-12-04    15
2019-12-05    11
Name: num_apples, dtype: int64
10
2019-12-03 00:00:00


If no index is passed, then it creates one using values [0,…len(data)-1]. For instance:


In [2]:
import pandas as pd
import numpy as np
print(pd.Series(np.random.randn(5),name='something'));



0   -0.870717
1   -0.703888
2    0.765159
3    1.393113
4    0.902752
Name: something, dtype: float64


<p>
    <i>Dataframe</i> <br>
Dataframe is a two-dimensional labelled data-structure with columns of potentially different types. It is the most commonly used Pandas object. <br>
A very simple example for creating a dataframe is as follows:


In [3]:
import pandas as pd
import numpy as np
dates = pd.date_range('20200101', periods=6)
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
print(df)
print(df.max())
print(df.nlargest(2,columns=list('ABCD'),keep="first"))
print(df['A'].nlargest(3))

                   A         B         C         D
2020-01-01  0.437154  1.518961 -0.236314 -1.268175
2020-01-02  2.549036 -0.296776 -0.980191  0.250784
2020-01-03  0.010786  0.318170 -2.426359  0.261615
2020-01-04  0.185179  1.288606  1.047357  1.184396
2020-01-05  1.242413  0.036349  1.002235 -0.014581
2020-01-06  2.196676  3.320251 -0.577551  0.895845
A    2.549036
B    3.320251
C    1.047357
D    1.184396
dtype: float64
                   A         B         C         D
2020-01-02  2.549036 -0.296776 -0.980191  0.250784
2020-01-06  2.196676  3.320251 -0.577551  0.895845
2020-01-02    2.549036
2020-01-06    2.196676
2020-01-05    1.242413
Name: A, dtype: float64


More about the Series and DataFrame can be obtained by referring to think links [here](https://pandas.pydata.org/docs/user_guide/dsintro.html)

In Pandas we can use many of the indexing and selection operations that we saw earlier in the case of NumPy.
More details about indexing and selection are available [here](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html)

Other aspects that are useful to consider in Pandas are the label alignment that is possible when working with series or DataFrames which are discussed in the following [link](https://pandas.pydata.org/docs/user_guide/dsintro.html).

### Missing Data
One very useful facet about Pandas is the ability to handle missing data.
There are usually often cases where some data is missing in a particular source.
The way it needs to be handled is either by using a special mask or a special value
to indicate that the data is missing. Pandas uses the latter by using <code>Nan</code> to indicate some data is missing.
It further provides specific functions for handling such missing data.

In [4]:

import pandas as pd
import numpy as np
dates = pd.date_range('20200101', periods=6)
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
df['A'][1] = np.NaN
df['B'][2] = np.NaN
df['D'][-1] = np.NaN
print(df)
print(df.sum()) # NaN values are ignored in such operations
df['E'] = df['A']+df['B'] #addition of NaN to a value results in NaN
df['F'] = df.sum(axis=1)
print(df)
idx = df.isna()# getting boolean index of NaN values
df[idx] = 0 #setting NaN values to 0
print(df)


                   A         B         C         D
2020-01-01 -0.245505  0.884681 -0.313470 -0.144505
2020-01-02       NaN -0.764650  0.326988 -0.290811
2020-01-03  0.783698       NaN -0.098859  0.808964
2020-01-04  0.084491  0.322840  0.201969  2.294212
2020-01-05 -1.288393 -0.210254  0.527050 -0.526116
2020-01-06 -1.183135 -0.156867  0.801218       NaN
A   -1.848844
B    0.075750
C    1.444896
D    2.141744
dtype: float64
                   A         B         C         D         E         F
2020-01-01 -0.245505  0.884681 -0.313470 -0.144505  0.639176  0.820378
2020-01-02       NaN -0.764650  0.326988 -0.290811       NaN -0.728473
2020-01-03  0.783698       NaN -0.098859  0.808964       NaN  1.493803
2020-01-04  0.084491  0.322840  0.201969  2.294212  0.407331  3.310843
2020-01-05 -1.288393 -0.210254  0.527050 -0.526116 -1.498647 -2.996361
2020-01-06 -1.183135 -0.156867  0.801218       NaN -1.340002 -1.878785
                   A         B         C         D         E         F
2020

We have just considered a few examples above for missing data. More examples for the same are available [here](https://pandas.pydata.org/pandas-docs/stable/user_guide/missing_data.html)

### Analysing the data

Some useful Pandas functions to consider are functions such as describe that allow us to obtain a complete
statistical analysis for a dataframe. It provides us the mean, max, standard deviation and the various percentiles.
For instance 50% is the median of the data and 25% indicates the first quartile of the data and 75% the third quartile
of the data.

Similarly the head() and tail() functions allow us to see the first few values of the dataframe

In [5]:
import pandas as pd
import numpy as np
dates = pd.date_range('20200101', periods=31)
df = pd.DataFrame(np.random.randn(31, 4), index=dates, columns=list('ABCD'))

print(df.describe())

print(df.head())

print(df.tail())

               A          B          C          D
count  31.000000  31.000000  31.000000  31.000000
mean    0.117389  -0.297599   0.176704  -0.006584
std     0.992239   1.092160   1.054826   1.224911
min    -1.734029  -2.564078  -3.068576  -2.989822
25%    -0.616580  -1.064195  -0.137303  -0.848671
50%     0.173176  -0.223259   0.267710   0.226967
75%     0.939607   0.365130   0.835097   0.807021
max     2.227107   2.363722   1.731194   2.584281
                   A         B         C         D
2020-01-01  0.648576  0.440383  1.276468 -0.416838
2020-01-02 -0.565172 -0.647034  0.558732  0.824649
2020-01-03  0.235010 -2.165494 -0.796766  2.584281
2020-01-04 -0.667671 -1.045061  1.012008  0.752111
2020-01-05 -0.962975 -0.537011  0.273614  0.809464
                   A         B         C         D
2020-01-27 -0.295788  0.679190  0.016278  0.226967
2020-01-28  0.312559  0.181296  0.254826  0.636783
2020-01-29  2.227107  0.264490 -0.211485 -0.767031
2020-01-30  0.277679 -0.576570  1.190712

### Further references

We have seen a few examples of working with Pandas. More examples can be obtained from the following [documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/index.html)

