# 🐼 Pandas Basics

Pandas is a popular Python library for data analysis. It is built on top of two core Python libraries - Matplotlib for data visualization and NumPy for mathematical operations. Pandas provides a flexible and efficient DataFrame object, which is similar to a spreadsheet and can be manipulated in a similar way to SQL tables.

Let's get started with some basic operations in pandas.

## 1 Installing and Importing Pandas

If you haven't installed pandas yet, you can do so using `pip` or `poetry`:

```bash
pip install pandas
poetry add pandas
```

Once installed, you can import pandas as:

In [1]:
import pandas as pd

## 2. DataFrame and Series

A DataFrame is a table of entries (like an Excel spreadsheet), with labeled axes (rows and columns). A Series, on the other hand, is a single column of a DataFrame.

In [2]:
import numpy as np

# Create a Series
s = pd.Series([1, 2, 3, np.nan, 5, 6])
s

0    1.0
1    2.0
2    3.0
3    NaN
4    5.0
5    6.0
dtype: float64

In [3]:
# set the random seed for reproducibility
np.random.seed(42)

# Create a DataFrame by passing a numpy array, with a datetime index and labeled columns
dates = pd.date_range('20230101', periods=6)
data = np.random.randn(6, 4)
df = pd.DataFrame(data, index=dates, columns=list('ABCD'))
print(df)

                   A         B         C         D
2023-01-01  0.496714 -0.138264  0.647689  1.523030
2023-01-02 -0.234153 -0.234137  1.579213  0.767435
2023-01-03 -0.469474  0.542560 -0.463418 -0.465730
2023-01-04  0.241962 -1.913280 -1.724918 -0.562288
2023-01-05 -1.012831  0.314247 -0.908024 -1.412304
2023-01-06  1.465649 -0.225776  0.067528 -1.424748


## 3. Viewing Data

You can view the top and bottom rows of the DataFrame using `head()` and `tail()` methods:

In [4]:
# View top rows
df.head()

Unnamed: 0,A,B,C,D
2023-01-01,0.496714,-0.138264,0.647689,1.52303
2023-01-02,-0.234153,-0.234137,1.579213,0.767435
2023-01-03,-0.469474,0.54256,-0.463418,-0.46573
2023-01-04,0.241962,-1.91328,-1.724918,-0.562288
2023-01-05,-1.012831,0.314247,-0.908024,-1.412304


In [5]:
# View bottom rows
df.tail()

Unnamed: 0,A,B,C,D
2023-01-02,-0.234153,-0.234137,1.579213,0.767435
2023-01-03,-0.469474,0.54256,-0.463418,-0.46573
2023-01-04,0.241962,-1.91328,-1.724918,-0.562288
2023-01-05,-1.012831,0.314247,-0.908024,-1.412304
2023-01-06,1.465649,-0.225776,0.067528,-1.424748


You can also display the index, columns, and the underlying numpy data:

In [6]:
# Display index, columns, and the underlying numpy data
print(df.index, "\n")
print(df.columns, "\n")
print(df.values, "\n")

DatetimeIndex(['2023-01-01', '2023-01-02', '2023-01-03', '2023-01-04',
               '2023-01-05', '2023-01-06'],
              dtype='datetime64[ns]', freq='D') 

Index(['A', 'B', 'C', 'D'], dtype='object') 

[[ 0.49671415 -0.1382643   0.64768854  1.52302986]
 [-0.23415337 -0.23413696  1.57921282  0.76743473]
 [-0.46947439  0.54256004 -0.46341769 -0.46572975]
 [ 0.24196227 -1.91328024 -1.72491783 -0.56228753]
 [-1.01283112  0.31424733 -0.90802408 -1.4123037 ]
 [ 1.46564877 -0.2257763   0.0675282  -1.42474819]] 



## 4. Statistics

A quick statistical summary of your data can be shown using `describe()`:

In [7]:
df.describe()

Unnamed: 0,A,B,C,D
count,6.0,6.0,6.0,6.0
mean,0.081311,-0.275775,-0.133655,-0.262434
std,0.86195,0.862828,1.168366,1.187681
min,-1.012831,-1.91328,-1.724918,-1.424748
25%,-0.410644,-0.232047,-0.796872,-1.1998
50%,0.003904,-0.18202,-0.197945,-0.514009
75%,0.433026,0.201119,0.502648,0.459144
max,1.465649,0.54256,1.579213,1.52303


## 5. Sorting

You can sort your data by the values in a particular column:

In [8]:
df.sort_values(by='B')

Unnamed: 0,A,B,C,D
2023-01-04,0.241962,-1.91328,-1.724918,-0.562288
2023-01-02,-0.234153,-0.234137,1.579213,0.767435
2023-01-06,1.465649,-0.225776,0.067528,-1.424748
2023-01-01,0.496714,-0.138264,0.647689,1.52303
2023-01-05,-1.012831,0.314247,-0.908024,-1.412304
2023-01-03,-0.469474,0.54256,-0.463418,-0.46573


You can also sort by the index or column names:

In [9]:
# Sorting by index
df.sort_index(axis=0, ascending=False)

Unnamed: 0,A,B,C,D
2023-01-06,1.465649,-0.225776,0.067528,-1.424748
2023-01-05,-1.012831,0.314247,-0.908024,-1.412304
2023-01-04,0.241962,-1.91328,-1.724918,-0.562288
2023-01-03,-0.469474,0.54256,-0.463418,-0.46573
2023-01-02,-0.234153,-0.234137,1.579213,0.767435
2023-01-01,0.496714,-0.138264,0.647689,1.52303


In [10]:
# Sorting by column names
df.sort_index(axis=1, ascending=False)

Unnamed: 0,D,C,B,A
2023-01-01,1.52303,0.647689,-0.138264,0.496714
2023-01-02,0.767435,1.579213,-0.234137,-0.234153
2023-01-03,-0.46573,-0.463418,0.54256,-0.469474
2023-01-04,-0.562288,-1.724918,-1.91328,0.241962
2023-01-05,-1.412304,-0.908024,0.314247,-1.012831
2023-01-06,-1.424748,0.067528,-0.225776,1.465649


## 6. Selection

You can select a single column by its label:

In [11]:
df['A']

2023-01-01    0.496714
2023-01-02   -0.234153
2023-01-03   -0.469474
2023-01-04    0.241962
2023-01-05   -1.012831
2023-01-06    1.465649
Freq: D, Name: A, dtype: float64

Or through `[]`, which slices the rows:

In [12]:
df[0:3]

Unnamed: 0,A,B,C,D
2023-01-01,0.496714,-0.138264,0.647689,1.52303
2023-01-02,-0.234153,-0.234137,1.579213,0.767435
2023-01-03,-0.469474,0.54256,-0.463418,-0.46573


For selection by label, you can use `loc`:

In [13]:
df.loc[dates[0]]

A    0.496714
B   -0.138264
C    0.647689
D    1.523030
Name: 2023-01-01 00:00:00, dtype: float64

For selection by position, you can use `iloc`:

In [14]:
df.iloc[3]

A    0.241962
B   -1.913280
C   -1.724918
D   -0.562288
Name: 2023-01-04 00:00:00, dtype: float64

## 7. Data Cleaning

To check the missing data, you can use `isna()` or `notna()`:

In [15]:
df.isna()

Unnamed: 0,A,B,C,D
2023-01-01,False,False,False,False
2023-01-02,False,False,False,False
2023-01-03,False,False,False,False
2023-01-04,False,False,False,False
2023-01-05,False,False,False,False
2023-01-06,False,False,False,False


In [16]:
df.notna()

Unnamed: 0,A,B,C,D
2023-01-01,True,True,True,True
2023-01-02,True,True,True,True
2023-01-03,True,True,True,True
2023-01-04,True,True,True,True
2023-01-05,True,True,True,True
2023-01-06,True,True,True,True


To drop any rows that have missing data, you can use `dropna()`:

In [17]:
df.dropna(how='any')  # how is used to specify if any or all rows with NaNs should be dropped

Unnamed: 0,A,B,C,D
2023-01-01,0.496714,-0.138264,0.647689,1.52303
2023-01-02,-0.234153,-0.234137,1.579213,0.767435
2023-01-03,-0.469474,0.54256,-0.463418,-0.46573
2023-01-04,0.241962,-1.91328,-1.724918,-0.562288
2023-01-05,-1.012831,0.314247,-0.908024,-1.412304
2023-01-06,1.465649,-0.225776,0.067528,-1.424748


To fill missing data with a specific value, you can use `fillna()`:

In [18]:
df.fillna(value=5)

Unnamed: 0,A,B,C,D
2023-01-01,0.496714,-0.138264,0.647689,1.52303
2023-01-02,-0.234153,-0.234137,1.579213,0.767435
2023-01-03,-0.469474,0.54256,-0.463418,-0.46573
2023-01-04,0.241962,-1.91328,-1.724918,-0.562288
2023-01-05,-1.012831,0.314247,-0.908024,-1.412304
2023-01-06,1.465649,-0.225776,0.067528,-1.424748


## 8. Applying Functions

You can apply functions to the data:

In [19]:
df.apply(np.cumsum)  # cumulative sum applies a sum from a row to its previous rows

Unnamed: 0,A,B,C,D
2023-01-01,0.496714,-0.138264,0.647689,1.52303
2023-01-02,0.262561,-0.372401,2.226901,2.290465
2023-01-03,-0.206914,0.170159,1.763484,1.824735
2023-01-04,0.035049,-1.743121,0.038566,1.262447
2023-01-05,-0.977782,-1.428874,-0.869458,-0.149856
2023-01-06,0.487866,-1.65465,-0.80193,-1.574605


Or apply a lambda function:

In [20]:
df.apply(lambda x: x.max() - x.min())  # lambda functions are used to apply a function to each column

A    2.478480
B    2.455840
C    3.304131
D    2.947778
dtype: float64

This concludes our brief tutorial on the basics of pandas. These are just the basics - pandas has many more features and functionalities that you can explore as per your data manipulation and analysis needs!