# DataFrames
* Data sets in Pandas are usually multi-dimensional tables, called DataFrames.
* 
Series is like a column, a DataFrame is the whole tabl
<p style="font-weight: bold;">A Pandas DataFrame is a 2 dimensional data structure, like a 2 dimensional array, or a table with rows and columns.</p>e.

In [68]:
import pandas as pd
import numpy as np

In [69]:
from numpy.random import randn

In [70]:
# This will generate the same random number(s) every time when you run the code.
np.random.seed(101)

In [71]:
df = pd.DataFrame(data=randn(5,4), index=['A', 'B', 'C', 'D', 'E'], columns=['W', 'X', 'Y', 'Z'])

In [72]:
df

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


### Breakdown of the above line:
1. The `data` parameter takes a `2D` array.
2. The `index`(optional) parameter will used to name labels/row.
3. The `column`(optional) parameter will used to name the columns.

#### If we do not pass any value to `index` and `column` paramters, they will use `integers` as default value.

In [73]:
pd.DataFrame(randn(5, 4))

Unnamed: 0,0,1,2,3
0,0.302665,1.693723,-1.706086,-1.159119
1,-0.134841,0.390528,0.166905,0.184502
2,0.807706,0.07296,0.638787,0.329646
3,-0.497104,-0.75407,-0.943406,0.484752
4,-0.116773,1.901755,0.238127,1.996652


* The `randn(rows, columns)` function takes number of rows and columns.
* `rows` parameter is `compulsory`.
* `columns` parameter's default value is `1`.

pd.DataFrame(randn(5))

<p style="color: green;font-size: 20px; font-weight: bold;">Note</p>
    The index and column numbering always starts from zero.

## Selection and Indexing


### Accessing Columns through their name

In [74]:
# Accessing a Single column.
df['W']

A    2.706850
B    0.651118
C   -2.018168
D    0.188695
E    0.190794
Name: W, dtype: float64

In [75]:
# Using dot(.) notation
df.W

A    2.706850
B    0.651118
C   -2.018168
D    0.188695
E    0.190794
Name: W, dtype: float64

<p style="font-size: 18px; color: green;">Each Column in a DataFrame is a Series.</p>

In [76]:
type(df['W'])

pandas.core.series.Series

In [77]:
# Accessing a list of columns.
df[['W', 'X']]

Unnamed: 0,W,X
A,2.70685,0.628133
B,0.651118,-0.319318
C,-2.018168,0.740122
D,0.188695,-0.758872
E,0.190794,1.978757


<p style="color: green;font-size: 20px; font-weight: bold;">Note</p>
    It is not a good practice to use dot(.) notation to access the columns. Always use `bracket([])` notation.

## Creating a new Column

In [78]:
df['new'] = df['W'] + df['Y']

In [79]:
df

Unnamed: 0,W,X,Y,Z,new
A,2.70685,0.628133,0.907969,0.503826,3.614819
B,0.651118,-0.319318,-0.848077,0.605965,-0.196959
C,-2.018168,0.740122,0.528813,-0.589001,-1.489355
D,0.188695,-0.758872,-0.933237,0.955057,-0.744542
E,0.190794,1.978757,2.605967,0.683509,2.796762


## Removing Column

We can use the `drop()` function to remove the columns.

<p style="font-size: 18px;">Parameters:</p>

1. `labels`: the `name` of the `column` as `string` you want to delete.
2. `axis`: the default value is `0` means rows. But if you want to delete column you need to assign `1` i.e. `axis=1`.

In [80]:
df.drop(labels="new", axis=1)

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


In [81]:
df

Unnamed: 0,W,X,Y,Z,new
A,2.70685,0.628133,0.907969,0.503826,3.614819
B,0.651118,-0.319318,-0.848077,0.605965,-0.196959
C,-2.018168,0.740122,0.528813,-0.589001,-1.489355
D,0.188695,-0.758872,-0.933237,0.955057,-0.744542
E,0.190794,1.978757,2.605967,0.683509,2.796762


As you can see, the `drop()` effect does not apply to the original table. Pandas does this to avoid accidental deletion.</br>To apply the effect, we have to specify one more parameter which is `inplace=True`.

In [82]:
df.drop("new", axis=1, inplace=True)

In [83]:
df

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


In [84]:
df.shape

(5, 4)

![Untitled-2024-02-28-2007.png](attachment:dbe10586-f7fd-4683-b978-0fdc4d119b29.png)

## Removing Row

In [85]:
# "axis" have default value 0.
df.drop('E', inplace=False)

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057


In [86]:
df

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


## Selecting Rows.

### 1. Using loc() method.
* Used to access the rows based on `labels`.

In [87]:
df.loc['C']

W   -2.018168
X    0.740122
Y    0.528813
Z   -0.589001
Name: C, dtype: float64

In [88]:
type(df.loc['C'])

pandas.core.series.Series

<p style="font-size: 18px; color: green;">This means every <code>row</code> is also a <code>series</code>.</p>

### 2. Using iloc() method.
* Used to access the rows based on `indices`.

In [89]:
df.iloc[2]

W   -2.018168
X    0.740122
Y    0.528813
Z   -0.589001
Name: C, dtype: float64

## Accessing subset from the table.

In [91]:
# Accessing a single element from the table.
df.loc['B', 'Y']

-0.8480769834036315

In [92]:
# Accessing a sub-table from the table
df.loc[['B', 'D'], ['W', 'Y']]

Unnamed: 0,W,Y
B,0.651118,-0.848077
D,0.188695,-0.933237


In [94]:
# Not neccessary to write the column or row names in order.
df.loc[['D', 'B'], ['Y', 'W']]

Unnamed: 0,Y,W
D,-0.933237,0.188695
B,-0.848077,0.651118


In [95]:
# The first list will always contain rows name and second list will always contain columns name. We can't mix them.
# Otherwise you will get an KeyError.
df.loc[['B', 'Y'], ['D', 'W']]

KeyError: "['Y'] not in index"