# Pandas 

Time to learn about the most important `data analysis/data processing` tool in the world.

- [LinkedIn](https://www.linkedin.com/in/pro-programmer/)
- [YouTube](http://www.youtube.com/@itvaya)
- [gtihub](https://github.com/RishatTalukder/Machine-Learning-Zero-to-Hero)
- [Gmail](talukderrishat2@gmail.com)
- [discord](https://discord.gg/ZB495XggcF)

# Installation 

If you have `conda` installed, Just run the following command in your terminal:

```bash
conda install pandas
```

Or if you dont't have `conda` installed, just run `pip install pandas` in your terminal.

`Pandas` is a `python` library for `data analysis`. Some say it's python's version of `excel`. Well they are not wrong but I would say it's more powerful and easy to use than `excel` and  has a lot of features that is way easier to use than `excel`.

For getting the grasp of `pandas` you need to know `two` crucial concepts:

1. `Series`
2. `DataFrame`

And also `pandas` is built on top of `numpy`. So, you need general understanding of `numpy` to understand `pandas` better.

> In my [github repo](https://github.com/RishatTalukder/Machine-Learning-Zero-to-Hero) I have section dedicated to `numpy` if you want to learn more about it.

So, now let's talk about `pandas` series.

# Series

A `Series` is a `one-dimensional` array-like object. It is similar to `numpy` array. 

I'll make 2 variables:

In [1]:
import numpy as np

data = np.array([5,2,4,2])
labels = ['a','b','c','d']

I made a simple `numpy` array called `data` and another list variable called `labels`.

Now, I use the `data` variable to create a `pandas series`.

In [2]:
import pandas as pd

series = pd.Series(data)
series

0    5
1    2
2    4
3    2
dtype: int64

And would you look at that!

I used the `pandas.Series` to create a `Series` from the `data` array and it looks like a table with `indexes`. Each index corresponds to a value of the array.

This is what a `Series` is.

Now, what we can do is use these `indexes` to get the a value from the `Series`.

In [3]:
series[0]

np.int64(5)

See? So, this a series is a pandas representation of a `numpy` array. Right? 

Looks like that.

But, no event though it looks like that, it's more than that.

In a normal pandas array, the indexes are unchangeable and always start from `0`. But in a series we can set custom indexes. 

That's why I tool another python list called `labels` and now what I can do is set the indices of the series to the values of `labels`.

In [4]:
series2 = pd.Series(data, index=labels)
series2

a    5
b    2
c    4
d    2
dtype: int64

And now we have a pandas series with custom index. Hmm.... why does this sound familiar? Did we do something like this in python?

Yes, we do have a list is python where we can do custom indexing. And that is a `dictionary` or `hashmap` in python. So, we can say that instead of being pandas version of a `numpy` array, it is more like a `pandas` verison of a `python` `dictionary`.

> A `dictionary` is a `hashmap` in `python`. A hashmap is a data structure that maps `keys` to `values`. Key are like custom index and values are the corresponding values to the keys.

We can also set the index of a `pandas` series that doesn't have a custom index. Like the `series` variable in the previous code block.

In [5]:
series.index = labels #setting the index
series

a    5
b    2
c    4
d    2
dtype: int64

> pandas series gives us a attributes like `index` and we can assign values to it. And in the code block above, I assigned the labels variable as the index of the `series`.

There is another way we can make a pandas `series`.

In [6]:
data = {'a': 1, 'b': 2, 'c': 3}

series3 = pd.Series(data)

series3

a    1
b    2
c    3
dtype: int64

That wasn't a shocker, was it? We can make a `pandas` series by directly passing a python `dictionary` to `pandas.Series` function. The `keys` are the `index` and the `values` are the `values`.

In the long run we will not be using `serieses` that much for data manipulation. But series is a building block for the most important thing in `pandas` which is `dataframes`.

So, It is impoertant we understand `series` to have a better grasp of `pandas`.

# Dataframes

# basics

Simly, `DataFrames` are `tables` of data. 

I think showing you would be more effective.

In [7]:
np.random.seed(1)

data = np.random.randn(6,5)
data

array([[ 1.62434536, -0.61175641, -0.52817175, -1.07296862,  0.86540763],
       [-2.3015387 ,  1.74481176, -0.7612069 ,  0.3190391 , -0.24937038],
       [ 1.46210794, -2.06014071, -0.3224172 , -0.38405435,  1.13376944],
       [-1.09989127, -0.17242821, -0.87785842,  0.04221375,  0.58281521],
       [-1.10061918,  1.14472371,  0.90159072,  0.50249434,  0.90085595],
       [-0.68372786, -0.12289023, -0.93576943, -0.26788808,  0.53035547]])

> Seed() is a mathod to freeze the random number generator in a particular random state so that the same random numbers are generated again and again.

I made a `2D` array called `data` with 6 rows and 5 columns.

Now, we can make a `pandas` dataframe from the `data` array.

`DataFrame()` method works almost exactly like `Series()` method. But the only change is that it has a extra argument. A table has rows and columns. So, to represent the rows and columns We have to pass them as arguments to the `DataFrame()` method.

In [8]:
df = pd.DataFrame(
    data=data,
    index=['row1', 'row2', 'row3', 'row4', 'row5', 'row6'],
    columns=['col1', 'col2', 'col3', 'col4', 'col5']
)
df

Unnamed: 0,col1,col2,col3,col4,col5
row1,1.624345,-0.611756,-0.528172,-1.072969,0.865408
row2,-2.301539,1.744812,-0.761207,0.319039,-0.24937
row3,1.462108,-2.060141,-0.322417,-0.384054,1.133769
row4,-1.099891,-0.172428,-0.877858,0.042214,0.582815
row5,-1.100619,1.144724,0.901591,0.502494,0.900856
row6,-0.683728,-0.12289,-0.935769,-0.267888,0.530355


In one command we have a `pandas` dataframe. With the specified `index` and `columns`.

And we can select any column if we want.

In [9]:
df['col1']

row1    1.624345
row2   -2.301539
row3    1.462108
row4   -1.099891
row5   -1.100619
row6   -0.683728
Name: col1, dtype: float64

In [10]:
df['col2']

row1   -0.611756
row2    1.744812
row3   -2.060141
row4   -0.172428
row5    1.144724
row6   -0.122890
Name: col2, dtype: float64

Have we see this kinda output before?

If we check out the type of a single column, we might get a surprise.

In [11]:
type(df['col1'])

pandas.core.series.Series

Owh!

That's surprising.

It's a `pandas` series.

Does that mean every single column in a `pandas` dataframe is a `pandas` series?

`YES!`

Pandas `dataframe` is a collection of multiple `pandas` series.

If we see type of the whole `dataframe`.

In [12]:
type(df)

pandas.core.frame.DataFrame

We can clearly see that it is a `pandas` dataframe. But individual columns are `pandas` series.

And that means the indexes are `series` indexes that are common to all the columns.

It's an important detail that will come in handy later on.

Now, what about selecting multiple columns?

We can do that by passing a list inside the `square brackets` to the `pandas` dataframe.


In [13]:
df[['col2', 'col3', 'col3']]

Unnamed: 0,col2,col3,col3.1
row1,-0.611756,-0.528172,-0.528172
row2,1.744812,-0.761207,-0.761207
row3,-2.060141,-0.322417,-0.322417
row4,-0.172428,-0.877858,-0.877858
row5,1.144724,0.901591,0.901591
row6,-0.12289,-0.935769,-0.935769


It'll give us the data with the columns we need.

And we can also make new columns if we want.

In [14]:
new_column_data = np.random.randn(3)
new_column_data

array([-0.69166075, -0.39675353, -0.6871727 ])

I generated a `numpy` array called `new_column_data` with 3 elements. 

Now we can add these as new column by:

In [15]:
df['new'] = new_column_data
df

ValueError: Length of values (3) does not match length of index (6)

ANNNND we have a big big error.

This error is saying that the new column does not have enough rows.

> If you want to make a new column you have to make sure it has the same amount of rows as the dataframe.

As I made a column with only 3 random values, where the dataframe has `6` rows, we get this error.

So, we make a `column` with the same number of rows as the `dataframe` and then we add it to the `dataframe`.

In [16]:
new_column_data = np.random.randn(6)
df['new'] = new_column_data
df

Unnamed: 0,col1,col2,col3,col4,col5,new
row1,1.624345,-0.611756,-0.528172,-1.072969,0.865408,-0.845206
row2,-2.301539,1.744812,-0.761207,0.319039,-0.24937,-0.671246
row3,1.462108,-2.060141,-0.322417,-0.384054,1.133769,-0.012665
row4,-1.099891,-0.172428,-0.877858,0.042214,0.582815,-1.11731
row5,-1.100619,1.144724,0.901591,0.502494,0.900856,0.234416
row6,-0.683728,-0.12289,-0.935769,-0.267888,0.530355,1.659802


And it is done.

We have a new column called `new` in the `dataframe`.

Now, to `remove` a column we can simply do:

In [17]:
df.drop(columns='new')

Unnamed: 0,col1,col2,col3,col4,col5
row1,1.624345,-0.611756,-0.528172,-1.072969,0.865408
row2,-2.301539,1.744812,-0.761207,0.319039,-0.24937
row3,1.462108,-2.060141,-0.322417,-0.384054,1.133769
row4,-1.099891,-0.172428,-0.877858,0.042214,0.582815
row5,-1.100619,1.144724,0.901591,0.502494,0.900856
row6,-0.683728,-0.12289,-0.935769,-0.267888,0.530355


And done!

`The new column is gone! or is it?!`

In [18]:
df

Unnamed: 0,col1,col2,col3,col4,col5,new
row1,1.624345,-0.611756,-0.528172,-1.072969,0.865408,-0.845206
row2,-2.301539,1.744812,-0.761207,0.319039,-0.24937,-0.671246
row3,1.462108,-2.060141,-0.322417,-0.384054,1.133769,-0.012665
row4,-1.099891,-0.172428,-0.877858,0.042214,0.582815,-1.11731
row5,-1.100619,1.144724,0.901591,0.502494,0.900856,0.234416
row6,-0.683728,-0.12289,-0.935769,-0.267888,0.530355,1.659802


Huh!

Why is the column still there even though we `removed` it? in the previous command?

> The `drop()` method returns a new `pandas` dataframe that does not have the column we removed. But it doesn't permanently remove the column.

And that's why to permanently remove a column we can pass an extra argument called `inplace=True` to the `drop()` method. 

This tells the `drop()` method to permanently remove the column.

In [19]:
df.drop(
    columns='new',
    inplace=True
)
df

Unnamed: 0,col1,col2,col3,col4,col5
row1,1.624345,-0.611756,-0.528172,-1.072969,0.865408
row2,-2.301539,1.744812,-0.761207,0.319039,-0.24937
row3,1.462108,-2.060141,-0.322417,-0.384054,1.133769
row4,-1.099891,-0.172428,-0.877858,0.042214,0.582815
row5,-1.100619,1.144724,0.901591,0.502494,0.900856
row6,-0.683728,-0.12289,-0.935769,-0.267888,0.530355


And the new column is `permanently` removed.

Now, you might ask can we do the same for a row?

Yes, we can.

It's exactly the same way.

But you need to know about `axis` argument.

In pandas, the columns and rows are given a `axis` for better indexing and selection.

> `axis 1` is for the columns and `axis 0` is for the rows.

In the drop method the axis is set to `1` by default. So, if we just say `drop('row1', inplace=True)` it will should show an error and it used to show and error on the previous versions of pandas but now it doesn't.

We can directly just pass the name of the `row` and it'll remove the row.

In [20]:
df.drop('row1')

Unnamed: 0,col1,col2,col3,col4,col5
row2,-2.301539,1.744812,-0.761207,0.319039,-0.24937
row3,1.462108,-2.060141,-0.322417,-0.384054,1.133769
row4,-1.099891,-0.172428,-0.877858,0.042214,0.582815
row5,-1.100619,1.144724,0.901591,0.502494,0.900856
row6,-0.683728,-0.12289,-0.935769,-0.267888,0.530355


Automatically finds if it's a row or a column.

But it is bets practice to use the `axis` argument.

In [21]:
df.drop(
    'row1',
    axis=0,
    inplace=True
)

df

Unnamed: 0,col1,col2,col3,col4,col5
row2,-2.301539,1.744812,-0.761207,0.319039,-0.24937
row3,1.462108,-2.060141,-0.322417,-0.384054,1.133769
row4,-1.099891,-0.172428,-0.877858,0.042214,0.582815
row5,-1.100619,1.144724,0.901591,0.502494,0.900856
row6,-0.683728,-0.12289,-0.935769,-0.267888,0.530355


Now, I told you before that `pandas` is built upon `numpy` and `numpy` is the main driving force and pandas is just a tool to make our task easier for numpy arrays. 

So, all the `pandas` data structures are just `numpy` arrays with index markers and some methods to make our life easier.


In [22]:
df.shape

(5, 5)

Now, as you can see our `dataframe` is still a `2d numpy` array with shape (5,5) because we removed a row.

So, as it is a numpy 2d array we should be able to do indexing and selecting just like we do in numpy.

And we can.

Let's say I want all the values of `row 3`.

In [23]:
df.loc['row3']

col1    1.462108
col2   -2.060141
col3   -0.322417
col4   -0.384054
col5    1.133769
Name: row3, dtype: float64

> We have to use the `loc or location` method to select a row individually.

And after that we can do `indexing` just like we do in `numpy` arrays.

In the above code block I selected the 3rd row and it looks awfully like a `pandas` series. Is it a pandas series?

In [24]:
type(df.loc['row3'])

pandas.core.series.Series

`YES!` it is.

Every indivisual row and column in a `pandas` dataframe is a `pandas series`.

Now, what about a `specific` value selection? Like I want to see the value of 3rd row and 4th column?

Just ike we did in numpy arrays. We can get a value from a `pandas` dataframe.

In [25]:
df.loc['row3', 'col4']

np.float64(-0.38405435466841564)

And the fun part is we select multiple rows and column at the same time.

Just like we can select multiple columns by passing a `list` to the `square brackets` to the `pandas` dataframe.

We can do the same with the `loc` method.

In [26]:
df.loc[['row3', 'row5'], ['col4', 'col5']]

Unnamed: 0,col4,col5
row3,-0.384054,1.133769
row5,0.502494,0.900856


It'll give us a `dataframe` with 2 rows and 2 columns.

And also another thing we need to discuss is that in pandas, it also gives a way to select a rows by index. Even though we set custom rows. We can use a special `method` for that.

> The method is called `iloc` or `integer location` method.

In [27]:
df.iloc[3]

col1   -1.100619
col2    1.144724
col3    0.901591
col4    0.502494
col5    0.900856
Name: row5, dtype: float64

Returns the row with index 3, which is `row5`.

In [28]:
df.iloc[3, 4]

np.float64(0.9008559492644118)

This returns the value of the row with index 3 and column with index 4. So, index 3 is `row5` and index 4 is `col5`.

In [29]:
df.loc['row5', 'col5']

np.float64(0.9008559492644118)

As you can see exactly the same thing.

There are a lot of different ways to select/index in `pandas`. But in the end it is a `2d numpy` array with custom index `markers`.

And to be honest we will not be using pandas like this. Most of the time we will be using pandas for data `manipulation`. Which is why you need to learn about `conditional selection` in `pandas`.

## Conditional Selection

In the `numpy` article I talked about `conditional selection` in `numpy` arrays. If you haven't read that article then please read it first.

`Conditional` selection in `pandas` and in `data analysis` is a very important concept.

For example. Let's say I want to select all the elements that are greater than `0`.

In [30]:
df[df > 0]

Unnamed: 0,col1,col2,col3,col4,col5
row2,,1.744812,,0.319039,
row3,1.462108,,,,1.133769
row4,,,,0.042214,0.582815
row5,,1.144724,0.901591,0.502494,0.900856
row6,,,,,0.530355


Let's me break it down what just happened.

First,

In [31]:
df > 0

Unnamed: 0,col1,col2,col3,col4,col5
row2,False,True,False,True,False
row3,True,False,False,False,True
row4,False,False,False,True,True
row5,False,True,True,True,True
row6,False,False,False,False,True


I'm Simply checking `df > 0` and it returns a `dataframe` with `True` and `False` values.

These boolean values correspond to the original dataframe when the condition is met.

We can see that the value of `row2` and `col2` is `True` that means the value of the original dataframe should be grater then 0.

In [32]:
df.loc['row2', 'col2']

np.float64(1.74481176421648)

And yes it is greater than 0.

Now, we can pass this `boolean` dataframe as a indexing argument to the `pandas` dataframe and get a filtered dataframe Where only the true values are shown and the false values are set to `NaN` values.

In [33]:
boolDF = df > 0
boolDF

Unnamed: 0,col1,col2,col3,col4,col5
row2,False,True,False,True,False
row3,True,False,False,False,True
row4,False,False,False,True,True
row5,False,True,True,True,True
row6,False,False,False,False,True


In [34]:
df[boolDF]

Unnamed: 0,col1,col2,col3,col4,col5
row2,,1.744812,,0.319039,
row3,1.462108,,,,1.133769
row4,,,,0.042214,0.582815
row5,,1.144724,0.901591,0.502494,0.900856
row6,,,,,0.530355


> It is good practice to break down the conditional selection into multiple steps for better understanding and readability. 

In a dataframe we can do this conditional selection on a single column.

Let's see what happens if we apply the same condition in `col4`.

In [35]:
df['col4'] > 0

row2     True
row3    False
row4     True
row5     True
row6    False
Name: col4, dtype: bool

And now it returns a series with `True` and `False` values for every row.

I'll just store this series in a variable and apply this condition to the whole dataframe.

In [36]:
boolSR = df['col4'] > 0
boolSR

row2     True
row3    False
row4     True
row5     True
row6    False
Name: col4, dtype: bool

In [37]:
df[boolSR]

Unnamed: 0,col1,col2,col3,col4,col5
row2,-2.301539,1.744812,-0.761207,0.319039,-0.24937
row4,-1.099891,-0.172428,-0.877858,0.042214,0.582815
row5,-1.100619,1.144724,0.901591,0.502494,0.900856


And we have a `filtered` dataframe.

Conditional selection on a single column gives us a filtered dataframe, where we can see that the `values` that were true for `col4` are now visible along with the `values` of the other columns even if they don't meet the condition.

It's like saying give me the dataframe where `col4` is greater than 0.

We can do the same thing, like saying give me the dataframe where `col5` is less than 0.

In [38]:
boolSR = df['col5'] < 0
df[boolSR]

Unnamed: 0,col1,col2,col3,col4,col5
row2,-2.301539,1.744812,-0.761207,0.319039,-0.24937


And we can see, there's only one row where `col5` is less than 0.

Now, as this is a `dataframe` If we want we can store it in a variable and do conditional selection or indexing and everything we can do in a `dataframe`.

Like let's say I only want the `col2` and `col4` columns where `col3` is greater than 0.

In [39]:
boolSR = df['col4'] > 0 # Boolean Series
boolSR

row2     True
row3    False
row4     True
row5     True
row6    False
Name: col4, dtype: bool

In [40]:
res_df = df[boolSR]
res_df

Unnamed: 0,col1,col2,col3,col4,col5
row2,-2.301539,1.744812,-0.761207,0.319039,-0.24937
row4,-1.099891,-0.172428,-0.877858,0.042214,0.582815
row5,-1.100619,1.144724,0.901591,0.502494,0.900856


In [41]:
res_df[['col2', 'col4']]

Unnamed: 0,col2,col4
row2,1.744812,0.319039
row4,-0.172428,0.042214
row5,1.144724,0.502494


As you can see the we can easily do conditional selection if we break it down into multiple steps.

And if you practice enough you'll eventually be able to do the same operations in 1 line.

In [42]:
df[df['col4']>0][['col2', 'col4']]

Unnamed: 0,col2,col4
row2,1.744812,0.319039
row4,-0.172428,0.042214
row5,1.144724,0.502494


> It's exactly the same thing as breaking the conditional selection into multiple steps.

One more very important thing I want to talk about is `multiple conditional selection` in `pandas` dataframes.

I gave some examples of how we can filter out data using a single condition on a `column` but what if we want to filter out data using multiple conditions on a single or multiple columns?

It's common python coding practice to use `and` and `or` operators to combine multiple conditions. They are called `logical operators` in python.

So, let's find the rows where `col3` is greater than 0 `or` `col4` is greater than 0.

In [43]:
boolSR = (df['col3'] > 0) or (df['col4'] > 0)
boolSR

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

and it's throughing an error. And this error tells us that `The truth value of a Series is ambiguous` which means that when we are using `or` operator it cannot decide between `True` and `False` values because the `logical` operators are used to only 2 boolean values not a whole series.

So, to fix this issue we can use the `bitwise or` operator `|` instead of `logical or` operator.

Same standard goes for `and` operator.

So, to combine multiple conditional selections we can use `|` and `&` operators for `or` and `and` respectively.

In [44]:
boolSR = (df['col3'] > 0) | (df['col4'] > 0)
boolSR

row2     True
row3    False
row4     True
row5     True
row6    False
dtype: bool

> AND ALWAYS REMEMBER TO USE `()` around the conditions.

Now, we have a `combined` boolean series. And we can use this boolean series to filter out the dataframe.

In [45]:
res = df[boolSR]
res

Unnamed: 0,col1,col2,col3,col4,col5
row2,-2.301539,1.744812,-0.761207,0.319039,-0.24937
row4,-1.099891,-0.172428,-0.877858,0.042214,0.582815
row5,-1.100619,1.144724,0.901591,0.502494,0.900856


And that's how you combine multiple conditional selections in `pandas` dataframes.

And before we go any further I want to show you one more thing.

## Resetting the index

In this article we have been using the same `dataframe` `df`. And I set it's index and columns manually.

Now, when your rows get very large it might be a little bit hard to manually set the index and columns.

And in that case you can completely reset the index by doing the following.

In [46]:
df.reset_index()

Unnamed: 0,index,col1,col2,col3,col4,col5
0,row2,-2.301539,1.744812,-0.761207,0.319039,-0.24937
1,row3,1.462108,-2.060141,-0.322417,-0.384054,1.133769
2,row4,-1.099891,-0.172428,-0.877858,0.042214,0.582815
3,row5,-1.100619,1.144724,0.901591,0.502494,0.900856
4,row6,-0.683728,-0.12289,-0.935769,-0.267888,0.530355


And if take a good look at the `dataframe` that we get. It has a new `index` starting from `0` and the old index is now a new column of the dataframe.

This is not inplace so the original `dataframe` is not changed.

In [47]:
df

Unnamed: 0,col1,col2,col3,col4,col5
row2,-2.301539,1.744812,-0.761207,0.319039,-0.24937
row3,1.462108,-2.060141,-0.322417,-0.384054,1.133769
row4,-1.099891,-0.172428,-0.877858,0.042214,0.582815
row5,-1.100619,1.144724,0.901591,0.502494,0.900856
row6,-0.683728,-0.12289,-0.935769,-0.267888,0.530355


We can set the `inplace` argument to `True` to make the changes inplace/permanently. Also, you might not want the old index to be a column so you can pass `drop=True` to the `reset_index()` method.

In [48]:
df.reset_index(
    drop=True,
    # inplace=True
)

Unnamed: 0,col1,col2,col3,col4,col5
0,-2.301539,1.744812,-0.761207,0.319039,-0.24937
1,1.462108,-2.060141,-0.322417,-0.384054,1.133769
2,-1.099891,-0.172428,-0.877858,0.042214,0.582815
3,-1.100619,1.144724,0.901591,0.502494,0.900856
4,-0.683728,-0.12289,-0.935769,-0.267888,0.530355


And the old index is now removed from the `dataframe`.

> I didn't do it inplace because I want to use this `df` for more examples later.

Another thing we can do is set an `existing` column as the index.

For that Let me make a custom `column` again.

In [49]:
new = ['custom1', 'custom2', 'custom3', 'custom4', 'custom5']

Now we acn add this as a new column for our dataframe.

In [50]:
df['new'] = new
df

Unnamed: 0,col1,col2,col3,col4,col5,new
row2,-2.301539,1.744812,-0.761207,0.319039,-0.24937,custom1
row3,1.462108,-2.060141,-0.322417,-0.384054,1.133769,custom2
row4,-1.099891,-0.172428,-0.877858,0.042214,0.582815,custom3
row5,-1.100619,1.144724,0.901591,0.502494,0.900856,custom4
row6,-0.683728,-0.12289,-0.935769,-0.267888,0.530355,custom5


In [51]:
df

Unnamed: 0,col1,col2,col3,col4,col5,new
row2,-2.301539,1.744812,-0.761207,0.319039,-0.24937,custom1
row3,1.462108,-2.060141,-0.322417,-0.384054,1.133769,custom2
row4,-1.099891,-0.172428,-0.877858,0.042214,0.582815,custom3
row5,-1.100619,1.144724,0.901591,0.502494,0.900856,custom4
row6,-0.683728,-0.12289,-0.935769,-0.267888,0.530355,custom5


Now we can make this new column as the index using the `set_index` method.

In [52]:
df.set_index(
    'new',
)

Unnamed: 0_level_0,col1,col2,col3,col4,col5
new,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
custom1,-2.301539,1.744812,-0.761207,0.319039,-0.24937
custom2,1.462108,-2.060141,-0.322417,-0.384054,1.133769
custom3,-1.099891,-0.172428,-0.877858,0.042214,0.582815
custom4,-1.100619,1.144724,0.901591,0.502494,0.900856
custom5,-0.683728,-0.12289,-0.935769,-0.267888,0.530355


And the new column is set as a new index and the older index is removed.

`set_index()` mathod has both `drop` and `inplace` arguments just like `reset_index()` method.

> Drop is set to `True` by default. inplace is set to `False` by default.

Ans this is how we can `reset` and `set` the index of a `pandas` dataframe.

## Multilevel Indexing

Well,