In [1]:
import numpy as np
import pandas as pd

# Data Structure in Pandas

Pandas mainly provides two core data structures, built on top of NumPy arrays:

1. Series :
    - It is an one dimentional labeled array
    - It can hold data of any type
    - Each element in series consists of index

In [2]:
s = pd.Series([10, 20, 30], index=["a", "b", "c"])

In [3]:
s

a    10
b    20
c    30
dtype: int64

2. DataFrame
    - It is an two dimentional labeled data structure ( like rows and columns )
    - Each column in DataFrame is a Series Object
    - It is heterogenous, i.e. each column can hold different type of data

In [4]:
dict1 = {
    "name" : ['divyansh', 'raj', 'himanshu', 'unnati', 'vivan', 'honey'],
    "marks" : [21, 22, 23, 24, 25, 26],
    "city" : ['azamgarh', 'kanpur', 'ambedkarnagar', 'azamgarh', 'ambedkarnagar', 'ambedkarnagar']
}

In [5]:
df = pd.DataFrame(dict1)

In [6]:
df

Unnamed: 0,name,marks,city
0,divyansh,21,azamgarh
1,raj,22,kanpur
2,himanshu,23,ambedkarnagar
3,unnati,24,azamgarh
4,vivan,25,ambedkarnagar
5,honey,26,ambedkarnagar


**pd.DataFrame(dict1)** calls the pandas constructor to convert that dictionary into a tabular format.

In [7]:
df.to_csv('cousins.csv')

this line of code converts the dictionary into a csv file 

but this basic code gives the index in the csv file too, to disable it, you'll have to use index=False

In [8]:
df.to_csv('cousins_no_index.csv', index=False)

this gives us a file without any indexing

In [9]:
df.head(2)

Unnamed: 0,name,marks,city
0,divyansh,21,azamgarh
1,raj,22,kanpur


this line of code shows us the first two lines of the dataframe

In [10]:
df.tail(2)

Unnamed: 0,name,marks,city
4,vivan,25,ambedkarnagar
5,honey,26,ambedkarnagar


this code shows us the last two lines of dataframe

In [11]:
df.iloc[1:3]

Unnamed: 0,name,marks,city
1,raj,22,kanpur
2,himanshu,23,ambedkarnagar


this is used to show the dataframe from one index to another, the last index is exclusive in this

In [12]:
df.loc[1:3]

Unnamed: 0,name,marks,city
1,raj,22,kanpur
2,himanshu,23,ambedkarnagar
3,unnati,24,azamgarh


here the last index is inclusive

In [13]:
df.describe()

Unnamed: 0,marks
count,6.0
mean,23.5
std,1.870829
min,21.0
25%,22.25
50%,23.5
75%,24.75
max,26.0


this line of code generates the stats of your dataframe (only works for numerical data by default)

here are the explanation of what each row means

- **count** → number of non-null entries
- **mean** → average value
- **std** → standard deviation (spread of values)
- **min** → smallest value
- **25%** → 1st quartile (25% of values below this)
- **50%** → median (middle value)
- **75%** → 3rd quartile (75% of values below this)
- **max** → largest value

In [14]:
tr = pd.read_csv('trains.csv')

In [15]:
tr

Unnamed: 0.5,Unnamed: 0.4,Unnamed: 0.3,Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,Source,Destination,Train_Name,Speed_kmph,Distance_km
0,0,0,0,0,0,New Delhi,Mumbai Central,Rajdhani Express,100,1384
1,1,1,1,1,1,Howrah,Chennai,Coromandel Express,80,1659
2,2,2,2,2,2,Bangalore City,Hyderabad,Kacheguda Express,70,707
3,3,3,3,3,3,Ahmedabad,Jaipur,Aravalli Express,65,632
4,4,4,4,4,4,Kolkata,Patna,Shatabdi Express,85,532


read_csv is used to read csv file present in the folder and this also helps us to allot the data of that csv file to any variable

In [16]:
tr['Source']

0         New Delhi
1            Howrah
2    Bangalore City
3         Ahmedabad
4           Kolkata
Name: Source, dtype: object

we can get info of certain columns by using their names

In [17]:
tr['Distance_km'][1]

np.int64(1659)

we can also get certain elements by using the column and index

this can also be used to change certain elements

In [18]:
tr['Speed_kmph'][0] = 100

You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or Series, because the intermediate object on which we are setting values will behave as a copy.
A typical example is when you are setting values in a column of a DataFrame, like:

df["col"][row_indexer] = value

Use `df.loc[row_indexer, "col"] = values` instead, to perform the assignment in a single step and ensure this keeps updating the original `df`.

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

  tr['Speed_kmph'][0] = 100
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  tr['Speed_kmph'][0] = 100


this code will give us a warning about setting a value on a slice

In [19]:
tr

Unnamed: 0.5,Unnamed: 0.4,Unnamed: 0.3,Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,Source,Destination,Train_Name,Speed_kmph,Distance_km
0,0,0,0,0,0,New Delhi,Mumbai Central,Rajdhani Express,100,1384
1,1,1,1,1,1,Howrah,Chennai,Coromandel Express,80,1659
2,2,2,2,2,2,Bangalore City,Hyderabad,Kacheguda Express,70,707
3,3,3,3,3,3,Ahmedabad,Jaipur,Aravalli Express,65,632
4,4,4,4,4,4,Kolkata,Patna,Shatabdi Express,85,532


but that line will change the element nonetheless

In [20]:
tr.to_csv('trains.csv')

this is to update the original csv file

In [21]:
tr.index = ['I', 'II', 'III', 'IV', 'v']

In [22]:
tr

Unnamed: 0.5,Unnamed: 0.4,Unnamed: 0.3,Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,Source,Destination,Train_Name,Speed_kmph,Distance_km
I,0,0,0,0,0,New Delhi,Mumbai Central,Rajdhani Express,100,1384
II,1,1,1,1,1,Howrah,Chennai,Coromandel Express,80,1659
III,2,2,2,2,2,Bangalore City,Hyderabad,Kacheguda Express,70,707
IV,3,3,3,3,3,Ahmedabad,Jaipur,Aravalli Express,65,632
v,4,4,4,4,4,Kolkata,Patna,Shatabdi Express,85,532


this line of code shows that index doesn't have to be a numerical value

# Understading the difference between Series and DataFrame

In [23]:
ser = pd.Series(np.random.rand(10))

In [24]:
ser

0    0.730166
1    0.997478
2    0.552216
3    0.504104
4    0.977858
5    0.628582
6    0.277517
7    0.652739
8    0.995805
9    0.163055
dtype: float64

we have generated a series of 10 random floating point numbers

In [25]:
type(ser)

pandas.core.series.Series

this shows us that the given data structure is a series

In [26]:
newdf = pd.DataFrame(np.random.rand(350, 5), index = np.arange(350))

In [27]:
newdf

Unnamed: 0,0,1,2,3,4
0,0.105347,0.857177,0.993586,0.605933,0.158149
1,0.697144,0.771117,0.509576,0.630326,0.047421
2,0.557687,0.146671,0.461229,0.207753,0.068252
3,0.885559,0.678606,0.424006,0.673018,0.010157
4,0.705288,0.804789,0.262785,0.153849,0.291168
...,...,...,...,...,...
345,0.166176,0.420344,0.531076,0.721601,0.032254
346,0.711007,0.539854,0.672553,0.171006,0.700453
347,0.672911,0.412094,0.179943,0.893669,0.499065
348,0.955880,0.811527,0.584494,0.703901,0.924520


this generates a DataFrame of given dimensions, but because it's so large in size, showing all of the rows is pointless here, so it shows the top 5 and bottom 5 rows of the dataframe

In [28]:
type(newdf)

pandas.core.frame.DataFrame

this shows us that the give data structure is a DataFrame

In [29]:
newdf.head()

Unnamed: 0,0,1,2,3,4
0,0.105347,0.857177,0.993586,0.605933,0.158149
1,0.697144,0.771117,0.509576,0.630326,0.047421
2,0.557687,0.146671,0.461229,0.207753,0.068252
3,0.885559,0.678606,0.424006,0.673018,0.010157
4,0.705288,0.804789,0.262785,0.153849,0.291168


this gives us the top 5 rows of the dataframe

In [30]:
newdf.tail()

Unnamed: 0,0,1,2,3,4
345,0.166176,0.420344,0.531076,0.721601,0.032254
346,0.711007,0.539854,0.672553,0.171006,0.700453
347,0.672911,0.412094,0.179943,0.893669,0.499065
348,0.95588,0.811527,0.584494,0.703901,0.92452
349,0.188588,0.55496,0.474922,0.473569,0.373803


this shows us the bottom 5 rows of the dataframe

In [31]:
newdf.dtypes

0    float64
1    float64
2    float64
3    float64
4    float64
dtype: object

this shows us the data type of each columns ( as dataframe can hold multiple datatypes)

In [32]:
newdf[0][0] = "Divyansh"

You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or Series, because the intermediate object on which we are setting values will behave as a copy.
A typical example is when you are setting values in a column of a DataFrame, like:

df["col"][row_indexer] = value

Use `df.loc[row_indexer, "col"] = values` instead, to perform the assignment in a single step and ensure this keeps updating the original `df`.

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

  newdf[0][0] = "Divyansh"
  newdf[0][0] = "Divyansh"


In [33]:
newdf.dtypes

0     object
1    float64
2    float64
3    float64
4    float64
dtype: object

now by expicitly changing the value of the slice to an object, we've change the whole coloumn data type to object

again, explicit change of a slice gives us a warning

In [34]:
newdf.index

Index([  0,   1,   2,   3,   4,   5,   6,   7,   8,   9,
       ...
       340, 341, 342, 343, 344, 345, 346, 347, 348, 349],
      dtype='int64', length=350)

this gives us the indexs of the dataframe

In [35]:
newdf.columns

RangeIndex(start=0, stop=5, step=1)

this gives us a range because it follows a common pattern, starting from 0 ending before 5 and taking one step at a time

In [36]:
newdf.to_numpy()

array([['Divyansh', 0.857177296726323, 0.9935856485932322,
        0.6059333324881087, 0.15814887599881255],
       [0.6971435700336193, 0.7711165602766791, 0.5095763066291314,
        0.6303263985361132, 0.04742141792044208],
       [0.5576871423826151, 0.14667145456815844, 0.46122881090995316,
        0.20775344911088556, 0.06825182519338202],
       ...,
       [0.6729108526276064, 0.41209449881751614, 0.17994262331251665,
        0.8936693409496337, 0.49906548500614967],
       [0.9558796237391695, 0.8115266951445343, 0.5844938697349574,
        0.703901082695936, 0.9245204187016358],
       [0.1885883088597602, 0.5549604908246007, 0.47492249454429747,
        0.47356894830463436, 0.3738028936111395]],
      shape=(350, 5), dtype=object)

it's still showing the data type as object, because we changed the original numerical value

let's try changing that value back to numerical value

In [37]:
newdf[0][0] = 0.675384234

In [38]:
newdf.head()

Unnamed: 0,0,1,2,3,4
0,0.675384,0.857177,0.993586,0.605933,0.158149
1,0.697144,0.771117,0.509576,0.630326,0.047421
2,0.557687,0.146671,0.461229,0.207753,0.068252
3,0.885559,0.678606,0.424006,0.673018,0.010157
4,0.705288,0.804789,0.262785,0.153849,0.291168


In [39]:
newdf.to_numpy()

array([[0.675384234, 0.857177296726323, 0.9935856485932322,
        0.6059333324881087, 0.15814887599881255],
       [0.6971435700336193, 0.7711165602766791, 0.5095763066291314,
        0.6303263985361132, 0.04742141792044208],
       [0.5576871423826151, 0.14667145456815844, 0.46122881090995316,
        0.20775344911088556, 0.06825182519338202],
       ...,
       [0.6729108526276064, 0.41209449881751614, 0.17994262331251665,
        0.8936693409496337, 0.49906548500614967],
       [0.9558796237391695, 0.8115266951445343, 0.5844938697349574,
        0.703901082695936, 0.9245204187016358],
       [0.1885883088597602, 0.5549604908246007, 0.47492249454429747,
        0.47356894830463436, 0.3738028936111395]],
      shape=(350, 5), dtype=object)

we did change the value back to numerics but it doesn't change the datatype of that column

In [40]:
newdf.T

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,340,341,342,343,344,345,346,347,348,349
0,0.675384,0.697144,0.557687,0.885559,0.705288,0.828846,0.269337,0.160237,0.166553,0.940971,...,0.785743,0.593184,0.071459,0.419398,0.476434,0.166176,0.711007,0.672911,0.95588,0.188588
1,0.857177,0.771117,0.146671,0.678606,0.804789,0.223245,0.026637,0.128669,0.892126,0.922085,...,0.922137,0.203001,0.048282,0.451307,0.982203,0.420344,0.539854,0.412094,0.811527,0.55496
2,0.993586,0.509576,0.461229,0.424006,0.262785,0.256808,0.555557,0.011721,0.334964,0.84501,...,0.118532,0.339524,0.120649,0.769656,0.846906,0.531076,0.672553,0.179943,0.584494,0.474922
3,0.605933,0.630326,0.207753,0.673018,0.153849,0.108497,0.075458,0.323754,0.703964,0.277373,...,0.126169,0.704626,0.118169,0.262175,0.007552,0.721601,0.171006,0.893669,0.703901,0.473569
4,0.158149,0.047421,0.068252,0.010157,0.291168,0.865209,0.979842,0.128662,0.459311,0.355318,...,0.294208,0.049632,0.945101,0.583893,0.994476,0.032254,0.700453,0.499065,0.92452,0.373803


In [41]:
newdf.head()

Unnamed: 0,0,1,2,3,4
0,0.675384,0.857177,0.993586,0.605933,0.158149
1,0.697144,0.771117,0.509576,0.630326,0.047421
2,0.557687,0.146671,0.461229,0.207753,0.068252
3,0.885559,0.678606,0.424006,0.673018,0.010157
4,0.705288,0.804789,0.262785,0.153849,0.291168


In [42]:
newdf.sort_index(axis = 0, ascending = False)

Unnamed: 0,0,1,2,3,4
349,0.188588,0.554960,0.474922,0.473569,0.373803
348,0.95588,0.811527,0.584494,0.703901,0.924520
347,0.672911,0.412094,0.179943,0.893669,0.499065
346,0.711007,0.539854,0.672553,0.171006,0.700453
345,0.166176,0.420344,0.531076,0.721601,0.032254
...,...,...,...,...,...
4,0.705288,0.804789,0.262785,0.153849,0.291168
3,0.885559,0.678606,0.424006,0.673018,0.010157
2,0.557687,0.146671,0.461229,0.207753,0.068252
1,0.697144,0.771117,0.509576,0.630326,0.047421


sorts the row in descending order by their index

In [43]:
newdf.sort_index(axis = 1, ascending = False)

Unnamed: 0,4,3,2,1,0
0,0.158149,0.605933,0.993586,0.857177,0.675384
1,0.047421,0.630326,0.509576,0.771117,0.697144
2,0.068252,0.207753,0.461229,0.146671,0.557687
3,0.010157,0.673018,0.424006,0.678606,0.885559
4,0.291168,0.153849,0.262785,0.804789,0.705288
...,...,...,...,...,...
345,0.032254,0.721601,0.531076,0.420344,0.166176
346,0.700453,0.171006,0.672553,0.539854,0.711007
347,0.499065,0.893669,0.179943,0.412094,0.672911
348,0.924520,0.703901,0.584494,0.811527,0.95588


sorts columns in descending order according to index

In [44]:
type(newdf[0])

pandas.core.series.Series

this codes confirms that every column in a dataframe is a series

In [45]:
newdf.head()

Unnamed: 0,0,1,2,3,4
0,0.675384,0.857177,0.993586,0.605933,0.158149
1,0.697144,0.771117,0.509576,0.630326,0.047421
2,0.557687,0.146671,0.461229,0.207753,0.068252
3,0.885559,0.678606,0.424006,0.673018,0.010157
4,0.705288,0.804789,0.262785,0.153849,0.291168


In [46]:
newdf2 = newdf

In [47]:
newdf2[0][0] = 54321

In [48]:
newdf

Unnamed: 0,0,1,2,3,4
0,54321,0.857177,0.993586,0.605933,0.158149
1,0.697144,0.771117,0.509576,0.630326,0.047421
2,0.557687,0.146671,0.461229,0.207753,0.068252
3,0.885559,0.678606,0.424006,0.673018,0.010157
4,0.705288,0.804789,0.262785,0.153849,0.291168
...,...,...,...,...,...
345,0.166176,0.420344,0.531076,0.721601,0.032254
346,0.711007,0.539854,0.672553,0.171006,0.700453
347,0.672911,0.412094,0.179943,0.893669,0.499065
348,0.95588,0.811527,0.584494,0.703901,0.924520


this is to clarify that "newdf2 = newdf" doesn't just copy the dataframe, but here newdf2 works as pass by reference, i.e. any changes to newdf2 will apply to newdf

In [49]:
newdf3 = newdf.copy()

In [50]:
newdf3[0][0] = 9

You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or Series, because the intermediate object on which we are setting values will behave as a copy.
A typical example is when you are setting values in a column of a DataFrame, like:

df["col"][row_indexer] = value

Use `df.loc[row_indexer, "col"] = values` instead, to perform the assignment in a single step and ensure this keeps updating the original `df`.

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

  newdf3[0][0] = 9
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  newdf3[0][0] = 9


In [51]:
newdf3.head()

Unnamed: 0,0,1,2,3,4
0,9.0,0.857177,0.993586,0.605933,0.158149
1,0.697144,0.771117,0.509576,0.630326,0.047421
2,0.557687,0.146671,0.461229,0.207753,0.068252
3,0.885559,0.678606,0.424006,0.673018,0.010157
4,0.705288,0.804789,0.262785,0.153849,0.291168


In [52]:
newdf.head()

Unnamed: 0,0,1,2,3,4
0,54321.0,0.857177,0.993586,0.605933,0.158149
1,0.697144,0.771117,0.509576,0.630326,0.047421
2,0.557687,0.146671,0.461229,0.207753,0.068252
3,0.885559,0.678606,0.424006,0.673018,0.010157
4,0.705288,0.804789,0.262785,0.153849,0.291168


here in this case we did not alter any elements in the original dataframe because we used .copy() function to just copy the dataframe instead of creating a sort of pointer to the dataframe

In [53]:
newdf.loc[0,0] = 654

In [54]:
newdf.head()

Unnamed: 0,0,1,2,3,4
0,654.0,0.857177,0.993586,0.605933,0.158149
1,0.697144,0.771117,0.509576,0.630326,0.047421
2,0.557687,0.146671,0.461229,0.207753,0.068252
3,0.885559,0.678606,0.424006,0.673018,0.010157
4,0.705288,0.804789,0.262785,0.153849,0.291168


**Difference: df[col][row] vs df.loc[row, col]**

    - df[col][row] → first picks a column, then a row → ❌ can cause chained assignment warning, not always safe.

    - df.loc[row, col] → directly picks row & column in one step → ✅ safe and recommended.

In [55]:
newdf.columns = list("ABCDE")

In [56]:
newdf.head()

Unnamed: 0,A,B,C,D,E
0,654.0,0.857177,0.993586,0.605933,0.158149
1,0.697144,0.771117,0.509576,0.630326,0.047421
2,0.557687,0.146671,0.461229,0.207753,0.068252
3,0.885559,0.678606,0.424006,0.673018,0.010157
4,0.705288,0.804789,0.262785,0.153849,0.291168


- df.columns = [...] → directly set new column names.
- here in our case it renames all columns to A, B, C, D, E

In [57]:
newdf.loc[0,0] = 1234

In [58]:
newdf

Unnamed: 0,A,B,C,D,E,0
0,654,0.857177,0.993586,0.605933,0.158149,1234.0
1,0.697144,0.771117,0.509576,0.630326,0.047421,
2,0.557687,0.146671,0.461229,0.207753,0.068252,
3,0.885559,0.678606,0.424006,0.673018,0.010157,
4,0.705288,0.804789,0.262785,0.153849,0.291168,
...,...,...,...,...,...,...
345,0.166176,0.420344,0.531076,0.721601,0.032254,
346,0.711007,0.539854,0.672553,0.171006,0.700453,
347,0.672911,0.412094,0.179943,0.893669,0.499065,
348,0.95588,0.811527,0.584494,0.703901,0.924520,


here as we can see that we can't access an element without providing it's row and col, if we provide a row or col that does not exist in the data frame, pandas will generate a new line for that element and name the missing row/col as the given one

In [59]:
newdf.loc[0, 'A'] = 1234

In [60]:
newdf.head()

Unnamed: 0,A,B,C,D,E,0
0,1234.0,0.857177,0.993586,0.605933,0.158149,1234.0
1,0.697144,0.771117,0.509576,0.630326,0.047421,
2,0.557687,0.146671,0.461229,0.207753,0.068252,
3,0.885559,0.678606,0.424006,0.673018,0.010157,
4,0.705288,0.804789,0.262785,0.153849,0.291168,


here as we can see that the data in desired field changed by providing the correct row and col index

In [61]:
temp = newdf.drop(0)

In [62]:
temp

Unnamed: 0,A,B,C,D,E,0
1,0.697144,0.771117,0.509576,0.630326,0.047421,
2,0.557687,0.146671,0.461229,0.207753,0.068252,
3,0.885559,0.678606,0.424006,0.673018,0.010157,
4,0.705288,0.804789,0.262785,0.153849,0.291168,
5,0.828846,0.223245,0.256808,0.108497,0.865209,
...,...,...,...,...,...,...
345,0.166176,0.420344,0.531076,0.721601,0.032254,
346,0.711007,0.539854,0.672553,0.171006,0.700453,
347,0.672911,0.412094,0.179943,0.893669,0.499065,
348,0.95588,0.811527,0.584494,0.703901,0.924520,


now what it does is removes a complete row of the given axis, to remove an column we need to give axis

In [63]:
temp = newdf.drop(0, axis = 1)

In [64]:
temp

Unnamed: 0,A,B,C,D,E
0,1234,0.857177,0.993586,0.605933,0.158149
1,0.697144,0.771117,0.509576,0.630326,0.047421
2,0.557687,0.146671,0.461229,0.207753,0.068252
3,0.885559,0.678606,0.424006,0.673018,0.010157
4,0.705288,0.804789,0.262785,0.153849,0.291168
...,...,...,...,...,...
345,0.166176,0.420344,0.531076,0.721601,0.032254
346,0.711007,0.539854,0.672553,0.171006,0.700453
347,0.672911,0.412094,0.179943,0.893669,0.499065
348,0.95588,0.811527,0.584494,0.703901,0.924520


here we've successfully removed the unnecessary column

In [65]:
newdf.drop(0, axis = 1)

Unnamed: 0,A,B,C,D,E
0,1234,0.857177,0.993586,0.605933,0.158149
1,0.697144,0.771117,0.509576,0.630326,0.047421
2,0.557687,0.146671,0.461229,0.207753,0.068252
3,0.885559,0.678606,0.424006,0.673018,0.010157
4,0.705288,0.804789,0.262785,0.153849,0.291168
...,...,...,...,...,...
345,0.166176,0.420344,0.531076,0.721601,0.032254
346,0.711007,0.539854,0.672553,0.171006,0.700453
347,0.672911,0.412094,0.179943,0.893669,0.499065
348,0.95588,0.811527,0.584494,0.703901,0.924520


In [66]:
newdf.loc[[1,2], ['B', 'C']]

Unnamed: 0,B,C
1,0.771117,0.509576
2,0.146671,0.461229


this gives a certain slice of rows and column that we define

this just returns a copy of that slice, it doesn't change the actual dataframe

In [68]:
newdf.loc[:,['A','B']]

Unnamed: 0,A,B
0,1234,0.857177
1,0.697144,0.771117
2,0.557687,0.146671
3,0.885559,0.678606
4,0.705288,0.804789
...,...,...
345,0.166176,0.420344
346,0.711007,0.539854
347,0.672911,0.412094
348,0.95588,0.811527


Here 
- : → means all rows.

- ['A','B'] → means only columns A and B.

So it selects a sub-DataFrame with all rows, but only the two columns A and B.

In [69]:
newdf.loc[[1,2], :]

Unnamed: 0,A,B,C,D,E,0
1,0.697144,0.771117,0.509576,0.630326,0.047421,
2,0.557687,0.146671,0.461229,0.207753,0.068252,


Here
- [[1,2]] → selects rows with labels 1 and 2.

- : → means all columns.

So it returns rows 1 and 2, with every column.

In [71]:
newdf.loc[(newdf['A']<0.3)]

Unnamed: 0,A,B,C,D,E,0
6,0.269337,0.026637,0.555557,0.075458,0.979842,
7,0.160237,0.128669,0.011721,0.323754,0.128662,
8,0.166553,0.892126,0.334964,0.703964,0.459311,
14,0.286467,0.805939,0.391460,0.437832,0.531848,
16,0.076986,0.240685,0.364705,0.763328,0.419404,
...,...,...,...,...,...,...
336,0.187481,0.781820,0.004233,0.115345,0.656987,
339,0.262992,0.068717,0.304326,0.499407,0.614494,
342,0.071459,0.048282,0.120649,0.118169,0.945101,
345,0.166176,0.420344,0.531076,0.721601,0.032254,


Here

- newdf['A'] < 0.3 → creates a Boolean mask (True/False for each row).
- .loc[...] → keeps only the rows where the condition is True.

 Returns all rows where column A has a value less than 0.3.
 
 By default, you get all columns back.

 Works for any condition (==, >, <=, etc.).

In [72]:
newdf.loc[(newdf['A']<0.3) & (newdf['C']>0.2)]

Unnamed: 0,A,B,C,D,E,0
6,0.269337,0.026637,0.555557,0.075458,0.979842,
8,0.166553,0.892126,0.334964,0.703964,0.459311,
14,0.286467,0.805939,0.391460,0.437832,0.531848,
16,0.076986,0.240685,0.364705,0.763328,0.419404,
29,0.038935,0.295192,0.468927,0.422501,0.410074,
...,...,...,...,...,...,...
333,0.037528,0.255128,0.552713,0.485830,0.293999,
335,0.238005,0.898140,0.780545,0.663065,0.224508,
339,0.262992,0.068717,0.304326,0.499407,0.614494,
345,0.166176,0.420344,0.531076,0.721601,0.032254,


In [73]:
newdf.iloc[0,3]

np.float64(0.6059333324881087)

this works as

**iloc[row_index, col_index]**

Both row & column use 0-based integer positions.

- .iloc → integer-location based indexing.
- 0 → row at position 0 (first row).
- 3 → column at position 3 (fourth column, since it’s 0-based).

Returns the value at that position.

📌 .loc[] vs .iloc[] in Pandas

- .iloc[] → uses integer positions (row index numbers, column numbers).

`df.iloc[0, 2]   # 1st row, 3rd column`


- .loc[] → uses labels (row names, column names).

`df.loc[0, "Age"]   # row label 0, column "Age"`


- Use .iloc for position-based access.
- Use .loc for label-based access.