In [1]:
import numpy as np
import pandas as pd

# Data Structure in Pandas

Pandas mainly provides two core data structures, built on top of NumPy arrays:

1. Series :
    - It is an one dimentional labeled array
    - It can hold data of any type
    - Each element in series consists of index

In [2]:
s = pd.Series([10, 20, 30], index=["a", "b", "c"])

In [3]:
s

a    10
b    20
c    30
dtype: int64

2. DataFrame
    - It is an two dimentional labeled data structure ( like rows and columns )
    - Each column in DataFrame is a Series Object
    - It is heterogenous, i.e. each column can hold different type of data

In [4]:
dict1 = {
    "name" : ['divyansh', 'raj', 'himanshu', 'unnati', 'vivan', 'honey'],
    "marks" : [21, 22, 23, 24, 25, 26],
    "city" : ['azamgarh', 'kanpur', 'ambedkarnagar', 'azamgarh', 'ambedkarnagar', 'ambedkarnagar']
}

In [5]:
df = pd.DataFrame(dict1)

In [6]:
df

Unnamed: 0,name,marks,city
0,divyansh,21,azamgarh
1,raj,22,kanpur
2,himanshu,23,ambedkarnagar
3,unnati,24,azamgarh
4,vivan,25,ambedkarnagar
5,honey,26,ambedkarnagar


**pd.DataFrame(dict1)** calls the pandas constructor to convert that dictionary into a tabular format.

In [7]:
df.to_csv('cousins.csv')

this line of code converts the dictionary into a csv file 

but this basic code gives the index in the csv file too, to disable it, you'll have to use index=False

In [8]:
df.to_csv('cousins_no_index.csv', index=False)

this gives us a file without any indexing

In [9]:
df.head(2)

Unnamed: 0,name,marks,city
0,divyansh,21,azamgarh
1,raj,22,kanpur


this line of code shows us the first two lines of the dataframe

In [10]:
df.tail(2)

Unnamed: 0,name,marks,city
4,vivan,25,ambedkarnagar
5,honey,26,ambedkarnagar


this code shows us the last two lines of dataframe

In [11]:
df.iloc[1:3]

Unnamed: 0,name,marks,city
1,raj,22,kanpur
2,himanshu,23,ambedkarnagar


this is used to show the dataframe from one index to another, the last index is exclusive in this

In [12]:
df.loc[1:3]

Unnamed: 0,name,marks,city
1,raj,22,kanpur
2,himanshu,23,ambedkarnagar
3,unnati,24,azamgarh


here the last index is inclusive

In [13]:
df.describe()

Unnamed: 0,marks
count,6.0
mean,23.5
std,1.870829
min,21.0
25%,22.25
50%,23.5
75%,24.75
max,26.0


this line of code generates the stats of your dataframe (only works for numerical data by default)

here are the explanation of what each row means

- **count** → number of non-null entries
- **mean** → average value
- **std** → standard deviation (spread of values)
- **min** → smallest value
- **25%** → 1st quartile (25% of values below this)
- **50%** → median (middle value)
- **75%** → 3rd quartile (75% of values below this)
- **max** → largest value

In [14]:
tr = pd.read_csv('trains.csv')

In [15]:
tr

Unnamed: 0.3,Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,Source,Destination,Train_Name,Speed_kmph,Distance_km
0,0,0,0,New Delhi,Mumbai Central,Rajdhani Express,100,1384
1,1,1,1,Howrah,Chennai,Coromandel Express,80,1659
2,2,2,2,Bangalore City,Hyderabad,Kacheguda Express,70,707
3,3,3,3,Ahmedabad,Jaipur,Aravalli Express,65,632
4,4,4,4,Kolkata,Patna,Shatabdi Express,85,532


read_csv is used to read csv file present in the folder and this also helps us to allot the data of that csv file to any variable

In [16]:
tr['Source']

0         New Delhi
1            Howrah
2    Bangalore City
3         Ahmedabad
4           Kolkata
Name: Source, dtype: object

we can get info of certain columns by using their names

In [17]:
tr['Distance_km'][1]

np.int64(1659)

we can also get certain elements by using the column and index

this can also be used to change certain elements

In [18]:
tr['Speed_kmph'][0] = 100

You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or Series, because the intermediate object on which we are setting values will behave as a copy.
A typical example is when you are setting values in a column of a DataFrame, like:

df["col"][row_indexer] = value

Use `df.loc[row_indexer, "col"] = values` instead, to perform the assignment in a single step and ensure this keeps updating the original `df`.

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

  tr['Speed_kmph'][0] = 100
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  tr['Speed_kmph'][0] = 100


this code will give us a warning about setting a value on a slice

In [19]:
tr

Unnamed: 0.3,Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,Source,Destination,Train_Name,Speed_kmph,Distance_km
0,0,0,0,New Delhi,Mumbai Central,Rajdhani Express,100,1384
1,1,1,1,Howrah,Chennai,Coromandel Express,80,1659
2,2,2,2,Bangalore City,Hyderabad,Kacheguda Express,70,707
3,3,3,3,Ahmedabad,Jaipur,Aravalli Express,65,632
4,4,4,4,Kolkata,Patna,Shatabdi Express,85,532


but that line will change the element nonetheless

In [20]:
tr.to_csv('trains.csv')

this is to update the original csv file

In [21]:
tr.index = ['I', 'II', 'III', 'IV', 'v']

In [22]:
tr

Unnamed: 0.3,Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,Source,Destination,Train_Name,Speed_kmph,Distance_km
I,0,0,0,New Delhi,Mumbai Central,Rajdhani Express,100,1384
II,1,1,1,Howrah,Chennai,Coromandel Express,80,1659
III,2,2,2,Bangalore City,Hyderabad,Kacheguda Express,70,707
IV,3,3,3,Ahmedabad,Jaipur,Aravalli Express,65,632
v,4,4,4,Kolkata,Patna,Shatabdi Express,85,532


this line of code shows that index doesn't have to be a numerical value

# Understading the difference between Series and DataFrame

In [23]:
ser = pd.Series(np.random.rand(10))

In [24]:
ser

0    0.574287
1    0.169995
2    0.012057
3    0.195279
4    0.844033
5    0.086473
6    0.843749
7    0.562117
8    0.096528
9    0.324634
dtype: float64

we have generated a series of 10 random floating point numbers

In [25]:
type(ser)

pandas.core.series.Series

this shows us that the given data structure is a series

In [26]:
newdf = pd.DataFrame(np.random.rand(350, 5), index = np.arange(350))

In [27]:
newdf

Unnamed: 0,0,1,2,3,4
0,0.177624,0.778557,0.200340,0.195448,0.363447
1,0.526906,0.893492,0.512165,0.714076,0.949962
2,0.790889,0.501086,0.507695,0.594873,0.074365
3,0.130050,0.883883,0.189677,0.210901,0.184302
4,0.334504,0.991274,0.895157,0.564080,0.238093
...,...,...,...,...,...
345,0.298504,0.245950,0.912893,0.197826,0.749297
346,0.570009,0.622063,0.282763,0.143820,0.086370
347,0.169717,0.754974,0.580899,0.289374,0.944199
348,0.269581,0.483395,0.581462,0.092624,0.687910


this generates a DataFrame of given dimensions, but because it's so large in size, showing all of the rows is pointless here, so it shows the top 5 and bottom 5 rows of the dataframe

In [28]:
type(newdf)

pandas.core.frame.DataFrame

this shows us that the give data structure is a DataFrame

In [29]:
newdf.head()

Unnamed: 0,0,1,2,3,4
0,0.177624,0.778557,0.20034,0.195448,0.363447
1,0.526906,0.893492,0.512165,0.714076,0.949962
2,0.790889,0.501086,0.507695,0.594873,0.074365
3,0.13005,0.883883,0.189677,0.210901,0.184302
4,0.334504,0.991274,0.895157,0.56408,0.238093


this gives us the top 5 rows of the dataframe

In [30]:
newdf.tail()

Unnamed: 0,0,1,2,3,4
345,0.298504,0.24595,0.912893,0.197826,0.749297
346,0.570009,0.622063,0.282763,0.14382,0.08637
347,0.169717,0.754974,0.580899,0.289374,0.944199
348,0.269581,0.483395,0.581462,0.092624,0.68791
349,0.775336,0.217291,0.225235,0.426016,0.061213


this shows us the bottom 5 rows of the dataframe

In [31]:
newdf.dtypes

0    float64
1    float64
2    float64
3    float64
4    float64
dtype: object

this shows us the data type of each columns ( as dataframe can hold multiple datatypes)

In [32]:
newdf[0][0] = "Divyansh"

You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or Series, because the intermediate object on which we are setting values will behave as a copy.
A typical example is when you are setting values in a column of a DataFrame, like:

df["col"][row_indexer] = value

Use `df.loc[row_indexer, "col"] = values` instead, to perform the assignment in a single step and ensure this keeps updating the original `df`.

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

  newdf[0][0] = "Divyansh"
  newdf[0][0] = "Divyansh"


In [33]:
newdf.dtypes

0     object
1    float64
2    float64
3    float64
4    float64
dtype: object

now by expicitly changing the value of the slice to an object, we've change the whole coloumn data type to object

again, explicit change of a slice gives us a warning

In [34]:
newdf.index

Index([  0,   1,   2,   3,   4,   5,   6,   7,   8,   9,
       ...
       340, 341, 342, 343, 344, 345, 346, 347, 348, 349],
      dtype='int64', length=350)

this gives us the indexs of the dataframe

In [35]:
newdf.columns

RangeIndex(start=0, stop=5, step=1)

this gives us a range because it follows a common pattern, starting from 0 ending before 5 and taking one step at a time

In [36]:
newdf.to_numpy()

array([['Divyansh', 0.7785571767870485, 0.20033972908504405,
        0.19544763732274417, 0.36344650396483236],
       [0.5269062788958148, 0.8934922637863206, 0.512165233410343,
        0.7140762188484645, 0.9499620892046281],
       [0.7908894604084982, 0.5010861548154303, 0.5076949810864374,
        0.5948730795539721, 0.07436525587632659],
       ...,
       [0.16971667589374007, 0.7549743806023697, 0.5808987233345423,
        0.2893737027774217, 0.944198755511609],
       [0.2695810183186621, 0.4833953727923507, 0.581461517693331,
        0.09262426902825827, 0.6879103416881936],
       [0.77533565101292, 0.2172907120105969, 0.22523480049513878,
        0.4260155015203084, 0.06121344764817793]],
      shape=(350, 5), dtype=object)

it's still showing the data type as object, because we changed the original numerical value

let's try changing that value back to numerical value

In [37]:
newdf[0][0] = 0.675384234

In [38]:
newdf.head()

Unnamed: 0,0,1,2,3,4
0,0.675384,0.778557,0.20034,0.195448,0.363447
1,0.526906,0.893492,0.512165,0.714076,0.949962
2,0.790889,0.501086,0.507695,0.594873,0.074365
3,0.13005,0.883883,0.189677,0.210901,0.184302
4,0.334504,0.991274,0.895157,0.56408,0.238093


In [39]:
newdf.to_numpy()

array([[0.675384234, 0.7785571767870485, 0.20033972908504405,
        0.19544763732274417, 0.36344650396483236],
       [0.5269062788958148, 0.8934922637863206, 0.512165233410343,
        0.7140762188484645, 0.9499620892046281],
       [0.7908894604084982, 0.5010861548154303, 0.5076949810864374,
        0.5948730795539721, 0.07436525587632659],
       ...,
       [0.16971667589374007, 0.7549743806023697, 0.5808987233345423,
        0.2893737027774217, 0.944198755511609],
       [0.2695810183186621, 0.4833953727923507, 0.581461517693331,
        0.09262426902825827, 0.6879103416881936],
       [0.77533565101292, 0.2172907120105969, 0.22523480049513878,
        0.4260155015203084, 0.06121344764817793]],
      shape=(350, 5), dtype=object)

we did change the value back to numerics but it doesn't change the datatype of that column

In [40]:
newdf.T

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,340,341,342,343,344,345,346,347,348,349
0,0.675384,0.526906,0.790889,0.13005,0.334504,0.779684,0.187345,0.513313,0.045101,0.850313,...,0.233594,0.545747,0.823616,0.208984,0.496077,0.298504,0.570009,0.169717,0.269581,0.775336
1,0.778557,0.893492,0.501086,0.883883,0.991274,0.759,0.202162,0.380429,0.87433,0.621139,...,0.406595,0.415836,0.210776,0.367731,0.134423,0.24595,0.622063,0.754974,0.483395,0.217291
2,0.20034,0.512165,0.507695,0.189677,0.895157,0.647079,0.638941,0.813191,0.290176,0.869268,...,0.182784,0.181924,0.823729,0.62982,0.364604,0.912893,0.282763,0.580899,0.581462,0.225235
3,0.195448,0.714076,0.594873,0.210901,0.56408,0.273708,0.878056,0.18797,0.782136,0.99163,...,0.45199,0.04565,0.348131,0.738736,0.904877,0.197826,0.14382,0.289374,0.092624,0.426016
4,0.363447,0.949962,0.074365,0.184302,0.238093,0.015351,0.330483,0.691766,0.059789,0.998106,...,0.343222,0.218238,0.004299,0.986244,0.146734,0.749297,0.08637,0.944199,0.68791,0.061213


In [41]:
newdf.head()

Unnamed: 0,0,1,2,3,4
0,0.675384,0.778557,0.20034,0.195448,0.363447
1,0.526906,0.893492,0.512165,0.714076,0.949962
2,0.790889,0.501086,0.507695,0.594873,0.074365
3,0.13005,0.883883,0.189677,0.210901,0.184302
4,0.334504,0.991274,0.895157,0.56408,0.238093


In [43]:
newdf.sort_index(axis = 0, ascending = False)

Unnamed: 0,0,1,2,3,4
349,0.775336,0.217291,0.225235,0.426016,0.061213
348,0.269581,0.483395,0.581462,0.092624,0.687910
347,0.169717,0.754974,0.580899,0.289374,0.944199
346,0.570009,0.622063,0.282763,0.143820,0.086370
345,0.298504,0.245950,0.912893,0.197826,0.749297
...,...,...,...,...,...
4,0.334504,0.991274,0.895157,0.564080,0.238093
3,0.13005,0.883883,0.189677,0.210901,0.184302
2,0.790889,0.501086,0.507695,0.594873,0.074365
1,0.526906,0.893492,0.512165,0.714076,0.949962


sorts the row in descending order by their index

In [44]:
newdf.sort_index(axis = 1, ascending = False)

Unnamed: 0,4,3,2,1,0
0,0.363447,0.195448,0.200340,0.778557,0.675384
1,0.949962,0.714076,0.512165,0.893492,0.526906
2,0.074365,0.594873,0.507695,0.501086,0.790889
3,0.184302,0.210901,0.189677,0.883883,0.13005
4,0.238093,0.564080,0.895157,0.991274,0.334504
...,...,...,...,...,...
345,0.749297,0.197826,0.912893,0.245950,0.298504
346,0.086370,0.143820,0.282763,0.622063,0.570009
347,0.944199,0.289374,0.580899,0.754974,0.169717
348,0.687910,0.092624,0.581462,0.483395,0.269581


sorts columns in descending order according to index

In [45]:
type(newdf[0])

pandas.core.series.Series

this codes confirms that every column in a dataframe is a series

In [46]:
newdf.head()

Unnamed: 0,0,1,2,3,4
0,0.675384,0.778557,0.20034,0.195448,0.363447
1,0.526906,0.893492,0.512165,0.714076,0.949962
2,0.790889,0.501086,0.507695,0.594873,0.074365
3,0.13005,0.883883,0.189677,0.210901,0.184302
4,0.334504,0.991274,0.895157,0.56408,0.238093


In [47]:
newdf2 = newdf

In [48]:
newdf2[0][0] = 54321

In [49]:
newdf

Unnamed: 0,0,1,2,3,4
0,54321,0.778557,0.200340,0.195448,0.363447
1,0.526906,0.893492,0.512165,0.714076,0.949962
2,0.790889,0.501086,0.507695,0.594873,0.074365
3,0.13005,0.883883,0.189677,0.210901,0.184302
4,0.334504,0.991274,0.895157,0.564080,0.238093
...,...,...,...,...,...
345,0.298504,0.245950,0.912893,0.197826,0.749297
346,0.570009,0.622063,0.282763,0.143820,0.086370
347,0.169717,0.754974,0.580899,0.289374,0.944199
348,0.269581,0.483395,0.581462,0.092624,0.687910


this is to clarify that "newdf2 = newdf" doesn't just copy the dataframe, but here newdf2 works as pass by reference, i.e. any changes to newdf2 will apply to newdf

In [50]:
newdf3 = newdf.copy()

In [51]:
newdf3[0][0] = 9

You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or Series, because the intermediate object on which we are setting values will behave as a copy.
A typical example is when you are setting values in a column of a DataFrame, like:

df["col"][row_indexer] = value

Use `df.loc[row_indexer, "col"] = values` instead, to perform the assignment in a single step and ensure this keeps updating the original `df`.

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

  newdf3[0][0] = 9
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  newdf3[0][0] = 9


In [52]:
newdf3.head()

Unnamed: 0,0,1,2,3,4
0,9.0,0.778557,0.20034,0.195448,0.363447
1,0.526906,0.893492,0.512165,0.714076,0.949962
2,0.790889,0.501086,0.507695,0.594873,0.074365
3,0.13005,0.883883,0.189677,0.210901,0.184302
4,0.334504,0.991274,0.895157,0.56408,0.238093


In [53]:
newdf.head()

Unnamed: 0,0,1,2,3,4
0,54321.0,0.778557,0.20034,0.195448,0.363447
1,0.526906,0.893492,0.512165,0.714076,0.949962
2,0.790889,0.501086,0.507695,0.594873,0.074365
3,0.13005,0.883883,0.189677,0.210901,0.184302
4,0.334504,0.991274,0.895157,0.56408,0.238093


here in this case we did not alter any elements in the original dataframe because we used .copy() function to just copy the dataframe instead of creating a sort of pointer to the dataframe