In [1]:
import numpy as np
import pandas as pd

# Data Structure in Pandas

Pandas mainly provides two core data structures, built on top of NumPy arrays:

1. Series :
    - It is an one dimentional labeled array
    - It can hold data of any type
    - Each element in series consists of index

In [2]:
s = pd.Series([10, 20, 30], index=["a", "b", "c"])

In [3]:
s

a    10
b    20
c    30
dtype: int64

2. DataFrame
    - It is an two dimentional labeled data structure ( like rows and columns )
    - Each column in DataFrame is a Series Object
    - It is heterogenous, i.e. each column can hold different type of data

In [4]:
dict1 = {
    "name" : ['divyansh', 'raj', 'himanshu', 'unnati', 'vivan', 'honey'],
    "marks" : [21, 22, 23, 24, 25, 26],
    "city" : ['azamgarh', 'kanpur', 'ambedkarnagar', 'azamgarh', 'ambedkarnagar', 'ambedkarnagar']
}

In [5]:
df = pd.DataFrame(dict1)

In [6]:
df

Unnamed: 0,name,marks,city
0,divyansh,21,azamgarh
1,raj,22,kanpur
2,himanshu,23,ambedkarnagar
3,unnati,24,azamgarh
4,vivan,25,ambedkarnagar
5,honey,26,ambedkarnagar


**pd.DataFrame(dict1)** calls the pandas constructor to convert that dictionary into a tabular format.

In [7]:
df.to_csv('cousins.csv')

this line of code converts the dictionary into a csv file 

but this basic code gives the index in the csv file too, to disable it, you'll have to use index=False

In [8]:
df.to_csv('cousins_no_index.csv', index=False)

this gives us a file without any indexing

In [9]:
df.head(2)

Unnamed: 0,name,marks,city
0,divyansh,21,azamgarh
1,raj,22,kanpur


this line of code shows us the first two lines of the dataframe

In [10]:
df.tail(2)

Unnamed: 0,name,marks,city
4,vivan,25,ambedkarnagar
5,honey,26,ambedkarnagar


this code shows us the last two lines of dataframe

In [11]:
df.iloc[1:3]

Unnamed: 0,name,marks,city
1,raj,22,kanpur
2,himanshu,23,ambedkarnagar


this is used to show the dataframe from one index to another, the last index is exclusive in this

In [12]:
df.loc[1:3]

Unnamed: 0,name,marks,city
1,raj,22,kanpur
2,himanshu,23,ambedkarnagar
3,unnati,24,azamgarh


here the last index is inclusive

In [13]:
df.describe()

Unnamed: 0,marks
count,6.0
mean,23.5
std,1.870829
min,21.0
25%,22.25
50%,23.5
75%,24.75
max,26.0


this line of code generates the stats of your dataframe (only works for numerical data by default)

here are the explanation of what each row means

- **count** → number of non-null entries
- **mean** → average value
- **std** → standard deviation (spread of values)
- **min** → smallest value
- **25%** → 1st quartile (25% of values below this)
- **50%** → median (middle value)
- **75%** → 3rd quartile (75% of values below this)
- **max** → largest value

In [14]:
tr = pd.read_csv('trains.csv')

In [15]:
tr

Unnamed: 0.1,Unnamed: 0,Source,Destination,Train_Name,Speed_kmph,Distance_km
0,0,New Delhi,Mumbai Central,Rajdhani Express,100,1384
1,1,Howrah,Chennai,Coromandel Express,80,1659
2,2,Bangalore City,Hyderabad,Kacheguda Express,70,707
3,3,Ahmedabad,Jaipur,Aravalli Express,65,632
4,4,Kolkata,Patna,Shatabdi Express,85,532


read_csv is used to read csv file present in the folder and this also helps us to allot the data of that csv file to any variable

In [16]:
tr['Source']

0         New Delhi
1            Howrah
2    Bangalore City
3         Ahmedabad
4           Kolkata
Name: Source, dtype: object

we can get info of certain columns by using their names

In [17]:
tr['Distance_km'][1]

np.int64(1659)

we can also get certain elements by using the column and index

this can also be used to change certain elements

In [18]:
tr['Speed_kmph'][0] = 100

You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or Series, because the intermediate object on which we are setting values will behave as a copy.
A typical example is when you are setting values in a column of a DataFrame, like:

df["col"][row_indexer] = value

Use `df.loc[row_indexer, "col"] = values` instead, to perform the assignment in a single step and ensure this keeps updating the original `df`.

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

  tr['Speed_kmph'][0] = 100
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  tr['Speed_kmph'][0] = 100


this code will give us a warning about setting a value on a slice

In [19]:
tr

Unnamed: 0.1,Unnamed: 0,Source,Destination,Train_Name,Speed_kmph,Distance_km
0,0,New Delhi,Mumbai Central,Rajdhani Express,100,1384
1,1,Howrah,Chennai,Coromandel Express,80,1659
2,2,Bangalore City,Hyderabad,Kacheguda Express,70,707
3,3,Ahmedabad,Jaipur,Aravalli Express,65,632
4,4,Kolkata,Patna,Shatabdi Express,85,532


but that line will change the element nonetheless

In [20]:
tr.to_csv('trains.csv')

this is to update the original csv file

In [21]:
tr.index = ['I', 'II', 'III', 'IV', 'v']

In [22]:
tr

Unnamed: 0.1,Unnamed: 0,Source,Destination,Train_Name,Speed_kmph,Distance_km
I,0,New Delhi,Mumbai Central,Rajdhani Express,100,1384
II,1,Howrah,Chennai,Coromandel Express,80,1659
III,2,Bangalore City,Hyderabad,Kacheguda Express,70,707
IV,3,Ahmedabad,Jaipur,Aravalli Express,65,632
v,4,Kolkata,Patna,Shatabdi Express,85,532


this line of code shows that index doesn't have to be a numerical value

# Understading the difference between Series and DataFrame

In [23]:
ser = pd.Series(np.random.rand(10))

In [24]:
ser

0    0.300538
1    0.793748
2    0.694569
3    0.590925
4    0.344477
5    0.648137
6    0.936831
7    0.745776
8    0.295927
9    0.788378
dtype: float64

we have generated a series of 10 random floating point numbers

In [25]:
type(ser)

pandas.core.series.Series

this shows us that the given data structure is a series

In [26]:
newdf = pd.DataFrame(np.random.rand(350, 5), index = np.arange(350))

In [27]:
newdf

Unnamed: 0,0,1,2,3,4
0,0.175510,0.841241,0.810361,0.014484,0.776411
1,0.811915,0.761082,0.355677,0.345850,0.787269
2,0.770055,0.833072,0.477042,0.231271,0.182973
3,0.241356,0.893818,0.339776,0.110377,0.662872
4,0.651163,0.217490,0.332754,0.160868,0.941651
...,...,...,...,...,...
345,0.594371,0.291743,0.259317,0.007656,0.661942
346,0.105105,0.939278,0.326849,0.727353,0.632294
347,0.728748,0.667759,0.297731,0.079681,0.896211
348,0.478756,0.997045,0.822389,0.882927,0.296244


this generates a DataFrame of given dimensions, but because it's so large in size, showing all of the rows is pointless here, so it shows the top 5 and bottom 5 rows of the dataframe

In [28]:
type(newdf)

pandas.core.frame.DataFrame

this shows us that the give data structure is a DataFrame

In [29]:
newdf.head()

Unnamed: 0,0,1,2,3,4
0,0.17551,0.841241,0.810361,0.014484,0.776411
1,0.811915,0.761082,0.355677,0.34585,0.787269
2,0.770055,0.833072,0.477042,0.231271,0.182973
3,0.241356,0.893818,0.339776,0.110377,0.662872
4,0.651163,0.21749,0.332754,0.160868,0.941651


this gives us the top 5 rows of the dataframe

In [30]:
newdf.tail()

Unnamed: 0,0,1,2,3,4
345,0.594371,0.291743,0.259317,0.007656,0.661942
346,0.105105,0.939278,0.326849,0.727353,0.632294
347,0.728748,0.667759,0.297731,0.079681,0.896211
348,0.478756,0.997045,0.822389,0.882927,0.296244
349,0.080205,0.827715,0.237541,0.572669,0.28862


this shows us the bottom 5 rows of the dataframe

In [31]:
newdf.dtypes

0    float64
1    float64
2    float64
3    float64
4    float64
dtype: object

this shows us the data type of each columns ( as dataframe can hold multiple datatypes)

In [32]:
newdf[0][0] = "Divyansh"

You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or Series, because the intermediate object on which we are setting values will behave as a copy.
A typical example is when you are setting values in a column of a DataFrame, like:

df["col"][row_indexer] = value

Use `df.loc[row_indexer, "col"] = values` instead, to perform the assignment in a single step and ensure this keeps updating the original `df`.

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

  newdf[0][0] = "Divyansh"
  newdf[0][0] = "Divyansh"


In [33]:
newdf.dtypes

0     object
1    float64
2    float64
3    float64
4    float64
dtype: object

now by expicitly changing the value of the slice to an object, we've change the whole coloumn data type to object

again, explicit change of a slice gives us a warning

In [34]:
newdf.index

Index([  0,   1,   2,   3,   4,   5,   6,   7,   8,   9,
       ...
       340, 341, 342, 343, 344, 345, 346, 347, 348, 349],
      dtype='int64', length=350)

this gives us the indexs of the dataframe

In [35]:
newdf.columns

RangeIndex(start=0, stop=5, step=1)

this gives us a range because it follows a common pattern, starting from 0 ending before 5 and taking one step at a time