# Pandas Series

1. It is single dimension array
2. It work like column into excel
3. It can have labeled index
4. It supports multi index
5. It supports dictionary with ordered keys
6. It supports all kind of data like object, string, int, boolean, float, double, byte

Purpose of working with pandas series

1. representation of single column records
2. enable fast operations on vector based data
3. supports labelled index so slicing is easy to perform
4. serve as building block for dataframe

# Create Pandas Series

In [4]:
import pandas as pd

In [5]:
#Create series from list
l = [10, 20, 30]
s = pd.Series(l)

In [6]:
s

0    10
1    20
2    30
dtype: int64

In [7]:
data_tuple = (101, 102, 103, 104)
s = pd.Series(data_tuple)

In [8]:
s

0    101
1    102
2    103
3    104
dtype: int64

In [9]:
data_dict = {'A': 100,'B': 200,'C':300}
s_dict = pd.Series(data_dict)
print(s_dict)

A    100
B    200
C    300
dtype: int64


In [10]:
import numpy as np
data_arr = np.array([1, 2, 3, 4, 5, 6, 7])
s = pd.Series(data_arr)
s

0    1
1    2
2    3
3    4
4    5
5    6
6    7
dtype: int32

In [11]:
value = [11, 22, 33, 44, 55]
indexv = ['a', 'b', 'c', 'd', 'e']
s = pd.Series(value, index = indexv)
s

a    11
b    22
c    33
d    44
e    55
dtype: int64

In [12]:
print(s[0])
print(s['a'])
print(s[-5])

11
11
11


  print(s[0])
  print(s[-5])


In [13]:
s.loc['e']

55

In [14]:
s.iloc[2]

33

# Pandas DataFrame

- Dataframe in pandas is 2 dimensional, heterogenous, tabular data structure like spreadsheet in python or sql table in a memory or dictionary of series object
- represent individual record as a row
- provide each record unique identification using indexes
- indexes are labeled
- representing each column as a variable
- As it is heterogenous we can store any datatype values in it like object, string, int, etc....

In [15]:
data = {
    'Substation': ['S1', 'S2', 'S3'],
    'Voltage (kV)': [220, 230, 225],
    'Load (MW)': [50, 65,55]
}
df = pd.DataFrame(data)
df

Unnamed: 0,Substation,Voltage (kV),Load (MW)
0,S1,220,50
1,S2,230,65
2,S3,225,55


In [16]:
data = [['S1', 220, 50], ['S2', 230, 65]]
df = pd.DataFrame(data, columns = ['Substation', 'Voltage (kV)', 'Load (MW)'])
df

Unnamed: 0,Substation,Voltage (kV),Load (MW)
0,S1,220,50
1,S2,230,65


In [17]:
data = [{'Substation': 'S1', 'Voltage': 220}, {'Substation': 'S2', 'Voltage': 230}]
df = pd.DataFrame(data)
df

Unnamed: 0,Substation,Voltage
0,S1,220
1,S2,230


In [18]:
import numpy as np

arr = np.array([[1, 2], [3, 4]])
df = pd.DataFrame(arr, columns=['A', 'B'])
df

Unnamed: 0,A,B
0,1,2
1,3,4


In [19]:
s1 = pd.Series([100, 200], index = ['S1', 'S2'])
s2 = pd.Series([5, 10], index = ['S1', 'S2'])
df = pd.DataFrame({'Voltage': s1, 'Current': s2})
df

Unnamed: 0,Voltage,Current
S1,100,5
S2,200,10


In [20]:
data = pd.read_clipboard()

In [21]:
data

Unnamed: 0,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,Exercise,5:,Reshape,with,Unknown,Dimension
Explanation:,NumPy,infers,second,dimension,(35,columns),Use,case:,Useful,when,only,one,dimension,is,known


In [22]:
data = pd.read_csv('cust.csv')

In [23]:
data

Unnamed: 0,Order_ID,Customer_Name,Product,Category,Quantity,Unit_Price,Total_Amount,Order_Date,City,Payment_Method
0,1001,Amit Sharma,Laptop,Electronics,1,55000,55000,2025-06-01,Mumbai,Credit Card
1,1002,Priya Desai,Smartphone,Electronics,2,18000,36000,2025-06-02,Pune,UPI
2,1003,Rahul Mehta,Office Chair,Furniture,1,7500,7500,2025-06-03,Delhi,Net Banking
3,1004,Sneha Iyer,Notebook,Stationery,5,50,250,2025-06-03,Bangalore,Cash
4,1005,Karan Patel,Desk,Furniture,1,12000,12000,2025-06-04,Ahmedabad,Credit Card
5,1006,Neha Verma,Pen,Stationery,10,20,200,2025-06-05,Mumbai,UPI
6,1007,Arjun Rao,Tablet,Electronics,1,25000,25000,2025-06-06,Hyderabad,Debit Card
7,1008,Divya Shah,Bookshelf,Furniture,1,8500,8500,2025-06-07,Pune,Cash
8,1009,Manish Gupta,Monitor,Electronics,2,12000,24000,2025-06-08,Delhi,Net Banking
9,1010,Anjali Nair,Printer,Electronics,1,15000,15000,2025-06-09,Bangalore,Credit Card


In [24]:
datajson = pd.read_json('data.json')
datajson

Unnamed: 0,substation_id,timestamp,voltage_kv,load_mw,status
0,S1,2025-07-01 00:00:00+00:00,220,45.2,Normal
1,S1,2025-07-01 01:00:00+00:00,219,47.5,Normal
2,S2,2025-07-01 00:00:00+00:00,230,51.0,Overload
3,S2,2025-07-01 01:00:00+00:00,231,49.8,Normal
4,S3,2025-07-01 00:00:00+00:00,224,40.6,Normal


In [25]:
# fetch records into dataframe from mysql
# conn = mysql.connector(host="",user="",passwd="",database="")
# df = pd.read_sql_query("Select * from employees", conn)

import mysql.connector
conn = mysql.connector.connect(
        host="localhost",
        user="root",
        password="root",
        database="godigitaldb"
    )

df = pd.read_sql_query("Select * from employee", conn)
df

  df = pd.read_sql_query("Select * from employee", conn)


Unnamed: 0,id,name,age,salary
0,2,Sunny,22,15000.0
1,3,Yukta,22,15000.0
2,5,Garima,21,19000.0


In [26]:
dataexcel = pd.read_excel('powerpulse_etl.xlsx', sheet_name='Sheet1')
dataexcel

Unnamed: 0,period,stateId,stateDescription,producertypeid,producerTypeDescription,energysourceid,energySourceDescription
0,2022,TX,Texas,TOT,All sectors,PETST,Petroleum - ST
1,2022,TX,Texas,TOT,All sectors,SOL,Solar
2,2022,TX,Texas,TOT,All sectors,SOLPV,Solar - PV
3,2022,TX,Texas,TOT,All sectors,WND,Wind
4,2022,TX,Texas,TOT,All sectors,WOO,Wood
...,...,...,...,...,...,...,...
4995,2021,LA,Louisiana,TOT,All sectors,PET,Petroleum
4996,2021,LA,Louisiana,TOT,All sectors,PETGT,Petroleum - GT
4997,2021,LA,Louisiana,TOT,All sectors,PETIC,Petroleum - IC
4998,2021,LA,Louisiana,TOT,All sectors,PETST,Petroleum - ST


In [27]:
data

Unnamed: 0,Order_ID,Customer_Name,Product,Category,Quantity,Unit_Price,Total_Amount,Order_Date,City,Payment_Method
0,1001,Amit Sharma,Laptop,Electronics,1,55000,55000,2025-06-01,Mumbai,Credit Card
1,1002,Priya Desai,Smartphone,Electronics,2,18000,36000,2025-06-02,Pune,UPI
2,1003,Rahul Mehta,Office Chair,Furniture,1,7500,7500,2025-06-03,Delhi,Net Banking
3,1004,Sneha Iyer,Notebook,Stationery,5,50,250,2025-06-03,Bangalore,Cash
4,1005,Karan Patel,Desk,Furniture,1,12000,12000,2025-06-04,Ahmedabad,Credit Card
5,1006,Neha Verma,Pen,Stationery,10,20,200,2025-06-05,Mumbai,UPI
6,1007,Arjun Rao,Tablet,Electronics,1,25000,25000,2025-06-06,Hyderabad,Debit Card
7,1008,Divya Shah,Bookshelf,Furniture,1,8500,8500,2025-06-07,Pune,Cash
8,1009,Manish Gupta,Monitor,Electronics,2,12000,24000,2025-06-08,Delhi,Net Banking
9,1010,Anjali Nair,Printer,Electronics,1,15000,15000,2025-06-09,Bangalore,Credit Card


In [28]:
# to check shape of dataframe
print(data.shape)

(10, 10)


In [29]:
data.columns

Index(['Order_ID', 'Customer_Name', 'Product', 'Category', 'Quantity',
       'Unit_Price', 'Total_Amount', 'Order_Date', 'City', 'Payment_Method'],
      dtype='object')

In [30]:
data.index

RangeIndex(start=0, stop=10, step=1)

In [31]:
data.head()

Unnamed: 0,Order_ID,Customer_Name,Product,Category,Quantity,Unit_Price,Total_Amount,Order_Date,City,Payment_Method
0,1001,Amit Sharma,Laptop,Electronics,1,55000,55000,2025-06-01,Mumbai,Credit Card
1,1002,Priya Desai,Smartphone,Electronics,2,18000,36000,2025-06-02,Pune,UPI
2,1003,Rahul Mehta,Office Chair,Furniture,1,7500,7500,2025-06-03,Delhi,Net Banking
3,1004,Sneha Iyer,Notebook,Stationery,5,50,250,2025-06-03,Bangalore,Cash
4,1005,Karan Patel,Desk,Furniture,1,12000,12000,2025-06-04,Ahmedabad,Credit Card


In [32]:
#last 5 records
data.tail()

Unnamed: 0,Order_ID,Customer_Name,Product,Category,Quantity,Unit_Price,Total_Amount,Order_Date,City,Payment_Method
5,1006,Neha Verma,Pen,Stationery,10,20,200,2025-06-05,Mumbai,UPI
6,1007,Arjun Rao,Tablet,Electronics,1,25000,25000,2025-06-06,Hyderabad,Debit Card
7,1008,Divya Shah,Bookshelf,Furniture,1,8500,8500,2025-06-07,Pune,Cash
8,1009,Manish Gupta,Monitor,Electronics,2,12000,24000,2025-06-08,Delhi,Net Banking
9,1010,Anjali Nair,Printer,Electronics,1,15000,15000,2025-06-09,Bangalore,Credit Card


In [33]:
data.head(3)

Unnamed: 0,Order_ID,Customer_Name,Product,Category,Quantity,Unit_Price,Total_Amount,Order_Date,City,Payment_Method
0,1001,Amit Sharma,Laptop,Electronics,1,55000,55000,2025-06-01,Mumbai,Credit Card
1,1002,Priya Desai,Smartphone,Electronics,2,18000,36000,2025-06-02,Pune,UPI
2,1003,Rahul Mehta,Office Chair,Furniture,1,7500,7500,2025-06-03,Delhi,Net Banking


In [34]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 10 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   Order_ID        10 non-null     int64 
 1   Customer_Name   10 non-null     object
 2   Product         10 non-null     object
 3   Category        10 non-null     object
 4   Quantity        10 non-null     int64 
 5   Unit_Price      10 non-null     int64 
 6   Total_Amount    10 non-null     int64 
 7   Order_Date      10 non-null     object
 8   City            10 non-null     object
 9   Payment_Method  10 non-null     object
dtypes: int64(4), object(6)
memory usage: 932.0+ bytes


In [35]:
data.describe()

Unnamed: 0,Order_ID,Quantity,Unit_Price,Total_Amount
count,10.0,10.0,10.0,10.0
mean,1005.5,2.5,15307.0,18345.0
std,3.02765,2.915476,15890.110167,17173.015816
min,1001.0,1.0,20.0,200.0
25%,1003.25,1.0,7750.0,7750.0
50%,1005.5,1.0,12000.0,13500.0
75%,1007.75,2.0,17250.0,24750.0
max,1010.0,10.0,55000.0,55000.0


In [36]:
data.dtypes

Order_ID           int64
Customer_Name     object
Product           object
Category          object
Quantity           int64
Unit_Price         int64
Total_Amount       int64
Order_Date        object
City              object
Payment_Method    object
dtype: object

# Pandas Dataframe Inspections

In [37]:
#shape helps to get an idea about dataframe num of rows * columns
data.shape

(10, 10)

In [38]:
# get columns in array this array is having datatype as object
#this columns are index
data.columns

Index(['Order_ID', 'Customer_Name', 'Product', 'Category', 'Quantity',
       'Unit_Price', 'Total_Amount', 'Order_Date', 'City', 'Payment_Method'],
      dtype='object')

In [39]:
data.index #gives range of index start:stop:step row level index

RangeIndex(start=0, stop=10, step=1)

In [40]:
#check dimension
data.ndim

2

In [41]:
#check size
data.size

100

# Previewing data into pandas

In [42]:
data

Unnamed: 0,Order_ID,Customer_Name,Product,Category,Quantity,Unit_Price,Total_Amount,Order_Date,City,Payment_Method
0,1001,Amit Sharma,Laptop,Electronics,1,55000,55000,2025-06-01,Mumbai,Credit Card
1,1002,Priya Desai,Smartphone,Electronics,2,18000,36000,2025-06-02,Pune,UPI
2,1003,Rahul Mehta,Office Chair,Furniture,1,7500,7500,2025-06-03,Delhi,Net Banking
3,1004,Sneha Iyer,Notebook,Stationery,5,50,250,2025-06-03,Bangalore,Cash
4,1005,Karan Patel,Desk,Furniture,1,12000,12000,2025-06-04,Ahmedabad,Credit Card
5,1006,Neha Verma,Pen,Stationery,10,20,200,2025-06-05,Mumbai,UPI
6,1007,Arjun Rao,Tablet,Electronics,1,25000,25000,2025-06-06,Hyderabad,Debit Card
7,1008,Divya Shah,Bookshelf,Furniture,1,8500,8500,2025-06-07,Pune,Cash
8,1009,Manish Gupta,Monitor,Electronics,2,12000,24000,2025-06-08,Delhi,Net Banking
9,1010,Anjali Nair,Printer,Electronics,1,15000,15000,2025-06-09,Bangalore,Credit Card


In [43]:
data.head() #first 5 bydefault

Unnamed: 0,Order_ID,Customer_Name,Product,Category,Quantity,Unit_Price,Total_Amount,Order_Date,City,Payment_Method
0,1001,Amit Sharma,Laptop,Electronics,1,55000,55000,2025-06-01,Mumbai,Credit Card
1,1002,Priya Desai,Smartphone,Electronics,2,18000,36000,2025-06-02,Pune,UPI
2,1003,Rahul Mehta,Office Chair,Furniture,1,7500,7500,2025-06-03,Delhi,Net Banking
3,1004,Sneha Iyer,Notebook,Stationery,5,50,250,2025-06-03,Bangalore,Cash
4,1005,Karan Patel,Desk,Furniture,1,12000,12000,2025-06-04,Ahmedabad,Credit Card


In [44]:
data.tail()

Unnamed: 0,Order_ID,Customer_Name,Product,Category,Quantity,Unit_Price,Total_Amount,Order_Date,City,Payment_Method
5,1006,Neha Verma,Pen,Stationery,10,20,200,2025-06-05,Mumbai,UPI
6,1007,Arjun Rao,Tablet,Electronics,1,25000,25000,2025-06-06,Hyderabad,Debit Card
7,1008,Divya Shah,Bookshelf,Furniture,1,8500,8500,2025-06-07,Pune,Cash
8,1009,Manish Gupta,Monitor,Electronics,2,12000,24000,2025-06-08,Delhi,Net Banking
9,1010,Anjali Nair,Printer,Electronics,1,15000,15000,2025-06-09,Bangalore,Credit Card


In [46]:
data.sample(4) #takes random samples

Unnamed: 0,Order_ID,Customer_Name,Product,Category,Quantity,Unit_Price,Total_Amount,Order_Date,City,Payment_Method
0,1001,Amit Sharma,Laptop,Electronics,1,55000,55000,2025-06-01,Mumbai,Credit Card
3,1004,Sneha Iyer,Notebook,Stationery,5,50,250,2025-06-03,Bangalore,Cash
9,1010,Anjali Nair,Printer,Electronics,1,15000,15000,2025-06-09,Bangalore,Credit Card
7,1008,Divya Shah,Bookshelf,Furniture,1,8500,8500,2025-06-07,Pune,Cash


In [47]:
x = data['Customer_Name'].head(5)

In [48]:
x

0    Amit Sharma
1    Priya Desai
2    Rahul Mehta
3     Sneha Iyer
4    Karan Patel
Name: Customer_Name, dtype: object

# Summary or Metadata

In [49]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 10 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   Order_ID        10 non-null     int64 
 1   Customer_Name   10 non-null     object
 2   Product         10 non-null     object
 3   Category        10 non-null     object
 4   Quantity        10 non-null     int64 
 5   Unit_Price      10 non-null     int64 
 6   Total_Amount    10 non-null     int64 
 7   Order_Date      10 non-null     object
 8   City            10 non-null     object
 9   Payment_Method  10 non-null     object
dtypes: int64(4), object(6)
memory usage: 932.0+ bytes


In [50]:
# descriptive summary of numeric data field
data.describe()

Unnamed: 0,Order_ID,Quantity,Unit_Price,Total_Amount
count,10.0,10.0,10.0,10.0
mean,1005.5,2.5,15307.0,18345.0
std,3.02765,2.915476,15890.110167,17173.015816
min,1001.0,1.0,20.0,200.0
25%,1003.25,1.0,7750.0,7750.0
50%,1005.5,1.0,12000.0,13500.0
75%,1007.75,2.0,17250.0,24750.0
max,1010.0,10.0,55000.0,55000.0


In [52]:
data.describe(include='all')

Unnamed: 0,Order_ID,Customer_Name,Product,Category,Quantity,Unit_Price,Total_Amount,Order_Date,City,Payment_Method
count,10.0,10,10,10,10.0,10.0,10.0,10,10,10
unique,,10,10,3,,,,9,6,5
top,,Amit Sharma,Laptop,Electronics,,,,2025-06-03,Mumbai,Credit Card
freq,,1,1,5,,,,2,2,3
mean,1005.5,,,,2.5,15307.0,18345.0,,,
std,3.02765,,,,2.915476,15890.110167,17173.015816,,,
min,1001.0,,,,1.0,20.0,200.0,,,
25%,1003.25,,,,1.0,7750.0,7750.0,,,
50%,1005.5,,,,1.0,12000.0,13500.0,,,
75%,1007.75,,,,2.0,17250.0,24750.0,,,


# Datatype and check null values

In [53]:
data.dtypes

Order_ID           int64
Customer_Name     object
Product           object
Category          object
Quantity           int64
Unit_Price         int64
Total_Amount       int64
Order_Date        object
City              object
Payment_Method    object
dtype: object

In [54]:
data.isnull()

Unnamed: 0,Order_ID,Customer_Name,Product,Category,Quantity,Unit_Price,Total_Amount,Order_Date,City,Payment_Method
0,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False
5,False,False,False,False,False,False,False,False,False,False
6,False,False,False,False,False,False,False,False,False,False
7,False,False,False,False,False,False,False,False,False,False
8,False,False,False,False,False,False,False,False,False,False
9,False,False,False,False,False,False,False,False,False,False


In [55]:
data.isnull().sum()

Order_ID          0
Customer_Name     0
Product           0
Category          0
Quantity          0
Unit_Price        0
Total_Amount      0
Order_Date        0
City              0
Payment_Method    0
dtype: int64

In [56]:
data['Category'].value_counts()

Category
Electronics    5
Furniture      3
Stationery     2
Name: count, dtype: int64

In [57]:
data['City'].unique()

array(['Mumbai', 'Pune', 'Delhi', 'Bangalore', 'Ahmedabad', 'Hyderabad'],
      dtype=object)

In [58]:
data['City'].value_counts()

City
Mumbai       2
Pune         2
Delhi        2
Bangalore    2
Ahmedabad    1
Hyderabad    1
Name: count, dtype: int64

In [59]:
data.columns

Index(['Order_ID', 'Customer_Name', 'Product', 'Category', 'Quantity',
       'Unit_Price', 'Total_Amount', 'Order_Date', 'City', 'Payment_Method'],
      dtype='object')

In [61]:
# filter based on selection
data[data['Total_Amount'] > 35000]

Unnamed: 0,Order_ID,Customer_Name,Product,Category,Quantity,Unit_Price,Total_Amount,Order_Date,City,Payment_Method
0,1001,Amit Sharma,Laptop,Electronics,1,55000,55000,2025-06-01,Mumbai,Credit Card
1,1002,Priya Desai,Smartphone,Electronics,2,18000,36000,2025-06-02,Pune,UPI


In [62]:
data['Customer_Name']

0     Amit Sharma
1     Priya Desai
2     Rahul Mehta
3      Sneha Iyer
4     Karan Patel
5      Neha Verma
6       Arjun Rao
7      Divya Shah
8    Manish Gupta
9     Anjali Nair
Name: Customer_Name, dtype: object

In [63]:
data.columns

Index(['Order_ID', 'Customer_Name', 'Product', 'Category', 'Quantity',
       'Unit_Price', 'Total_Amount', 'Order_Date', 'City', 'Payment_Method'],
      dtype='object')

In [64]:
data[['Customer_Name','Product','Quantity','City']]

Unnamed: 0,Customer_Name,Product,Quantity,City
0,Amit Sharma,Laptop,1,Mumbai
1,Priya Desai,Smartphone,2,Pune
2,Rahul Mehta,Office Chair,1,Delhi
3,Sneha Iyer,Notebook,5,Bangalore
4,Karan Patel,Desk,1,Ahmedabad
5,Neha Verma,Pen,10,Mumbai
6,Arjun Rao,Tablet,1,Hyderabad
7,Divya Shah,Bookshelf,1,Pune
8,Manish Gupta,Monitor,2,Delhi
9,Anjali Nair,Printer,1,Bangalore


In [65]:
# index of rows
data.loc[0] # loc can work in both rows and column iloc only works for rows

Order_ID                 1001
Customer_Name     Amit Sharma
Product                Laptop
Category          Electronics
Quantity                    1
Unit_Price              55000
Total_Amount            55000
Order_Date         2025-06-01
City                   Mumbai
Payment_Method    Credit Card
Name: 0, dtype: object

In [66]:
data.loc[data['Total_Amount'] > 35000]

Unnamed: 0,Order_ID,Customer_Name,Product,Category,Quantity,Unit_Price,Total_Amount,Order_Date,City,Payment_Method
0,1001,Amit Sharma,Laptop,Electronics,1,55000,55000,2025-06-01,Mumbai,Credit Card
1,1002,Priya Desai,Smartphone,Electronics,2,18000,36000,2025-06-02,Pune,UPI


In [67]:
# iloc always works row wise
data.iloc[0]

Order_ID                 1001
Customer_Name     Amit Sharma
Product                Laptop
Category          Electronics
Quantity                    1
Unit_Price              55000
Total_Amount            55000
Order_Date         2025-06-01
City                   Mumbai
Payment_Method    Credit Card
Name: 0, dtype: object

In [68]:
data.iloc[2:5]

Unnamed: 0,Order_ID,Customer_Name,Product,Category,Quantity,Unit_Price,Total_Amount,Order_Date,City,Payment_Method
2,1003,Rahul Mehta,Office Chair,Furniture,1,7500,7500,2025-06-03,Delhi,Net Banking
3,1004,Sneha Iyer,Notebook,Stationery,5,50,250,2025-06-03,Bangalore,Cash
4,1005,Karan Patel,Desk,Furniture,1,12000,12000,2025-06-04,Ahmedabad,Credit Card


In [69]:
data.loc[0, 'Customer_Name']

'Amit Sharma'

In [70]:
data.iloc[0,1]

'Amit Sharma'

In [74]:
# single column drop is method of dataframe class
x = data.drop('Quantity', axis=1)

In [75]:
x.columns

Index(['Order_ID', 'Customer_Name', 'Product', 'Category', 'Unit_Price',
       'Total_Amount', 'Order_Date', 'City', 'Payment_Method'],
      dtype='object')

In [76]:
data.columns

Index(['Order_ID', 'Customer_Name', 'Product', 'Category', 'Quantity',
       'Unit_Price', 'Total_Amount', 'Order_Date', 'City', 'Payment_Method'],
      dtype='object')

In [77]:
x.drop('Product',axis=1,inplace = True)

In [78]:
x

Unnamed: 0,Order_ID,Customer_Name,Category,Unit_Price,Total_Amount,Order_Date,City,Payment_Method
0,1001,Amit Sharma,Electronics,55000,55000,2025-06-01,Mumbai,Credit Card
1,1002,Priya Desai,Electronics,18000,36000,2025-06-02,Pune,UPI
2,1003,Rahul Mehta,Furniture,7500,7500,2025-06-03,Delhi,Net Banking
3,1004,Sneha Iyer,Stationery,50,250,2025-06-03,Bangalore,Cash
4,1005,Karan Patel,Furniture,12000,12000,2025-06-04,Ahmedabad,Credit Card
5,1006,Neha Verma,Stationery,20,200,2025-06-05,Mumbai,UPI
6,1007,Arjun Rao,Electronics,25000,25000,2025-06-06,Hyderabad,Debit Card
7,1008,Divya Shah,Furniture,8500,8500,2025-06-07,Pune,Cash
8,1009,Manish Gupta,Electronics,12000,24000,2025-06-08,Delhi,Net Banking
9,1010,Anjali Nair,Electronics,15000,15000,2025-06-09,Bangalore,Credit Card


In [79]:
y = data.drop_duplicates()

In [80]:
y

Unnamed: 0,Order_ID,Customer_Name,Product,Category,Quantity,Unit_Price,Total_Amount,Order_Date,City,Payment_Method
0,1001,Amit Sharma,Laptop,Electronics,1,55000,55000,2025-06-01,Mumbai,Credit Card
1,1002,Priya Desai,Smartphone,Electronics,2,18000,36000,2025-06-02,Pune,UPI
2,1003,Rahul Mehta,Office Chair,Furniture,1,7500,7500,2025-06-03,Delhi,Net Banking
3,1004,Sneha Iyer,Notebook,Stationery,5,50,250,2025-06-03,Bangalore,Cash
4,1005,Karan Patel,Desk,Furniture,1,12000,12000,2025-06-04,Ahmedabad,Credit Card
5,1006,Neha Verma,Pen,Stationery,10,20,200,2025-06-05,Mumbai,UPI
6,1007,Arjun Rao,Tablet,Electronics,1,25000,25000,2025-06-06,Hyderabad,Debit Card
7,1008,Divya Shah,Bookshelf,Furniture,1,8500,8500,2025-06-07,Pune,Cash
8,1009,Manish Gupta,Monitor,Electronics,2,12000,24000,2025-06-08,Delhi,Net Banking
9,1010,Anjali Nair,Printer,Electronics,1,15000,15000,2025-06-09,Bangalore,Credit Card


In [81]:
z = data.dropna()

In [82]:
z

Unnamed: 0,Order_ID,Customer_Name,Product,Category,Quantity,Unit_Price,Total_Amount,Order_Date,City,Payment_Method
0,1001,Amit Sharma,Laptop,Electronics,1,55000,55000,2025-06-01,Mumbai,Credit Card
1,1002,Priya Desai,Smartphone,Electronics,2,18000,36000,2025-06-02,Pune,UPI
2,1003,Rahul Mehta,Office Chair,Furniture,1,7500,7500,2025-06-03,Delhi,Net Banking
3,1004,Sneha Iyer,Notebook,Stationery,5,50,250,2025-06-03,Bangalore,Cash
4,1005,Karan Patel,Desk,Furniture,1,12000,12000,2025-06-04,Ahmedabad,Credit Card
5,1006,Neha Verma,Pen,Stationery,10,20,200,2025-06-05,Mumbai,UPI
6,1007,Arjun Rao,Tablet,Electronics,1,25000,25000,2025-06-06,Hyderabad,Debit Card
7,1008,Divya Shah,Bookshelf,Furniture,1,8500,8500,2025-06-07,Pune,Cash
8,1009,Manish Gupta,Monitor,Electronics,2,12000,24000,2025-06-08,Delhi,Net Banking
9,1010,Anjali Nair,Printer,Electronics,1,15000,15000,2025-06-09,Bangalore,Credit Card


In [83]:
del data['Category']

# Boolean Indexing

In [85]:
df = pd.read_csv('cust.csv')

In [86]:
df.columns

Index(['Order_ID', 'Customer_Name', 'Product', 'Category', 'Quantity',
       'Unit_Price', 'Total_Amount', 'Order_Date', 'City', 'Payment_Method'],
      dtype='object')

In [87]:
# fetching data from dataframe using boolean operation
# combining two columns
# here we are fetching records from dataframe where category is electronics and total amount spent
df[(df['Category'] == 'Electronics') & (df['Total_Amount'] > 36000)]

Unnamed: 0,Order_ID,Customer_Name,Product,Category,Quantity,Unit_Price,Total_Amount,Order_Date,City,Payment_Method
0,1001,Amit Sharma,Laptop,Electronics,1,55000,55000,2025-06-01,Mumbai,Credit Card


In [88]:
df.query("Category == 'Electronics' and Total_Amount > 36000")

Unnamed: 0,Order_ID,Customer_Name,Product,Category,Quantity,Unit_Price,Total_Amount,Order_Date,City,Payment_Method
0,1001,Amit Sharma,Laptop,Electronics,1,55000,55000,2025-06-01,Mumbai,Credit Card


In [90]:
c='Electronics'
a=36000
df.query("Category == @c and Total_Amount > @a")

Unnamed: 0,Order_ID,Customer_Name,Product,Category,Quantity,Unit_Price,Total_Amount,Order_Date,City,Payment_Method
0,1001,Amit Sharma,Laptop,Electronics,1,55000,55000,2025-06-01,Mumbai,Credit Card


In [91]:
# membership check
df[df['City'].isin(['Mumbai', 'Delhi'])]

Unnamed: 0,Order_ID,Customer_Name,Product,Category,Quantity,Unit_Price,Total_Amount,Order_Date,City,Payment_Method
0,1001,Amit Sharma,Laptop,Electronics,1,55000,55000,2025-06-01,Mumbai,Credit Card
2,1003,Rahul Mehta,Office Chair,Furniture,1,7500,7500,2025-06-03,Delhi,Net Banking
5,1006,Neha Verma,Pen,Stationery,10,20,200,2025-06-05,Mumbai,UPI
8,1009,Manish Gupta,Monitor,Electronics,2,12000,24000,2025-06-08,Delhi,Net Banking


In [92]:
# String based record selection
df[df['Customer_Name'].str.startswith('M')]

Unnamed: 0,Order_ID,Customer_Name,Product,Category,Quantity,Unit_Price,Total_Amount,Order_Date,City,Payment_Method
8,1009,Manish Gupta,Monitor,Electronics,2,12000,24000,2025-06-08,Delhi,Net Banking


In [93]:
df[df['City'].str.startswith('Mum')]

Unnamed: 0,Order_ID,Customer_Name,Product,Category,Quantity,Unit_Price,Total_Amount,Order_Date,City,Payment_Method
0,1001,Amit Sharma,Laptop,Electronics,1,55000,55000,2025-06-01,Mumbai,Credit Card
5,1006,Neha Verma,Pen,Stationery,10,20,200,2025-06-05,Mumbai,UPI


# Filtering with filter method of dataframe

In [95]:
# Syntax: df.filter(items = None, like = None, regex = None, axis = 0 or 1)
df.filter(items=['City'])

Unnamed: 0,City
0,Mumbai
1,Pune
2,Delhi
3,Bangalore
4,Ahmedabad
5,Mumbai
6,Hyderabad
7,Pune
8,Delhi
9,Bangalore


In [98]:
df.filter(like='Unit_Price',axis=1)

Unnamed: 0,Unit_Price
0,55000
1,18000
2,7500
3,50
4,12000
5,20
6,25000
7,8500
8,12000
9,15000


In [99]:
# Range based filter
df[df['Total_Amount'].between(15000,35000)]

Unnamed: 0,Order_ID,Customer_Name,Product,Category,Quantity,Unit_Price,Total_Amount,Order_Date,City,Payment_Method
6,1007,Arjun Rao,Tablet,Electronics,1,25000,25000,2025-06-06,Hyderabad,Debit Card
8,1009,Manish Gupta,Monitor,Electronics,2,12000,24000,2025-06-08,Delhi,Net Banking
9,1010,Anjali Nair,Printer,Electronics,1,15000,15000,2025-06-09,Bangalore,Credit Card


# Missing Value Treatment

In [157]:
import numpy as np
import pandas as pd

In [158]:
data = {
    'customer_id': [101, 102, 103, 104, 105, 106],
    'age': [25, np.nan, 35, 40, np.nan, 29],                      #float
    'purchases': [5, 3, np. nan, 8, 2, np.nan],                   #float
    'gender': ['Male', 'Female', np.nan, 'Female','', 'Male'],    #object
    'is_active': [True, np.nan, False, True, np.nan, False],      #boolean
    'signup_date': ['2023-01-01', '2023-01-05', np. nan, '2023-01-10','', '2023-01-15']      # datetime-like
} 

In [159]:
df = pd.DataFrame(data)

In [160]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   customer_id  6 non-null      int64  
 1   age          4 non-null      float64
 2   purchases    4 non-null      float64
 3   gender       5 non-null      object 
 4   is_active    4 non-null      object 
 5   signup_date  5 non-null      object 
dtypes: float64(2), int64(1), object(3)
memory usage: 420.0+ bytes


In [162]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   customer_id  6 non-null      int64  
 1   age          4 non-null      float64
 2   purchases    4 non-null      float64
 3   gender       5 non-null      object 
 4   is_active    4 non-null      object 
 5   signup_date  5 non-null      object 
dtypes: float64(2), int64(1), object(3)
memory usage: 420.0+ bytes


In [163]:
df

Unnamed: 0,customer_id,age,purchases,gender,is_active,signup_date
0,101,25.0,5.0,Male,True,2023-01-01
1,102,,3.0,Female,,2023-01-05
2,103,35.0,,,False,
3,104,40.0,8.0,Female,True,2023-01-10
4,105,,2.0,,,
5,106,29.0,,Male,False,2023-01-15


In [164]:
df['age'].fillna(df['age'].mean(), inplace = True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['age'].fillna(df['age'].mean(), inplace = True)


In [165]:
df

Unnamed: 0,customer_id,age,purchases,gender,is_active,signup_date
0,101,25.0,5.0,Male,True,2023-01-01
1,102,32.25,3.0,Female,,2023-01-05
2,103,35.0,,,False,
3,104,40.0,8.0,Female,True,2023-01-10
4,105,32.25,2.0,,,
5,106,29.0,,Male,False,2023-01-15


In [166]:
df['purchases'].fillna(df['purchases'].median(), inplace = True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['purchases'].fillna(df['purchases'].median(), inplace = True)


In [167]:
df

Unnamed: 0,customer_id,age,purchases,gender,is_active,signup_date
0,101,25.0,5.0,Male,True,2023-01-01
1,102,32.25,3.0,Female,,2023-01-05
2,103,35.0,4.0,,False,
3,104,40.0,8.0,Female,True,2023-01-10
4,105,32.25,2.0,,,
5,106,29.0,4.0,Male,False,2023-01-15


In [168]:
# df['gender'] = df['gender'].astype('string')
df['gender'].fillna(df['gender'].mode(), inplace = True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['gender'].fillna(df['gender'].mode(), inplace = True)


In [169]:
df

Unnamed: 0,customer_id,age,purchases,gender,is_active,signup_date
0,101,25.0,5.0,Male,True,2023-01-01
1,102,32.25,3.0,Female,,2023-01-05
2,103,35.0,4.0,,False,
3,104,40.0,8.0,Female,True,2023-01-10
4,105,32.25,2.0,,,
5,106,29.0,4.0,Male,False,2023-01-15


In [174]:
df['gender'].replace('',np.nan,inplace=True)
df['gender'].fillna(df['gender'].mode()[0], inplace = True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['gender'].replace('',np.nan,inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['gender'].fillna(df['gender'].mode()[0], inplace = True)


In [175]:
df

Unnamed: 0,customer_id,age,purchases,gender,is_active,signup_date
0,101,25.0,5.0,Male,True,2023-01-01
1,102,32.25,3.0,Female,,2023-01-05
2,103,35.0,4.0,Female,False,
3,104,40.0,8.0,Female,True,2023-01-10
4,105,32.25,2.0,Female,,
5,106,29.0,4.0,Male,False,2023-01-15


In [176]:
df['gender']=df['gender'].str.strip().str.title()

In [177]:
df

Unnamed: 0,customer_id,age,purchases,gender,is_active,signup_date
0,101,25.0,5.0,Male,True,2023-01-01
1,102,32.25,3.0,Female,,2023-01-05
2,103,35.0,4.0,Female,False,
3,104,40.0,8.0,Female,True,2023-01-10
4,105,32.25,2.0,Female,,
5,106,29.0,4.0,Male,False,2023-01-15


In [178]:
df['is_active'].fillna(False, inplace= True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['is_active'].fillna(False, inplace= True)
  df['is_active'].fillna(False, inplace= True)


In [179]:
df

Unnamed: 0,customer_id,age,purchases,gender,is_active,signup_date
0,101,25.0,5.0,Male,True,2023-01-01
1,102,32.25,3.0,Female,False,2023-01-05
2,103,35.0,4.0,Female,False,
3,104,40.0,8.0,Female,True,2023-01-10
4,105,32.25,2.0,Female,False,
5,106,29.0,4.0,Male,False,2023-01-15


In [182]:
df['signup_date'].replace('',np.nan,inplace=True)
df['signup_date'].fillna(method='ffill',inplace=True)

  df['signup_date'].fillna(method='ffill',inplace=True)


In [183]:
df

Unnamed: 0,customer_id,age,purchases,gender,is_active,signup_date
0,101,25.0,5.0,Male,True,2023-01-01
1,102,32.25,3.0,Female,False,2023-01-05
2,103,35.0,4.0,Female,False,2023-01-05
3,104,40.0,8.0,Female,True,2023-01-10
4,105,32.25,2.0,Female,False,2023-01-10
5,106,29.0,4.0,Male,False,2023-01-15


In [184]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   customer_id  6 non-null      int64  
 1   age          6 non-null      float64
 2   purchases    6 non-null      float64
 3   gender       6 non-null      object 
 4   is_active    6 non-null      bool   
 5   signup_date  6 non-null      object 
dtypes: bool(1), float64(2), int64(1), object(2)
memory usage: 378.0+ bytes


In [188]:
df['gender'] = df['gender'].astype('string')
df['signup_date'] = pd.to_datetime(df['signup_date'])
df['customer_id'] = df['customer_id'].astype('string')

In [189]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype         
---  ------       --------------  -----         
 0   customer_id  6 non-null      string        
 1   age          6 non-null      float64       
 2   purchases    6 non-null      float64       
 3   gender       6 non-null      string        
 4   is_active    6 non-null      bool          
 5   signup_date  6 non-null      datetime64[ns]
dtypes: bool(1), datetime64[ns](1), float64(2), string(2)
memory usage: 378.0 bytes


In [190]:
df.isnull().sum()

customer_id    0
age            0
purchases      0
gender         0
is_active      0
signup_date    0
dtype: int64