# Pandas-DataFrame And Series

It is a powerful Data Manipulation library in Python, widely used in Data Analytics for Data Cleaning, Data Transforming, Structuring and Organizing the data. It mainly use two important Data Structure Series and DataFrame. A Series in One-Dimensional array like DS whereas the DataFrame is the Two-Dimensional, size-mutable and heterogenous tabular like structure in form of rows and columns

In [2]:
import pandas as pd

In [4]:
#Series: A Pandas Series in One-Dimensional array-like object that can hold any datatype. It is similar to single column in a table
data = [1,2,4,5]
series = pd.Series(data)

print(series)

0    1
1    2
2    4
3    5
dtype: int64


In [7]:
#Creating Series from Dictionary
dict1 = {"name": "subrat", "age": 24, "course": "mca", "address": "bahadurgarh"} #here key becomes the index and value become the single column in Series

series = pd.Series(dict1)
print(series, type(series))

name            subrat
age                 24
course             mca
address    bahadurgarh
dtype: object <class 'pandas.Series'>


In [11]:
#If we want to give our custom index we can give that so
data = [1,2,34,54]
index = ['a', 'v', 'g', 'f']

series = pd.Series(data, index=index)
print(series)

a     1
v     2
g    34
f    54
dtype: int64


In [16]:
#DataFrame: It is the Two-Dimensional, size-mutable and heterogenous tabular like structure in form of rows and column
#create a DataFrame from a dictionary of list 
data = {
    'name':['subrat', 'raj', 'john'], 
    'age':[23,45,34],
    'city':['banglore', 'delhi','mumbai']
}

df = pd.DataFrame(data)
print(df)
print(type(df))

     name  age      city
0  subrat   23  banglore
1     raj   45     delhi
2    john   34    mumbai
<class 'pandas.DataFrame'>


In [18]:
#Now we can convert the tabular dataframe into the matrix using numpy
import numpy as np

info = np.array(df)
print(info)
print(info.shape)

[['subrat' 23 'banglore']
 ['raj' 45 'delhi']
 ['john' 34 'mumbai']]
(3, 3)


In [20]:
#Creating a Dataframe from the list of dictionaries
data = [
    {'name': 'subrat', 'age': 23, 'city': 'banglore'},
    {'name': 'raj', 'age': 45, 'city': 'delhi'},
    {'name': 'john', 'age': 34, 'city': 'mumbai'},
]

df = pd.DataFrame(data)
print(df)

     name  age      city
0  subrat   23  banglore
1     raj   45     delhi
2    john   34    mumbai


In [None]:
df['city'] #It is basically returns the Series  

0    banglore
1       delhi
2      mumbai
Name: city, dtype: str

In [29]:
#Reading and Accessing the csv files
df = pd.read_csv("sales_data.csv")

In [28]:
#To look the top five elements of the csv file 
df.head()

Unnamed: 0,Transaction ID,Date,Product Category,Product Name,Units Sold,Unit Price,Total Revenue,Region,Payment Method
0,10001,2024-01-01,Electronics,iPhone 14 Pro,2,999.99,1999.98,North America,Credit Card
1,10002,2024-01-02,Home Appliances,Dyson V11 Vacuum,1,499.99,499.99,Europe,PayPal
2,10003,2024-01-03,Clothing,Levi's 501 Jeans,3,69.99,209.97,Asia,Debit Card
3,10004,2024-01-04,Books,The Da Vinci Code,4,15.99,63.96,North America,Credit Card
4,10005,2024-01-05,Beauty Products,Neutrogena Skincare Set,1,89.99,89.99,Europe,PayPal


In [31]:
#To look the last five elements of the csv file
df.tail(5)

Unnamed: 0,Transaction ID,Date,Product Category,Product Name,Units Sold,Unit Price,Total Revenue,Region,Payment Method
235,10236,2024-08-23,Home Appliances,Nespresso Vertuo Next Coffee and Espresso Maker,1,159.99,159.99,Europe,PayPal
236,10237,2024-08-24,Clothing,Nike Air Force 1 Sneakers,3,90.0,270.0,Asia,Debit Card
237,10238,2024-08-25,Books,The Handmaid's Tale by Margaret Atwood,3,10.99,32.97,North America,Credit Card
238,10239,2024-08-26,Beauty Products,Sunday Riley Luna Sleeping Night Oil,1,55.0,55.0,Europe,PayPal
239,10240,2024-08-27,Sports,Yeti Rambler 20 oz Tumbler,2,29.99,59.98,Asia,Credit Card


In [35]:
#Accessing the data from the dataframe

data = {
    'name':['subrat', 'raj', 'john'], 
    'age':[23,45,34],
    'city':['banglore', 'delhi','mumbai']
}

df = pd.DataFrame(data)

df

Unnamed: 0,name,age,city
0,subrat,23,banglore
1,raj,45,delhi
2,john,34,mumbai


In [38]:
#Accessing
df['name']

0    subrat
1       raj
2      john
Name: name, dtype: str

In [None]:
print(type(df['name'])) #accessing through the attributes in the end becomes the series 

<class 'pandas.Series'>


In [4]:
data = {
    "Name": ["Rahul", "Ananya", "Rohit"],
    "Age": [25, 28, 30],
    "City": ["Delhi", "Mumbai", "Pune"]
}
index = [101, 102,103]

df = pd.DataFrame(data, index=index)

df

Unnamed: 0,Name,Age,City
101,Rahul,25,Delhi
102,Ananya,28,Mumbai
103,Rohit,30,Pune


In [None]:
#we use .loc when we are aware about the labeling i.e row and columns name, let see example
df.loc[101, ['Name', 'City']]

Name    Rahul
City    Delhi
Name: 101, dtype: str


In [8]:
df.loc[df["Age"] > 26] #the condition will be applied to every key-value pair of dictionary and df.loc[] will filter out all the rows which consists of true value

Unnamed: 0,Name,Age,City
102,Ananya,28,Mumbai
103,Rohit,30,Pune


In [None]:
#we use .iloc when you want data by position, not label(as labelling you dont remember)
df.iloc[0,[0,2]] #here we are considering rows and columns as a matrix and accessing that way

Name    Rahul
City    Delhi
Name: 101, dtype: str

In [64]:
df.iloc[1:, [0,2]]

Unnamed: 0,Name,City
102,Ananya,Mumbai
103,Rohit,Pune


In [None]:
#In above example if you want to access with .loc we can also do so
df.loc[102:, ['Name', 'City']] # we can exactly see the difference 

Unnamed: 0,Name,City
102,Ananya,Mumbai
103,Rohit,Pune


In [9]:
data = {
    'name':['subrat', 'raj', 'john'], 
    'age':[23,45,34],
    'city':['banglore', 'delhi','mumbai']
}

df = pd.DataFrame(data)

df

Unnamed: 0,name,age,city
0,subrat,23,banglore
1,raj,45,delhi
2,john,34,mumbai


In [None]:
#Accessing a specified element

#We us .at and .iat when we want to access the single element from the dataframe .loc and .iloc are used for the complex slicing and complex accessing the elements from the series
#using at: when we are aware about the labels
print(df.at[0,'city'])
print(df.at[1, 'name'])

#using iat: accessing through row and column indexes
print(df.iat[0,1]) #print(df.iloc[0,1]) this will also give the same but .iat is used to access only specified element 

banglore
raj
23
23


In [124]:
#Data Manipulation with Dataframe
df

Unnamed: 0,name,age,city
0,subrat,23,banglore
1,raj,45,delhi
2,john,34,mumbai


In [125]:
#We can add a new column
df['salary'] = [5000, 40000, 50050]
df

Unnamed: 0,name,age,city,salary
0,subrat,23,banglore,5000
1,raj,45,delhi,40000
2,john,34,mumbai,50050


In [None]:
#remove a column

#when we are trying to delete any column we need to make axis=1 because axis=0 means looking at each row*record) and axis=1 means looking at each column. therefore axis=1 check col salary found drop it 
#.drop() dont permanently delete the data from
df.drop('salary', axis=1)  

Unnamed: 0,name,age,city
0,subrat,23,banglore
1,raj,45,delhi
2,john,34,mumbai


In [127]:
df

Unnamed: 0,name,age,city,salary
0,subrat,23,banglore,5000
1,raj,45,delhi,40000
2,john,34,mumbai,50050


In [128]:
#if you want to permanently remove the column
df.drop('salary', axis=1, inplace=True)

In [129]:
df

Unnamed: 0,name,age,city
0,subrat,23,banglore
1,raj,45,delhi
2,john,34,mumbai


In [131]:
#Adding age to the column
df['age'] = df['age'] + 1

df

Unnamed: 0,name,age,city
0,subrat,24,banglore
1,raj,46,delhi
2,john,35,mumbai


In [None]:
#removing any specific row 
df.drop(0, inplace=True)

In [135]:
df

Unnamed: 0,name,age,city
1,raj,46,delhi
2,john,35,mumbai


In [136]:
#Lets load the dataset again 
df = pd.read_csv("sales_data.csv")

In [None]:
df.head()

Unnamed: 0,Transaction ID,Date,Product Category,Product Name,Units Sold,Unit Price,Total Revenue,Region,Payment Method
0,10001,2024-01-01,Electronics,iPhone 14 Pro,2,999.99,1999.98,North America,Credit Card
1,10002,2024-01-02,Home Appliances,Dyson V11 Vacuum,1,499.99,499.99,Europe,PayPal
2,10003,2024-01-03,Clothing,Levi's 501 Jeans,3,69.99,209.97,Asia,Debit Card
3,10004,2024-01-04,Books,The Da Vinci Code,4,15.99,63.96,North America,Credit Card
4,10005,2024-01-05,Beauty Products,Neutrogena Skincare Set,1,89.99,89.99,Europe,PayPal


In [142]:
#Some computation on this dataset

#Describe the data types of each column
print("Datatypes : \n", df.dtypes)

Datatypes : 
 Transaction ID        int64
Date                    str
Product Category        str
Product Name            str
Units Sold            int64
Unit Price          float64
Total Revenue       float64
Region                  str
Payment Method          str
dtype: object


In [141]:
#Describe the dataframe 
print("Statistical Summary: \n", df.describe())

Statistical Summary: 
        Transaction ID  Units Sold   Unit Price  Total Revenue
count       240.00000  240.000000   240.000000     240.000000
mean      10120.50000    2.158333   236.395583     335.699375
std          69.42622    1.322454   429.446695     485.804469
min       10001.00000    1.000000     6.500000       6.500000
25%       10060.75000    1.000000    29.500000      62.965000
50%       10120.50000    2.000000    89.990000     179.970000
75%       10180.25000    3.000000   249.990000     399.225000
max       10240.00000   10.000000  3899.990000    3899.990000
