#### Pandas-DataFrame And Series

Pandas is a powerful data manipulation library in Python, widely used for data analysis and data cleaning. It provides two primary data strutures: Series and DataFrame. A Series is a one-dimensional array-like object, while a DataFrame is a two-dimensional, size-mutable, and potentially heterogeneous tabular data structure with labled axex (rows and columns).

In [1]:
!pip install pandas


[notice] A new release of pip is available: 24.0 -> 24.3.1
[notice] To update, run: C:\Users\Dell\AppData\Local\Programs\Python\Python312\python.exe -m pip install --upgrade pip




In [5]:
## Series
## A Pandas Series is a one-dimensional array-like object that can hold any data type. It is similar to a column in a table.
import pandas as pd
data = [1, 2, 3, 4, 5]
series = pd.Series(data)
print("Series: \n", series)
print(type(series))

Series: 
 0    1
1    2
2    3
3    4
4    5
dtype: int64
<class 'pandas.core.series.Series'>


In [6]:
## Create a Series from a Dictionary
data = {'a':1, 'b':2, 'c':3}
series_dict = pd.Series(data)
print(series_dict)
print(type(series_dict))

a    1
b    2
c    3
dtype: int64
<class 'pandas.core.series.Series'>


In [7]:
data = [10, 20, 30]
index = ['a', 'b', 'c'] # index
series_1 = pd.Series(data, index=index)
print(series_1)

a    10
b    20
c    30
dtype: int64


In [10]:
## Dataframe - multiple rows and columns
# create a DataFrame from a Dictionary of Lists

data = {
    'Name':['Prasanna', 'John', 'Prasanth'],
    'Age':[25, 30, 45],
    'City':['Bangalore', 'New York', 'Florida']
}

# 3 columns and 3 rows of data

df = pd.DataFrame(data)
print(df) # The data looks same as Excel sheet - multi rows and columns
print(type(df))

       Name  Age       City
0  Prasanna   25  Bangalore
1      John   30   New York
2  Prasanth   45    Florida
<class 'pandas.core.frame.DataFrame'>


In [11]:
## Create a DataFrame from a List of Dictionaris

data = [
    {'Name':'Prasanna', 'Age':25, 'City':'Bangalore'},
    {'Name':'John', 'Age':30, 'City':'New York'},
    {'Name':'Prasanth', 'Age':45, 'City':'Florida'}    
]
df = pd.DataFrame(data)
print(df) 
print(type(df))

       Name  Age       City
0  Prasanna   25  Bangalore
1      John   30   New York
2  Prasanth   45    Florida
<class 'pandas.core.frame.DataFrame'>


In [None]:
## We mostly use excel and csv
# Inorder to Read

In [None]:
df = pd.read_csv('sales_data.csv')
df.head(5) # First 5 Records

Unnamed: 0,Transaction ID,Date,Product Category,Product Name,Units Sold,Unit Price,Total Revenue,Region,Payment Method
0,10001,2024-01-01,Electronics,iPhone 14 Pro,2,999.99,1999.98,North America,Credit Card
1,10002,2024-01-02,Home Appliances,Dyson V11 Vacuum,1,499.99,499.99,Europe,PayPal
2,10003,2024-01-03,Clothing,Levi's 501 Jeans,3,69.99,209.97,Asia,Debit Card
3,10004,2024-01-04,Books,The Da Vinci Code,4,15.99,63.96,North America,Credit Card
4,10005,2024-01-05,Beauty Products,Neutrogena Skincare Set,1,89.99,89.99,Europe,PayPal


In [14]:
df.tail(5) # Last 5 Records

Unnamed: 0,Transaction ID,Date,Product Category,Product Name,Units Sold,Unit Price,Total Revenue,Region,Payment Method
235,10236,2024-08-23,Home Appliances,Nespresso Vertuo Next Coffee and Espresso Maker,1,159.99,159.99,Europe,PayPal
236,10237,2024-08-24,Clothing,Nike Air Force 1 Sneakers,3,90.0,270.0,Asia,Debit Card
237,10238,2024-08-25,Books,The Handmaid's Tale by Margaret Atwood,3,10.99,32.97,North America,Credit Card
238,10239,2024-08-26,Beauty Products,Sunday Riley Luna Sleeping Night Oil,1,55.0,55.0,Europe,PayPal
239,10240,2024-08-27,Sports,Yeti Rambler 20 oz Tumbler,2,29.99,59.98,Asia,Credit Card


In [17]:
# Accessing Data
data = {
    'Name':['Prasanna', 'John', 'Prasanth'],
    'Age':[25, 30, 45],
    'City':['Bangalore', 'New York', 'Florida']
}

df = pd.DataFrame(data)

df

Unnamed: 0,Name,Age,City
0,Prasanna,25,Bangalore
1,John,30,New York
2,Prasanth,45,Florida


In [21]:
df['Name']

0    Prasanna
1        John
2    Prasanth
Name: Name, dtype: object

In [24]:
type(df['Name']) # Name column is treated as series

pandas.core.series.Series

In [25]:
df.loc[0] # For Row Index

Name     Prasanna
Age            25
City    Bangalore
Name: 0, dtype: object

In [28]:
df.iloc[0] # For Index Location

Name     Prasanna
Age            25
City    Bangalore
Name: 0, dtype: object

In [29]:
df.loc[0][0]

  df.loc[0][0]


'Prasanna'

In [33]:
df

Unnamed: 0,Name,Age,City
0,Prasanna,25,Bangalore
1,John,30,New York
2,Prasanth,45,Florida


In [36]:
## Accessing a specified element
df.at[1, 'Age']

np.int64(30)

In [37]:
df.at[2, 'City']

'Florida'

In [38]:
## Accessing specified element using iat
df.iat[2, 2]

'Florida'

In [39]:
## Data Manipulation with DataFrame

# Adding a new Column - Salary
df['Salary'] = [50000, 60000, 70000]
df

Unnamed: 0,Name,Age,City,Salary
0,Prasanna,25,Bangalore,50000
1,John,30,New York,60000
2,Prasanth,45,Florida,70000


In [40]:
# Removing a Column
df.drop('Salary', axis=1) # Default axis=0 checks for Row Index
# Its is temporary operation and doesnt save

Unnamed: 0,Name,Age,City
0,Prasanna,25,Bangalore
1,John,30,New York
2,Prasanth,45,Florida


In [41]:
df

Unnamed: 0,Name,Age,City,Salary
0,Prasanna,25,Bangalore,50000
1,John,30,New York,60000
2,Prasanth,45,Florida,70000


In [42]:
# Inorder to drop permanantly use inplace
df.drop('Salary', axis=1, inplace=True) # Default axis=0 checks for Row Index


In [43]:
df

Unnamed: 0,Name,Age,City
0,Prasanna,25,Bangalore
1,John,30,New York
2,Prasanth,45,Florida


In [44]:
## Add age to the column
df['Age'] = df['Age']+1 # Adds one to every age

In [45]:
df

Unnamed: 0,Name,Age,City
0,Prasanna,26,Bangalore
1,John,31,New York
2,Prasanth,46,Florida


In [46]:
# Drop based on Row
df.drop(0)

Unnamed: 0,Name,Age,City
1,John,31,New York
2,Prasanth,46,Florida


In [47]:
df = pd.read_csv('sales_data.csv')

In [49]:
df.head(5)

Unnamed: 0,Transaction ID,Date,Product Category,Product Name,Units Sold,Unit Price,Total Revenue,Region,Payment Method
0,10001,2024-01-01,Electronics,iPhone 14 Pro,2,999.99,1999.98,North America,Credit Card
1,10002,2024-01-02,Home Appliances,Dyson V11 Vacuum,1,499.99,499.99,Europe,PayPal
2,10003,2024-01-03,Clothing,Levi's 501 Jeans,3,69.99,209.97,Asia,Debit Card
3,10004,2024-01-04,Books,The Da Vinci Code,4,15.99,63.96,North America,Credit Card
4,10005,2024-01-05,Beauty Products,Neutrogena Skincare Set,1,89.99,89.99,Europe,PayPal


In [50]:
# Display the data types of each column
print("Date types:\n", df.dtypes)

# Describe the DataFrame
print("Statistical Summary:\n", df.describe())


Date types:
 Transaction ID        int64
Date                 object
Product Category     object
Product Name         object
Units Sold            int64
Unit Price          float64
Total Revenue       float64
Region               object
Payment Method       object
dtype: object
Statistical Summary:
        Transaction ID  Units Sold   Unit Price  Total Revenue
count       240.00000  240.000000   240.000000     240.000000
mean      10120.50000    2.158333   236.395583     335.699375
std          69.42622    1.322454   429.446695     485.804469
min       10001.00000    1.000000     6.500000       6.500000
25%       10060.75000    1.000000    29.500000      62.965000
50%       10120.50000    2.000000    89.990000     179.970000
75%       10180.25000    3.000000   249.990000     399.225000
max       10240.00000   10.000000  3899.990000    3899.990000
