## Pandas - DataFrame and Series
Pandas is a powerful data manipulation library in Python, widely used ofr data analysis and data cleaning. It provides data structures like Series and DataFrame. A Series is a one - dimensional array-like object, while DataFrame is a two-dimensional, size-mutable, and potentially heterogeneous tabular data structure with labeled axes (rows and columns).

In [36]:
import pandas as pd

In [37]:
# Series :
import pandas as pd
data = [1,2,3,4,5]
series = pd.Series(data)
print('Series:  \n', series)

Series:  
 0    1
1    2
2    3
3    4
4    5
dtype: int64


In [38]:
# Create a series from dictionary
data = {
    'a':1,
    'b':2,
    'c':3
}

series = pd.Series(data)
print(series)

a    1
b    2
c    3
dtype: int64


In [39]:
# Series with custom indexing
data = [1,2,3,4,5]
index = ['a', 'b', 'c', 'd', 'e']
series = pd.Series(data,index=index)
print(series)

a    1
b    2
c    3
d    4
e    5
dtype: int64


In [40]:
# DataFrame :
# Create a dataframe form a dictionary
data = {
    'name':['Prince', 'Prashant', 'Pranav', 'Pavan'],
    'age':[32,13,54,32],
    'gender':['male', 'male', 'male', 'male']
}

df = pd.DataFrame(data)
print(df.head())
df.shape

       name  age gender
0    Prince   32   male
1  Prashant   13   male
2    Pranav   54   male
3     Pavan   32   male


(4, 3)

In [41]:
# DataFrame: Read data from a csv file
df = pd.read_csv('sales_data.csv')
df.head()

Unnamed: 0,Transaction ID,Date,Product Category,Product Name,Units Sold,Unit Price,Total Revenue,Region,Payment Method
0,10001,2024-01-01,Electronics,iPhone 14 Pro,2,999.99,1999.98,North America,Credit Card
1,10002,2024-01-02,Home Appliances,Dyson V11 Vacuum,1,499.99,499.99,Europe,PayPal
2,10003,2024-01-03,Clothing,Levi's 501 Jeans,3,69.99,209.97,Asia,Debit Card
3,10004,2024-01-04,Books,The Da Vinci Code,4,15.99,63.96,North America,Credit Card
4,10005,2024-01-05,Beauty Products,Neutrogena Skincare Set,1,89.99,89.99,Europe,PayPal


In [42]:
df.tail()

Unnamed: 0,Transaction ID,Date,Product Category,Product Name,Units Sold,Unit Price,Total Revenue,Region,Payment Method
235,10236,2024-08-23,Home Appliances,Nespresso Vertuo Next Coffee and Espresso Maker,1,159.99,159.99,Europe,PayPal
236,10237,2024-08-24,Clothing,Nike Air Force 1 Sneakers,3,90.0,270.0,Asia,Debit Card
237,10238,2024-08-25,Books,The Handmaid's Tale by Margaret Atwood,3,10.99,32.97,North America,Credit Card
238,10239,2024-08-26,Beauty Products,Sunday Riley Luna Sleeping Night Oil,1,55.0,55.0,Europe,PayPal
239,10240,2024-08-27,Sports,Yeti Rambler 20 oz Tumbler,2,29.99,59.98,Asia,Credit Card


In [43]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 240 entries, 0 to 239
Data columns (total 9 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Transaction ID    240 non-null    int64  
 1   Date              240 non-null    object 
 2   Product Category  240 non-null    object 
 3   Product Name      240 non-null    object 
 4   Units Sold        240 non-null    int64  
 5   Unit Price        240 non-null    float64
 6   Total Revenue     240 non-null    float64
 7   Region            240 non-null    object 
 8   Payment Method    240 non-null    object 
dtypes: float64(2), int64(2), object(5)
memory usage: 17.0+ KB


In [44]:
df.describe()

Unnamed: 0,Transaction ID,Units Sold,Unit Price,Total Revenue
count,240.0,240.0,240.0,240.0
mean,10120.5,2.158333,236.395583,335.699375
std,69.42622,1.322454,429.446695,485.804469
min,10001.0,1.0,6.5,6.5
25%,10060.75,1.0,29.5,62.965
50%,10120.5,2.0,89.99,179.97
75%,10180.25,3.0,249.99,399.225
max,10240.0,10.0,3899.99,3899.99


In [45]:
## Accessing data from DataFrame
df.head(3)

# Access single column
df['Total Revenue']
# Access Multiple columns
df[['Date', 'Transaction ID']]

# Accessing rows : Index based
df.iloc[0]
# Access Individual element in that row : Product name
df.iloc[0, 3]

# Accessing rows : position based
df.loc[0, ['Product Name', 'Transaction ID']]

# df.at[] : Access a single value for a row/column label pair.
df.at[0, 'Date']   # df.at[] is position based similar to df.loc[]
df.iat[0, 0]        # df.iat[] is index based similar to df.iloc[]

np.int64(10001)

In [48]:
# Data Manipulation with DataFrames
data = {
    'name':['Prashant', 'Pranav', 'Prince', 'Pavan'],
    'age':[43, 23, 54, 23],
    'City':['Jalgaon', 'Mumbai', 'Delhi', 'Kolkata']
}

df = pd.DataFrame(data)
df

# Adding a new column : email
df['email'] = ['prashant@gmail.com', 'pranav@gmail.com', 'prince@gmail.com', 'pavan@gmail.com']
df

# Remove the newly added column : axis = 0 for row and 1 for column and inplace=True to save the changes in the current dataframe
df.drop('email', axis=1, inplace=True)
df

# Adding 1 year to everyones age
df['age'] = df['age'] + 1
df

# drop rows
df.drop(3, inplace=True)
df

Unnamed: 0,name,age,City
0,Prashant,44,Jalgaon
1,Pranav,24,Mumbai
2,Prince,55,Delhi
