### Pandas - DataFrame and Series

Pandas is a powerful data manipulation library in Python, widely used for data analysis and data cleaning.    
It provides two primary data structures: Series and DataFrame.    
A Series is an one dimensional array-like object. (1D numpy array can also be converted into Series).    
While a DataFrame is a two dimensional, size-mutable, and potentially heterogenous tabular data structure with labelled axes (rows and columns).    
(2D numpy array can also be converted into DataFrame).

In [None]:
import pandas as pd #pd is an alias for pandas

In [7]:
## Series
## 1D array like object that can store any data type.
## Similar to a column in a table.

data = [1,2,3,4,5]
series = pd.Series(data)
print("Series")
print(series)
print(type(series))
#0,1,2,3,4,5 are defauly indexes
#They are the indexes with respect to the series

Series
0    1
1    2
2    3
3    4
4    5
dtype: int64
<class 'pandas.core.series.Series'>


In [None]:
## Creating Series from Dictionary
data = {'a':1, 'b':2, 'c':3}
series_dict = pd.Series(data)
print(series_dict)
# The keys of the dictionary become the indexes
# The values of the dictionary become the value of the Series

a    1
b    2
c    3
dtype: int64


In [10]:
## Custom Indexes
data = [10,20,30]
index = ['a','b','c']
# The number of indexes should correspond to the number of elements

series = pd.Series(data, index=index)
print(series)

a    10
b    20
c    30
dtype: int64


In [54]:
## DataFrame

## Creating a DataFrame from a dictionary of list

data = {
    'Name':["Krish","Anant","Titas"],
    'Age':[19,19,21],
    'City':["Kolkata","Howrah","Kharagpur"],
}
df = pd.DataFrame(data)
print("DataFrame")
print(df)
print(type(df))

DataFrame
    Name  Age       City
0  Krish   19    Kolkata
1  Anant   19     Howrah
2  Titas   21  Kharagpur
<class 'pandas.core.frame.DataFrame'>


In [None]:
## Converting a Pandas DataFrame into a NumPy array
## It will skip the row indexes and the column labels
import numpy as np
array = np.array(df)
print(array)

[['Krish' 19 'Kolkata']
 ['Anant' 19 'Howrah']
 ['Titas' 21 'Kharagpur']]


In [23]:
## Creating a DataFrame from a list of dictionaries
data = [
    {'Name':'Krish', 'Age':19, 'City': 'Kolkata'},
    {'Name':'Anant', 'Age':19, 'City': 'Howrah'},
    {'Name':'Titas', 'Age':21, 'City': 'Kharagpur'}
]
# Keys become column names
# Values associated with the keys become the values under that particular column
df = pd.DataFrame(data)
print(df)

    Name  Age       City
0  Krish   19    Kolkata
1  Anant   19     Howrah
2  Titas   21  Kharagpur


In [None]:
df['Name'] #Gets all the elements under the column name 'Name'

0    Krish
1    Anant
2    Titas
Name: Name, dtype: object

In [None]:
type(df['Name']) #When dealing with a single column, it is of Series type

pandas.core.series.Series

In [None]:
df.loc[0] #Gets the 0th row
## loc is used for row index

Name      Krish
Age          19
City    Kolkata
Name: 0, dtype: object

In [28]:
df.iloc[0][2] #Gets the element at 0th row index and 2nd column index 

  df.iloc[0][2] #Gets the element at 0th row index and 2nd column index


'Kolkata'

In [29]:
data = pd.DataFrame({'A': [10, 20, 30], 'B': [40, 50, 60]}, index=['x', 'y', 'z'])

# Using loc (label-based)
print(data.loc['x'])  # Retrieves row with label 'x'

# Using iloc (position-based)
print(data.iloc[0])   # Retrieves first row (index 0)


A    10
B    40
Name: x, dtype: int64
A    10
B    40
Name: x, dtype: int64


In [None]:
### Accessing a specified element
df['Name'][0]

'Krish'

In [34]:
df.at[0,"Name"]

'Krish'

In [None]:
### Accessing a specified element
df.at[2,"Age"]

np.int64(21)

In [None]:
## Accessing a specified element using iat
df.iat[0,0] #Gets the element at the first row and first column

'Krish'

In [36]:
### Data Manipulation with DataFrames
df

Unnamed: 0,Name,Age,City
0,Krish,19,Kolkata
1,Anant,19,Howrah
2,Titas,21,Kharagpur


In [46]:
## Adding a new column
df["Salary"] = [99999999999,99999999,9999999]
df

Unnamed: 0,Name,Age,City,Salary
0,Krish,19,Kolkata,99999999999
1,Anant,19,Howrah,99999999
2,Titas,21,Kharagpur,9999999


In [None]:
df.drop('Salary') #We get this error because it looks for "Salary" in rows

KeyError: "['Salary'] not found in axis"

In [None]:
## Removing a column
df.drop('Salary',axis=1) #Not a permanent operation, it does not save the state of the data frame
## When axis = 0, it will compare against rows
## When axis = 1, it will comapre against columns

Unnamed: 0,Name,Age,City
0,Krish,19,Kolkata
1,Anant,19,Howrah
2,Titas,21,Kharagpur


In [None]:
df

Unnamed: 0,Name,Age,City,Salary
0,Krish,19,Kolkata,99999999999
1,Anant,19,Howrah,99999999
2,Titas,21,Kharagpur,9999999


In [None]:
## Removing a column
## Saving the state of the dataframe
df.drop("Salary", axis=1, inplace = True)
## When inplace = True, the operation is a permanent operation, i.e, it saves the state of the dataframe

In [50]:
df

Unnamed: 0,Name,Age,City
0,Krish,19,Kolkata
1,Anant,19,Howrah
2,Titas,21,Kharagpur


In [55]:
## Add age to the column
df["Age"] = df["Age"] + 1 #All ages are increased by 1
df

Unnamed: 0,Name,Age,City
0,Krish,20,Kolkata
1,Anant,20,Howrah
2,Titas,22,Kharagpur


In [56]:
## Removing a row
df.drop(2)

Unnamed: 0,Name,Age,City
0,Krish,20,Kolkata
1,Anant,20,Howrah


In [57]:
df = pd.read_csv("sales.csv")
df.head(5) #Gives the first 5 elements

Unnamed: 0,Transaction ID,Date,Product Category,Product Name,Units Sold,Unit Price,Total Revenue,Region,Payment Method
0,10001,2024-01-01,Electronics,iPhone 14 Pro,2,999.99,1999.98,North America,Credit Card
1,10002,2024-01-02,Home Appliances,Dyson V11 Vacuum,1,499.99,499.99,Europe,PayPal
2,10003,2024-01-03,Clothing,Levi's 501 Jeans,3,69.99,209.97,Asia,Debit Card
3,10004,2024-01-04,Books,The Da Vinci Code,4,15.99,63.96,North America,Credit Card
4,10005,2024-01-05,Beauty Products,Neutrogena Skincare Set,1,89.99,89.99,Europe,PayPal


In [18]:
df.tail(5) #Gives the last 5 records

Unnamed: 0,Transaction ID,Date,Product Category,Product Name,Units Sold,Unit Price,Total Revenue,Region,Payment Method
235,10236,2024-08-23,Home Appliances,Nespresso Vertuo Next Coffee and Espresso Maker,1,159.99,159.99,Europe,PayPal
236,10237,2024-08-24,Clothing,Nike Air Force 1 Sneakers,3,90.0,270.0,Asia,Debit Card
237,10238,2024-08-25,Books,The Handmaid's Tale by Margaret Atwood,3,10.99,32.97,North America,Credit Card
238,10239,2024-08-26,Beauty Products,Sunday Riley Luna Sleeping Night Oil,1,55.0,55.0,Europe,PayPal
239,10240,2024-08-27,Sports,Yeti Rambler 20 oz Tumbler,2,29.99,59.98,Asia,Credit Card


In [20]:
### Accessing Data from a DataFrame
df

Unnamed: 0,Transaction ID,Date,Product Category,Product Name,Units Sold,Unit Price,Total Revenue,Region,Payment Method
0,10001,2024-01-01,Electronics,iPhone 14 Pro,2,999.99,1999.98,North America,Credit Card
1,10002,2024-01-02,Home Appliances,Dyson V11 Vacuum,1,499.99,499.99,Europe,PayPal
2,10003,2024-01-03,Clothing,Levi's 501 Jeans,3,69.99,209.97,Asia,Debit Card
3,10004,2024-01-04,Books,The Da Vinci Code,4,15.99,63.96,North America,Credit Card
4,10005,2024-01-05,Beauty Products,Neutrogena Skincare Set,1,89.99,89.99,Europe,PayPal
...,...,...,...,...,...,...,...,...,...
235,10236,2024-08-23,Home Appliances,Nespresso Vertuo Next Coffee and Espresso Maker,1,159.99,159.99,Europe,PayPal
236,10237,2024-08-24,Clothing,Nike Air Force 1 Sneakers,3,90.00,270.00,Asia,Debit Card
237,10238,2024-08-25,Books,The Handmaid's Tale by Margaret Atwood,3,10.99,32.97,North America,Credit Card
238,10239,2024-08-26,Beauty Products,Sunday Riley Luna Sleeping Night Oil,1,55.00,55.00,Europe,PayPal


In [60]:
## Display the datatype of each column
print("Datatypes:")
df.dtypes

Datatypes:


Transaction ID        int64
Date                 object
Product Category     object
Product Name         object
Units Sold            int64
Unit Price          float64
Total Revenue       float64
Region               object
Payment Method       object
dtype: object

In [61]:
## Describe the dataframe
print("Statistical Summary")
df.describe()

Statistical Summary


Unnamed: 0,Transaction ID,Units Sold,Unit Price,Total Revenue
count,240.0,240.0,240.0,240.0
mean,10120.5,2.158333,236.395583,335.699375
std,69.42622,1.322454,429.446695,485.804469
min,10001.0,1.0,6.5,6.5
25%,10060.75,1.0,29.5,62.965
50%,10120.5,2.0,89.99,179.97
75%,10180.25,3.0,249.99,399.225
max,10240.0,10.0,3899.99,3899.99
