# **Pandas Basics**

### **Install pandas package**

In [1]:
%pip install pandas

Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple
Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 24.0 -> 24.2
[notice] To update, run: python.exe -m pip install --upgrade pip


### **Import pandas**

In [3]:
import pandas as pd

## **DataFrames**
A DataFrame is a two-dimensional labeled data structure with columns of potentially 
different data types, similar to a spreadsheet or SQL table. 
It provides a powerful and flexible way to manipulate and analyze structured data in Python, 
offering functionalities for data analysis.

In [4]:
df=pd.DataFrame()

In [11]:
# Create a DataFrame using a list of lists
row_data=[["john",30],["ane",28],["smith", 26]]
df=pd.DataFrame(row_data,columns=["Name","Age"])
df

Unnamed: 0,Name,Age
0,john,30
1,ane,28
2,smith,26


In [10]:
# Create a DataFrame using a dictionary
data={
    "Name":["John","Jane","Smith"],
    "Age":[30,28,32]
}

df=pd.DataFrame(data)
df

Unnamed: 0,Name,Age
0,John,30
1,Jane,28
2,Smith,32


In [13]:
# Create a DataFrame using a list of dictionary
data=[
    {"Name":"John","Age":30},
    {"Name":"Jane","Age":28},
    {"Name":"Smith","Age":32}
]
df=pd.DataFrame(data)
df

Unnamed: 0,Name,Age
0,John,30
1,Jane,28
2,Smith,32


## **Series**

A pandas Series is a one-dimensional labeled array capable of 
holding data of any type (integer, string, float, etc.). 
It's similar to a one-column table or an array with associated labels, 
providing powerful indexing and manipulation capabilities in Python.

In [14]:
series=pd.Series([1,2,3,4,5])
series

0    1
1    2
2    3
3    4
4    5
dtype: int64

### **Pandas Data Types**

Numeric:
- Integer (int64): Represents whole numbers (e.g., 10, -5). 
    This is the default integer type in pandas. (64 bit integer)
- Float (float64): Represents numbers with decimals (e.g., 3.14, -12.5).
- Boolean (bool): Represents logical True or False values.
- Object: This is a versatile but less efficient type that can store various data types 
like strings, lists, or custom objects. 
    Pandas uses this type when it cannot infer a more specific data type.

In [None]:
# Integer
series=pd.Series([1,2,3,4,5])
series

In [15]:
float_series=pd.Series([3.14,-3.14])
float_series

0    3.14
1   -3.14
dtype: float64

In [16]:
boolean_series=pd.Series([True,False,True])
boolean_series

0     True
1    False
2     True
dtype: bool

In [17]:
object_series=pd.Series([30,3.14,False])
object_series

0       30
1     3.14
2    False
dtype: object

Specialized Data Types:
- Datetime (datetime64[ns]): Represents dates and times with nanosecond precision. 
    Useful for time-series data analysis.
- Timedelta (timedelta64[ns]): Represents durations between timestamps.
- Categorical: Represents categorical data with predefined categories. 
    Efficient for storing limited sets of categories.
- Sparse: Represents sparse data with many missing values. 
    Stores data efficiently by only keeping non-zero values.

In [18]:
# DateTime(Timestamp)

dt=pd.to_datetime("2024-08-27")
dt

Timestamp('2024-08-27 00:00:00')

In [20]:
time_series=pd.Series([
    pd.Timedelta(days=8,hours=3,minutes=15),
    pd.Timedelta(days=4,hours=3,minutes=15),
    pd.Timedelta(days=1,hours=3,minutes=15)
])
time_series

0   8 days 03:15:00
1   4 days 03:15:00
2   1 days 03:15:00
dtype: timedelta64[ns]

In [21]:
categorical_series=pd.Series(pd.Categorical(["Sales","Marketing","Operations"]))
categorical_series

0         Sales
1     Marketing
2    Operations
dtype: category
Categories (3, object): ['Marketing', 'Operations', 'Sales']

In [22]:
#Series which contain null values
sparse_series=pd.Series(
pd.arrays.SparseArray([1,2,pd.NA,4,pd.NA])
)

sparse_series

0      1
1      2
2    NaN
3      4
4    NaN
dtype: Sparse[object, nan]

### **Changing Data Types**

In [24]:
# Integer
series=pd.Series([1,2,3,4,5])
series
print(series.dtype)
float_series=series.astype('float64')
float_series

int64


0    1.0
1    2.0
2    3.0
3    4.0
4    5.0
dtype: float64

**Example: Sales Data Analysis**

You have a dataset of sales transactions that includes the product name, quantity sold, and sale price. 
You want to analyze the data to find the total revenue per product.

In [25]:
# DataFrame using Dictionary of List
data = {
    'Product Name':['Iced Tea','Hot Chocolate','Lemonade','Coffee','Milkshake','Tea', 'Smoothie', 'Soda', 'Protein Shake', 'Matcha Latte'],
    'Type': ['Cold', 'Hot', 'Cold', 'Hot', 'Cold', 'Hot', 'Cold', 'Hot', 'Cold', 'Hot'],
    'Stock': [15, 15, 15, 15, 15, 15, 15, 15, 15, 15],
    'Quantity Sold':[6, 9, 13, 11, 8, 6, 14, 10, 8, 10],
    'Cost of Goods Sold':[7, 10, 6, 8, 9, 7, 10, 11, 8, 9],
    'Sale Price':[13, 20, 11, 15, 19, 14, 17, 18, 20, 12],
    'Rating': [1, 3, 5, 4, 3, 2, 5, 3, 3, 3]
}

In [29]:
sales_df=pd.DataFrame(data)
sales_df

Unnamed: 0,Product Name,Type,Stock,Quantity Sold,Cost of Goods Sold,Sale Price,Rating
0,Iced Tea,Cold,15,6,7,13,1
1,Hot Chocolate,Hot,15,9,10,20,3
2,Lemonade,Cold,15,13,6,11,5
3,Coffee,Hot,15,11,8,15,4
4,Milkshake,Cold,15,8,9,19,3
5,Tea,Hot,15,6,7,14,2
6,Smoothie,Cold,15,14,10,17,5
7,Soda,Hot,15,10,11,18,3
8,Protein Shake,Cold,15,8,8,20,3
9,Matcha Latte,Hot,15,10,9,12,3


In [None]:
sales_df["Product Name"]

In [34]:
sales_df["Total Revenue"]=sales_df["Quantity Sold"] * sales_df["Sale Price"]
sales_df["Gross Profit"]= sales_df["Sale Price"]-sales_df["Cost of Goods Sold"] 
sales_df

Unnamed: 0,Product Name,Type,Stock,Quantity Sold,Cost of Goods Sold,Sale Price,Rating,Total Revenue,Gross Profit
0,Iced Tea,Cold,15,6,7,13,1,78,6
1,Hot Chocolate,Hot,15,9,10,20,3,180,10
2,Lemonade,Cold,15,13,6,11,5,143,5
3,Coffee,Hot,15,11,8,15,4,165,7
4,Milkshake,Cold,15,8,9,19,3,152,10
5,Tea,Hot,15,6,7,14,2,84,7
6,Smoothie,Cold,15,14,10,17,5,238,7
7,Soda,Hot,15,10,11,18,3,180,7
8,Protein Shake,Cold,15,8,8,20,3,160,12
9,Matcha Latte,Hot,15,10,9,12,3,120,3


### **Data Selection**

Pandas provides numerous methods for selecting and indexing data in Series and DataFrames, 
including label-based indexing with .loc, integer-position based indexing with .iloc, and conditional selection.

In [43]:
sales_df["Product Name"][0:2] # slice,[String_index: ending_index(excluded):Step/Travers]

sales_df["Product Name"][3:7] 

sales_df["Product Name"][::2] # selecting by twos

0         Iced Tea
2         Lemonade
4        Milkshake
6         Smoothie
8    Protein Shake
Name: Product Name, dtype: object

### **Data Selection in Series**

In [49]:
# get the first three rows
sales_df.iloc[:3]
type(sales_df.iloc[0])
sales_df.iloc[3]

Product Name          Coffee
Type                     Hot
Stock                     15
Quantity Sold             11
Cost of Goods Sold         8
Sale Price                15
Rating                     4
Total Revenue            165
Gross Profit               7
Name: 3, dtype: object

### **Data Selection in DataFrame**

#### **Index Location (.iloc)**
- Will get rows based on a number/index.
- Will output into a DataFrame instead of a Series.
> Syntax: [starting_index:ending_index(excluded):step/traversal method]

In [50]:
sales_df.iloc[:3,[0,3,5,7]]



Unnamed: 0,Product Name,Quantity Sold,Sale Price,Total Revenue
0,Iced Tea,6,13,78
1,Hot Chocolate,9,20,180
2,Lemonade,13,11,143


In [None]:
sales_df.iloc[:3,0:6]

#### **Location (.loc)**
- Access a group of rows and columns by label(s) or a boolean array.
> Syntax: [starting_index:ending_index(included):step/traversal method]

In [53]:
sales_df.loc[5:9]
sales_df.loc[:3]


Unnamed: 0,Product Name,Type,Stock,Quantity Sold,Cost of Goods Sold,Sale Price,Rating,Total Revenue,Gross Profit
0,Iced Tea,Cold,15,6,7,13,1,78,6
1,Hot Chocolate,Hot,15,9,10,20,3,180,10
2,Lemonade,Cold,15,13,6,11,5,143,5
3,Coffee,Hot,15,11,8,15,4,165,7


In [52]:
sales_df.loc[5:9,"Product Name":"Sale Price"]
sales_df.loc[5:9,"Product Name","Sale Price"]

Unnamed: 0,Product Name,Type,Stock,Quantity Sold,Cost of Goods Sold,Sale Price
5,Tea,Hot,15,6,7,14
6,Smoothie,Cold,15,14,10,17
7,Soda,Hot,15,10,11,18
8,Protein Shake,Cold,15,8,8,20
9,Matcha Latte,Hot,15,10,9,12


## **Conditional Filtering** 

In [57]:
sales_df[(sales_df["Total Revenue"]>=150) & (sales_df["Type"]=="Cold")]

Unnamed: 0,Product Name,Type,Stock,Quantity Sold,Cost of Goods Sold,Sale Price,Rating,Total Revenue,Gross Profit
4,Milkshake,Cold,15,8,9,19,3,152,10
6,Smoothie,Cold,15,14,10,17,5,238,7
8,Protein Shake,Cold,15,8,8,20,3,160,12


## **Apply**

The apply function in pandas is a powerful tool for working with DataFrames. 
It allows you to apply a custom function to each element (row or column) of the DataFrame 
and return a new DataFrame or Series based on the results.

In [62]:
def discount(original_price):
    discount_rate=0.10
    discount_amount=original_price*discount_rate
    discount_price=original_price-discount_amount
    return discount_price
sales_df["10% Discounted Price"]=sales_df["Sale Price"].apply(discount)
sales_df

Unnamed: 0,Product Name,Type,Stock,Quantity Sold,Cost of Goods Sold,Sale Price,Rating,Total Revenue,Gross Profit,10% Discounted Price
0,Iced Tea,Cold,15,6,7,13,1,78,6,11.7
1,Hot Chocolate,Hot,15,9,10,20,3,180,10,18.0
2,Lemonade,Cold,15,13,6,11,5,143,5,9.9
3,Coffee,Hot,15,11,8,15,4,165,7,13.5
4,Milkshake,Cold,15,8,9,19,3,152,10,17.1
5,Tea,Hot,15,6,7,14,2,84,7,12.6
6,Smoothie,Cold,15,14,10,17,5,238,7,15.3
7,Soda,Hot,15,10,11,18,3,180,7,16.2
8,Protein Shake,Cold,15,8,8,20,3,160,12,18.0
9,Matcha Latte,Hot,15,10,9,12,3,120,3,10.8


## Pandas Operators
Data Analysis:

- sum(): Calculates the sum of a Series or DataFrame
- mean(): Calculates the mean of a Series or DataFrame
- median(): Calculates the median of a Series or DataFrame
- std(): Calculates the standard deviation of a Series or DataFrame
- var(): Calculates the variance of a Series or DataFrame

Data Loading and Exploration:

- head(): Shows the first few rows of a DataFrame
- tail(): Shows the last few rows of a DataFrame
- describe(): Generates summary statistics for each column (mean, standard deviation, etc.)
- info(): Displays information about the DataFrame, including data types and memory usage

In [64]:
sales_df["Total Revenue"].sum()

1500

In [66]:
sales_df["Sale Price"].mean()

15.9

In [None]:
# 1. Sort the values
# 2. Get the middle value
# sales_df["Rating"].sort_values()
sales_df["Rating"].median()

In [67]:
sales_df["Rating"].std()

1.2292725943057183

In [68]:
sales_df["Rating"].var() #variance

1.5111111111111113

In [None]:
sales_df.head() # default first 5 rows
sales_df.head(3)

In [69]:
sales_df.tail() # default first 5 rows
sales_df.tail(3)

Unnamed: 0,Product Name,Type,Stock,Quantity Sold,Cost of Goods Sold,Sale Price,Rating,Total Revenue,Gross Profit,10% Discounted Price
7,Soda,Hot,15,10,11,18,3,180,7,16.2
8,Protein Shake,Cold,15,8,8,20,3,160,12,18.0
9,Matcha Latte,Hot,15,10,9,12,3,120,3,10.8


In [70]:
sales_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 10 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   Product Name          10 non-null     object 
 1   Type                  10 non-null     object 
 2   Stock                 10 non-null     int64  
 3   Quantity Sold         10 non-null     int64  
 4   Cost of Goods Sold    10 non-null     int64  
 5   Sale Price            10 non-null     int64  
 6   Rating                10 non-null     int64  
 7   Total Revenue         10 non-null     int64  
 8   Gross Profit          10 non-null     int64  
 9   10% Discounted Price  10 non-null     float64
dtypes: float64(1), int64(7), object(2)
memory usage: 932.0+ bytes


In [71]:
sales_df.describe()

Unnamed: 0,Stock,Quantity Sold,Cost of Goods Sold,Sale Price,Rating,Total Revenue,Gross Profit,10% Discounted Price
count,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0
mean,15.0,9.5,8.5,15.9,3.2,150.0,7.4,14.31
std,0.0,2.677063,1.581139,3.3483,1.229273,47.56516,2.633122,3.01347
min,15.0,6.0,6.0,11.0,1.0,78.0,3.0,9.9
25%,15.0,8.0,7.25,13.25,3.0,125.75,6.25,11.925
50%,15.0,9.5,8.5,16.0,3.0,156.0,7.0,14.4
75%,15.0,10.75,9.75,18.75,3.75,176.25,9.25,16.875
max,15.0,14.0,11.0,20.0,5.0,238.0,12.0,18.0


### **Aggregating Data** (.groupby)

Aggregating data involves summarizing data points into meaningful statistics, 
such as averages, sums, or counts, which can be achieved using GroupBy operations or pivot tables. 
This helps in understanding the dataset at a higher level.

In [72]:
sales_df["Type"].unique()


array(['Cold', 'Hot'], dtype=object)

In [73]:
sales_df["Type"] = sales_df["Type"].astype("category")
print("Data Type of Type Column:", sales_df["Type"].dtype)

Data Type of Type Column: category


In [None]:
total_revenue_based_on_type = sales_df.groupby('Type')['Total Revenue'].sum()
total_revenue_based_on_type

In [None]:
total_revenue_based_on_type_df = pd.DataFrame()
total_revenue_based_on_type_df["Total Revenue"] = sales_df.groupby('Type')['Total Revenue'].sum()
total_revenue_based_on_type_df


In [None]:
total_revenue_based_on_type_df["Total Quantity Sold"] = sales_df.groupby('Type')['Quantity Sold'].sum()
total_revenue_based_on_type_df