# **Pandas Basics**

### **Install pandas package**

In [54]:
# commented out after the first run to save time
# %pip install pandas

### **Import pandas**

In [55]:
import pandas as pd

## **DataFrames**
A DataFrame is a two-dimensional labeled data structure with columns of potentially 
different data types, similar to a spreadsheet or SQL table. 
It provides a powerful and flexible way to manipulate and analyze structured data in Python, 
offering functionalities for data analysis.

In [56]:
# create an empty data frame
# df = pd.DataFrame()
# df

In [57]:
# Create a DataFrame using a list of lists

row_data = [['Ron', 20], ['Shaun', 21], ['Aris', 22]]
df = pd.DataFrame(row_data, columns=['Name', 'Age'])
df

Unnamed: 0,Name,Age
0,Ron,20
1,Shaun,21
2,Aris,22


In [58]:
# Create a DataFrame using Dictionary of List

data = {
    'Name': ['Ron', 'Shaun', 'Aris'],
    'Age': [30, 31, 32]
}

df = pd.DataFrame(data)
df

Unnamed: 0,Name,Age
0,Ron,30
1,Shaun,31
2,Aris,32


In [59]:
# Creating a DataFrame using a List of Dictionary

data = [
    {'Name': 'Ron', 'Age':40},
    {'Name': 'Shaun', 'Age':41},
    {'Name': 'Aris', 'Age':42},
]

df = pd.DataFrame(data)
df

Unnamed: 0,Name,Age
0,Ron,40
1,Shaun,41
2,Aris,42


## **Series**

A pandas Series is a one-dimensional labeled array capable of 
holding data of any type (integer, string, float, etc.). 
It's similar to a one-column table or an array with associated labels, 
providing powerful indexing and manipulation capabilities in Python.

In [60]:
# Integer series
s = pd.Series([1,2,3,4,5])
s

0    1
1    2
2    3
3    4
4    5
dtype: int64

### **Pandas Data Types**

Numeric:
- Integer (int64): Represents whole numbers (e.g., 10, -5). 
    This is the default integer type in pandas. (64 bit integer)
- Float (float64): Represents numbers with decimals (e.g., 3.14, -12.5).
- Boolean (bool): Represents logical True or False values.
- Object: This is a versatile but less efficient type that can store various data types 
like strings, lists, or custom objects. 
    Pandas uses this type when it cannot infer a more specific data type.

In [61]:
# Float series

float_series = pd.Series([3.14, -3.14, 0.77, -0.7777])
float_series

0    3.1400
1   -3.1400
2    0.7700
3   -0.7777
dtype: float64

In [62]:
# Boolean series: True(1) or False(0)
boolean_series = pd.Series([True, False, False, True])
boolean_series

0     True
1    False
2    False
3     True
dtype: bool

In [63]:
# Mixed data types of series (Type = object)
obj_series = pd.Series([30, 3.14, True, -30, -3.14])
obj_series

0      30
1    3.14
2    True
3     -30
4   -3.14
dtype: object

Specialized Data Types:
- Datetime (datetime64[ns]): Represents dates and times with nanosecond precision. 
    Useful for time-series data analysis.
- Timedelta (timedelta64[ns]): Represents durations between timestamps.
- Categorical: Represents categorical data with predefined categories. 
    Efficient for storing limited sets of categories.
- Sparse: Represents sparse data with many missing values. 
    Stores data efficiently by only keeping non-zero values.

In [64]:
# DateTime Series (convert string to datetime first)
dt_series = pd.Series([
    pd.to_datetime("2024-08-12 12:00:00"),
    pd.to_datetime("2024-08-13 13:00:00"),
    pd.to_datetime("2024-08-14 02:00:00")
])
dt_series

0   2024-08-12 12:00:00
1   2024-08-13 13:00:00
2   2024-08-14 02:00:00
dtype: datetime64[ns]

In [65]:
# Timedelta Series
timedelta_series = pd.Series([
    pd.Timedelta(days= 8,hours=7, minutes=30),
    pd.Timedelta(days= 3,hours=8, minutes=20),
    pd.Timedelta(days= 1,hours=5, minutes=5)
])
timedelta_series


0   8 days 07:30:00
1   3 days 08:20:00
2   1 days 05:05:00
dtype: timedelta64[ns]

In [66]:
# Categorical Series (The result "Categories" shows the distinct values)
categorical_series = pd.Series(pd.Categorical(['Male', 'Female', 'Others', 'Male', 'Female']))
categorical_series

0      Male
1    Female
2    Others
3      Male
4    Female
dtype: category
Categories (3, object): ['Female', 'Male', 'Others']

In [67]:
# Sparse Series
sparse_array = pd.Series(pd.arrays.SparseArray([1,2, pd.NA, 4, pd.NA]))
sparse_array

0      1
1      2
2    NaN
3      4
4    NaN
dtype: Sparse[object, nan]

### **Changing Data Types**

In [68]:
# Step 1: Check the current data type using ".dtype"
s.dtype

dtype('int64')

In [69]:
# Step 2: Convert the data type using ".astype('dtype')" to another one.
# e.g, int to float as below
f_series = s.astype('float64')
f_series

0    1.0
1    2.0
2    3.0
3    4.0
4    5.0
dtype: float64

In [70]:
# float64 to string
string_series = f_series.astype('string')
string_series

0    1.0
1    2.0
2    3.0
3    4.0
4    5.0
dtype: string

In [71]:
# string of float values to int (to avoid error, it needs to be converted to float then to int)
f_series_2 = string_series.astype('float64')
int_series = f_series_2.astype('int64')
int_series

0    1
1    2
2    3
3    4
4    5
dtype: int64

**Example: Sales Data Analysis**

You have a dataset of sales transactions that includes the product name, quantity sold, and sale price. 
You want to analyze the data to find the total revenue per product.

In [72]:
# Create a DataFrame using Dictionary of List
data = {
    'Product Name':['A','B','C','A','B','A'],
    'Quantity Sold':[3,2,5,4,1,2],
    'Sale Price':[10,20,10,15,20,15]
}

# Step 1: Create a DataFrame
sales_df = pd.DataFrame(data)
sales_df

Unnamed: 0,Product Name,Quantity Sold,Sale Price
0,A,3,10
1,B,2,20
2,C,5,10
3,A,4,15
4,B,1,20
5,A,2,15


In [73]:
# Step 2: Get Quantity Sold and the Sale Price columns
sales_df["Quantity Sold"]

0    3
1    2
2    5
3    4
4    1
5    2
Name: Quantity Sold, dtype: int64

In [74]:
sales_df["Sale Price"]

0    10
1    20
2    10
3    15
4    20
5    15
Name: Sale Price, dtype: int64

In [75]:
# Step 3: calculate the total revenue and add it to the DataFrame

sales_df["Total Revenue"] = sales_df["Quantity Sold"] * sales_df["Sale Price"]
sales_df


Unnamed: 0,Product Name,Quantity Sold,Sale Price,Total Revenue
0,A,3,10,30
1,B,2,20,40
2,C,5,10,50
3,A,4,15,60
4,B,1,20,20
5,A,2,15,30


In [76]:
grand_total = sum(sales_df["Total Revenue"])
grand_total

230

In [77]:
# Step 4: group column based on Product Name
total_revenue = sales_df.groupby("Product Name")["Total Revenue"].sum()
total_revenue

Product Name
A    120
B     60
C     50
Name: Total Revenue, dtype: int64

In [78]:
# Plus: Display total revenue and its average price per product
result_per_product = pd.DataFrame()
result_per_product["Total Revenue"] = sales_df.groupby("Product Name")["Total Revenue"].sum()
result_per_product["Aaverage Price"] = sales_df.groupby("Product Name")["Sale Price"].mean()
result_per_product


Unnamed: 0_level_0,Total Revenue,Aaverage Price
Product Name,Unnamed: 1_level_1,Unnamed: 2_level_1
A,120,13.333333
B,60,20.0
C,50,10.0


### **Data Selection**

Pandas provides numerous methods for selecting and indexing data in Series and DataFrames, 
including label-based indexing with .loc, integer-position based indexing with .iloc, and conditional selection.

In [79]:
# starting index : ending index(exclusive) : step/traversal method

# Getting the first three elements
sales_df['Product Name'][0:3]

0    A
1    B
2    C
Name: Product Name, dtype: object

In [80]:
sales_df["Sale Price"][::2]

0    10
2    10
4    20
Name: Sale Price, dtype: int64

### **Data Selection in Series**

### **Data Selection in DataFrame**

#### **Index Location (.iloc)**
- Will get rows based on a number/index.
- Will output into a DataFrame instead of a Series.
> Syntax: [starting_index:ending_index(excluded):step/traversal method]

In [81]:
sales_df

Unnamed: 0,Product Name,Quantity Sold,Sale Price,Total Revenue
0,A,3,10,30
1,B,2,20,40
2,C,5,10,50
3,A,4,15,60
4,B,1,20,20
5,A,2,15,30


In [82]:
# Getting the first 5 rows
sales_df.iloc[0:5]

Unnamed: 0,Product Name,Quantity Sold,Sale Price,Total Revenue
0,A,3,10,30
1,B,2,20,40
2,C,5,10,50
3,A,4,15,60
4,B,1,20,20


In [83]:
# Getting the first 5 rows and only for the Sale Price and Total Revenue
sales_df.iloc[0:5, 2:] # or can be [0:5, 2:4]

Unnamed: 0,Sale Price,Total Revenue
0,10,30
1,20,40
2,10,50
3,15,60
4,20,20


In [84]:
sales_df.iloc[0:, 2:4]

Unnamed: 0,Sale Price,Total Revenue
0,10,30
1,20,40
2,10,50
3,15,60
4,20,20
5,15,30


#### **Location (.loc)**
- Access a group of rows and columns by label(s) or a boolean array.
> Syntax: [starting_index:ending_index(**included**):step/traversal method]

In [85]:
# Getting the rows 2 to 4 and without the Total Revenue
sales_df.loc[2:4, "Product Name":"Sale Price"]

Unnamed: 0,Product Name,Quantity Sold,Sale Price
2,C,5,10
3,A,4,15
4,B,1,20


## **Conditional Filtering** 

In [86]:
sales_df

Unnamed: 0,Product Name,Quantity Sold,Sale Price,Total Revenue
0,A,3,10,30
1,B,2,20,40
2,C,5,10,50
3,A,4,15,60
4,B,1,20,20
5,A,2,15,30


In [87]:
# Based on the Total Revenue value
sales_df[sales_df["Total Revenue"] >= 40]

Unnamed: 0,Product Name,Quantity Sold,Sale Price,Total Revenue
1,B,2,20,40
2,C,5,10,50
3,A,4,15,60


In [88]:
# Based on the Product Name
sales_df[sales_df["Product Name"] == "A"]

Unnamed: 0,Product Name,Quantity Sold,Sale Price,Total Revenue
0,A,3,10,30
3,A,4,15,60
5,A,2,15,30


In [90]:
# on two conditions (use `` - next to "1")
sales_df.query('`Total Revenue` >= 40 and `Sale Price` >10')

Unnamed: 0,Product Name,Quantity Sold,Sale Price,Total Revenue
1,B,2,20,40
3,A,4,15,60


## **Apply**

The apply function in pandas is a powerful tool for working with DataFrames. 
It allows you to apply a custom function to each element (row or column) of the DataFrame 
and return a new DataFrame or Series based on the results.

In [None]:
sales_df

Unnamed: 0,Product Name,Quantity Sold,Sale Price,Total Revenue
0,A,3,10,30
1,B,2,20,40
2,C,5,10,50
3,A,4,15,60
4,B,1,20,20
5,A,2,15,30


In [None]:
def discount(original_price):
    discount_rate = 0.10  # 10% discount
    discount_amount = original_price * discount_rate
    discounted_price = original_price - discount_amount
    return discounted_price

# disc_price = sales_df["Sale Price"].apply(discount)

sales_df["With 10 % Discount"] = sales_df["Sale Price"].apply(discount)
sales_df

Unnamed: 0,Product Name,Quantity Sold,Sale Price,Total Revenue,With 10 % Discount
0,A,3,10,30,9.0
1,B,2,20,40,18.0
2,C,5,10,50,9.0
3,A,4,15,60,13.5
4,B,1,20,20,18.0
5,A,2,15,30,13.5


## Pandas Operators

Data Loading and Exploration:

- head(): Shows the first few rows of a DataFrame
- tail(): Shows the last few rows of a DataFrame
- describe(): Generates summary statistics for each column (mean, standard deviation, etc.)
- info(): Displays information about the DataFrame, including data types and memory usage

Data Analysis:

- sum(): Calculates the sum of a Series or DataFrame
- mean(): Calculates the mean of a Series or DataFrame
- median(): Calculates the median of a Series or DataFrame
- std(): Calculates the standard deviation of a Series or DataFrame
- var(): Calculates the variance of a Series or DataFrame

In [None]:
reviews_data = {
    'ProductID': ['P1', 'P2', 'P3', 'P4', 'P5', 'P6', 'P7', 'P8', 'P9', 'P10'],
    'Rating': [5, 3, 2, 3, 4, 5, 2, 4, 3, 1]
}
reviews_data

{'ProductID': ['P1', 'P2', 'P3', 'P4', 'P5', 'P6', 'P7', 'P8', 'P9', 'P10'],
 'Rating': [5, 3, 2, 3, 4, 5, 2, 4, 3, 1]}

In [None]:
reviews_df = pd.DataFrame(reviews_data)
reviews_df

Unnamed: 0,ProductID,Rating
0,P1,5
1,P2,3
2,P3,2
3,P4,3
4,P5,4
5,P6,5
6,P7,2
7,P8,4
8,P9,3
9,P10,1


In [None]:
# first 5 rows
reviews_df.head()

Unnamed: 0,ProductID,Rating
0,P1,5
1,P2,3
2,P3,2
3,P4,3
4,P5,4


In [None]:
# first 7 rows
reviews_df.head(7)

Unnamed: 0,ProductID,Rating
0,P1,5
1,P2,3
2,P3,2
3,P4,3
4,P5,4
5,P6,5
6,P7,2


In [None]:
# last 5 rows
reviews_df.tail()

Unnamed: 0,ProductID,Rating
5,P6,5
6,P7,2
7,P8,4
8,P9,3
9,P10,1


In [None]:
# last 3 rows
reviews_df.tail(3)

Unnamed: 0,ProductID,Rating
7,P8,4
8,P9,3
9,P10,1


In [None]:
# .describe()
reviews_df.describe()

Unnamed: 0,Rating
count,10.0
mean,3.2
std,1.316561
min,1.0
25%,2.25
50%,3.0
75%,4.0
max,5.0


In [None]:
reviews_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   ProductID  10 non-null     object
 1   Rating     10 non-null     int64 
dtypes: int64(1), object(1)
memory usage: 292.0+ bytes


In [None]:
print("Sum: ", reviews_df["Rating"].sum())
print("Mean/Avarage: ", reviews_df["Rating"].mean())
print("Median: ", reviews_df["Rating"].median())
print("Standard Deviation: ", reviews_df["Rating"].std())
print("Variance: ", reviews_df["Rating"].var())

Sum:  32
Mean/Avarage:  3.2
Median:  3.0
Standard Deviation:  1.3165611772087666
Variance:  1.7333333333333334


### **Importing and Exporting Data**

Pandas supports reading from and writing to a variety of file formats, 
including CSV, Excel, SQL, making it easy to integrate with data analysis workflows.

In [None]:
data_df = pd.read_csv("example.csv")
data_df

Unnamed: 0,A,B,C
0,1.0,5.0,10.0
1,2.0,6.5,11.0
2,2.333333,6.5,12.0
3,4.0,8.0,11.0


In [None]:
%pip install openpyxl

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 24.0 -> 24.2
[notice] To update, run: python.exe -m pip install --upgrade pip


In [None]:
# Export the dataframe in Excel file 
# .to_excel('file_name', 'sheet_name' , include the index column or not)
data_df.to_excel('exported_data.xlsx', sheet_name='Example Sheet', index=False)

In [None]:
# Export the dataframe in CSV file 
# .to_csv('file_name', include the index column or not)
data_df.to_csv('exported_data.csv', index=False)

In [None]:
%pip install sqlite3

Note: you may need to restart the kernel to use updated packages.


ERROR: Could not find a version that satisfies the requirement sqlite3 (from versions: none)
ERROR: No matching distribution found for sqlite3

[notice] A new release of pip is available: 24.0 -> 24.2
[notice] To update, run: python.exe -m pip install --upgrade pip


In [None]:
import sqlite3

In [None]:
# make the connection first
conn = sqlite3.connect("pandas-sql.db")
employees_df = pd.read_sql_query("SELECT * FROM employees", conn)
employees_df

Unnamed: 0,employee_id,first_name,last_name,department,salary,hire_date,performance_rating
0,1,John,Doe,Sales,50000,2024-04-23,3
1,2,Jane,Smith,Marketing,55000,2024-04-25,4
2,3,Michael,Johnson,Sales,60000,2024-04-26,5
3,4,Emily,Davis,Operations,62000,2024-04-27,2
4,5,David,Wilson,Operations,58000,2024-04-28,1
