# **Pandas Basics**

### **Install pandas package**

In [1]:
# use for data analytics
%pip install pandas

Collecting pandas
  Downloading pandas-2.2.3-cp313-cp313-win_amd64.whl.metadata (19 kB)
Collecting numpy>=1.26.0 (from pandas)
  Downloading numpy-2.2.3-cp313-cp313-win_amd64.whl.metadata (60 kB)
Collecting pytz>=2020.1 (from pandas)
  Downloading pytz-2025.1-py2.py3-none-any.whl.metadata (22 kB)
Collecting tzdata>=2022.7 (from pandas)
  Downloading tzdata-2025.1-py2.py3-none-any.whl.metadata (1.4 kB)
Downloading pandas-2.2.3-cp313-cp313-win_amd64.whl (11.5 MB)
   ---------------------------------------- 0.0/11.5 MB ? eta -:--:--
   ------ --------------------------------- 1.8/11.5 MB 8.8 MB/s eta 0:00:02
   ---------- ----------------------------- 2.9/11.5 MB 7.1 MB/s eta 0:00:02
   -------------- ------------------------- 4.2/11.5 MB 6.8 MB/s eta 0:00:02
   ------------------- -------------------- 5.5/11.5 MB 6.6 MB/s eta 0:00:01
   ----------------------- ---------------- 6.8/11.5 MB 6.5 MB/s eta 0:00:01
   -------------------------- ------------- 7.6/11.5 MB 6.2 MB/s eta 0:00:01
  


[notice] A new release of pip is available: 24.3.1 -> 25.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip


### **Import pandas**

In [3]:
import pandas as pd
# pd = Pandas (alias)

## **DataFrames**
A DataFrame is a two-dimensional labeled data structure with columns of potentially 
different data types, similar to a spreadsheet or SQL table. 
It provides a powerful and flexible way to manipulate and analyze structured data in Python, 
offering functionalities for data analysis.

In [5]:
# df = DataFrame
df = pd.DataFrame()
df

In [8]:
# Create pass data inside of list
# List of lists
row_data = [["John", 30], ["Jane", 28], ["Smith", 26]]
headers = ["Name", "Age"]
df = pd.DataFrame(row_data, columns=headers)
df

Unnamed: 0,Name,Age
0,John,30
1,Jane,28
2,Smith,26


In [9]:
# Data inside of a dictionary
# Dictionary of list
# {key:value}
data = {
    "Name": ["John", "Jane", "Smith"],
    "Age": [30,28,26]
}
df = pd.DataFrame(data)
df

Unnamed: 0,Name,Age
0,John,30
1,Jane,28
2,Smith,26


In [10]:
type(df)

pandas.core.frame.DataFrame

## **Series**

A pandas Series is a one-dimensional labeled array capable of 
holding data of any type (integer, string, float, etc.). 
It's similar to a one-column table or an array with associated labels, 
providing powerful indexing and manipulation capabilities in Python.

In [11]:
series = pd.Series([1,2,3,4,5])
series

0    1
1    2
2    3
3    4
4    5
dtype: int64

In [12]:
type(series)

pandas.core.series.Series

### **Pandas Data Types**

Numeric:
- Integer (int64): Represents whole numbers (e.g., 10, -5). 
    This is the default integer type in pandas. (64 bit integer)
- Float (float64): Represents numbers with decimals (e.g., 3.14, -12.5).
- Boolean (bool): Represents logical True or False values.
- Object: This is a versatile but less efficient type that can store various data types 
like strings, lists, or custom objects. 
    Pandas uses this type when it cannot infer a more specific data type.

In [13]:
# Integer (int64)
int_series =pd.Series([1,2,3,4,5])
int_series

0    1
1    2
2    3
3    4
4    5
dtype: int64

In [14]:
# Float (float64)
float_series = pd.Series([3.14, -3.14, 0.0001, -0.0001])
float_series

0    3.1400
1   -3.1400
2    0.0001
3   -0.0001
dtype: float64

In [15]:
# Boolean (bool)
boolean_series = pd.Series([True, False, False, True])
boolean_series

0     True
1    False
2    False
3     True
dtype: bool

In [16]:
# Object (Object  Mixed Data Types)
object_series = pd.Series([30, 3.14, True, "John"])
object_series

0      30
1    3.14
2    True
3    John
dtype: object

Specialized Data Types:
- Categorical: Represents categorical data with predefined categories. 
    Efficient for storing limited sets of categories.
- Sparse: Represents sparse data with many missing values. 
    Stores data efficiently by only keeping non-zero values.
- How to manipulate data in the future

In [18]:
# [('Marketing',), ('Sales',), ('Operations',), ('IT',), ('Finance',), ('HR',)]
# create categorical list
categorical_list = pd.Categorical(["Marketing", "Sales", "Operations", "IT", "Finance", "HR"])
categorical_series = pd.Series(categorical_list)
categorical_series

0     Marketing
1         Sales
2    Operations
3            IT
4       Finance
5            HR
dtype: category
Categories (6, object): ['Finance', 'HR', 'IT', 'Marketing', 'Operations', 'Sales']

In [19]:
sparse_series = pd.Series(pd.arrays.SparseArray([30,31,32,pd.NA, 29, 42, pd.NA]))
sparse_series

0     30
1     31
2     32
3    NaN
4     29
5     42
6    NaN
dtype: Sparse[object, nan]

### **Changing Data Types**

In [20]:
# astype()
# convert integer to float
# Step 2:
float_series = int_series.astype('float64')
float_series

0    1.0
1    2.0
2    3.0
3    4.0
4    5.0
dtype: float64

In [25]:
# astype()
# convert integer to float
# Step 2:
float_series = int_series.astype('string')
float_series

0    1
1    2
2    3
3    4
4    5
dtype: string

**Example: Sales Data Analysis**

You have a dataset of sales transactions that includes the product name, quantity sold, and sale price. 
You want to analyze the data to find the total revenue per product.

In [None]:
# Create a DataFrame using Dictionary of List
data = {
    'Product Name':['Iced Tea', 'Hot Chocolate' , 'Lemonade', 'Coffee', 'Milkshake', 'Tea', 'Smoothie', 'Soda', 'Protein Shake', 'Matcha Latte'],
    'Type': ['Cold', 'Hot', 'Cold', 'Hot', 'Cold', 'Hot', 'Cold', 'Cold', 'Cold', 'Hot'],
    'Stock': [15, 15, 15, 15, 15, 15, 15, 15, 15, 15],
    'Quantity Sold':[6, 9, 13, 11, 8, 6, 14, 10, 8, 10],
    'Manufacturing Cost':[7, 10, 6, 8, 9, 7, 10, 11, 8, 9],
    'Market Price':[13, 20, 11, 15, 19, 14, 17, 18, 20, 12],
    'Rating': [1, 3, 5, 4, 3, 2, 5, 3, 2, 3]
}

### **Data Selection**

Pandas provides numerous methods for selecting and indexing data in Series and DataFrames, 
including label-based indexing with .loc, integer-position based indexing with .iloc, and conditional selection.

### **Data Selection in Series**

### **Data Selection in DataFrame**

#### **Index Location (.iloc)**
- Will get rows based on a number/index.
- Will output into a DataFrame instead of a Series.
> Syntax: [starting_index:ending_index(excluded):step]

#### **Location (.loc)**
- Access a group of rows and columns by label(s) or a boolean array.
> Syntax: [starting_index:ending_index(included):step]

## **Conditional Filtering** 

## **Apply**

The apply function in pandas is a powerful tool for working with DataFrames. 
It allows you to apply a custom function to each element (row or column) of the DataFrame 
and return a new DataFrame or Series based on the results.

## Pandas Operators
Data Analysis:

- sum(): Calculates the sum of a Series or DataFrame
- mean(): Calculates the mean of a Series or DataFrame
- median(): Calculates the median of a Series or DataFrame
- std(): Calculates the standard deviation of a Series or DataFrame
- var(): Calculates the variance of a Series or DataFrame

Data Loading and Exploration:

- head(): Shows the first few rows of a DataFrame
- tail(): Shows the last few rows of a DataFrame
- describe(): Generates summary statistics for each column (mean, standard deviation, etc.)
- info(): Displays information about the DataFrame, including data types and memory usage

### **Aggregating Data** (.groupby)

Aggregating data involves summarizing data points into meaningful statistics, 
such as averages, sums, or counts, which can be achieved using GroupBy operations or pivot tables. 
This helps in understanding the dataset at a higher level.