# Pandas Basics

In [1]:
# Install pandas
%pip install pandas

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 23.2.1 -> 24.0
[notice] To update, run: python.exe -m pip install --upgrade pip


In [2]:
# importing pandas
import pandas as pd

## DataFrames
A DataFrame is a two-dimensional labeled data structure with columns of potentially different data types, similar to a spreadsheet or SQL table. 
It provides a powerful and flexible way to manipulate and analyze structured data in Python, offering functionalities for data analysis.

In [3]:
# Creating an Empty Dataframe

df = pd.DataFrame()
df

In [4]:
# Creating a DataFrame using a list of lists

data = [['Alice', 23], ['Mike', 34], ['Jetson', 28]]
df = pd.DataFrame(data, columns=['Name', 'Age'])
df

Unnamed: 0,Name,Age
0,Alice,23
1,Mike,34
2,Jetson,28


In [5]:
# Creating a DataFrame using a dictionary of lists

data = {
    'Name': ['Alice', 'Mike', 'Jetson'],
    'Age': [23, 34, 28]
}

df = pd.DataFrame(data)
df

Unnamed: 0,Name,Age
0,Alice,23
1,Mike,34
2,Jetson,28


In [6]:
# Creating a DataFrame using a list of dictionaries

data = [
    {'name': 'Alice', 'age': 23}, 
    {'name': 'Mike', 'age': 34}, 
    {'name': 'Jetson', 'age': 28}
]

df = pd.DataFrame(data)
df

Unnamed: 0,name,age
0,Alice,23
1,Mike,34
2,Jetson,28


## Series

A pandas Series is a one-dimensional labeled array capable of holding data of any type (integer, string, float, etc.). It's similar to a one-column table or an array with associated labels, providing powerful indexing and manipulation capabilities in Python.

In [7]:
s = pd.Series([1,3,5,6,7,8])
s

0    1
1    3
2    5
3    6
4    7
5    8
dtype: int64

In [8]:
s = pd.Series([1,3,"Test",6,7,8])
print(s.dtype)
s

object


0       1
1       3
2    Test
3       6
4       7
5       8
dtype: object

### **Pandas Data Types**

Numeric:
- Integer (int64): Represents whole numbers (e.g., 10, -5). This is the default integer type in pandas.
- Float (float64): Represents numbers with decimals (e.g., 3.14, -12.5).
- Boolean (bool): Represents logical True or False values.
- Object: This is a versatile but less efficient type that can store various data types like strings, lists, or custom objects. Pandas uses this type when it cannot infer a more specific data type.

In [9]:
# Integer (int64)
int_series = pd.Series([1, 3, 5, 7])
int_series

# Float (float64)
float_series = pd.Series([1.02, 2.3, 3.1415, 20])
float_series

# Boolean (bool)
bool_series = pd.Series([True, False])
bool_series

# Object (object / mixed data types)
obj_series = pd.Series([1,3,"Test",6,7,8])
obj_series

0       1
1       3
2    Test
3       6
4       7
5       8
dtype: object

Specialized Data Types:
- Datetime (datetime64[ns]): Represents dates and times with nanosecond precision. Useful for time-series data analysis.
- Timedelta (timedelta64[ns]): Represents durations between timestamps.
- Categorical: Represents categorical data with predefined categories. Efficient for storing limited sets of categories.
- Sparse: Represents sparse data with many missing values. Stores data efficiently by only keeping non-zero values.

In [10]:
# DateTime (Timestamp / datetime64)
datetime_series = pd.Series([pd.to_datetime('2024-04-05'), pd.to_datetime('2024-05-05'), pd.to_datetime('2024-06-05')])
datetime_series

# Timedelta (timedelta64)
timedelta_series = pd.Series([pd.Timedelta(days=8, hours=3, minutes=30), pd.Timedelta(days=4, hours=3, minutes=30), pd.Timedelta(days=2, hours=1, minutes=5)])
timedelta_series

# Categorical (category)
category_series = pd.Series(pd.Categorical(["Sales", "Marketing", "Operations"]))
category_series

# Sparse (Sparse[data types])
sparse_series = pd.Series(pd.arrays.SparseArray([2,4,6,8]))
sparse_series

0    2
1    4
2    6
3    8
dtype: Sparse[int64, 0]

In [11]:
# Check datatypes
int_series.dtype

# Change datatypes of columns
int_series = int_series.astype('float64')
int_series

# Changing to string
int_series = int_series.astype('string')
int_series

0    1.0
1    3.0
2    5.0
3    7.0
dtype: string

**Example: Sales Data Analysis**

You have a dataset of sales transactions that includes the product name, quantity sold, and sale price. 
You want to analyze the data to find the total revenue per product.

In [12]:
data = {
    'Product Name':['A','B','C','A','B','A'],
    'Quantity Sold':[3,2,5,4,1,2],
    'Sale Price':[10,20,10,15,20,15]
}

sales_df = pd.DataFrame(data)
sales_df

Unnamed: 0,Product Name,Quantity Sold,Sale Price
0,A,3,10
1,B,2,20
2,C,5,10
3,A,4,15
4,B,1,20
5,A,2,15


In [13]:
# Getting the Product Name Column
sales_df['Product Name']

# Operations in Pandas
sales_df['Total Revenue'] = sales_df['Quantity Sold'] * sales_df['Sale Price']

# Get overall revenue
print(sum(sales_df['Total Revenue']))

sales_df

230


Unnamed: 0,Product Name,Quantity Sold,Sale Price,Total Revenue
0,A,3,10,30
1,B,2,20,40
2,C,5,10,50
3,A,4,15,60
4,B,1,20,20
5,A,2,15,30


In [14]:
results_df = pd.DataFrame()

# Grouping column values (groupby)
results_df['Total Revenue'] = sales_df.groupby('Product Name')['Total Revenue'].sum()

# Grouping column values (groupby)
results_df['Total Quantity'] = total_quantity = sales_df.groupby('Product Name')['Quantity Sold'].sum()
total_quantity

results_df

Unnamed: 0_level_0,Total Revenue,Total Quantity
Product Name,Unnamed: 1_level_1,Unnamed: 2_level_1
A,120,9
B,60,3
C,50,5


### **Data Selection**

Pandas provides numerous methods for selecting and indexing data in Series and DataFrames, including label-based indexing with .loc, integer-position based indexing with .iloc, and conditional selection.

In [15]:
# Recall our Sales DataFrame
sales_df

Unnamed: 0,Product Name,Quantity Sold,Sale Price,Total Revenue
0,A,3,10,30
1,B,2,20,40
2,C,5,10,50
3,A,4,15,60
4,B,1,20,20
5,A,2,15,30


In [16]:
# [start:end]
# starting point : ending point : increment/traversal method

# Check the First Two Rows of Product Names
sales_df['Product Name'][0:2]

# Check the Custom Amt. of Rows for Quantity Sold
sales_df['Quantity Sold'][2:5]

# Check in twos
sales_df['Sale Price'][::2]

0    10
2    10
4    20
Name: Sale Price, dtype: int64

In [17]:
# Sum the Quantity Sold for the first 2 rows
sales_df['Quantity Sold'][0:2].sum()

5

In [18]:
"""
Index Location (.iloc)
- Will get rows based on a number/index.
- Will output into a DataFrame instead of a Series.
""" 

# Turn the first 3 rows into a new dataframe
sales_df.iloc[0:3]

Unnamed: 0,Product Name,Quantity Sold,Sale Price,Total Revenue
0,A,3,10,30
1,B,2,20,40
2,C,5,10,50


In [19]:
"""
Location (.loc)
- Access a group of rows and columns by label(s) or a boolean array.
"""

# Get only specific columns
sales_df.loc[0:3, ['Product Name', 'Sale Price']]

Unnamed: 0,Product Name,Sale Price
0,A,10
1,B,20
2,C,10
3,A,15


In [20]:
# Conditional Filtering

# Check for Total Revenues greater than or equal to 40
sales_df[sales_df['Total Revenue'] >= 40]

# Check for Product Names equal to A
sales_df[sales_df['Product Name'] == "A"]

Unnamed: 0,Product Name,Quantity Sold,Sale Price,Total Revenue
0,A,3,10,30
3,A,4,15,60
5,A,2,15,30


**Example: Filtering Customer Reviews**

A DataFrame contains customer reviews for different products, including a numeric rating. You need to filter reviews to find all reviews of a specific product with a rating of 4 or higher.

In [21]:
reviews_data = {
    'ProductID': ['P1','P2','P3','P4','P5','P6','P7','P8','P9','P10'],
    'Rating': [5,3,2,1,4,3,2,4,6,1]
}

reviews_df = pd.DataFrame(reviews_data)

# Ratings of Products that are 4 or higher
reviews_df[reviews_df['Rating'] >= 4]

Unnamed: 0,ProductID,Rating
0,P1,5
4,P5,4
7,P8,4
8,P9,6


## Pandas Operators

Data Loading and Exploration:

- head(): Shows the first few rows of a DataFrame
- tail(): Shows the last few rows of a DataFrame
- describe(): Generates summary statistics for each column (mean, standard deviation, etc.)
- info(): Displays information about the DataFrame, including data types and memory usage

Data Analysis:

- sum(): Calculates the sum of a Series or DataFrame
- mean(): Calculates the mean of a Series or DataFrame
- median(): Calculates the median of a Series or DataFrame
- std(): Calculates the standard deviation of a Series or DataFrame
- var(): Calculates the variance of a Series or DataFrame

In [22]:
# head(n)
reviews_df.head() # First 5 rows by default
reviews_df.head(3) # First 3 rows

# tail(n)
reviews_df.tail() # Last 5 rows by default
reviews_df.tail(3) # Last 3 rows

# describe() - count, mean, std, min, 25%, 50%, 75%, max
reviews_df.describe()

# info()
reviews_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   ProductID  10 non-null     object
 1   Rating     10 non-null     int64 
dtypes: int64(1), object(1)
memory usage: 292.0+ bytes


In [23]:
# Sum
reviews_df['Rating'].sum()

# Mean
reviews_df['Rating'].mean()

# Median
reviews_df['Rating'].median()

# Standard Deviation (how far apart our ratings are from each other)
reviews_df['Rating'].std()

# Variance (how far our values are from the mean/average)
reviews_df['Rating'].var()

2.766666666666667

### **Importing and Exporting Data**

Pandas supports reading from and writing to a variety of file formats, including CSV, Excel, SQL, making it easy to integrate with data analysis workflows.

In [24]:
# Turns CSV into a DataFrame
data = pd.read_csv('example.csv')
data

Unnamed: 0,A,B,C
0,1.0,5.0,10.0
1,2.0,6.5,11.0
2,2.333333,6.5,12.0
3,4.0,8.0,11.0


In [25]:
# This lets us export to excel
%pip install openpyxl

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 23.2.1 -> 24.0
[notice] To update, run: python.exe -m pip install --upgrade pip


In [26]:
# Export DataFrame to an Excel file
data.to_excel('exported_data.xlsx', sheet_name="Example Sheet", index=False)

In [27]:
# Export DataFrame to a CSV file
data.to_csv('exported_data.csv', index=False)

**Example: Importing Sales Data and Exporting Analysis**

Assume you have a CSV file, sales.csv, containing sales data. You need to import this data, calculate the total sales, and export the summary to a new file.

In [28]:
# Importing Sales into the notebook
sales_df = pd.read_csv('sales.csv')
sales_df

Unnamed: 0,Product Name,Quantity Sold,Sale Price
0,Product A,3,10
1,Product B,2,20
2,Product A,5,10
3,Product C,4,15
4,Product B,1,20
5,Product C,2,15


In [30]:
# Get the total revenue of the store

# Multiplying the price with how much it was sold
sales_df['Total Revenue'] = sales_df['Quantity Sold'] * sales_df['Sale Price']
sales_df

# Adding all totals together
total_revenue = sales_df['Total Revenue'].sum()
total_revenue

230

In [32]:
# Create a DataFrame for Total Sales
summary_df = pd.DataFrame([{'Total Sales': total_revenue}])
summary_df

# Export to CSV file
summary_df.to_csv('sales_summary.csv', index=False)

## Apply

The apply function in pandas is a powerful tool for working with DataFrames. It allows you to apply a custom function to each element (row or column) of the DataFrame and return a new DataFrame or Series based on the results.

In [33]:
data = {
    "Age": [25, 30, 22],
    "Name": ["Jeff", "John", "Joey"]
}

df = pd.DataFrame(data)
df

Unnamed: 0,Age,Name
0,25,Jeff
1,30,John
2,22,Joey


In [34]:
def square(x):
    return x * x

df["AgeSquared"] = df["Age"].apply(square)
df

Unnamed: 0,Age,Name,AgeSquared
0,25,Jeff,625
1,30,John,900
2,22,Joey,484


In [35]:
def add_prefix(name):
    return f"Mr. {name}"

df["Full Name"] = df["Name"].apply(add_prefix)
df

Unnamed: 0,Age,Name,AgeSquared,Full Name
0,25,Jeff,625,Mr. Jeff
1,30,John,900,Mr. John
2,22,Joey,484,Mr. Joey


In [36]:
def keep_young(age):
    return age < 30

# Conditional Filtering
df_filtered = df[df["Age"].apply(keep_young)]
df_filtered

Unnamed: 0,Age,Name,AgeSquared,Full Name
0,25,Jeff,625,Mr. Jeff
2,22,Joey,484,Mr. Joey
