# 01 - Pandas Introduction

## Introduction

Pandas is the most important Python library for data manipulation and analysis. It provides powerful tools for working with structured data, similar to Excel or SQL, but much more powerful and flexible.

## What You'll Learn

- What is pandas?
- Understanding Series (1-dimensional data)
- Understanding DataFrames (2-dimensional data)
- Creating DataFrames
- Basic DataFrame operations
- Viewing and inspecting data


## What is Pandas?

**Pandas** stands for "Panel Data" and is built on top of NumPy. It provides:

- **Series**: 1-dimensional labeled array (like a column in Excel)
- **DataFrame**: 2-dimensional labeled data structure (like a table in SQL or Excel)
- Powerful data manipulation tools
- Easy integration with various data sources

**Why Pandas for Data Engineering?**
- Works with structured data (tables)
- Similar to SQL operations (SELECT, WHERE, GROUP BY, JOIN)
- Handles missing data gracefully
- Fast and efficient for data manipulation
- Easy to read and write various file formats


## Importing Pandas

The standard convention is to import pandas as `pd`.


In [2]:
# source .venv/bin/activate
# python -m pip install ipykernel
# python -m ipykernel install --user --name dev_de_tr --display-name "Python (dev_de_tr)"

In [2]:
import pandas as pd

# Check pandas version
print(f"Pandas version: {pd.__version__}")
print("Pandas imported successfully!")


Pandas version: 2.3.3
Pandas imported successfully!


## Understanding Series

A **Series** is a one-dimensional labeled array. Think of it as a single column of data with an index.

**Key characteristics:**
- Has an index (row labels)
- Has data (values)
- Similar to a Python list or dictionary


In [3]:
# Creating a Series from a list
ages = pd.Series([25, 30, 35, 28, 32])
print("Series from list:")
print(ages)
print(f"\nType: {type(ages)}")


Series from list:
0    25
1    30
2    35
3    28
4    32
dtype: int64

Type: <class 'pandas.core.series.Series'>


In [4]:
# Creating a Series with custom index
ages = pd.Series([25, 30, 35, 28, 32], index=['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'])
print("Series with custom index:")
print(ages)
print(f"\nAlice's age: {ages['Alice']}")


Series with custom index:
Alice      25
Bob        30
Charlie    35
Diana      28
Eve        32
dtype: int64

Alice's age: 25


In [6]:
# Creating a Series from a dictionary
scores = pd.Series({'Math': 85, 'Science': 90, 'English': 88})
print("Series from dictionary:")
print(scores)
print(f"\nMarks in English: {scores['English']}")


Series from dictionary:
Math       85
Science    90
English    88
dtype: int64

Marks in English: 88


## Understanding DataFrames

A **DataFrame** is a two-dimensional labeled data structure with columns of potentially different types. Think of it as:
- A table in SQL
- A spreadsheet in Excel
- A collection of Series (columns)

**Key characteristics:**
- Has rows (index)
- Has columns (column names)
- Each column can have different data types


In [7]:
# Creating a DataFrame from a dictionary
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
    'Age': [25, 30, 35, 28, 32],
    'City': ['New York', 'London', 'Tokyo', 'Paris', 'Sydney'],
    'Salary': [50000, 60000, 70000, 55000, 65000]
}

df = pd.DataFrame(data)
print("DataFrame from dictionary:")
print(df)
print(f"\nType: {type(df)}")


DataFrame from dictionary:
      Name  Age      City  Salary
0    Alice   25  New York   50000
1      Bob   30    London   60000
2  Charlie   35     Tokyo   70000
3    Diana   28     Paris   55000
4      Eve   32    Sydney   65000

Type: <class 'pandas.core.frame.DataFrame'>


In [None]:
# Creating a DataFrame from a list of lists
data_list = [
    ['Alice', 25, 'New York', 50000],
    ['Bob', 30, 'London', 60000],
    ['Charlie', 35, 'Tokyo', 70000],
    ['Diana', 28, 'Paris', 55000],
    ['Eve', 32, 'Sydney', 65000]
]

df2 = pd.DataFrame(data_list, columns=['Name', 'Age', 'City', 'Salary'])
print("DataFrame from list of lists:")
print(df2)


DataFrame from list of lists:
      Name  Age      City   Salary
0    Alice   25  New York  50000.0
1      Bob   30    London  60000.0
2  Charlie   35     Tokyo  70000.0
3    Diana   28     Paris  55000.0
4      Eve   32     65000      NaN


## Basic DataFrame Operations

Let's explore some essential operations for inspecting and understanding DataFrames.


In [9]:
# Create a sample DataFrame
df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
    'Age': [25, 30, 35, 28, 32],
    'City': ['New York', 'London', 'Tokyo', 'Paris', 'Sydney'],
    'Salary': [50000, 60000, 70000, 55000, 65000]
})

# View first few rows (default: 5 rows)
print("First 5 rows:")
print(df.head())


First 5 rows:
      Name  Age      City  Salary
0    Alice   25  New York   50000
1      Bob   30    London   60000
2  Charlie   35     Tokyo   70000
3    Diana   28     Paris   55000
4      Eve   32    Sydney   65000


In [10]:
# View last few rows
print("Last 3 rows:")
print(df.tail(3))


Last 3 rows:
      Name  Age    City  Salary
2  Charlie   35   Tokyo   70000
3    Diana   28   Paris   55000
4      Eve   32  Sydney   65000


In [11]:
# Get DataFrame shape (rows, columns)
print(f"Shape: {df.shape}")
print(f"Number of rows: {df.shape[0]}")
print(f"Number of columns: {df.shape[1]}")


Shape: (5, 4)
Number of rows: 5
Number of columns: 4


In [12]:
# Get column names
print("Column names:")
print(df.columns)
print(f"\nColumn names as list: {df.columns.tolist()}")


Column names:
Index(['Name', 'Age', 'City', 'Salary'], dtype='object')

Column names as list: ['Name', 'Age', 'City', 'Salary']


In [13]:
# Get data types of each column
print("Data types:")
print(df.dtypes)


Data types:
Name      object
Age        int64
City      object
Salary     int64
dtype: object


In [14]:
# Get information about the DataFrame
print("DataFrame info:")
df.info()


DataFrame info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Name    5 non-null      object
 1   Age     5 non-null      int64 
 2   City    5 non-null      object
 3   Salary  5 non-null      int64 
dtypes: int64(2), object(2)
memory usage: 292.0+ bytes


In [15]:
# Get statistical summary (only for numeric columns)
print("Statistical summary:")
print(df.describe())


Statistical summary:
             Age       Salary
count   5.000000      5.00000
mean   30.000000  60000.00000
std     3.807887   7905.69415
min    25.000000  50000.00000
25%    28.000000  55000.00000
50%    30.000000  60000.00000
75%    32.000000  65000.00000
max    35.000000  70000.00000


## Accessing Columns

You can access columns in a DataFrame like dictionary keys or as attributes.


In [16]:
# Access a single column (returns a Series)
print("Accessing 'Name' column:")
print(df['Name'])
print(f"\nType: {type(df['Name'])}")


Accessing 'Name' column:
0      Alice
1        Bob
2    Charlie
3      Diana
4        Eve
Name: Name, dtype: object

Type: <class 'pandas.core.series.Series'>


In [17]:
# Access multiple columns (returns a DataFrame)
print("Accessing multiple columns:")
print(df[['Name', 'Age']])
print(f"\nType: {type(df[['Name', 'Age']])}")


Accessing multiple columns:
      Name  Age
0    Alice   25
1      Bob   30
2  Charlie   35
3    Diana   28
4      Eve   32

Type: <class 'pandas.core.frame.DataFrame'>


In [18]:
# Access column as attribute (only works if column name is a valid Python identifier)
print("Accessing column as attribute:")
print(df.Age)


Accessing column as attribute:
0    25
1    30
2    35
3    28
4    32
Name: Age, dtype: int64
