# Introduction

Pandas is a highly versatile Python library for data manipulation and analysis, built on top of the NumPy package. Its core data structures, __Series__ and __DataFrame__, are specifically designed to handle structured data.

A DataFrame organizes data into rows (of observations) and columns (of variables), making it ideal for storing and processing tabular data. Pandas’s efficiency and flexibility have made it an essential tool for researchers and physicists like yourself 😁

## What is a DataFrame?

A DataFrame is a two-dimensional, labeled data structure in Pandas, with columns that can hold data of different types. It’s similar to a spreadsheet. As one of the most commonly used Pandas objects, a DataFrame organizes data into:
- __Rows__: Representing individual records or observations.
- __Columns__: Representing variables or attributes.

### Creating a DataFrame

A DataFrame can be created from various data types, including lists, dictionaries, NumPy arrays, or external sources such as CSV and Excel files.

In [2]:
# Creating a DataFrame from a Dictionary

import pandas as pd

# Sample data
data = {
    "Time (s)": [0, 1, 2, 3],
    "Position (m)": [0, 5, 20, 45],
    "Velocity (m/s)": [0, 5, 10, 15]
}

# Create a DataFrame
df = pd.DataFrame(data)

print(df)

   Time (s)  Position (m)  Velocity (m/s)
0         0             0               0
1         1             5               5
2         2            20              10
3         3            45              15


In [14]:
# Loading a DataFrame from a CSV File

import pandas as pd

# Load data from a CSV file
df = pd.read_csv("data.csv")
# print(df) # Display the DataFrame
# print(df.head()) # Display the first 5 rows

## Inspecting DataFrames

Pandas provides several functions to inspect the contents of a DataFrame, including:
- `head()`: Displays the first few rows of a DataFrame, e.g., `df.head()` displays the first five rows (default is 5).
- `tail()`: Displays the last few rows of a DataFrame.
- `info()`: Displays a summary of the DataFrame, including the data types and missing values.
- `describe()`: Displays a statistical summary of the DataFrame.
- `shape`: Returns the dimensions of the DataFrame (rows, columns).
- `columns`: Returns the column labels of the DataFrame.
- `index`: Returns the row labels of the DataFrame,
and so on.

🐍 Try it out for yourself using the sample data.csv.

## Basic DataFrame Operations

Pandas provides a wide range of operations to manipulate and analyze DataFrames. These operations are invaluable for data cleaning, transformation, and analysis.

An example scenario you might find yourself in as a junior researcher is analyzing experimental results. For instance, imagine you have a dataset containing time-series data from a physics experiment, such as measurements of position, velocity, and acceleration over time. You might need to clean the data by handling missing values, filter rows based on specific conditions (e.g., velocity exceeding a threshold), calculate metrics like energy or force, and visualize trends to interpret your results.

Pandas simplifies these tasks, enabling you to focus on extracting insights from your data rather than writing complex code to manipulate it.

### Slicing and Indexing

You've already encountered slicing and indexing in its general form in Python. Pandas extends these concepts to DataFrames, allowing you to select specific rows and columns based on labels or positions.

__Accessing Rows and Columns__:

- __Columns__: You can access columns using their labels as attributes, e.g., `df.column_name` or as keys, e.g., `df['column_name']`.
- __Rows__: You can access rows using the `loc[]` and `iloc[]` methods. `loc[]` is label-based (meaning you specify the row label), while `iloc[]` is integer position-based (meaning you specify the row index).

In [17]:
# Access a column

print(df["Position"])

# Access a specific row
print(df.loc[2]) # Access row with label 2
print(df.iloc[2]) # Access the third row by position

0      0
1      5
2     20
3     45
4     80
5    100
Name: Position, dtype: int64
Time         2
Position    20
Name: 2, dtype: int64
Time         2
Position    20
Name: 2, dtype: int64


__Slicing Rows and Columns:__

You can slice rows and columns in a DataFrame using the `loc[]` and `iloc[]` methods:
- _Label-based slicing_: Use `df.loc[row_labels, column_labels]` to select specific rows and columns based on their labels.
_Example_: `df.loc[1:3, 'column_name']` selects rows 1 to 3 for the specified column.

- _Position-based slicing_: Use `df.iloc[row_indices, column_indices]` to slice rows and columns based on their integer positions.

This approach provides flexibility for accessing subsets of data efficiently.

In [19]:
# Slice rows and columns
print(df.loc[1:3, ["Time", "Position"]])
print(df.iloc[0:2, 0:2])

   Time  Position
1     1         5
2     2        20
3     3        45
   Time  Position
0     0         0
1     1         5


### Filtering Data

Filtering data is a common task in data analysis. You can filter rows based on specific conditions using boolean indexing. This simply means creating a list of True or False values, where each value indicates whether a row meets the condition. The rows with True are included in the filtered DataFrame, while the others are excluded.

In [26]:
## Filter rows where Position > 20

# The boolean condition returns a Series of True and False values
condition = df["Position"] > 20
filtered_data = df[condition]

# Alternatively, you can combine the two steps into one line of code
# filtered_data = df[df["Position"] > 20]

print(filtered_data)

   Time  Position
3     3        45
4     4        80
5     5       100
