In [1]:
import pandas as pd

# Python Pandas: From Beginner to Advanced

[Pandas](https://pandas.pydata.org/) is a powerful, open-source library for data manipulation and analysis in Python. Built on top of [NumPy](https://numpy.org/), it provides intuitive data structures and functions to handle structured data efficiently. Whether you're cleaning data, performing complex analyses, or preparing inputs for machine learning models, Pandas is an essential tool for any data scientist or analyst.

This guide takes you from the basics to advanced features of Pandas, with practical examples and real-world applications. By the end, you'll be equipped to tackle data challenges confidently.

---

## **Table of Contents**
1. [What is Pandas?](#what-is-pandas)
2. [Getting Started](#getting-started)
   - [Installation](#installation)
   - [Basic Data Structures](#basic-data-structures)
3. [Beginner Level](#beginner-level)
   - [Reading and Writing Data](#reading-and-writing-data)
   - [Basic Operations](#basic-operations)
4. [Intermediate Level](#intermediate-level)
   - [Indexing and Selection](#indexing-and-selection)
   - [Handling Missing Data](#handling-missing-data)
   - [GroupBy and Aggregation](#groupby-and-aggregation)
   - [Merging and Joining](#merging-and-joining)
5. [Advanced Level](#advanced-level)
   - [Pivot Tables](#pivot-tables)
   - [Applying Functions](#applying-functions)
   - [Time Series](#time-series)
   - [MultiIndex](#multiindex)
   - [Performance Optimization](#performance-optimization)
6. [Real-World Example: Sales Data Analysis](#real-world-example-sales-data-analysis)
7. [Connections to Other Topics](#connections-to-other-topics)
8. [Additional Resources](#additional-resources)

---

## **What is Pandas?**
Pandas is designed for working with structured data (e.g., tables, time series). Its key features include:
- **DataFrames**: 2D tables with labeled rows and columns.
- **Series**: 1D labeled arrays.
- Efficient handling of large datasets.
- Seamless integration with other libraries like NumPy, Matplotlib, and Scikit-learn.

It’s particularly useful for:
- Reading/writing data from various formats (CSV, Excel, JSON, SQL).
- Cleaning and transforming data.
- Analyzing and summarizing data.

---

## **Getting Started**

### **Installation**
Install Pandas using pip:
```bash
pip install pandas
```
Import it in your Python script:
```python
import pandas as pd
import numpy as np  # Often used alongside Pandas
```

### **Basic Data Structures**
#### **Series**
A Series is a one-dimensional array with labels (an index).
```python
s = pd.Series([10, 20, 30, 40], index=['a', 'b', 'c', 'd'])
print(s)
# Output:
# a    10
# b    20
# c    30
# d    40
# dtype: int64
```

#### **DataFrame**
A DataFrame is a two-dimensional table with rows and columns.
```python
data = {'Name': ['Alice', 'Bob', 'Charlie'], 'Age': [25, 30, 35], 'City': ['New York', 'London', 'Paris']}
df = pd.DataFrame(data)
print(df)
# Output:
#       Name  Age    City
# 0    Alice   25  New York
# 1      Bob   30   London
# 2  Charlie   35    Paris
```

---

## **Beginner Level**

### **Reading and Writing Data**
Pandas supports multiple file formats:
```python
# Read CSV
df = pd.read_csv('data.csv')
# Write to CSV
df.to_csv('output.csv', index=False)
# Other formats
df_excel = pd.read_excel('data.xlsx')
df_json = pd.read_json('data.json')
```

### **Basic Operations**
- **View Data**:
  ```python
  print(df.head(2))  # First 2 rows
  print(df.tail(1))  # Last row
  print(df.info())   # Data types and non-null counts
  print(df.describe())  # Summary statistics
  ```
- **Select Columns**:
  ```python
  print(df['Name'])  # Series
  print(df[['Name', 'Age']])  # DataFrame
  ```
- **Filter Rows**:
  ```python
  print(df[df['Age'] > 30])  # Rows where Age > 30
  ```
- **Add/Drop Columns**:
  ```python
  df['Salary'] = [50000, 60000, 70000]
  df = df.drop('Salary', axis=1)
  ```

---

## **Intermediate Level**

### **Indexing and Selection**
- **loc** (label-based):
  ```python
  print(df.loc[0, 'Name'])  # 'Alice'
  print(df.loc[0:1, ['Name', 'City']])
  ```
- **iloc** (integer-based):
  ```python
  print(df.iloc[0, 0])  # 'Alice'
  print(df.iloc[0:2, 0:2])
  ```

### **Handling Missing Data**
```python
df.loc[1, 'Age'] = np.nan
print(df.isna())
df['Age'] = df['Age'].fillna(df['Age'].mean())
df = df.dropna()
```

### **GroupBy and Aggregation**
```python
df['Department'] = ['HR', 'IT', 'HR']
grouped = df.groupby('Department')
print(grouped['Age'].mean())
```

### **Merging and Joining**
```python
df2 = pd.DataFrame({'Name': ['Alice', 'Bob'], 'Bonus': [5000, 3000]})
merged = pd.merge(df, df2, on='Name', how='left')
print(merged)
```

---

## **Advanced Level**

### **Pivot Tables**
```python
df['Sales'] = [100, 200, 150]
pivot = df.pivot_table(values='Sales', index='Department', columns='City', aggfunc='sum')
print(pivot)
```

### **Applying Functions**
```python
def age_category(age):
    return 'Young' if age < 30 else 'Senior'
df['Age_Category'] = df['Age'].apply(age_category)
```

### **Time Series**
```python
dates = pd.date_range('2023-01-01', periods=3, freq='D')
df['Date'] = dates
df.set_index('Date', inplace=True)
monthly = df.resample('M').sum(numeric_only=True)
```

### **MultiIndex**
```python
df_multi = df.set_index(['Department', 'City'])
print(df_multi)
```

### **Performance Optimization**
- Use `categorical` for repeated strings:
  ```python
  df['Department'] = df['Department'].astype('category')
  ```
- Avoid loops; use vectorized operations.
- For large files, use chunking:
  ```python
  chunks = pd.read_csv('large_data.csv', chunksize=1000)
  ```

---

## **Real-World Example: Sales Data Analysis**
Assume `sales_data.csv`:
```
OrderID,Date,Product,Category,Price,Quantity,Region
1,2023-01-01,Laptop,Electronics,1000,2,North
2,2023-01-02,Phone,Electronics,500,3,South
3,2023-01-03,Desk,Furniture,200,1,North
4,2023-01-04,Chair,Furniture,100,4,South
```

### **Code**
```python
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# 1. Load Data
df = pd.read_csv('sales_data.csv')
print("Original Data:\n", df.head())

# 2. Clean Data
df['Date'] = pd.to_datetime(df['Date'])
print("\nMissing Values:\n", df.isna().sum())
df['Price'] = df['Price'].fillna(df['Price'].mean())

# 3. Add Calculated Column
df['Total_Sales'] = df['Price'] * df['Quantity']
print("\nData with Total Sales:\n", df)

# 4. Group and Aggregate
category_sales = df.groupby('Category')['Total_Sales'].sum()
print("\nSales by Category:\n", category_sales)

# 5. Pivot Table
pivot = df.pivot_table(values='Total_Sales', index='Region', columns='Category', aggfunc='sum')
print("\nPivot Table:\n", pivot)

# 6. Time Series Analysis
df.set_index('Date', inplace=True)
monthly_sales = df.resample('M')['Total_Sales'].sum()
print("\nMonthly Sales:\n", monthly_sales)

# 7. Filter and Sort
high_sales = df[df['Total_Sales'] > 1000].sort_values('Total_Sales', ascending=False)
print("\nHigh Sales Orders:\n", high_sales)

# 8. Visualization
category_sales.plot(kind='bar', title='Sales by Category')
plt.ylabel('Total Sales')
plt.show()

# 9. Save Results
df.to_csv('processed_sales.csv')
```

### **Explanation**
- **Load**: Reads CSV.
- **Clean**: Converts dates, handles missing values.
- **Calculate**: Adds `Total_Sales`.
- **Group**: Summarizes sales by category.
- **Pivot**: Sales by region and category.
- **Time Series**: Monthly sales aggregation.
- **Filter/Sort**: Finds high-value orders.
- **Visualize**: Bar plot of category sales.
- **Save**: Writes processed data to CSV.

---

## **Connections to Other Topics**
- **NumPy**: Pandas builds on NumPy (e.g., DataFrames use NumPy arrays internally). Explore NumPy arrays for deeper insights.
- **Data Processing**: JSON and dictionary handling in Pandas mirrors common data workflows.
- **Visualization**: Pandas’ built-in plotting (e.g., `df.plot()`) enhances data exploration.

---

## **Additional Resources**
- [Official Pandas Documentation](https://pandas.pydata.org/docs/)
- [10 Minutes to Pandas](https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html)
- [Pandas Cheat Sheet](https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf)

---

**Try It Out!**  
Clone this repository, run the examples, and experiment with your own datasets. Pandas is a versatile tool—master it to unlock powerful data insights.

---

*License: MIT*  
Feel free to use, modify, and share this guide.