# Pandas Basics - Comprehensive Tutorial

This notebook covers fundamental and advanced Pandas concepts for data manipulation.

## Table of Contents
1. [Introduction](#intro)
2. [Series and DataFrames](#structures)
3. [Data Loading](#loading)
4. [Data Selection](#selection)
5. [Data Cleaning](#cleaning)
6. [Data Transformation](#transform)
7. [GroupBy Operations](#groupby)
8. [Time Series](#timeseries)
9. [Practical Examples](#examples)

## 1. Introduction <a id='intro'></a>

In [None]:
import pandas as pd
import numpy as np
print(f"Pandas version: {pd.__version__}")

## 2. Series and DataFrames <a id='structures'></a>

In [None]:
# Creating a Series
s = pd.Series([1, 2, 3, 4, 5], index=['a', 'b', 'c', 'd', 'e'])
print("Series:")
print(s)

# Creating a DataFrame
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'Age': [25, 30, 35, 28],
    'City': ['New York', 'Los Angeles', 'Chicago', 'Houston'],
    'Salary': [50000, 60000, 75000, 55000]
}
df = pd.DataFrame(data)
print("\nDataFrame:")
print(df)

## 3. Data Loading <a id='loading'></a>

In [None]:
# Create sample data
sample_data = pd.DataFrame({
    'A': [1, 2, 3, 4, 5],
    'B': [10, 20, 30, 40, 50],
    'C': ['x', 'y', 'z', 'x', 'y']
})

# Save to CSV
sample_data.to_csv('/tmp/sample.csv', index=False)

# Read from CSV
df = pd.read_csv('/tmp/sample.csv')
print("Data from CSV:")
print(df)
print("\nData info:")
print(df.info())
print("\nData description:")
print(df.describe())

## 4. Data Selection <a id='selection'></a>

In [None]:
df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
    'Age': [25, 30, 35, 28, 32],
    'City': ['NYC', 'LA', 'Chicago', 'Houston', 'NYC'],
    'Salary': [50000, 60000, 75000, 55000, 65000]
})

print("Original DataFrame:")
print(df)

# Column selection
print("\nSingle column:")
print(df['Name'])

print("\nMultiple columns:")
print(df[['Name', 'Salary']])

# Row selection with loc and iloc
print("\nFirst row (iloc):")
print(df.iloc[0])

print("\nRows 1-3 (iloc):")
print(df.iloc[1:3])

# Boolean indexing
print("\nPeople older than 28:")
print(df[df['Age'] > 28])

print("\nPeople from NYC:")
print(df[df['City'] == 'NYC'])

# Multiple conditions
print("\nAge > 28 AND City == NYC:")
print(df[(df['Age'] > 28) & (df['City'] == 'NYC')])

## 5. Data Cleaning <a id='cleaning'></a>

In [None]:
# Create data with missing values
df = pd.DataFrame({
    'A': [1, 2, np.nan, 4, 5],
    'B': [10, np.nan, 30, 40, 50],
    'C': ['x', 'y', 'z', 'x', 'y']
})

print("Data with missing values:")
print(df)

# Check for missing values
print("\nMissing values:")
print(df.isnull())
print("\nMissing value counts:")
print(df.isnull().sum())

# Fill missing values
print("\nFill with mean:")
print(df.fillna(df.mean()))

print("\nFill with specific value:")
print(df.fillna(0))

# Drop missing values
print("\nDrop rows with missing values:")
print(df.dropna())

# Remove duplicates
df_dup = pd.DataFrame({
    'A': [1, 2, 2, 3],
    'B': [4, 5, 5, 6]
})
print("\nData with duplicates:")
print(df_dup)
print("\nAfter removing duplicates:")
print(df_dup.drop_duplicates())

## 6. Data Transformation <a id='transform'></a>

In [None]:
df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Score': [85, 90, 78]
})

print("Original DataFrame:")
print(df)

# Add new column
df['Grade'] = df['Score'].apply(lambda x: 'A' if x >= 90 else ('B' if x >= 80 else 'C'))
print("\nWith Grade column:")
print(df)

# Map function
df['Score_Doubled'] = df['Score'].map(lambda x: x * 2)
print("\nWith Score_Doubled:")
print(df)

# String operations
df['Name_Upper'] = df['Name'].str.upper()
df['Name_Length'] = df['Name'].str.len()
print("\nWith string operations:")
print(df)

## 7. GroupBy Operations <a id='groupby'></a>

In [None]:
df = pd.DataFrame({
    'City': ['NYC', 'LA', 'NYC', 'LA', 'Chicago'],
    'Category': ['A', 'B', 'A', 'B', 'A'],
    'Sales': [100, 150, 200, 180, 120],
    'Profit': [20, 30, 40, 35, 25]
})

print("Original data:")
print(df)

# Group by single column
print("\nAverage sales by city:")
print(df.groupby('City')['Sales'].mean())

# Group by multiple columns
print("\nTotal sales by city and category:")
print(df.groupby(['City', 'Category'])['Sales'].sum())

# Multiple aggregations
print("\nMultiple aggregations:")
print(df.groupby('City').agg({
    'Sales': ['mean', 'sum'],
    'Profit': ['mean', 'max']
}))

## 8. Time Series <a id='timeseries'></a>

In [None]:
# Create time series data
dates = pd.date_range('2024-01-01', periods=10, freq='D')
df = pd.DataFrame({
    'Date': dates,
    'Value': np.random.randn(10).cumsum()
})

print("Time series data:")
print(df)

# Set date as index
df.set_index('Date', inplace=True)

# Date operations
print("\nData for first 5 days:")
print(df.head())

# Resampling
print("\nResample to 2-day periods:")
print(df.resample('2D').mean())

## 9. Practical Examples <a id='examples'></a>

In [None]:
# Example: Sales data analysis
sales_data = pd.DataFrame({
    'Date': pd.date_range('2024-01-01', periods=20, freq='D'),
    'Product': ['A', 'B'] * 10,
    'Sales': np.random.randint(50, 200, 20),
    'Region': ['North', 'South'] * 10
})

print("Sales Data:")
print(sales_data.head())

# Total sales by product
print("\nTotal sales by product:")
print(sales_data.groupby('Product')['Sales'].sum())

# Average sales by region
print("\nAverage sales by region:")
print(sales_data.groupby('Region')['Sales'].mean())

# Pivot table
print("\nPivot table:")
pivot = sales_data.pivot_table(
    values='Sales',
    index='Product',
    columns='Region',
    aggfunc='mean'
)
print(pivot)

## Summary

In this notebook, we covered:
- Pandas data structures (Series and DataFrames)
- Loading and saving data
- Selecting and filtering data
- Cleaning missing values and duplicates
- Transforming data with apply/map
- GroupBy operations and aggregations
- Time series analysis
- Practical data analysis examples

## Next Steps

Move on to **Matplotlib** for data visualization!