# 📊 Pandas Tutorial 1: Reading and Inspecting Data

Welcome to the first notebook in our comprehensive Pandas learning series! This notebook covers the fundamentals of reading data from various sources and performing initial data exploration.

## 🎯 Learning Objectives

By the end of this notebook, you will be able to:
- Load data from CSV files using pandas
- Understand different data types in pandas
- Explore data structure and dimensions
- Generate basic statistical summaries
- Identify missing values and data quality issues

## 📁 Dataset Overview

We'll work with three sample datasets:
1. **Sales Data** - Product sales information with dates, categories, and revenue
2. **Employee Data** - HR data with salaries, departments, and performance scores
3. **Weather Data** - Temperature and weather conditions across different cities

Let's get started! 🚀

In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Set display options for better output
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', None)

print("📚 Libraries imported successfully!")
print(f"🐼 Pandas version: {pd.__version__}")

## 📖 Section 1: Loading Data from CSV Files

The most common way to load data into pandas is from CSV files. Let's load our three sample datasets and explore them.

In [None]:
# Load the three sample datasets
# Note: Adjust paths if running from different directory

# 1. Sales Data
sales_df = pd.read_csv('../data/sales_data.csv')
print("✅ Sales data loaded successfully!")

# 2. Employee Data  
employees_df = pd.read_csv('../data/employees.csv')
print("✅ Employee data loaded successfully!")

# 3. Weather Data
weather_df = pd.read_csv('../data/weather_data.csv')
print("✅ Weather data loaded successfully!")

print(f"\n📊 Loaded {len(sales_df)} sales records")
print(f"👥 Loaded {len(employees_df)} employee records") 
print(f"🌤️ Loaded {len(weather_df)} weather records")

## 🔍 Section 2: First Look at the Data

Let's examine the structure and first few rows of each dataset using various pandas methods.

In [None]:
# Examine the Sales Data
print("🛒 SALES DATA OVERVIEW")
print("=" * 50)
print(f"Shape: {sales_df.shape}")
print(f"Columns: {list(sales_df.columns)}")
print("\nFirst 5 rows:")
display(sales_df.head())

print("\nLast 3 rows:")
display(sales_df.tail(3))

In [None]:
# Examine the Employee Data
print("👥 EMPLOYEE DATA OVERVIEW")
print("=" * 50)
print(f"Shape: {employees_df.shape}")
print(f"Columns: {list(employees_df.columns)}")
print("\nFirst 5 rows:")
display(employees_df.head())

print("\nColumn data types:")
print(employees_df.dtypes)

In [None]:
# Examine the Weather Data
print("🌤️ WEATHER DATA OVERVIEW")
print("=" * 50)
print(f"Shape: {weather_df.shape}")
print(f"Columns: {list(weather_df.columns)}")
print("\nFirst 5 rows:")
display(weather_df.head())

print("\nSample of data:")
display(weather_df.sample(5))  # Random sample of 5 rows

## 📊 Section 3: Data Information and Summary Statistics

The `.info()` method provides a concise summary of your DataFrame, while `.describe()` gives statistical summaries.

In [None]:
# Get detailed information about each dataset
print("📋 SALES DATA INFO")
print("=" * 40)
sales_df.info()

print("\n📋 EMPLOYEE DATA INFO")
print("=" * 40)
employees_df.info()

print("\n📋 WEATHER DATA INFO")
print("=" * 40)
weather_df.info()

In [None]:
# Generate statistical summaries for numerical columns
print("📈 SALES DATA - NUMERICAL SUMMARY")
print("=" * 50)
display(sales_df.describe())

print("\n📈 EMPLOYEE DATA - NUMERICAL SUMMARY")
print("=" * 50)
display(employees_df.describe())

print("\n📈 WEATHER DATA - NUMERICAL SUMMARY")
print("=" * 50)
display(weather_df.describe())

## 🎯 Section 4: Understanding Data Types

Understanding data types is crucial for proper data analysis. Let's explore the different types in detail.

In [None]:
# Let's convert the Date column to datetime for better analysis
print("📅 CONVERTING DATE COLUMNS")
print("=" * 40)

# Sales data date conversion
print("Before conversion:")
print(f"Sales Date column type: {sales_df['Date'].dtype}")
print(f"Sample values: {sales_df['Date'].head(3).tolist()}")

# Convert to datetime
sales_df['Date'] = pd.to_datetime(sales_df['Date'])
print(f"\nAfter conversion:")
print(f"Sales Date column type: {sales_df['Date'].dtype}")
print(f"Sample values: {sales_df['Date'].head(3).tolist()}")

# Employee data date conversion
employees_df['Join_Date'] = pd.to_datetime(employees_df['Join_Date'])
print(f"\nEmployee Join_Date column type: {employees_df['Join_Date'].dtype}")

# Weather data date conversion  
weather_df['Date'] = pd.to_datetime(weather_df['Date'])
print(f"Weather Date column type: {weather_df['Date'].dtype}")

print("\n✅ All date columns converted successfully!")

In [None]:
# Explore unique values in categorical columns
print("🏷️ EXPLORING CATEGORICAL DATA")
print("=" * 40)

# Sales data categories
print("Sales - Product categories:")
print(sales_df['Product'].value_counts())
print(f"\nSales - Unique categories: {sales_df['Category'].unique()}")
print(f"Sales - Unique regions: {sales_df['Region'].unique()}")

print("\n" + "="*40)
print("Employee - Department distribution:")
print(employees_df['Department'].value_counts())

print("\n" + "="*40)
print("Weather - Cities in dataset:")
print(weather_df['City'].value_counts())

## 🚨 Section 5: Data Quality Assessment

Let's check for missing values, duplicates, and other data quality issues.

In [None]:
# Check for missing values in all datasets
print("🔍 MISSING VALUES ANALYSIS")
print("=" * 40)

datasets = {
    'Sales': sales_df,
    'Employees': employees_df, 
    'Weather': weather_df
}

for name, df in datasets.items():
    print(f"\n{name} Dataset:")
    missing = df.isnull().sum()
    missing_pct = (missing / len(df)) * 100
    
    if missing.sum() == 0:
        print("  ✅ No missing values found!")
    else:
        print("  Missing values:")
        for col, count in missing[missing > 0].items():
            print(f"    {col}: {count} ({missing_pct[col]:.1f}%)")

# Check for duplicate rows
print(f"\n🔄 DUPLICATE ROWS CHECK")
print("=" * 40)
for name, df in datasets.items():
    duplicates = df.duplicated().sum()
    print(f"{name} Dataset: {duplicates} duplicate rows")
    if duplicates > 0:
        print(f"  → {duplicates/len(df)*100:.1f}% of total rows")

## 📈 Section 6: Quick Visual Exploration

Let's create some simple visualizations to better understand our data.

In [None]:
# Create some basic visualizations
plt.style.use('default')
fig, axes = plt.subplots(2, 2, figsize=(15, 10))

# Sales by Product
sales_df['Product'].value_counts().plot(kind='bar', ax=axes[0,0], color='skyblue')
axes[0,0].set_title('📊 Sales Count by Product', fontsize=12, fontweight='bold')
axes[0,0].set_ylabel('Count')
axes[0,0].tick_params(axis='x', rotation=45)

# Revenue by Region
region_revenue = sales_df.groupby('Region')['Revenue'].sum()
region_revenue.plot(kind='pie', ax=axes[0,1], autopct='%1.1f%%', startangle=90)
axes[0,1].set_title('💰 Revenue Distribution by Region', fontsize=12, fontweight='bold')
axes[0,1].set_ylabel('')

# Employee Age Distribution
employees_df['Age'].hist(bins=10, ax=axes[1,0], color='lightgreen', alpha=0.7)
axes[1,0].set_title('👥 Employee Age Distribution', fontsize=12, fontweight='bold')
axes[1,0].set_xlabel('Age')
axes[1,0].set_ylabel('Frequency')

# Temperature by City
for city in weather_df['City'].unique():
    city_data = weather_df[weather_df['City'] == city]
    axes[1,1].plot(city_data['Date'], city_data['Temperature'], marker='o', label=city)
axes[1,1].set_title('🌡️ Temperature Trends by City', fontsize=12, fontweight='bold')
axes[1,1].set_xlabel('Date')
axes[1,1].set_ylabel('Temperature (°C)')
axes[1,1].legend()
axes[1,1].tick_params(axis='x', rotation=45)

plt.tight_layout()
plt.show()

## 🎓 Section 7: Key Takeaways and Next Steps

### What We've Learned:

1. **📁 Data Loading**: How to read CSV files using `pd.read_csv()`
2. **🔍 Data Exploration**: Using `.head()`, `.tail()`, `.info()`, and `.describe()` 
3. **📊 Data Structure**: Understanding DataFrame shape, columns, and data types
4. **🎯 Data Types**: Converting strings to datetime objects for better analysis
5. **🚨 Data Quality**: Checking for missing values and duplicates
6. **📈 Quick Visualization**: Creating basic plots to understand data patterns

### Next Steps:

In the next notebook, we'll dive deeper into:
- 🔎 **Data Filtering and Selection** - Finding specific subsets of data
- 🔄 **Data Grouping and Aggregation** - Summarizing data by categories  
- 🧮 **Statistical Analysis** - Calculating custom metrics and insights
- 🔗 **Data Merging** - Combining multiple datasets

### 💡 Pro Tips:

- Always inspect your data before analysis
- Convert date columns to datetime for time-based operations
- Check for missing values and outliers early
- Use descriptive variable names and comments in your code