# Python Big Data Fundamentals
*From Krish Naik's Big Data Course*

## Overview
This notebook provides an interactive introduction to Python libraries essential for big data processing.

### Topics Covered:
1. NumPy for numerical computing
2. Pandas for data manipulation
3. Memory optimization techniques
4. Performance comparison between libraries

In [None]:
# Import required libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import time
import sys

# Display versions
print(f"NumPy version: {np.__version__}")
print(f"Pandas version: {pd.__version__}")

## 1. NumPy Fundamentals

In [None]:
# Create sample data
data_size = 1000000

# NumPy array creation
np_array = np.random.random(data_size)
print(f"NumPy array shape: {np_array.shape}")
print(f"Memory usage: {np_array.nbytes / 1024**2:.2f} MB")

# Basic statistics
print(f"Mean: {np.mean(np_array):.4f}")
print(f"Standard deviation: {np.std(np_array):.4f}")
print(f"Min: {np.min(np_array):.4f}")
print(f"Max: {np.max(np_array):.4f}")

## 2. Performance Comparison: NumPy vs Pure Python

In [None]:
# Performance comparison
size = 100000

# Pure Python
python_list = list(range(size))
start_time = time.time()
python_result = sum([x**2 for x in python_list])
python_time = time.time() - start_time

# NumPy
numpy_array = np.arange(size)
start_time = time.time()
numpy_result = np.sum(numpy_array**2)
numpy_time = time.time() - start_time

print(f"Pure Python time: {python_time:.4f} seconds")
print(f"NumPy time: {numpy_time:.4f} seconds")
print(f"NumPy is {python_time/numpy_time:.1f}x faster")

# Visualization
methods = ['Python', 'NumPy']
times = [python_time, numpy_time]

plt.figure(figsize=(8, 5))
plt.bar(methods, times, color=['red', 'blue'])
plt.title('Performance Comparison: Python vs NumPy')
plt.ylabel('Time (seconds)')
plt.show()

## 3. Pandas DataFrame Operations

In [None]:
# Create sample dataset
np.random.seed(42)
n_rows = 10000

data = {
    'employee_id': range(1, n_rows + 1),
    'name': [f'Employee_{i}' for i in range(1, n_rows + 1)],
    'department': np.random.choice(['IT', 'HR', 'Finance', 'Marketing', 'Operations'], n_rows),
    'salary': np.random.normal(75000, 15000, n_rows),
    'age': np.random.randint(22, 65, n_rows),
    'years_experience': np.random.randint(0, 40, n_rows)
}

df = pd.DataFrame(data)
df['salary'] = df['salary'].round(2)

print(f"Dataset shape: {df.shape}")
print(f"Memory usage: {df.memory_usage(deep=True).sum() / 1024**2:.2f} MB")
df.head()

In [None]:
# Data analysis
print("Basic statistics:")
print(df.describe())

print("\nDepartment distribution:")
print(df['department'].value_counts())

In [None]:
# Groupby operations
dept_analysis = df.groupby('department').agg({
    'salary': ['mean', 'median', 'std'],
    'age': 'mean',
    'years_experience': 'mean',
    'employee_id': 'count'
}).round(2)

dept_analysis.columns = ['avg_salary', 'median_salary', 'salary_std', 'avg_age', 'avg_experience', 'count']
print("Department Analysis:")
print(dept_analysis)

## 4. Data Visualization

In [None]:
# Salary distribution by department
plt.figure(figsize=(12, 8))

# Subplot 1: Salary distribution
plt.subplot(2, 2, 1)
df['salary'].hist(bins=50, alpha=0.7)
plt.title('Salary Distribution')
plt.xlabel('Salary')
plt.ylabel('Frequency')

# Subplot 2: Department counts
plt.subplot(2, 2, 2)
df['department'].value_counts().plot(kind='bar')
plt.title('Employees by Department')
plt.xlabel('Department')
plt.ylabel('Count')
plt.xticks(rotation=45)

# Subplot 3: Age vs Salary scatter plot
plt.subplot(2, 2, 3)
plt.scatter(df['age'], df['salary'], alpha=0.5)
plt.title('Age vs Salary')
plt.xlabel('Age')
plt.ylabel('Salary')

# Subplot 4: Average salary by department
plt.subplot(2, 2, 4)
dept_salary = df.groupby('department')['salary'].mean().sort_values(ascending=False)
dept_salary.plot(kind='bar')
plt.title('Average Salary by Department')
plt.xlabel('Department')
plt.ylabel('Average Salary')
plt.xticks(rotation=45)

plt.tight_layout()
plt.show()

## 5. Memory Optimization Techniques

In [None]:
# Memory optimization
print("Original DataFrame memory usage:")
print(df.memory_usage(deep=True))
print(f"Total: {df.memory_usage(deep=True).sum() / 1024**2:.2f} MB")

# Optimize data types
df_optimized = df.copy()

# Convert department to category
df_optimized['department'] = df_optimized['department'].astype('category')

# Downcast integers
df_optimized['employee_id'] = pd.to_numeric(df_optimized['employee_id'], downcast='integer')
df_optimized['age'] = pd.to_numeric(df_optimized['age'], downcast='integer')
df_optimized['years_experience'] = pd.to_numeric(df_optimized['years_experience'], downcast='integer')

# Downcast floats
df_optimized['salary'] = pd.to_numeric(df_optimized['salary'], downcast='float')

print("\nOptimized DataFrame memory usage:")
print(df_optimized.memory_usage(deep=True))
print(f"Total: {df_optimized.memory_usage(deep=True).sum() / 1024**2:.2f} MB")

memory_saved = df.memory_usage(deep=True).sum() - df_optimized.memory_usage(deep=True).sum()
print(f"\nMemory saved: {memory_saved / 1024**2:.2f} MB")
print(f"Reduction: {memory_saved / df.memory_usage(deep=True).sum() * 100:.1f}%")

## 6. Practice Exercises

### Exercise 1: Data Filtering
Find all employees who:
- Are older than 40
- Have more than 10 years of experience
- Earn more than the median salary

In [None]:
# Your solution here
median_salary = df['salary'].median()

experienced_high_earners = df[
    (df['age'] > 40) & 
    (df['years_experience'] > 10) & 
    (df['salary'] > median_salary)
]

print(f"Found {len(experienced_high_earners)} experienced high earners")
print(f"Average salary: ${experienced_high_earners['salary'].mean():.2f}")
experienced_high_earners.head()

### Exercise 2: Data Aggregation
Calculate the correlation between years of experience and salary for each department.

In [None]:
# Your solution here
correlations = df.groupby('department').apply(
    lambda x: x['years_experience'].corr(x['salary'])
)

print("Correlation between experience and salary by department:")
for dept, corr in correlations.items():
    print(f"{dept}: {corr:.3f}")

## Summary

In this notebook, we've covered:
1. **NumPy fundamentals** for efficient numerical computing
2. **Performance benefits** of vectorized operations
3. **Pandas DataFrame** operations for data manipulation
4. **Data visualization** techniques
5. **Memory optimization** strategies
6. **Practical exercises** for hands-on learning

These skills form the foundation for working with big data in Python!