# 01: NumPy and Pandas Basics

## Overview
This notebook covers the fundamental data structures and operations in NumPy and Pandas, which are essential for data manipulation and analysis in machine learning.

## Topics Covered:
1. NumPy Arrays and Operations
2. Array Indexing and Slicing
3. Broadcasting and Vectorization
4. Pandas DataFrames and Series
5. Data Loading and Basic Operations
6. Data Cleaning and Manipulation
7. Grouping and Aggregation

## Interview Focus:
- Understanding of vectorization vs loops
- Memory efficiency
- Data transformation techniques
- Handling missing data

## 1. NumPy Basics

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

print(f"NumPy version: {np.__version__}")
print(f"Pandas version: {pd.__version__}")

### 1.1 Creating NumPy Arrays

In [None]:
# Different ways to create arrays
arr1 = np.array([1, 2, 3, 4, 5])
arr2 = np.zeros((3, 4))  # 3x4 array of zeros
arr3 = np.ones((2, 3))   # 2x3 array of ones
arr4 = np.arange(0, 10, 2)  # Array from 0 to 10 with step 2
arr5 = np.linspace(0, 1, 5)  # 5 evenly spaced values from 0 to 1
arr6 = np.random.randn(3, 3)  # 3x3 array of random values from standard normal

print("Array 1:", arr1)
print("\nArray 2 (zeros):\n", arr2)
print("\nArray 4 (arange):", arr4)
print("\nArray 5 (linspace):", arr5)
print("\nArray 6 (random):\n", arr6)

### 1.2 Array Properties and Attributes

In [None]:
# Array properties
arr = np.array([[1, 2, 3], [4, 5, 6]])

print("Array:\n", arr)
print("\nShape:", arr.shape)  # (rows, columns)
print("Dimensions:", arr.ndim)
print("Size (total elements):", arr.size)
print("Data type:", arr.dtype)
print("Item size (bytes):", arr.itemsize)

### 1.3 Array Indexing and Slicing

In [None]:
# Indexing and slicing
arr = np.array([[1, 2, 3, 4],
                [5, 6, 7, 8],
                [9, 10, 11, 12]])

print("Original array:\n", arr)
print("\nElement at [1, 2]:", arr[1, 2])
print("\nFirst row:", arr[0, :])
print("\nSecond column:", arr[:, 1])
print("\nSubarray:\n", arr[0:2, 1:3])

# Boolean indexing
print("\nElements > 5:", arr[arr > 5])

# Fancy indexing
print("\nRows 0 and 2:", arr[[0, 2], :])

### 1.4 Array Operations and Broadcasting

In [None]:
# Element-wise operations
a = np.array([1, 2, 3, 4])
b = np.array([10, 20, 30, 40])

print("Array a:", a)
print("Array b:", b)
print("\nAddition:", a + b)
print("Multiplication:", a * b)
print("Power:", a ** 2)

# Broadcasting example
matrix = np.array([[1, 2, 3],
                   [4, 5, 6],
                   [7, 8, 9]])
vector = np.array([1, 0, 1])

print("\nMatrix:\n", matrix)
print("\nVector:", vector)
print("\nMatrix + Vector (broadcasting):\n", matrix + vector)

### 1.5 Statistical Operations

In [None]:
# Statistical operations
arr = np.array([[1, 2, 3],
                [4, 5, 6],
                [7, 8, 9]])

print("Array:\n", arr)
print("\nSum:", arr.sum())
print("Mean:", arr.mean())
print("Standard deviation:", arr.std())
print("Min:", arr.min())
print("Max:", arr.max())
print("\nSum along axis 0 (columns):", arr.sum(axis=0))
print("Sum along axis 1 (rows):", arr.sum(axis=1))

### 1.6 Reshaping and Transposing

In [None]:
# Reshaping
arr = np.arange(12)
print("Original:", arr)
print("\nReshaped (3x4):\n", arr.reshape(3, 4))
print("\nReshaped (2x6):\n", arr.reshape(2, 6))

# Transpose
matrix = np.array([[1, 2, 3],
                   [4, 5, 6]])
print("\nOriginal matrix:\n", matrix)
print("\nTransposed:\n", matrix.T)

# Flatten
print("\nFlattened:", matrix.flatten())

## 2. Pandas Basics

### 2.1 Creating Series and DataFrames

In [None]:
# Creating a Series
s = pd.Series([1, 3, 5, np.nan, 6, 8])
print("Series:")
print(s)

# Creating a DataFrame from dictionary
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
    'Age': [25, 30, 35, 28, 32],
    'City': ['New York', 'London', 'Paris', 'Tokyo', 'Berlin'],
    'Salary': [50000, 60000, 55000, 58000, 62000]
}

df = pd.DataFrame(data)
print("\nDataFrame:")
print(df)

### 2.2 DataFrame Operations

In [None]:
# Basic information
print("DataFrame Info:")
print(df.info())

print("\nFirst 3 rows:")
print(df.head(3))

print("\nLast 2 rows:")
print(df.tail(2))

print("\nDescriptive statistics:")
print(df.describe())

print("\nColumn names:", df.columns.tolist())
print("Shape:", df.shape)

### 2.3 Indexing and Selection

In [None]:
# Selecting columns
print("Age column:")
print(df['Age'])

print("\nMultiple columns:")
print(df[['Name', 'Salary']])

# Selecting rows by position (iloc)
print("\nFirst row (iloc):")
print(df.iloc[0])

print("\nFirst 3 rows, first 2 columns (iloc):")
print(df.iloc[0:3, 0:2])

# Selecting rows by label (loc)
print("\nRows 0-2 (loc):")
print(df.loc[0:2, ['Name', 'Age']])

# Boolean indexing
print("\nPeople older than 30:")
print(df[df['Age'] > 30])

### 2.4 Data Manipulation

In [None]:
# Adding new columns
df['Salary_Bonus'] = df['Salary'] * 1.1
df['Age_Group'] = df['Age'].apply(lambda x: 'Young' if x < 30 else 'Senior')

print("DataFrame with new columns:")
print(df)

# Dropping columns
df_copy = df.copy()
df_copy = df_copy.drop('Salary_Bonus', axis=1)
print("\nAfter dropping Salary_Bonus:")
print(df_copy.columns.tolist())

# Sorting
print("\nSorted by Age (descending):")
print(df.sort_values('Age', ascending=False))

### 2.5 Handling Missing Data

In [None]:
# Creating DataFrame with missing values
data_with_nan = {
    'A': [1, 2, np.nan, 4],
    'B': [5, np.nan, np.nan, 8],
    'C': [9, 10, 11, 12]
}
df_nan = pd.DataFrame(data_with_nan)

print("DataFrame with NaN:")
print(df_nan)

print("\nCheck for NaN:")
print(df_nan.isnull())

print("\nCount of NaN per column:")
print(df_nan.isnull().sum())

# Filling missing values
print("\nFill NaN with 0:")
print(df_nan.fillna(0))

print("\nFill NaN with column mean:")
print(df_nan.fillna(df_nan.mean()))

# Dropping missing values
print("\nDrop rows with any NaN:")
print(df_nan.dropna())

print("\nDrop columns with any NaN:")
print(df_nan.dropna(axis=1))

### 2.6 GroupBy and Aggregation

In [None]:
# Creating sample data
data = {
    'Department': ['IT', 'HR', 'IT', 'HR', 'IT', 'Finance', 'Finance'],
    'Employee': ['Alice', 'Bob', 'Charlie', 'David', 'Eve', 'Frank', 'Grace'],
    'Salary': [70000, 50000, 75000, 52000, 80000, 60000, 65000],
    'Years': [3, 5, 4, 6, 7, 2, 3]
}
df_emp = pd.DataFrame(data)

print("Employee Data:")
print(df_emp)

# GroupBy operations
print("\nAverage salary by department:")
print(df_emp.groupby('Department')['Salary'].mean())

print("\nMultiple aggregations:")
print(df_emp.groupby('Department').agg({
    'Salary': ['mean', 'min', 'max'],
    'Years': 'mean'
}))

print("\nCount by department:")
print(df_emp.groupby('Department').size())

### 2.7 Merging and Joining DataFrames

In [None]:
# Creating sample DataFrames
df1 = pd.DataFrame({
    'ID': [1, 2, 3, 4],
    'Name': ['Alice', 'Bob', 'Charlie', 'David']
})

df2 = pd.DataFrame({
    'ID': [3, 4, 5, 6],
    'Score': [85, 90, 78, 88]
})

print("DataFrame 1:")
print(df1)
print("\nDataFrame 2:")
print(df2)

# Inner join (default)
print("\nInner join:")
print(pd.merge(df1, df2, on='ID', how='inner'))

# Left join
print("\nLeft join:")
print(pd.merge(df1, df2, on='ID', how='left'))

# Outer join
print("\nOuter join:")
print(pd.merge(df1, df2, on='ID', how='outer'))

# Concatenation
df3 = pd.DataFrame({'ID': [7], 'Name': ['Eve']})
print("\nConcatenate:")
print(pd.concat([df1, df3], ignore_index=True))

### 2.8 Reading and Writing Data

In [None]:
# Writing to CSV
df.to_csv('/tmp/sample_data.csv', index=False)
print("Data written to CSV")

# Reading from CSV
df_read = pd.read_csv('/tmp/sample_data.csv')
print("\nData read from CSV:")
print(df_read.head())

# Note: You can also use read_excel(), to_excel(), read_json(), to_json(), etc.

## Interview Questions

### Q1: What is the difference between `loc` and `iloc`?
**Answer:** 
- `loc` is label-based indexing: you use column and row names/labels
- `iloc` is integer position-based indexing: you use integer positions (0-indexed)

### Q2: What is broadcasting in NumPy?
**Answer:** Broadcasting allows NumPy to perform operations on arrays of different shapes by automatically expanding the smaller array to match the shape of the larger one (following certain rules), without making copies of data.

### Q3: How do you handle missing data in Pandas?
**Answer:** 
- `dropna()`: Remove rows/columns with missing values
- `fillna()`: Fill missing values with a specific value or method (mean, median, forward fill, etc.)
- `isnull()` or `isna()`: Check for missing values
- `interpolate()`: Fill missing values using interpolation

### Q4: What's the difference between `copy()` and view in NumPy?
**Answer:** 
- A view is a reference to the original array data; changes to the view affect the original
- A copy creates a new array with its own data; changes don't affect the original
- Slicing creates views, while methods like `copy()` create copies

### Q5: Why is vectorization important?
**Answer:** Vectorization replaces explicit loops with array operations, leading to:
- Faster execution (operations are implemented in C)
- More readable code
- Better memory efficiency
- Automatic parallelization in many cases

## Practice Exercises

1. Create a 5x5 matrix with random values and normalize it (subtract mean, divide by std)
2. Create a DataFrame from a dictionary and filter rows based on multiple conditions
3. Implement a function that handles missing values in different ways based on the percentage of missing data
4. Use groupby to compute multiple statistics for different groups
5. Merge two DataFrames with different join types and observe the differences