# **Chapter 4: Data Fundamentals and Programming Basics**

---

## **Learning Objectives**

By the end of this chapter, you will be able to:

- Use NumPy and pandas for time-series data manipulation
- Load time-series data from various sources (CSV, databases, APIs)
- Handle missing values and outliers effectively
- Perform data type conversions and transformations
- Aggregate and resample time-series data
- Index and select data based on time and conditions
- Conduct basic exploratory data analysis
- Assess and document data quality

---

## **Prerequisites**

- Completed Chapter 3: Setting Up Your Development Environment
- Python 3.8+ installed with required libraries
- Basic understanding of programming concepts

---

## **4.1 Python for Data Science**

Python has become the de facto language for data science and machine learning due to its rich ecosystem of libraries. For time-series prediction systems, two libraries form the foundation: NumPy for numerical computing and pandas for data manipulation.

### **4.1.1 NumPy Fundamentals**

NumPy (Numerical Python) provides efficient multi-dimensional arrays and mathematical operations. Understanding NumPy is essential because pandas is built on top of it, and most machine learning libraries use NumPy arrays internally.

```python
import numpy as np

# Creating NumPy arrays - the fundamental data structure
# Arrays are homogeneous (all elements same type) and stored contiguously in memory

# Create array from a list - basic creation method
prices_list = [2850.50, 2875.25, 2890.00, 2865.75, 2880.50]
prices_array = np.array(prices_list)
print(f"Array: {prices_array}")
print(f"Type: {type(prices_array)}")
print(f"Data type: {prices_array.dtype}")
print(f"Shape: {prices_array.shape}")

# Output:
# Array: [2850.5  2875.25 2890.   2865.75 2880.5 ]
# Type: <class 'numpy.ndarray'>
# Data type: float64
# Shape: (5,)
```

**Explanation:**
- `np.array()` converts a Python list into a NumPy array. Unlike Python lists, NumPy arrays are stored as contiguous blocks of memory, making operations much faster.
- `dtype` (data type) shows the type of elements. `float64` means 64-bit floating-point numbers, which is the default for decimal numbers.
- `shape` returns a tuple representing dimensions. `(5,)` means a 1-dimensional array with 5 elements.

---

```python
# Creating arrays with specific patterns - useful for initialization

# Array of zeros - useful for initializing accumulators or masks
zeros = np.zeros(5)
print(f"Zeros: {zeros}")

# Array of ones - useful for creating weight vectors or multipliers
ones = np.ones(5)
print(f"Ones: {ones}")

# Array with range of values - similar to Python's range() but returns array
range_array = np.arange(0, 10, 2)  # start, stop, step
print(f"Range: {range_array}")

# Linearly spaced values - creates evenly spaced numbers over an interval
# Useful for creating time indices or price level grids
linear_space = np.linspace(0, 100, 5)  # start, stop, num_points
print(f"Linear space: {linear_space}")

# Output:
# Zeros: [0. 0. 0. 0. 0.]
# Ones: [1. 1. 1. 1. 1.]
# Range: [0 2 4 6 8]
# Linear space: [  0.  25.  50.  75. 100.]
```

**Explanation:**
- `np.zeros()` creates an array filled with zeros. In time-series, we often use this to initialize arrays for storing predictions or intermediate calculations.
- `np.ones()` creates an array filled with ones. Useful for creating moving average weights or normalization factors.
- `np.arange()` is like Python's `range()` but returns a NumPy array. The `step` parameter controls the spacing between values.
- `np.linspace()` creates evenly spaced numbers. Unlike `arange()`, you specify the number of points rather than the step size. This is useful for creating time axes or price grids for analysis.

---

```python
# Multi-dimensional arrays - representing time-series with multiple features

# 2D array representing 3 days of NEPSE stock data with 4 features each
# Each row is a day, each column is a feature
stock_data = np.array([
    [2850.50, 2890.00, 2840.00, 2875.25],  # Day 1: Open, High, Low, Close
    [2875.25, 2910.00, 2860.00, 2895.50],  # Day 2
    [2895.50, 2920.00, 2880.00, 2900.00],  # Day 3
])

print(f"Shape: {stock_data.shape}")  # (3, 4) - 3 rows, 4 columns
print(f"Dimensions: {stock_data.ndim}")  # 2 - number of dimensions
print(f"Size: {stock_data.size}")  # 12 - total number of elements

# Accessing elements in 2D array
print(f"First row (Day 1): {stock_data[0]}")
print(f"First column (All Opens): {stock_data[:, 0]}")
print(f"Specific element (Day 2 High): {stock_data[1, 1]}")

# Output:
# Shape: (3, 4)
# Dimensions: 2
# Size: 12
# First row (Day 1): [2850.5 2890.   2840.   2875.25]
# First column (All Opens): [2850.5  2875.25 2895.5 ]
# Specific element (Day 2 High): 2910.0
```

**Explanation:**
- Multi-dimensional arrays are perfect for representing time-series with multiple features. Each row typically represents a time point, and each column represents a feature.
- `shape` returns `(rows, columns)`. `(3, 4)` means 3 time points with 4 features each.
- `ndim` returns the number of dimensions (axes). A 2D array has 2 dimensions.
- `size` returns the total number of elements (rows × columns).
- Indexing uses `[row, column]` syntax. The colon `:` means "all elements along this axis."
- `stock_data[:, 0]` means "all rows, column 0" - extracting the Open prices for all days.

---

```python
# Mathematical operations on arrays - vectorized operations for efficiency

# Sample closing prices for a week
close_prices = np.array([2850.50, 2875.25, 2890.00, 2865.75, 2880.50])

# Element-wise arithmetic - operations apply to each element automatically
adjusted_prices = close_prices + 10  # Add 10 to each price
print(f"Adjusted prices: {adjusted_prices}")

percentage_of_start = (close_prices / close_prices[0]) * 100  # Normalize to percentage
print(f"Percentage of start: {percentage_of_start}")

# Statistical operations - essential for time-series analysis
print(f"Mean price: {np.mean(close_prices):.2f}")
print(f"Standard deviation: {np.std(close_prices):.2f}")
print(f"Minimum: {np.min(close_prices):.2f}")
print(f"Maximum: {np.max(close_prices):.2f}")
print(f"Range: {np.ptp(close_prices):.2f}")  # ptp = peak to peak (max - min)

# Output:
# Adjusted prices: [2860.5  2885.25 2900.   2875.75 2890.5 ]
# Percentage of start: [100.          100.86824561 101.38500219 100.53420452 101.0505613 ]
# Mean price: 2872.40
# Standard deviation: 14.16
# Minimum: 2850.50
# Maximum: 2890.00
# Range: 39.50
```

**Explanation:**
- **Vectorization** is NumPy's key advantage. Operations like `close_prices + 10` apply to all elements simultaneously without explicit loops, making code faster and cleaner.
- Division by the first element normalizes all prices relative to the starting point, useful for comparing different stocks on the same scale.
- `np.mean()` calculates the average - fundamental for understanding central tendency in price data.
- `np.std()` calculates standard deviation - measures price volatility, crucial for risk assessment.
- `np.ptp()` (peak to peak) calculates the range (max - min), showing price spread during the period.

---

```python
# Boolean indexing and filtering - selecting data based on conditions

# Sample prices with some anomalies
prices = np.array([2850.50, 2875.25, 5000.00, 2890.00, 2865.75, 0.0, 2880.50])

# Create boolean mask - returns True/False for each element
mask_high = prices > 3000  # Find abnormally high prices
print(f"High price mask: {mask_high}")

# Use mask to filter - extracts only True elements
high_prices = prices[mask_high]
print(f"High prices: {high_prices}")

# Find valid prices (not zero, not abnormally high)
mask_valid = (prices > 0) & (prices < 3000)  # Combine conditions with &
valid_prices = prices[mask_valid]
print(f"Valid prices: {valid_prices}")

# Count valid and invalid prices
print(f"Valid count: {np.sum(mask_valid)}")  # True counts as 1
print(f"Invalid count: {np.sum(~mask_valid)}")  # ~ is NOT operator

# Output:
# High price mask: [False False  True False False False False]
# High prices: [5000.]
# Valid prices: [2850.5  2875.25 2890.   2865.75 2880.5 ]
# Valid count: 5
# Invalid count: 2
```

**Explanation:**
- **Boolean indexing** is powerful for filtering time-series data. A comparison like `prices > 3000` creates a boolean array (True/False for each element).
- Using a boolean mask as an index extracts only elements where the mask is True.
- Multiple conditions can be combined with `&` (AND), `|` (OR). Each condition must be in parentheses.
- `~` is the NOT operator, inverting the boolean mask.
- `np.sum()` on a boolean array counts True values (True = 1, False = 0), useful for counting how many data points meet certain criteria.

---

### **4.1.2 pandas Essentials**

pandas is the most important library for time-series data manipulation. It provides DataFrame (2D table) and Series (1D array) data structures with powerful indexing, grouping, and time-series functionality.

```python
import pandas as pd

# Creating a Series - 1D labeled array
# A Series combines data with an index (labels for each element)

# Create Series from a list with custom index (trading days)
close_prices = pd.Series(
    [2850.50, 2875.25, 2890.00, 2865.75, 2880.50],
    index=['Day1', 'Day2', 'Day3', 'Day4', 'Day5'],
    name='Close_Price'
)

print("Series:")
print(close_prices)
print(f"\nSeries values: {close_prices.values}")  # NumPy array
print(f"Series index: {close_prices.index}")  # Index object
print(f"Series name: {close_prices.name}")

# Accessing by label vs position
print(f"\nAccess by label: {close_prices['Day3']}")  # Using index label
print(f"Access by position: {close_prices.iloc[2]}")  # Using integer position

# Output:
# Series:
# Day1    2850.50
# Day2    2875.25
# Day3    2890.00
# Day4    2865.75
# Day5    2880.50
# Name: Close_Price, dtype: float64
#
# Series values: [2850.5  2875.25 2890.   2865.75 2880.5 ]
# Series index: Index(['Day1', 'Day2', 'Day3', 'Day4', 'Day5'], dtype='object')
# Series name: Close_Price
#
# Access by label: 2890.0
# Access by position: 2890.0
```

**Explanation:**
- A **Series** is pandas' 1D data structure, like a column in a spreadsheet. It combines data with an index (row labels).
- The `index` parameter assigns custom labels to each element. Without it, pandas uses default integer indices (0, 1, 2...).
- `name` gives the Series a label, useful when it becomes a DataFrame column.
- `.values` returns the underlying NumPy array - useful when you need NumPy operations.
- `.index` returns the index object containing the labels.
- Access by label uses square brackets with the label name (`close_prices['Day3']`).
- `.iloc[]` accesses by integer position, useful when you want position-based access regardless of index type.

---

```python
# Creating a DataFrame - 2D labeled data structure
# A DataFrame is like a spreadsheet or SQL table with rows and columns

# Sample NEPSE stock data for multiple days
nepse_data = {
    'Symbol': ['NABIL', 'NABIL', 'NABIL', 'NABIL', 'NABIL'],
    'Date': ['2024-01-15', '2024-01-16', '2024-01-17', '2024-01-18', '2024-01-19'],
    'Open': [2850.50, 2875.25, 2890.00, 2865.75, 2880.50],
    'High': [2890.00, 2910.00, 2920.00, 2900.00, 2915.00],
    'Low': [2840.00, 2860.00, 2880.00, 2850.00, 2870.00],
    'Close': [2875.25, 2895.50, 2900.00, 2880.50, 2905.00],
    'Volume': [125000, 150000, 175000, 140000, 160000],
}

df = pd.DataFrame(nepse_data)

print("DataFrame:")
print(df)
print(f"\nShape: {df.shape}")  # (rows, columns)
print(f"Columns: {df.columns.tolist()}")
print(f"Index: {df.index}")
print(f"Data types:\n{df.dtypes}")

# Output:
# DataFrame:
#   Symbol        Date     Open     High      Low    Close  Volume
# 0  NABIL  2024-01-15  2850.50  2890.00  2840.00  2875.25  125000
# 1  NABIL  2024-01-16  2875.25  2910.00  2860.00  2895.50  150000
# 2  NABIL  2024-01-17  2890.00  2920.00  2880.00  2900.00  175000
# 3  NABIL  2024-01-18  2865.75  2900.00  2850.00  2880.50  140000
# 4  NABIL  2024-01-19  2880.50  2915.00  2870.00  2905.00  160000
#
# Shape: (5, 7)
# Columns: ['Symbol', 'Date', 'Open', 'High', 'Low', 'Close', 'Volume']
# Index: RangeIndex(start=0, stop=5, step=1)
# Data types:
# Symbol      object
# Date        object
# Open       float64
# High       float64
# Low        float64
# Close      float64
# Volume       int64
```

**Explanation:**
- A **DataFrame** is pandas' primary 2D data structure. Think of it as a table with labeled rows and columns.
- Created from a dictionary where keys become column names and values become column data.
- `.shape` returns `(rows, columns)` as a tuple.
- `.columns` returns the column labels as an Index object. `.tolist()` converts it to a Python list.
- `.index` shows the row labels. By default, pandas uses RangeIndex (0, 1, 2...).
- `.dtypes` shows the data type of each column. `object` typically means strings, `float64` for decimals, `int64` for integers.

---

```python
# DataFrame column access and manipulation

# Access single column (returns Series)
close_series = df['Close']
print(f"Close prices (Series):\n{close_series}")

# Access multiple columns (returns DataFrame)
price_columns = df[['Open', 'High', 'Low', 'Close']]
print(f"\nPrice columns (DataFrame):\n{price_columns.head()}")

# Add new column - calculated field
df['Range'] = df['High'] - df['Low']  # Daily price range
df['Change'] = df['Close'] - df['Open']  # Daily change
df['Change_Pct'] = (df['Change'] / df['Open']) * 100  # Percentage change

print(f"\nDataFrame with new columns:\n{df}")

# Modify existing column
df['Volume'] = df['Volume'] / 1000  # Convert to thousands
print(f"\nVolume in thousands:\n{df['Volume']}")

# Delete column
df_dropped = df.drop('Symbol', axis=1)  # axis=1 means column
print(f"\nAfter dropping Symbol column:\n{df_dropped.head()}")

# Output:
# Close prices (Series):
# 0    2875.25
# 1    2895.50
# 2    2900.00
# 3    2880.50
# 4    2905.00
# Name: Close, dtype: float64
```

**Explanation:**
- Accessing a single column with `df['ColumnName']` returns a Series (1D).
- Accessing multiple columns with `df[['Col1', 'Col2']]` returns a DataFrame. Note the double brackets (a list inside brackets).
- New columns are created by assignment: `df['NewColumn'] = values`. Values can be calculated from existing columns using vectorized operations.
- **Vectorized operations** between columns (like `df['High'] - df['Low']`) apply to each row automatically, much faster than iterating.
- Columns can be modified in place by reassignment.
- `.drop()` removes rows or columns. `axis=1` specifies columns (axis=0 is rows). Returns a new DataFrame by default; use `inplace=True` to modify the original.

---

```python
# DataFrame row access - loc vs iloc

# Create sample DataFrame with date index
df_dates = df.copy()
df_dates['Date'] = pd.to_datetime(df_dates['Date'])
df_dates.set_index('Date', inplace=True)  # Set Date as index

print("DataFrame with date index:")
print(df_dates)

# iloc - integer position based (position 0, 1, 2, ...)
print(f"\nFirst row (iloc[0]):\n{df_dates.iloc[0]}")
print(f"\nLast row (iloc[-1]):\n{df_dates.iloc[-1]}")
print(f"\nFirst 3 rows:\n{df_dates.iloc[:3]}")
print(f"\nSpecific rows and columns:\n{df_dates.iloc[0:3, 0:4]}")  # rows 0-2, cols 0-3

# loc - label based (using index labels)
print(f"\nRow for 2024-01-15:\n{df_dates.loc['2024-01-15']}")
print(f"\nRows for date range:\n{df_dates.loc['2024-01-15':'2024-01-17']}")

# Output:
# DataFrame with date index:
#             Symbol     Open     High      Low    Close  Volume   Range  Change  Change_Pct
# Date
# 2024-01-15   NABIL  2850.50  2890.00  2840.00  2875.25   125.0    50.0   24.75    0.867692
```

**Explanation:**
- `.set_index()` sets a column as the DataFrame index. `inplace=True` modifies the DataFrame directly.
- `.iloc[]` is **integer-location based** indexing. Use it when you want to access by position (like Python lists).
- `.loc[]` is **label-based** indexing. Use it when you want to access by index labels (like dates or names).
- Both support slicing: `iloc[0:3]` gets positions 0, 1, 2 (exclusive end). `loc['2024-01-15':'2024-01-17']` includes both endpoints.
- `.iloc[0:3, 0:4]` accesses both rows and columns by position: rows 0-2, columns 0-3.

---

```python
# Filtering DataFrames with conditions

# Filter rows based on condition
high_volume = df[df['Volume'] > 140]
print(f"Days with volume > 140k:\n{high_volume}")

# Multiple conditions
complex_filter = df[(df['Close'] > 2880) & (df['Volume'] > 130)]
print(f"\nDays with Close > 2880 AND Volume > 130k:\n{complex_filter}")

# Using isin() for categorical filtering
symbols_of_interest = ['NABIL', 'NICA']
# This would work if we had multiple symbols:
# filtered = df[df['Symbol'].isin(symbols_of_interest)]

# Query method - SQL-like syntax
query_result = df.query('Close > 2880 and Volume > 130')
print(f"\nQuery result:\n{query_result}")

# Output:
# Days with volume > 140k:
#   Symbol        Date     Open     High      Low    Close  Volume   Range  Change  Change_Pct
# 1  NABIL  2024-01-16  2875.25  2910.00  2860.00  2895.50   150.0    50.0   20.25    0.704259
# 2  NABIL  2024-01-17  2890.00  2920.00  2880.00  2900.00   175.0    40.0   10.00    0.346021
# 4  NABIL  2024-01-19  2880.50  2915.00  2870.00  2905.00   160.0    45.0   24.50    0.850846
```

**Explanation:**
- Filtering uses boolean indexing: `df[condition]` returns rows where condition is True.
- The condition `df['Volume'] > 140` creates a boolean Series (True/False for each row).
- Multiple conditions must use `&` (AND) or `|` (OR), with each condition in parentheses.
- `.isin()` filters for values in a given list, useful for categorical columns like symbols or sectors.
- `.query()` provides SQL-like syntax for filtering. Column names are used directly without quotes, and the string can contain complex conditions.

---

### **4.1.3 Working with Dates and Times**

Time-series analysis heavily depends on proper date and time handling. pandas provides extensive functionality for parsing, manipulating, and resampling time-based data.

```python
import pandas as pd
from datetime import datetime, timedelta

# Creating datetime objects - Python's built-in datetime module

# Current date and time
now = datetime.now()
print(f"Current datetime: {now}")
print(f"Year: {now.year}, Month: {now.month}, Day: {now.day}")
print(f"Hour: {now.hour}, Minute: {now.minute}, Second: {now.second}")

# Create specific date
specific_date = datetime(2024, 1, 15, 9, 30, 0)  # year, month, day, hour, minute, second
print(f"\nSpecific datetime: {specific_date}")

# Date arithmetic
tomorrow = datetime.now() + timedelta(days=1)
last_week = datetime.now() - timedelta(weeks=1)
print(f"\nTomorrow: {tomorrow}")
print(f"Last week: {last_week}")

# Difference between dates
trading_start = datetime(2024, 1, 1)
trading_end = datetime(2024, 1, 15)
difference = trading_end - trading_start
print(f"\nDays between: {difference.days}")
print(f"Total seconds: {difference.total_seconds()}")

# Output:
# Current datetime: 2024-01-20 14:30:45.123456
# Year: 2024, Month: 1, Day: 20
# Hour: 14, Minute: 30, Second: 45
#
# Specific datetime: 2024-01-15 09:30:00
#
# Tomorrow: 2024-01-21 14:30:45.123456
# Last week: 2024-01-13 14:30:45.123456
#
# Days between: 14
# Total seconds: 1209600.0
```

**Explanation:**
- `datetime.now()` returns the current local date and time.
- `datetime(year, month, day, ...)` creates a specific datetime. Month and day are required; time components default to 0.
- `timedelta` represents a duration. You can add/subtract timedelta from datetime objects.
- Subtracting two datetime objects returns a `timedelta` object with `days` and `total_seconds()` methods.
- These operations form the basis for calculating trading days, time-to-expiry, and other time-based metrics.

---

```python
# pandas datetime parsing - converting strings to datetime

# Sample date strings (as they might appear in NEPSE CSV)
date_strings = ['2024-01-15', '2024-01-16', '2024-01-17', '2024-01-18', '2024-01-19']

# Convert to datetime
dates = pd.to_datetime(date_strings)
print(f"Parsed dates:\n{dates}")
print(f"Type: {type(dates)}")  # DatetimeIndex

# Different date formats
various_formats = ['15/01/2024', 'Jan 15, 2024', '15-Jan-2024', '20240115']
parsed_formats = pd.to_datetime(various_formats)
print(f"\nParsed various formats:\n{parsed_formats}")

# Specify format explicitly for non-standard formats
nepse_format = pd.to_datetime('15-01-2024', format='%d-%m-%Y')
print(f"\nCustom format parsed: {nepse_format}")

# Handle errors gracefully
problematic_dates = ['2024-01-15', 'invalid-date', '2024-01-17']
# This would raise an error:
# parsed = pd.to_datetime(problematic_dates)
# Use errors='coerce' to convert invalid to NaT (Not a Time)
parsed_safe = pd.to_datetime(problematic_dates, errors='coerce')
print(f"\nWith error handling:\n{parsed_safe}")

# Output:
# Parsed dates:
# DatetimeIndex(['2024-01-15', '2024-01-16', '2024-01-17', '2024-01-18', '2024-01-19'], dtype='datetime64[ns]', freq=None)
```

**Explanation:**
- `pd.to_datetime()` converts strings to pandas datetime objects. It's smart enough to parse many common formats automatically.
- The result is a `DatetimeIndex`, which is a specialized Index type for datetime data.
- For non-standard formats, use the `format` parameter with strptime-style format codes:
  - `%d` = day, `%m` = month, `%Y` = 4-digit year, `%y` = 2-digit year
  - `%H` = hour (24h), `%I` = hour (12h), `%M` = minute, `%S` = second
- `errors='coerce'` converts invalid dates to `NaT` (Not a Time) instead of raising an error. This is essential when dealing with real-world data that may have typos or missing values.

---

```python
# DatetimeIndex functionality for time-series

# Create a time-series DataFrame
dates = pd.date_range('2024-01-01', periods=10, freq='D')  # 10 daily dates
print(f"Date range:\n{dates}")

# Create sample stock data
ts_data = pd.DataFrame({
    'Close': [2850.50, 2875.25, 2890.00, 2865.75, 2880.50, 
              2905.00, 2895.00, 2910.00, 2900.00, 2920.00],
    'Volume': [125000, 150000, 175000, 140000, 160000,
               180000, 155000, 190000, 170000, 200000]
}, index=dates)

print(f"\nTime-series DataFrame:\n{ts_data}")

# Access datetime components
print(f"\nYear: {ts_data.index.year}")
print(f"Month: {ts_data.index.month}")
print(f"Day: {ts_data.index.day}")
print(f"Day of week: {ts_data.index.dayofweek}")  # Monday=0, Sunday=6
print(f"Day name: {ts_data.index.day_name()}")

# Filter by date
print(f"\nData for January 5-8:\n{ts_data['2024-01-05':'2024-01-08']}")

# Filter by month
print(f"\nAll January data:\n{ts_data['2024-01']}")

# Output:
# Date range:
# DatetimeIndex(['2024-01-01', '2024-01-02', '2024-01-03', '2024-01-04',
#                '2024-01-05', '2024-01-06', '2024-01-07', '2024-01-08',
#                '2024-01-09', '2024-01-10'],
#               dtype='datetime64[ns]', freq='D')
```

**Explanation:**
- `pd.date_range()` creates a sequence of dates. Parameters:
  - `start`: Starting date
  - `end`: Ending date (alternative to periods)
  - `periods`: Number of dates to generate
  - `freq`: Frequency ('D'=daily, 'H'=hourly, 'B'=business days, 'W'=weekly, 'M'=month end)
- Using a DatetimeIndex enables powerful time-based operations.
- **Datetime components** are accessible via properties: `.year`, `.month`, `.day`, `.dayofweek`, `.day_name()`.
- **Date-based slicing** works naturally: `ts_data['2024-01-05':'2024-01-08']` gets all data between those dates.
- You can filter by just year-month: `ts_data['2024-01']` gets all January 2024 data.

---

```python
# Resampling - converting between time frequencies

# Create hourly data (simulating intraday trading)
hourly_dates = pd.date_range('2024-01-15 09:00', periods=24, freq='H')
hourly_prices = pd.DataFrame({
    'Price': [2850 + i * 2 + (i % 3 - 1) * 5 for i in range(24)],
    'Volume': [10000 + i * 1000 for i in range(24)]
}, index=hourly_dates)

print("Hourly data (first 5 rows):")
print(hourly_prices.head())

# Resample to daily - aggregate hourly to daily
daily = hourly_prices.resample('D').agg({
    'Price': 'last',  # Closing price is the last price of the day
    'Volume': 'sum'   # Total volume is sum of hourly volumes
})
print(f"\nDaily resampled:\n{daily}")

# Different aggregation methods
print(f"\nDaily OHLC:")
daily_ohlc = hourly_prices['Price'].resample('D').ohlc()
print(daily_ohlc)

# Weekly resampling
weekly = hourly_prices.resample('W').agg({
    'Price': 'last',
    'Volume': 'sum'
})
print(f"\nWeekly resampled:\n{weekly}")

# Downsampling - fill missing times
print(f"\nUpsampling to 30-min intervals (forward fill):")
upsampled = hourly_prices.resample('30T').ffill()  # 30T = 30 minutes
print(upsampled.head(6))

# Output:
# Hourly data (first 5 rows):
#                      Price  Volume
# 2024-01-15 09:00:00   2845   10000
# 2024-01-15 10:00:00   2850   11000
# 2024-01-15 11:00:00   2848   12000
# 2024-01-15 12:00:00   2853   13000
# 2024-01-15 13:00:00   2858   14000
```

**Explanation:**
- **Resampling** is crucial for time-series analysis. It converts data between different time frequencies.
- `.resample('D')` groups data by day. Other frequencies: 'H'=hourly, 'W'=weekly, 'M'=monthly, 'Q'=quarterly.
- **Downsampling** (high to low frequency) requires aggregation:
  - `'last'`: Last value in the period (typical for closing prices)
  - `'sum'`: Sum of values (typical for volume)
  - `'mean'`: Average value
  - `'ohlc'`: Open, High, Low, Close (creates 4 columns)
- **Upsampling** (low to high frequency) requires filling:
  - `.ffill()`: Forward fill - propagate last known value
  - `.bfill()`: Backward fill - use next known value
  - `.interpolate()`: Linear interpolation between values
- Frequency strings: 'D'=day, 'H'=hour, 'T' or 'min'=minute, 'S'=second, 'W'=week, 'M'=month

---

## **4.2 Data Structures for Time-Series**

Choosing the right data structure is essential for efficient time-series analysis. pandas provides several specialized structures, and understanding when to use each is important.

```python
import pandas as pd
import numpy as np

# Series vs DataFrame - when to use each

# Series: Single time-series (one variable over time)
# Use when tracking a single metric
nabil_close = pd.Series(
    [2850.50, 2875.25, 2890.00, 2865.75, 2880.50],
    index=pd.date_range('2024-01-15', periods=5, freq='D'),
    name='NABIL_Close'
)
print("Series (single time-series):")
print(nabil_close)
print(f"Shape: {nabil_close.shape}")  # 1D: (5,)

# DataFrame: Multiple time-series (multiple variables over time)
# Use when tracking multiple related metrics
nabil_data = pd.DataFrame({
    'Open': [2840.00, 2870.00, 2885.00, 2860.00, 2875.00],
    'High': [2890.00, 2895.00, 2920.00, 2900.00, 2915.00],
    'Low': [2835.00, 2865.00, 2880.00, 2850.00, 2870.00],
    'Close': [2850.50, 2875.25, 2890.00, 2865.75, 2880.50],
    'Volume': [125000, 150000, 175000, 140000, 160000]
}, index=pd.date_range('2024-01-15', periods=5, freq='D'))

print("\nDataFrame (multiple time-series):")
print(nabil_data)
print(f"Shape: {nabil_data.shape}")  # 2D: (5, 5)

# Panel (deprecated) vs MultiIndex - for 3D data
# When you have multiple symbols over multiple time periods with multiple features

# MultiIndex approach for panel data (multiple stocks)
symbols = ['NABIL', 'NABIL', 'NABIL', 'NICA', 'NICA', 'NICA']
dates = pd.to_datetime(['2024-01-15', '2024-01-16', '2024-01-17'] * 2)

multi_index = pd.MultiIndex.from_arrays([symbols, dates], names=['Symbol', 'Date'])
panel_data = pd.DataFrame({
    'Close': [2850.50, 2875.25, 2890.00, 1180.00, 1195.50, 1210.00],
    'Volume': [125000, 150000, 175000, 85000, 92000, 98000]
}, index=multi_index)

print("\nMultiIndex DataFrame (multiple symbols):")
print(panel_data)

# Access specific symbol
print(f"\nNABIL data:\n{panel_data.loc['NABIL']}")

# Output:
# Series (single time-series):
# 2024-01-15    2850.50
# 2024-01-16    2875.25
# 2024-01-17    2890.00
# 2024-01-18    2865.75
# 2024-01-19    2880.50
# Freq: D, Name: NABIL_Close, dtype: float64
# Shape: (5,)
```

**Explanation:**
- **Series** is for 1D labeled data. Use it when you have a single metric over time (e.g., just closing prices). It's more memory-efficient for single variables.
- **DataFrame** is for 2D labeled data. Use it when you have multiple related metrics over time (e.g., OHLCV data). Each column is a Series.
- **MultiIndex** handles 3D-like data (e.g., multiple stocks over multiple time periods). The index has multiple levels (Symbol and Date in this example).
- `pd.MultiIndex.from_arrays()` creates a multi-level index from separate arrays. The `names` parameter labels each level.
- Accessing MultiIndex data: `panel_data.loc['NABIL']` selects all rows for NABIL.
- The shape differs: Series is `(n,)`, DataFrame is `(rows, cols)`, MultiIndex DataFrame is still 2D but with hierarchical row labels.

---

```python
# Efficient data structures for large time-series

# Using appropriate dtypes to save memory
import pandas as pd

# Create sample data with default types
large_data = pd.DataFrame({
    'Symbol': ['NABIL'] * 100000,
    'Open': [2850.50 + i * 0.01 for i in range(100000)],
    'Volume': [100000 + i * 10 for i in range(100000)]
})

print("Default dtypes and memory usage:")
print(large_data.dtypes)
print(f"Memory usage: {large_data.memory_usage(deep=True).sum() / 1024 / 1024:.2f} MB")

# Optimize dtypes
optimized_data = large_data.copy()
optimized_data['Symbol'] = optimized_data['Symbol'].astype('category')  # Categorical for repeated strings
optimized_data['Open'] = optimized_data['Open'].astype('float32')  # Smaller float
optimized_data['Volume'] = optimized_data['Volume'].astype('int32')  # Smaller int

print("\nOptimized dtypes and memory usage:")
print(optimized_data.dtypes)
print(f"Memory usage: {optimized_data.memory_usage(deep=True).sum() / 1024 / 1024:.2f} MB")

# Memory savings
savings = (1 - optimized_data.memory_usage(deep=True).sum() / large_data.memory_usage(deep=True).sum()) * 100
print(f"\nMemory saved: {savings:.1f}%")

# Output:
# Default dtypes and memory usage:
# Symbol     object
# Open      float64
# Volume      int64
# dtype: object
# Memory usage: 7.63 MB
#
# Optimized dtypes and memory usage:
# Symbol    category
# Open       float32
# Volume       int32
# dtype: object
# Memory usage: 1.15 MB
#
# Memory saved: 84.9%
```

**Explanation:**
- **Memory optimization** is crucial for large time-series datasets. Default pandas types (object for strings, float64 for decimals, int64 for integers) can be excessive.
- `'category'` dtype is for columns with few unique values (like Symbol). It stores values as integers with a lookup table, dramatically reducing memory for repeated strings.
- `float32` is 32-bit instead of 64-bit, halving memory for floats. Precision is about 7 decimal digits (sufficient for most prices).
- `int32` is 32-bit instead of 64-bit, halving memory for integers. Range is about ±2 billion (sufficient for most volumes).
- `.memory_usage(deep=True)` calculates actual memory including object types. Without `deep=True`, it underestimates memory for string columns.
- In this example, we reduced memory usage by 85%, which is significant when working with millions of rows.

---

## **4.3 Loading Time-Series Data**

Real-world time-series prediction systems need to load data from various sources. Let's explore different data loading methods.

### **4.3.1 From CSV Files**

CSV (Comma-Separated Values) is the most common format for time-series data. The NEPSE data in this handbook comes in CSV format.

```python
import pandas as pd
import os

# Basic CSV loading
# Assuming we have a NEPSE data file named 'nepse_data.csv'

# First, let's create a sample CSV file for demonstration
sample_data = """S.No,Symbol,Conf.,Open,High,Low,Close,LTP,Close - LTP,Close - LTP %,VWAP,Vol,Prev. Close,Turnover,Trans.,Diff,Range,Diff %,Range %,VWAP %,52 Weeks High,52 Weeks Low
1,NABIL,A,2850.50,2890.00,2840.00,2875.25,2875.25,0.00,0.00,2872.50,125000,2845.00,359062500,500,30.25,50.00,1.06,1.76,0.97,3200.00,2400.00
2,NABIL,A,2875.25,2910.00,2860.00,2895.50,2895.50,0.00,0.00,2888.00,150000,2875.25,433200000,620,20.25,50.00,0.70,1.74,0.45,3200.00,2400.00
3,NABIL,A,2890.00,2920.00,2880.00,2900.00,2900.00,0.00,0.00,2898.50,175000,2895.50,507237500,750,4.50,40.00,0.16,1.38,0.15,3200.00,2400.00
4,NICA,A,1180.00,1195.00,1175.00,1190.50,1190.50,0.00,0.00,1186.00,85000,1178.00,100910000,320,12.50,20.00,1.06,1.70,0.38,1350.00,980.00
5,NICA,A,1190.50,1210.00,1185.00,1205.00,1205.00,0.00,0.00,1198.50,92000,1190.50,110262000,380,14.50,25.00,1.22,2.10,0.77,1350.00,980.00
6,SCBL,A,2450.00,2480.00,2440.00,2465.00,2465.00,0.00,0.00,2458.00,45000,2445.00,110610000,180,20.00,40.00,0.82,1.64,0.53,2700.00,2100.00"""

# Write to file
with open('nepse_data.csv', 'w') as f:
    f.write(sample_data)

# Now load the CSV
df = pd.read_csv('nepse_data.csv')

print("Loaded DataFrame:")
print(df)
print(f"\nShape: {df.shape}")
print(f"\nColumns: {df.columns.tolist()}")
print(f"\nData types:\n{df.dtypes}")

# Output:
# Loaded DataFrame:
#    S.No Symbol Conf.     Open     High      Low    Close     LTP  ...  52 Weeks Low
# 0     1  NABIL     A  2850.50  2890.00  2840.00  2875.25  2875.25  ...        2400.0
# 1     2  NABIL     A  2875.25  2910.00  2860.00  2895.50  2895.50  ...        2400.0
# ...
```

**Explanation:**
- `pd.read_csv()` is the primary function for loading CSV files. It returns a DataFrame.
- The function automatically uses the first row as column headers.
- Column names are taken directly from the CSV header row.
- `df.shape` shows the dimensions: (number of rows, number of columns).
- `df.dtypes` shows the inferred data types for each column. Pandas automatically infers types:
  - Numeric columns become `float64` or `int64`
  - String columns become `object`
- The ellipsis `...` in output indicates columns are truncated for display when there are many columns.

---

```python
# Advanced CSV loading options

# Specify which columns to load (saves memory for large files)
df_subset = pd.read_csv('nepse_data.csv', usecols=['Symbol', 'Date', 'Open', 'High', 'Low', 'Close', 'Volume'])
# Note: 'Date' column not in our sample, but shown for illustration

# Actually, let's use columns that exist:
df_subset = pd.read_csv('nepse_data.csv', usecols=['Symbol', 'Open', 'High', 'Low', 'Close', 'Vol'])
print("Subset of columns:")
print(df_subset.head())

# Specify data types for columns (avoids type inference overhead)
df_types = pd.read_csv('nepse_data.csv', dtype={
    'Symbol': 'category',
    'Vol': 'int32'
})
print(f"\nWith specified dtypes:\n{df_types.dtypes}")

# Handle missing values during loading
# na_values specifies additional strings to treat as NaN
df_na = pd.read_csv('nepse_data.csv', na_values=['N/A', 'NA', '-', '', 'null'])

# Parse dates during loading
# If we had a 'Date' column, we could parse it:
# df_dates = pd.read_csv('nepse_data.csv', parse_dates=['Date'])

# Set index during loading
# df_indexed = pd.read_csv('nepse_data.csv', index_col='S.No')

# Skip rows (useful if file has header information)
# df_skip = pd.read_csv('nepse_data.csv', skiprows=2)  # Skip first 2 rows

# Read only first N rows (useful for exploring large files)
df_preview = pd.read_csv('nepse_data.csv', nrows=3)
print(f"\nFirst 3 rows only:")
print(df_preview)

# Output:
# Subset of columns:
#   Symbol     Open     High      Low    Close     Vol
# 0  NABIL  2850.50  2890.00  2840.00  2875.25  125000
# 1  NABIL  2875.25  2910.00  2860.00  2895.50  150000
```

**Explanation:**
- `usecols` parameter loads only specified columns. This is crucial for memory efficiency when you have many columns but only need a few.
- `dtype` parameter forces specific data types during loading. This prevents pandas from guessing types and saves memory with appropriate types.
- `na_values` parameter specifies additional strings to treat as missing values (NaN). By default, pandas treats 'NA', 'N/A', 'NULL', etc., but custom formats may need explicit specification.
- `parse_dates` automatically converts specified columns to datetime. This is more efficient than parsing after loading.
- `index_col` sets a column as the index during loading. For time-series, this is typically the date column.
- `skiprows` skips initial rows, useful when CSV files have metadata headers before the actual data.
- `nrows` limits rows loaded, useful for quickly previewing large files without loading everything.

---

```python
# Loading CSV with proper datetime handling for NEPSE data

# Create a more realistic NEPSE CSV with dates
nepse_with_dates = """Date,Symbol,Open,High,Low,Close,Volume,Turnover
2024-01-15,NABIL,2850.50,2890.00,2840.00,2875.25,125000,359062500
2024-01-16,NABIL,2875.25,2910.00,2860.00,2895.50,150000,433200000
2024-01-17,NABIL,2890.00,2920.00,2880.00,2900.00,175000,507237500
2024-01-18,NABIL,2865.75,2900.00,2850.00,2880.50,140000,403630000
2024-01-19,NABIL,2880.50,2915.00,2870.00,2905.00,160000,465280000
2024-01-15,NICA,1180.00,1195.00,1175.00,1190.50,85000,100910000
2024-01-16,NICA,1190.50,1210.00,1185.00,1205.00,92000,110262000
2024-01-17,NICA,1205.00,1220.00,1200.00,1215.00,88000,106920000"""

with open('nepse_with_dates.csv', 'w') as f:
    f.write(nepse_with_dates)

# Load with date parsing and index setting
df_ts = pd.read_csv(
    'nepse_with_dates.csv',
    parse_dates=['Date'],  # Convert Date column to datetime
    index_col='Date',       # Set Date as index
    dtype={'Symbol': 'category', 'Volume': 'int32'}
)

print("Time-series DataFrame:")
print(df_ts)
print(f"\nIndex type: {type(df_ts.index)}")
print(f"Index dtype: {df_ts.index.dtype}")

# Now we can do time-based operations
print(f"\nData for Jan 16-18:\n{df_ts['2024-01-16':'2024-01-18']}")

# Select specific symbol
nabil_data = df_ts[df_ts['Symbol'] == 'NABIL']
print(f"\nNABIL data:\n{nabil_data}")

# Output:
# Time-series DataFrame:
#           Symbol     Open     High      Low    Close  Volume    Turnover
# Date
# 2024-01-15   NABIL  2850.50  2890.00  2840.00  2875.25  125000  359062500
# 2024-01-16   NABIL  2875.25  2910.00  2860.00  2895.50  150000  433200000
# ...
```

**Explanation:**
- `parse_dates=['Date']` converts the Date column to pandas datetime type during loading. This is essential for time-series operations.
- `index_col='Date'` sets the Date column as the index. With a DatetimeIndex, you can use date-based slicing and resampling.
- Combining `parse_dates` and `index_col` during loading is more efficient than parsing after loading.
- With DatetimeIndex, you can use string dates for slicing: `df_ts['2024-01-16':'2024-01-18']` gets all data in that range.
- `dtype={'Symbol': 'category'}` optimizes the Symbol column since it has repeated values.
- After loading with proper date handling, the DataFrame is ready for time-series analysis.

---

### **4.3.2 From Databases**

Many production systems store time-series data in databases. Let's explore loading data from SQL databases.

```python
import pandas as pd
import sqlite3
from sqlalchemy import create_engine

# Create a sample SQLite database for demonstration
# In production, you would connect to an existing database

# Create in-memory SQLite database
conn = sqlite3.connect(':memory:')  # Use ':memory:' for temporary in-memory database

# Create table and insert sample data
create_table_sql = """
CREATE TABLE stock_prices (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    symbol TEXT NOT NULL,
    trade_date DATE NOT NULL,
    open REAL,
    high REAL,
    low REAL,
    close REAL,
    volume INTEGER,
    turnover REAL,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
"""

conn.execute(create_table_sql)

# Insert sample NEPSE data
insert_sql = """
INSERT INTO stock_prices (symbol, trade_date, open, high, low, close, volume, turnover)
VALUES 
    ('NABIL', '2024-01-15', 2850.50, 2890.00, 2840.00, 2875.25, 125000, 359062500),
    ('NABIL', '2024-01-16', 2875.25, 2910.00, 2860.00, 2895.50, 150000, 433200000),
    ('NABIL', '2024-01-17', 2890.00, 2920.00, 2880.00, 2900.00, 175000, 507237500),
    ('NABIL', '2024-01-18', 2865.75, 2900.00, 2850.00, 2880.50, 140000, 403630000),
    ('NABIL', '2024-01-19', 2880.50, 2915.00, 2870.00, 2905.00, 160000, 465280000),
    ('NICA', '2024-01-15', 1180.00, 1195.00, 1175.00, 1190.50, 85000, 100910000),
    ('NICA', '2024-01-16', 1190.50, 1210.00, 1185.00, 1205.00, 92000, 110262000),
    ('NICA', '2024-01-17', 1205.00, 1220.00, 1200.00, 1215.00, 88000, 106920000);
"""

conn.execute(insert_sql)
conn.commit()  # Commit the transaction

# Verify data was inserted
result = conn.execute("SELECT COUNT(*) FROM stock_prices")
print(f"Total records in database: {result.fetchone()[0]}")

# Output:
# Total records in database: 8
```

**Explanation:**
- `sqlite3.connect()` creates a connection to a SQLite database. `:memory:` creates a temporary in-memory database that doesn't persist to disk.
- `conn.execute()` runs SQL commands. `CREATE TABLE` defines the schema with appropriate data types.
- `INSERT INTO` adds data to the table. Multiple rows can be inserted in a single statement.
- `conn.commit()` commits the transaction, making changes permanent. Without it, changes would be rolled back when the connection closes.
- In production, you would connect to an existing database file: `sqlite3.connect('nepse.db')` or use other database systems like PostgreSQL or MySQL.

---

```python
# Loading data from database using pandas

# Method 1: Using read_sql with raw SQL query
query = """
SELECT symbol, trade_date, open, high, low, close, volume 
FROM stock_prices 
WHERE symbol = 'NABIL'
ORDER BY trade_date
"""

df_sql = pd.read_sql(query, conn, parse_dates=['trade_date'])

print("Data loaded with SQL query:")
print(df_sql)
print(f"\nData types:\n{df_sql.dtypes}")

# Method 2: Using read_sql_query (same as read_sql for queries)
df_query = pd.read_sql_query(
    "SELECT * FROM stock_prices WHERE volume > 100000",
    conn,
    parse_dates=['trade_date'],
    index_col='trade_date'
)
print(f"\nHigh volume trades:\n{df_query}")

# Method 3: Using read_sql_table (loads entire table)
# Requires SQLAlchemy connection
engine = create_engine('sqlite:///:memory:')  # Create new engine for demo

# Copy data to the new engine's database
df_sample = pd.read_sql("SELECT * FROM stock_prices", conn)
df_sample.to_sql('stock_prices', engine, index=False, if_exists='replace')

df_table = pd.read_sql_table('stock_prices', engine, parse_dates=['trade_date'])
print(f"\nEntire table loaded:\n{df_table.head()}")

# Output:
# Data loaded with SQL query:
#   symbol trade_date     open     high      low    close  volume
# 0  NABIL 2024-01-15  2850.50  2890.00  2840.00  2875.25  125000
# 1  NABIL 2024-01-16  2875.25  2910.00  2860.00  2895.50  150000
```

**Explanation:**
- `pd.read_sql()` executes a SQL query and returns the result as a DataFrame. It's the most flexible method.
- The `parse_dates` parameter converts date columns to datetime type, just like with CSV loading.
- `pd.read_sql_query()` is similar to `read_sql` but explicitly for queries (not table names).
- `pd.read_sql_table()` loads an entire table by name. Requires SQLAlchemy engine.
- `index_col` sets a column as the DataFrame index, useful for time-series with date columns.
- These methods support any SQL database (PostgreSQL, MySQL, Oracle, etc.) with appropriate connection strings.

---

```python
# Parameterized queries for safe database access

# Using parameters prevents SQL injection attacks
# Never use string formatting for user input in SQL!

# BAD: Vulnerable to SQL injection
# symbol = "NABIL'; DROP TABLE stock_prices; --"
# bad_query = f"SELECT * FROM stock_prices WHERE symbol = '{symbol}'"

# GOOD: Use parameterized queries
symbol_param = 'NABIL'
start_date = '2024-01-15'
end_date = '2024-01-17'

# Using ? placeholders (SQLite style)
params_query = """
SELECT * FROM stock_prices 
WHERE symbol = ? AND trade_date BETWEEN ? AND ?
ORDER BY trade_date
"""

df_params = pd.read_sql(
    params_query, 
    conn, 
    params=(symbol_param, start_date, end_date),
    parse_dates=['trade_date']
)

print("Parameterized query result:")
print(df_params)

# Using named parameters with SQLAlchemy
from sqlalchemy import text

engine = create_engine('sqlite:///:memory:')
pd.read_sql("SELECT * FROM stock_prices", conn).to_sql('stock_prices', engine, index=False, if_exists='replace')

named_query = text("""
SELECT symbol, trade_date, close, volume 
FROM stock_prices 
WHERE symbol = :symbol AND volume > :min_volume
""")

with engine.connect() as connection:
    df_named = pd.read_sql(
        named_query, 
        connection, 
        params={'symbol': 'NABIL', 'min_volume': 100000}
    )

print(f"\nNamed parameter query:\n{df_named}")

# Output:
# Parameterized query result:
#    id symbol trade_date     open     high      low    close  volume    turnover
# 0   1  NABIL 2024-01-15  2850.50  2890.00  2840.00  2875.25  125000  359062500
# 1   2  NABIL 2024-01-16  2875.25  2910.00  2860.00  2895.50  150000  433200000
```

**Explanation:**
- **Parameterized queries** are essential for security. Never use string formatting (f-strings or %) to insert user input into SQL queries - this creates SQL injection vulnerabilities.
- Use `?` placeholders (SQLite style) or `:name` placeholders (named parameters) with SQLAlchemy.
- The `params` parameter in `read_sql()` accepts a tuple for positional parameters or a dictionary for named parameters.
- SQLAlchemy's `text()` wraps SQL with named parameter support.
- When building prediction systems with user inputs (like symbol selection), always use parameterized queries.

---

### **4.3.3 From APIs**

Many financial data providers offer APIs for accessing real-time and historical data. Let's explore API integration.

```python
import pandas as pd
import requests
import json
from datetime import datetime

# Understanding API basics
# APIs (Application Programming Interfaces) allow programs to communicate
# REST APIs use HTTP methods (GET, POST, etc.) to request data

# Sample API call structure (mock NEPSE API)
# In production, you would use actual NEPSE or other financial APIs

# Simulating an API response
mock_api_response = {
    "status": "success",
    "data": [
        {
            "symbol": "NABIL",
            "date": "2024-01-15",
            "open": 2850.50,
            "high": 2890.00,
            "low": 2840.00,
            "close": 2875.25,
            "volume": 125000,
            "turnover": 359062500
        },
        {
            "symbol": "NABIL",
            "date": "2024-01-16",
            "open": 2875.25,
            "high": 2910.00,
            "low": 2860.00,
            "close": 2895.50,
            "volume": 150000,
            "turnover": 433200000
        },
        {
            "symbol": "NABIL",
            "date": "2024-01-17",
            "open": 2890.00,
            "high": 2920.00,
            "low": 2880.00,
            "close": 2900.00,
            "volume": 175000,
            "turnover": 507237500
        }
    ],
    "metadata": {
        "total_records": 3,
        "page": 1,
        "per_page": 100
    }
}

# Parse JSON response into DataFrame
# APIs typically return JSON (JavaScript Object Notation)
print("API Response structure:")
print(json.dumps(mock_api_response, indent=2))

# Convert to DataFrame
df_api = pd.DataFrame(mock_api_response['data'])
print(f"\nDataFrame from API:\n{df_api}")

# Convert date column
df_api['date'] = pd.to_datetime(df_api['date'])
print(f"\nWith parsed dates:\n{df_api.dtypes}")

# Output:
# API Response structure:
# {
#   "status": "success",
#   "data": [
#     {
#       "symbol": "NABIL",
#       "date": "2024-01-15",
#       "open": 2850.5,
# ...
```

**Explanation:**
- APIs return data in JSON format (JavaScript Object Notation), which is similar to Python dictionaries.
- `json.dumps()` converts Python objects to JSON string. `indent=2` formats it for readability.
- `pd.DataFrame()` can directly convert a list of dictionaries to a DataFrame - each dictionary becomes a row.
- API responses typically have a structure with:
  - `status`: Indicates success/failure
  - `data`: The actual data (array of records)
  - `metadata`: Information about the response (pagination, counts)
- Date strings from APIs need to be parsed with `pd.to_datetime()` for time-series operations.

---

```python
# Making actual HTTP requests to APIs

import requests
import pandas as pd

# Example function to fetch stock data from an API
def fetch_stock_data(symbol, start_date, end_date, api_key='demo'):
    """
    Fetch stock data from a financial API.
    
    Parameters:
    -----------
    symbol : str
        Stock symbol (e.g., 'NABIL')
    start_date : str
        Start date in 'YYYY-MM-DD' format
    end_date : str
        End date in 'YYYY-MM-DD' format
    api_key : str
        API key for authentication
    
    Returns:
    --------
    pandas.DataFrame
        DataFrame with stock data
    """
    
    # Build API URL (example structure - actual URL depends on provider)
    # Common financial APIs: Alpha Vantage, Yahoo Finance, IEX Cloud
    base_url = "https://api.example.com/stock/data"  # Placeholder
    
    # Request parameters
    params = {
        'symbol': symbol,
        'start': start_date,
        'end': end_date,
        'apikey': api_key
    }
    
    # Make GET request
    # response = requests.get(base_url, params=params)
    
    # Check if request was successful
    # if response.status_code == 200:
    #     data = response.json()
    #     df = pd.DataFrame(data['data'])
    #     df['date'] = pd.to_datetime(df['date'])
    #     return df
    # else:
    #     raise Exception(f"API request failed: {response.status_code}")
    
    # For demonstration, return mock data
    mock_data = [
        {'date': '2024-01-15', 'open': 2850.50, 'high': 2890.00, 
         'low': 2840.00, 'close': 2875.25, 'volume': 125000},
        {'date': '2024-01-16', 'open': 2875.25, 'high': 2910.00, 
         'low': 2860.00, 'close': 2895.50, 'volume': 150000},
    ]
    df = pd.DataFrame(mock_data)
    df['date'] = pd.to_datetime(df['date'])
    return df

# Use the function
try:
    df_fetched = fetch_stock_data('NABIL', '2024-01-15', '2024-01-16')
    print("Fetched data:")
    print(df_fetched)
except Exception as e:
    print(f"Error: {e}")

# Output:
# Fetched data:
#         date     open     high      low    close  volume
# 0 2024-01-15  2850.50  2890.00  2840.00  2875.25  125000
# 1 2024-01-16  2875.25  2910.00  2860.00  2895.50  150000
```

**Explanation:**
- `requests.get()` makes an HTTP GET request to the specified URL with parameters.
- `params` dictionary is automatically converted to URL query parameters (?symbol=NABIL&start=...).
- `response.status_code` indicates success (200) or failure (4xx for client errors, 5xx for server errors).
- `response.json()` parses the JSON response body into Python dictionaries/lists.
- **API keys** are typically required for authentication. Never hardcode API keys in source code - use environment variables or configuration files.
- Always wrap API calls in try-except blocks to handle network errors, rate limits, and other failures gracefully.
- The function demonstrates the standard pattern for API data fetching: build request, make request, check response, parse data.

---

```python
# Handling API pagination and rate limits

import time
import requests

def fetch_all_stock_data(symbol, start_date, end_date, api_key='demo'):
    """
    Fetch all stock data with pagination handling.
    
    Many APIs limit records per request and require pagination.
    """
    
    all_data = []
    page = 1
    per_page = 100
    
    while True:
        params = {
            'symbol': symbol,
            'start': start_date,
            'end': end_date,
            'page': page,
            'per_page': per_page,
            'apikey': api_key
        }
        
        # response = requests.get('https://api.example.com/stock/data', params=params)
        # data = response.json()
        
        # Simulate API response
        mock_data = {
            'data': [
                {'date': f'2024-01-{15+i}', 'close': 2850 + i*10, 'volume': 100000 + i*1000}
                for i in range(min(per_page, 10))  # Simulate 10 records
            ] if page == 1 else [],
            'metadata': {'total_pages': 1}
        }
        data = mock_data
        
        # Add data to collection
        if data['data']:
            all_data.extend(data['data'])
        
        # Check if more pages exist
        if page >= data['metadata'].get('total_pages', 1):
            break
        
        page += 1
        
        # Rate limiting - wait between requests
        time.sleep(0.1)  # Wait 100ms between requests
        
        # Some APIs provide rate limit info in headers
        # remaining = response.headers.get('X-RateLimit-Remaining')
        # if remaining and int(remaining) < 10:
        #     time.sleep(60)  # Wait if running low on rate limit
    
    df = pd.DataFrame(all_data)
    if 'date' in df.columns:
        df['date'] = pd.to_datetime(df['date'])
    
    return df

# Fetch data with pagination
df_paginated = fetch_all_stock_data('NABIL', '2024-01-01', '2024-01-31')
print(f"Total records fetched: {len(df_paginated)}")
print(df_paginated.head())

# Output:
# Total records fetched: 10
#         date   close  volume
# 0 2024-01-15  2850.0  100000
# 1 2024-01-16  2860.0  101000
```

**Explanation:**
- **Pagination** is necessary when APIs limit records per request. You make multiple requests, incrementing the page number.
- Loop until no more data is returned or you've fetched all pages.
- `all_data.extend()` adds records from each page to a single list.
- **Rate limiting** prevents overwhelming the API server. APIs often have limits like 100 requests per minute.
- `time.sleep()` pauses between requests. Adjust based on API's rate limits.
- Some APIs include rate limit information in response headers (`X-RateLimit-Remaining`, `X-RateLimit-Reset`).
- Best practice: Check rate limits and pause before hitting them, rather than waiting for errors.
- This pattern is essential for fetching large historical datasets.

---

### **4.3.4 From Other Sources**

Time-series data can come from various other sources including Excel files, Parquet files, and real-time streams.

```python
import pandas as pd
import numpy as np

# Loading from Excel files
# Excel is common for manually maintained data

# Create sample Excel file
df_sample = pd.DataFrame({
    'Date': pd.date_range('2024-01-15', periods=5, freq='D'),
    'Symbol': ['NABIL'] * 5,
    'Open': [2850.50, 2875.25, 2890.00, 2865.75, 2880.50],
    'High': [2890.00, 2910.00, 2920.00, 2900.00, 2915.00],
    'Low': [2840.00, 2860.00, 2880.00, 2850.00, 2870.00],
    'Close': [2875.25, 2895.50, 2900.00, 2880.50, 2905.00]
})

# Save to Excel
df_sample.to_excel('nepse_data.xlsx', sheet_name='Stock Prices', index=False)

# Load from Excel
df_excel = pd.read_excel('nepse_data.xlsx', sheet_name='Stock Prices')
print("Data from Excel:")
print(df_excel)
print(f"\nData types:\n{df_excel.dtypes}")

# Loading specific sheet and range
df_range = pd.read_excel('nepse_data.xlsx', sheet_name='Stock Prices', usecols='A:F', skiprows=0)
print(f"\nSpecific range:\n{df_range.head()}")

# Output:
# Data from Excel:
#        Date Symbol     Open     High      Low    Close
# 0 2024-01-15  NABIL  2850.50  2890.00  2840.00  2875.25
# 1 2024-01-16  NABIL  2875.25  2910.00  2860.00  2895.50
```

**Explanation:**
- `pd.read_excel()` reads Excel files (.xlsx, .xls). It requires the `openpyxl` or `xlrd` library.
- `sheet_name` specifies which sheet to read. Can be sheet name (string), index (integer), or None for all sheets.
- `usecols` specifies columns to read. Can be column letters ('A:F'), names, or indices.
- `to_excel()` writes DataFrame to Excel. `index=False` excludes the index from the output.
- Excel files are useful for smaller datasets or when data is manually maintained, but they're slower and larger than CSV or Parquet for large datasets.

---

```python
# Loading from Parquet files - efficient for large datasets

import pyarrow.parquet as pq

# Parquet is a columnar storage format optimized for analytics
# It's much faster and smaller than CSV for large files

# Save to Parquet
df_sample.to_parquet('nepse_data.parquet', engine='pyarrow')

# Load from Parquet
df_parquet = pd.read_parquet('nepse_data.parquet')
print("Data from Parquet:")
print(df_parquet)
print(f"\nData types preserved:\n{df_parquet.dtypes}")

# Compare file sizes
import os
csv_size = os.path.getsize('nepse_data.csv') if os.path.exists('nepse_data.csv') else 0
parquet_size = os.path.getsize('nepse_data.parquet')

print(f"\nFile comparison:")
print(f"CSV size: {csv_size} bytes")
print(f"Parquet size: {parquet_size} bytes")

# Parquet benefits for time-series:
# 1. Preserves data types (no need to re-parse dates)
# 2. Columnar storage - faster queries on specific columns
# 3. Compression - smaller file sizes
# 4. Predicate pushdown - can skip reading irrelevant data

# Output:
# Data from Parquet:
#        Date Symbol     Open     High      Low    Close
# 0 2024-01-15  NABIL  2850.50  2890.00  2840.00  2875.25
# 1 2024-01-16  NABIL  2875.25  2910.00  2860.00  2895.50
```

**Explanation:**
- **Parquet** is a columnar storage format designed for big data. It's ideal for time-series data in production.
- `to_parquet()` writes DataFrame to Parquet. `engine='pyarrow'` specifies the library to use (fastest option).
- `read_parquet()` loads Parquet files. Data types are preserved automatically - no need for `parse_dates`.
- **Advantages over CSV**:
  - Data types preserved (datetime stays datetime, not string)
  - Columnar format: reading only needed columns is much faster
  - Built-in compression: files are typically 10x smaller
  - Predicate pushdown: filtering happens during file read, not after
- Use Parquet for: production systems, large datasets, archival storage.
- Use CSV for: data exchange, small datasets, human readability.

---

## **4.4 Data Types and Conversions**

Understanding and managing data types is crucial for time-series analysis. Incorrect types can cause errors or produce incorrect results.

```python
import pandas as pd
import numpy as np

# Create sample NEPSE data with various types
data = {
    'Symbol': ['NABIL', 'NABIL', 'NICA', 'NICA'],
    'Date': ['2024-01-15', '2024-01-16', '2024-01-15', '2024-01-16'],
    'Open': [2850.50, 2875.25, 1180.00, 1190.50],
    'High': [2890.00, 2910.00, 1195.00, 1210.00],
    'Low': [2840.00, 2860.00, 1175.00, 1185.00],
    'Close': [2875.25, 2895.50, 1190.50, 1205.00],
    'Volume': ['125,000', '150,000', '85,000', '92,000'],  # String with commas
    'Change_Pct': ['1.06%', '0.70%', '1.06%', '1.22%'],  # String with %
}

df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)
print(f"\nOriginal data types:\n{df.dtypes}")

# Output:
# Original DataFrame:
#   Symbol        Date     Open     High      Low    Close   Volume Change_Pct
# 0  NABIL  2024-01-15  2850.50  2890.00  2840.00  2875.25  125,000      1.06%
# 1  NABIL  2024-01-16  2875.25  2910.00  2860.00  2895.50  150,000      0.70%
# ...
#
# Original data types:
# Symbol        object
# Date          object
# Open         float64
# High         float64
# Low          float64
# Close        float64
# Volume        object
# Change_Pct    object
```

**Explanation:**
- When data is loaded, pandas infers types. Strings become `object` type.
- In this example, `Volume` and `Change_Pct` are strings because they contain commas and percentage signs.
- `object` type prevents numerical operations - you can't calculate the mean of strings.
- Before analysis, we need to convert these to appropriate numeric types.
- The goal is to have:
  - `Symbol`: categorical (limited unique values)
  - `Date`: datetime
  - `Open`, `High`, `Low`, `Close`: float
  - `Volume`: int
  - `Change_Pct`: float (as decimal, e.g., 0.0106)

---

```python
# Converting string data to proper types

# Convert Date to datetime
df['Date'] = pd.to_datetime(df['Date'])
print(f"After date conversion:\n{df['Date'].dtype}")  # datetime64[ns]

# Convert Volume (remove commas, convert to int)
# Method 1: String replacement
df['Volume'] = df['Volume'].str.replace(',', '').astype(int)
print(f"\nVolume after conversion:\n{df['Volume']}")

# Convert Change_Pct (remove %, divide by 100)
# Method 1: String replacement
df['Change_Pct_Numeric'] = df['Change_Pct'].str.replace('%', '').astype(float) / 100
print(f"\nChange % as decimal:\n{df['Change_Pct_Numeric']}")

# Method 2: Using extract with regex for more complex patterns
# Example: extract numeric value from strings like "+1.06%" or "-0.50%"
df['Change_With_Sign'] = ['+1.06%', '-0.50%', '+1.06%', '+1.22%']
# Extract numeric part (including sign and decimal)
df['Change_Signed'] = df['Change_With_Sign'].str.extract(r'([+-]?\d+\.?\d*)')[0].astype(float) / 100
print(f"\nSigned change extraction:\n{df['Change_Signed']}")

# Convert Symbol to category (more efficient for repeated strings)
df['Symbol'] = df['Symbol'].astype('category')
print(f"\nSymbol dtype: {df['Symbol'].dtype}")

print(f"\nFinal data types:\n{df.dtypes}")

# Output:
# After date conversion:
# datetime64[ns]
#
# Volume after conversion:
# 0    125000
# 1    150000
# ...
```

**Explanation:**
- `pd.to_datetime()` converts strings to datetime. By default, it tries multiple formats automatically.
- `.str.replace()` applies string replacement to each element in a string column. This is vectorized (fast for large datasets).
- `.astype()` converts a column to a specified type. Common conversions: `int`, `float`, `str`, `category`.
- For percentage strings: remove the `%`, convert to float, then divide by 100 to get the decimal value.
- `.str.extract()` uses regular expressions to extract patterns. `([+-]?\d+\.?\d*)` matches optional sign, digits, optional decimal, more digits.
- `'category'` type stores strings as integer codes, saving memory for columns with few unique values (like stock symbols).

---

```python
# Handling numeric conversion errors

# Sample data with problematic values
problematic_data = {
    'Price': ['2850.50', '2875.25', 'N/A', '2890.00', 'unknown', '2865.75'],
    'Volume': ['125000', '150000', '', '175000', 'N/A', '140000']
}
df_problem = pd.DataFrame(problematic_data)
print("Data with problematic values:")
print(df_problem)

# Using to_numeric with error handling
# errors='coerce' converts invalid values to NaN
df_problem['Price_Clean'] = pd.to_numeric(df_problem['Price'], errors='coerce')
print(f"\nPrice after to_numeric with coerce:\n{df_problem['Price_Clean']}")

# Check how many values became NaN
print(f"\nNaN count: {df_problem['Price_Clean'].isna().sum()}")

# Alternative: errors='raise' (default) raises exception on error
# Alternative: errors='ignore' returns original values if conversion fails

# Convert multiple columns at once
df_problem[['Price_Clean', 'Volume_Clean']] = df_problem[['Price', 'Volume']].apply(
    pd.to_numeric, errors='coerce'
)
print(f"\nMultiple columns converted:\n{df_problem}")

# Fill NaN values after conversion
df_problem['Price_Filled'] = df_problem['Price_Clean'].fillna(df_problem['Price_Clean'].mean())
print(f"\nWith NaN filled by mean:\n{df_problem['Price_Filled']}")

# Output:
# Data with problematic values:
#      Price  Volume
# 0  2850.50  125000
# 1  2875.25  150000
# 2      N/A        
# 3  2890.00  175000
# 4  unknown     N/A
# 5  2865.75  140000
#
# Price after to_numeric with coerce:
# 0    2850.50
# 1    2875.25
# 2        NaN
# 3    2890.00
# 4        NaN
# 5    2865.75
```

**Explanation:**
- `pd.to_numeric()` specifically converts to numeric types, with more control than `.astype()`.
- `errors='coerce'` is crucial for real-world data: invalid values become NaN instead of causing errors.
- `errors='raise'` (default) throws an exception if any value can't be converted.
- `errors='ignore'` returns the original column if conversion fails (rarely useful).
- `.apply()` applies a function to each column (or row with axis=1). Here, we apply `to_numeric` to multiple columns.
- After conversion, NaN values need handling. `.fillna()` replaces NaN with specified values.
- Common fill strategies: mean, median, forward fill, backward fill, or drop rows with NaN.

---

```python
# Type inference and optimization

# Create a larger dataset for demonstration
np.random.seed(42)
large_df = pd.DataFrame({
    'Symbol': np.random.choice(['NABIL', 'NICA', 'SCBL', 'ADBL'], 100000),
    'Date': pd.date_range('2024-01-01', periods=100000, freq='H'),
    'Price': np.random.uniform(1000, 3000, 100000),
    'Volume': np.random.randint(10000, 500000, 100000),
    'Flag': np.random.choice([0, 1], 100000)
})

print("Default data types and memory:")
print(large_df.dtypes)
print(f"Memory usage: {large_df.memory_usage(deep=True).sum() / 1024 / 1024:.2f} MB")

# Optimize types
optimized_df = large_df.copy()
optimized_df['Symbol'] = optimized_df['Symbol'].astype('category')
optimized_df['Price'] = optimized_df['Price'].astype('float32')
optimized_df['Volume'] = optimized_df['Volume'].astype('int32')
optimized_df['Flag'] = optimized_df['Flag'].astype('int8')  # For small integers

print("\nOptimized data types and memory:")
print(optimized_df.dtypes)
print(f"Memory usage: {optimized_df.memory_usage(deep=True).sum() / 1024 / 1024:.2f} MB")

# Verify no data loss
print(f"\nPrice range preserved: {optimized_df['Price'].min():.2f} to {optimized_df['Price'].max():.2f}")

# Automatic type inference with infer_objects()
# Useful after operations that convert types
df_after_op = pd.DataFrame({'A': [1, 2, 3], 'B': ['x', 'y', 'z']})
df_after_op['A'] = df_after_op['A'].astype(object)  # Convert to object
print(f"\nBefore infer_objects: {df_after_op['A'].dtype}")
df_after_op = df_after_op.infer_objects()
print(f"After infer_objects: {df_after_op['A'].dtype}")

# Output:
# Default data types and memory:
# Symbol            object
# Date       datetime64[ns]
# Price            float64
# Volume             int64
# Flag               int64
# dtype: object
# Memory usage: 5.34 MB
#
# Optimized data types and memory:
# Symbol          category
# Date       datetime64[ns]
# Price           float32
# Volume            int32
# Flag               int8
# dtype: object
# Memory usage: 1.14 MB
```

**Explanation:**
- **Type optimization** can dramatically reduce memory usage (80% reduction in this example).
- `category` type is ideal for strings with few unique values (like stock symbols). Instead of storing each string, it stores integer codes with a lookup table.
- `float32` uses half the memory of `float64`. Precision is about 7 decimal digits, which is sufficient for stock prices.
- `int32` supports values up to about ±2 billion (sufficient for volume). `int64` supports much larger values but uses twice the memory.
- `int8` or `int16` are perfect for small integers (like binary flags 0/1, or small integers like ratings 1-5).
- `.memory_usage(deep=True)` shows actual memory usage, including overhead for object types. Without `deep=True`, it underestimates memory for string columns.
- `.infer_objects()` automatically infers better types after operations that may have converted them to `object`.
- **Best practice**: Apply type optimization immediately after loading data, before any analysis.

---

## **4.5 Handling Missing Values**

Missing values are common in time-series data due to holidays, trading halts, data collection errors, or system failures. Proper handling is essential for accurate predictions.

### **4.5.1 Identification Strategies**

Before handling missing values, we need to identify where they exist and understand their patterns.

```python
import pandas as pd
import numpy as np

# Create sample NEPSE data with missing values
dates = pd.date_range('2024-01-15', periods=10, freq='D')

nepse_data = pd.DataFrame({
    'Date': dates,
    'Symbol': 'NABIL',
    'Open': [2850.50, np.nan, 2890.00, 2865.75, np.nan, 2880.50, 2905.00, np.nan, 2900.00, 2920.00],
    'High': [2890.00, 2910.00, np.nan, 2900.00, 2915.00, 2920.00, np.nan, 2925.00, 2930.00, 2940.00],
    'Low': [2840.00, 2860.00, 2880.00, np.nan, 2870.00, 2875.00, 2890.00, 2900.00, np.nan, 2910.00],
    'Close': [2875.25, 2895.50, 2900.00, 2880.50, 2905.00, np.nan, 2910.00, 2915.00, 2925.00, 2935.00],
    'Volume': [125000, 150000, np.nan, 140000, 160000, 180000, 155000, np.nan, 170000, 200000]
})

nepse_data.set_index('Date', inplace=True)
print("NEPSE data with missing values:")
print(nepse_data)

# Output:
# NEPSE data with missing values:
#            Symbol     Open     High      Low    Close   Volume
# Date                                                         
# 2024-01-15   NABIL  2850.50  2890.00  2840.00  2875.25  125000.0
# 2024-01-16   NABIL      NaN  2910.00  2860.00  2895.50  150000.0
# 2024-01-17   NABIL  2890.00      NaN  2880.00  2900.00       NaN
# 2024-01-18   NABIL  2865.75  2900.00      NaN  2880.50  140000.0
# 2024-01-19   NABIL      NaN  2915.00  2870.00  2905.00  160000.0
# ...
```

**Explanation:**
- Missing values are represented as `NaN` (Not a Number) in pandas. They come from `np.nan` or appear when data is unavailable.
- In time-series, missing values can occur for various reasons:
  - **Non-trading days**: Weekends and holidays when markets are closed
  - **Data collection failures**: System errors during data capture
  - **Trading halts**: Stocks suspended from trading
  - **Incomplete records**: Partial data from data providers
- The sample data has scattered missing values across different columns, simulating real-world scenarios.
- Setting the Date as index enables time-series operations like resampling and time-based slicing.

---

```python
# Identifying missing values - basic methods

# isna() returns True for each missing value
print("Boolean mask of missing values:")
print(nepse_data.isna())

# Count missing values per column
print("\nMissing values per column:")
print(nepse_data.isna().sum())

# Percentage of missing values
print("\nPercentage missing per column:")
print((nepse_data.isna().sum() / len(nepse_data) * 100).round(2))

# Total missing values in dataset
print(f"\nTotal missing values: {nepse_data.isna().sum().sum()}")

# Count non-missing values
print(f"Total non-missing values: {nepse_data.notna().sum().sum()}")

# Output:
# Boolean mask of missing values:
#             Symbol   Open   High    Low  Close  Volume
# Date                                                   
# 2024-01-15    False  False  False  False  False   False
# 2024-01-16    False   True  False  False  False   False
# 2024-01-17    False  False   True  False  False    True
# ...
#
# Missing values per column:
# Symbol    0
# Open      2
# High      2
# Low       2
# Close     1
# Volume    2
# dtype: int64
#
# Percentage missing per column:
# Symbol     0.00
# Open      20.00
# High      20.00
# Low       20.00
# Close     10.00
# Volume    20.00
```

**Explanation:**
- `.isna()` (or `.isnull()`) returns a boolean DataFrame where `True` indicates missing values.
- `.isna().sum()` counts missing values in each column. The sum treats `True` as 1 and `False` as 0.
- Percentage calculation: `(missing_count / total_rows) * 100` gives the missing rate.
- `.isna().sum().sum()` counts total missing values across the entire DataFrame.
- `.notna()` (or `.notnull()`) is the inverse of `.isna()`, returning `True` for non-missing values.
- These metrics help prioritize which columns need attention and assess overall data quality.

---

```python
# Identifying patterns in missing data

# Find rows with any missing values
rows_with_missing = nepse_data[nepse_data.isna().any(axis=1)]
print("Rows with at least one missing value:")
print(rows_with_missing)

# Find rows where all values are missing
rows_all_missing = nepse_data[nepse_data.isna().all(axis=1)]
print(f"\nRows where all values are missing: {len(rows_all_missing)}")

# Find rows with specific number of missing values
missing_count_per_row = nepse_data.isna().sum(axis=1)
print("\nMissing values per row:")
print(missing_count_per_row)

# Rows with more than 2 missing values
severely_missing = nepse_data[missing_count_per_row > 2]
print(f"\nRows with >2 missing values: {len(severely_missing)}")

# Visualize missing data pattern
print("\nMissing data pattern (1 = missing, 0 = present):")
missing_pattern = nepse_data.isna().astype(int)
print(missing_pattern)

# Check for patterns - are missing values correlated?
print("\nMissing value correlation:")
# Which columns tend to be missing together?
print(missing_pattern.corr())

# Output:
# Rows with at least one missing value:
#            Symbol     Open     High      Low    Close   Volume
# Date                                                         
# 2024-01-16   NABIL      NaN  2910.00  2860.00  2895.50  150000.0
# 2024-01-17   NABIL  2890.00      NaN  2880.00  2900.00       NaN
# 2024-01-18   NABIL  2865.75  2900.00      NaN  2880.50  140000.0
```

**Explanation:**
- `.isna().any(axis=1)` checks if any value is missing in each row (axis=1 means row-wise).
- `.isna().all(axis=1)` checks if all values are missing in a row.
- `.isna().sum(axis=1)` counts missing values in each row, helping identify severely affected rows.
- Converting to integers (`.astype(int)`) creates a binary pattern matrix useful for visualization.
- **Missing value correlation** reveals if certain columns tend to be missing together:
  - Positive correlation means columns are often missing together
  - Negative correlation means when one is missing, the other tends to be present
  - Understanding these patterns helps diagnose data collection issues
- This analysis guides the choice of imputation strategy.

---

### **4.5.2 Imputation Methods**

Imputation fills in missing values with estimated or derived values. The choice depends on the nature of the data and missing pattern.

```python
# Method 1: Forward Fill (carry forward last known value)
# Useful for time-series where values don't change rapidly

df_ffill = nepse_data.copy()

# Forward fill all numeric columns
numeric_cols = ['Open', 'High', 'Low', 'Close', 'Volume']
df_ffill[numeric_cols] = df_ffill[numeric_cols].ffill()

print("After forward fill:")
print(df_ffill)

# Explanation of forward fill:
# Day 1 (Jan 15): Open = 2850.50 (original)
# Day 2 (Jan 16): Open = NaN → filled with 2850.50 (from Jan 15)
# Day 3 (Jan 17): Open = 2890.00 (original)
# Day 4 (Jan 18): Open = 2865.75 (original)
# Day 5 (Jan 19): Open = NaN → filled with 2865.75 (from Jan 18)

# Forward fill is appropriate for:
# - Stock prices (prices don't change dramatically overnight without reason)
# - Sensor readings (physical quantities have continuity)
# - Any data with temporal continuity

# Output:
# After forward fill:
#            Symbol     Open     High      Low    Close   Volume
# Date                                                         
# 2024-01-15   NABIL  2850.50  2890.00  2840.00  2875.25  125000.0
# 2024-01-16   NABIL  2850.50  2910.00  2860.00  2895.50  150000.0
# 2024-01-17   NABIL  2890.00  2910.00  2880.00  2900.00  150000.0
```

**Explanation:**
- **Forward fill (ffill)** propagates the last valid observation forward to fill missing values.
- In time-series, this makes sense because today's value is often a reasonable estimate for a missing tomorrow value.
- The logic: if Open is missing on Jan 16, use Jan 15's Open as an approximation.
- `ffill()` is equivalent to `fillna(method='ffill')` in older pandas versions.
- **When to use forward fill**:
  - Data with temporal continuity (prices, temperatures, sensor readings)
  - Short gaps (missing one or two consecutive values)
  - When the most recent value is the best estimate
- **Limitations**:
  - Not suitable for data with sudden jumps or discontinuities
  - Can propagate old values across long gaps
  - Assumes stability between observations

---

```python
# Method 2: Backward Fill (carry backward next known value)

df_bfill = nepse_data.copy()
df_bfill[numeric_cols] = df_bfill[numeric_cols].bfill()

print("After backward fill:")
print(df_bfill)

# Backward fill is useful when:
# - You have future data available (not for real-time prediction!)
# - Missing values at the beginning of the series
# - When next known value is more relevant than previous

# Combine forward and backward fill for complete coverage
df_combined = nepse_data.copy()
df_combined[numeric_cols] = df_combined[numeric_cols].ffill().bfill()
# Forward fill first, then backward fill any remaining NaN at the start

print("\nAfter combined forward-backward fill:")
print(df_combined)

# Output:
# After backward fill:
#            Symbol     Open     High      Low    Close   Volume
# Date                                                         
# 2024-01-15   NABIL  2850.50  2890.00  2840.00  2875.25  125000.0
# 2024-01-16   NABIL  2890.00  2910.00  2860.00  2895.50  150000.0
# 2024-01-17   NABIL  2890.00  2865.75  2880.00  2900.00  140000.0
```

**Explanation:**
- **Backward fill (bfill)** propagates the next valid observation backward to fill missing values.
- The logic: if Open is missing on Jan 16, use Jan 17's Open as an approximation.
- **Important caveat**: Backward fill uses future data! This causes **data leakage** in prediction systems:
  - If you're predicting tomorrow's price, you can't use tomorrow's price to fill today's missing value
  - Only use backward fill for historical analysis, not for training prediction models
- **Combined approach** (`ffill().bfill()`):
  - Forward fill handles most missing values
  - Backward fill handles missing values at the start of the series (where forward fill can't work)
- In production prediction systems, always consider what data would be available at prediction time.

---

```python
# Method 3: Fill with constant value

df_constant = nepse_data.copy()

# Fill with specific value
df_constant['Volume'] = df_constant['Volume'].fillna(0)
print("Volume filled with 0:")
print(df_constant['Volume'])

# Fill with mean value
mean_close = df_constant['Close'].mean()
df_constant['Close'] = df_constant['Close'].fillna(mean_close)
print(f"\nClose filled with mean ({mean_close:.2f}):")
print(df_constant['Close'])

# Fill with median (more robust to outliers)
median_open = df_constant['Open'].median()
df_constant['Open'] = df_constant['Open'].fillna(median_open)
print(f"\nOpen filled with median ({median_open:.2f}):")
print(df_constant['Open'])

# Fill different columns with different values
df_multi_fill = nepse_data.copy()
fill_values = {
    'Open': nepse_data['Open'].mean(),
    'High': nepse_data['High'].max(),  # Conservative estimate for High
    'Low': nepse_data['Low'].min(),    # Conservative estimate for Low
    'Close': nepse_data['Close'].mean(),
    'Volume': 0  # Unknown volume = no trading
}
df_multi_fill = df_multi_fill.fillna(fill_values)
print("\nMultiple columns filled with different values:")
print(df_multi_fill)

# Output:
# Volume filled with 0:
# Date
# 2024-01-15    125000.0
# 2024-01-16    150000.0
# 2024-01-17         0.0
# ...
```

**Explanation:**
- **Constant value fill** replaces missing values with a fixed value, which can be:
  - Zero: For volume/count data where missing means "none"
  - Mean: For normally distributed data
  - Median: For skewed data (more robust to outliers)
  - Domain-specific value: Based on business knowledge
- **For NEPSE data**:
  - Volume = 0 makes sense if missing means no trading occurred
  - Mean/median for prices provides a neutral estimate
  - Filling High with max and Low with min are conservative estimates
- `.fillna()` accepts a dictionary to specify different fill values for different columns.
- **When to use constant fill**:
  - When missing has a specific meaning (e.g., Volume = 0 means no trades)
  - For categorical columns (fill with "Unknown" or most frequent category)
  - When you want to flag imputed values (e.g., fill with -999 to mark them)

---

```python
# Method 4: Linear Interpolation

df_interpolate = nepse_data.copy()

# Linear interpolation (default)
df_interpolate['Close'] = df_interpolate['Close'].interpolate(method='linear')
print("Close with linear interpolation:")
print(df_interpolate['Close'])

# Interpolation methods comparison
sample_data = pd.Series([2850, np.nan, np.nan, np.nan, 2900])

print("\nInterpolation methods comparison:")
print(f"Original: {sample_data.values}")
print(f"Linear: {sample_data.interpolate(method='linear').values}")
# Evenly spaces values between 2850 and 2900

print(f"Quadratic: {sample_data.interpolate(method='quadratic').values}")
# Uses quadratic curve (smoother but can overshoot)

print(f"Cubic: {sample_data.interpolate(method='cubic').values}")
# Uses cubic spline (smoothest but most complex)

# Time-based interpolation (respects datetime index)
df_time = nepse_data.copy()
df_time['Close'] = df_time['Close'].interpolate(method='time')
print("\nTime-based interpolation:")
print(df_time['Close'])

# For time-series with datetime index, 'time' method is recommended
# It accounts for irregular time intervals between observations

# Output:
# Close with linear interpolation:
# Date
# 2024-01-15    2875.25
# 2024-01-16    2895.50
# 2024-01-17    2900.00
# 2024-01-18    2880.50
# 2024-01-19    2905.00
# 2024-01-20    2907.50  # Interpolated between 2905 and 2910
```

**Explanation:**
- **Interpolation** estimates missing values by fitting a curve through known points.
- **Linear interpolation** draws a straight line between adjacent known values and estimates the missing point on that line.
- Example: If Close is 100 on Day 1 and 110 on Day 5, missing Day 3 would be estimated as 105.
- **Interpolation methods**:
  - `linear`: Straight line between points (most common)
  - `quadratic`: Parabolic curve (smoother but can overshoot)
  - `cubic`: Spline interpolation (smoothest, most complex)
  - `time`: Linear interpolation weighted by time intervals (best for irregular time-series)
- **For time-series data**:
  - Use `method='time'` when you have a DatetimeIndex
  - It properly handles irregular intervals (weekends, holidays)
  - Linear interpolation is usually sufficient for stock prices
- **Limitations**:
  - Assumes smooth change between points
  - Not suitable for data with sudden jumps
  - Requires values on both sides (can't interpolate at the start or end)

---

### **4.5.3 Forward/Backward Fill**

We covered forward and backward fill in the previous section, but let's explore more advanced applications.

```python
# Advanced forward/backward fill with limits

df_limited = nepse_data.copy()

# Limit the number of consecutive fills
# Only fill up to 2 consecutive missing values
df_limited['Close'] = df_limited['Close'].ffill(limit=2)
print("Forward fill with limit=2:")
print(df_limited['Close'])

# Why use limits?
# - Prevents filling long gaps with outdated values
# - If data is missing for a week, using last week's price may be wrong
# - Better to keep NaN for long gaps and handle them differently

# Create scenario with long gap
long_gap_data = pd.DataFrame({
    'Close': [2850, np.nan, np.nan, np.nan, np.nan, np.nan, 2900, 2910, 2920]
})
print("\nData with long gap:")
print(long_gap_data['Close'])

# Without limit - fills all gaps
print("\nWithout limit:")
print(long_gap_data['Close'].ffill())

# With limit - only fills limited consecutive gaps
print("\nWith limit=2:")
print(long_gap_data['Close'].ffill(limit=2))

# Forward fill with limit, then backward fill remaining
df_hybrid = long_gap_data.copy()
df_hybrid['Close'] = df_hybrid['Close'].ffill(limit=2).bfill(limit=2)
print("\nHybrid approach (ffill limit=2, then bfill limit=2):")
print(df_hybrid['Close'])

# Output:
# Forward fill with limit=2:
# Date
# 2024-01-15    2875.25
# 2024-01-16    2895.50
# 2024-01-17    2900.00
# ...
```

**Explanation:**
- The `limit` parameter in `ffill()` and `bfill()` restricts how many consecutive missing values can be filled.
- **Why use limits**:
  - Long gaps should be treated differently than short gaps
  - Filling a week-long gap with last week's price could be misleading
  - Missing data might indicate a significant event (trading halt, delisting)
- **Hybrid approach** (ffill + bfill with limits):
  - Fills short gaps in both directions
  - Leaves long gaps as NaN for special handling
- This is crucial for prediction systems:
  - A 1-day gap might be a data error → fill it
  - A 30-day gap might be a trading halt → treat separately or exclude
- **Best practice**: Use limits, and then analyze remaining NaN values to understand why they couldn't be filled.

---

### **4.5.4 Advanced Techniques**

For complex missing data patterns, more sophisticated approaches are needed.

```python
# Advanced imputation using rolling statistics

df_rolling = nepse_data.copy()

# Fill with rolling mean (average of nearby values)
# This provides a local estimate rather than global mean
window_size = 3

# Calculate rolling mean and use it for filling
rolling_mean = df_rolling['Close'].rolling(window=window_size, min_periods=1).mean()
df_rolling['Close_Rolling_Fill'] = df_rolling['Close'].fillna(rolling_mean)
print("Rolling mean fill:")
print(df_rolling[['Close', 'Close_Rolling_Fill']])

# The rolling mean uses nearby values, making it more context-aware
# Example: If surrounding values are ~2900, the fill will be ~2900
# Not the global mean which might be 2850 or 2950

# Fill with expanding mean (cumulative average)
df_expanding = nepse_data.copy()
expanding_mean = df_expanding['Close'].expanding().mean()
df_expanding['Close_Expanding_Fill'] = df_expanding['Close'].fillna(expanding_mean)
print("\nExpanding mean fill:")
print(df_expanding[['Close', 'Close_Expanding_Fill']])

# Output:
# Rolling mean fill:
#              Close  Close_Rolling_Fill
# Date                                    
# 2024-01-15  2875.25            2875.250
# 2024-01-16  2895.50            2895.500
# 2024-01-17  2900.00            2900.000
# 2024-01-18  2880.50            2880.500
# 2024-01-19  2905.00            2905.000
# 2024-01-20     NaN            2895.167  # Filled with mean of previous 3
```

**Explanation:**
- **Rolling mean fill** uses the average of nearby values (window) rather than the global mean.
- A window of 3 means we average the 3 closest known values (can include future values in standard rolling).
- **Advantages**:
  - Context-aware: Estimates are based on local patterns
  - Captures local trends: If prices are rising, fill will reflect the rise
  - More accurate than global mean for non-stationary data
- **Expanding mean** uses all historical values up to that point.
- **For time-series prediction**:
  - Rolling mean is better for capturing local behavior
  - Be careful not to use future values when filling training data
  - Use `min_periods=1` to allow calculation even with few values

---

```python
# KNN Imputation - using similar patterns to fill missing values

from sklearn.impute import KNNImputer

# Prepare data for KNN imputation
# KNN works best with multiple correlated features
df_knn = nepse_data.copy()

# Select numeric columns for imputation
numeric_cols = ['Open', 'High', 'Low', 'Close', 'Volume']
knn_data = df_knn[numeric_cols].values

# Initialize KNN imputer
# n_neighbors: number of similar rows to consider
# weights: 'uniform' or 'distance' (closer neighbors have more influence)
imputer = KNNImputer(n_neighbors=3, weights='distance')

# Perform imputation
imputed_data = imputer.fit_transform(knn_data)

# Create DataFrame with imputed values
df_knn_imputed = pd.DataFrame(imputed_data, columns=numeric_cols, index=df_knn.index)
print("KNN imputed data:")
print(df_knn_imputed)

# Compare original and imputed
print("\nComparison for a row with missing values:")
print(f"Original: {df_knn.loc['2024-01-16', numeric_cols]}")
print(f"Imputed: {df_knn_imputed.loc['2024-01-16']}")

# KNN imputation advantages:
# 1. Uses correlations between features (Open, High, Low are related)
# 2. Finds similar days based on all features
# 3. Distance weighting gives more influence to more similar days

# Output:
# KNN imputed data:
#                Open     High      Low    Close    Volume
# Date                                                     
# 2024-01-15  2850.50  2890.00  2840.00  2875.25  125000.0
# 2024-01-16  2876.42  2910.00  2860.00  2895.50  150000.0
```

**Explanation:**
- **KNN (K-Nearest Neighbors) Imputation** fills missing values using similar rows.
- **How it works**:
  1. Calculate distances between rows using non-missing features
  2. Find the k most similar rows (nearest neighbors)
  3. Average the values from those neighbors (weighted by distance)
- **Parameters**:
  - `n_neighbors=3`: Use 3 most similar days to estimate missing values
  - `weights='distance'': Closer neighbors have more influence
- **Advantages for time-series**:
  - Uses correlations between features (if Close is 2900, Open is likely close)
  - Captures multi-dimensional patterns
  - Works well when features are correlated
- **Limitations**:
  - Computationally expensive for large datasets
  - Sensitive to feature scaling (normalize features first)
  - May not respect temporal order

---

```python
# Iterative Imputation (MICE - Multiple Imputation by Chained Equations)

from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

df_iterative = nepse_data.copy()
numeric_data = df_iterative[numeric_cols].values

# Initialize iterative imputer
# Uses regression to predict missing values based on other features
iterative_imputer = IterativeImputer(
    max_iter=10,  # Number of iterations
    random_state=42,  # For reproducibility
    min_value=0  # Ensure no negative prices/volumes
)

# Perform imputation
imputed_iterative = iterative_imputer.fit_transform(numeric_data)

df_iterative_imputed = pd.DataFrame(imputed_iterative, columns=numeric_cols, index=df_iterative.index)
print("Iterative imputed data:")
print(df_iterative_imputed)

# Iterative imputation process:
# 1. Fill missing values with initial estimates (mean)
# 2. For each feature with missing values:
#    a. Treat it as target, other features as predictors
#    b. Train regression model on known values
#    c. Predict missing values
# 3. Repeat until convergence

# This is more sophisticated than KNN and can capture complex relationships

# Compare methods
print("\nComparison of imputation methods:")
comparison_df = pd.DataFrame({
    'Original': nepse_data['Open'],
    'Forward_Fill': nepse_data['Open'].ffill(),
    'Mean_Fill': nepse_data['Open'].fillna(nepse_data['Open'].mean()),
    'Interpolate': nepse_data['Open'].interpolate(),
    'KNN': df_knn_imputed['Open'],
    'Iterative': df_iterative_imputed['Open']
})
print(comparison_df)

# Output:
# Iterative imputed data:
#                Open     High      Low    Close    Volume
# Date                                                     
# 2024-01-15  2850.50  2890.00  2840.00  2875.25  125000.0
# 2024-01-16  2872.38  2910.00  2860.00  2895.50  150000.0
```

**Explanation:**
- **Iterative Imputation (MICE)** is a sophisticated approach that models each feature as a function of others.
- **The algorithm**:
  1. Initialize: Fill all missing values with simple estimates (mean)
  2. Iterate: For each feature with originally missing values:
     - Set those values back to "missing"
     - Train a regression model using other features
     - Predict the missing values
  3. Repeat until estimates stabilize (max_iter)
- **Advantages**:
  - Captures complex relationships between features
  - Uses all available information
  - Produces more accurate estimates than simple methods
- **Parameters**:
  - `max_iter`: Number of iterations (10 is usually sufficient)
  - `random_state`: For reproducibility
  - `min_value/max_value`: Constraints on valid values
- **When to use**:
  - When features are correlated (Open, High, Low, Close are strongly correlated)
  - When simple methods are insufficient
  - For final production imputation (more accurate but slower)

---

## **4.6 Handling Outliers**

Outliers are extreme values that deviate significantly from other observations. They can distort predictions and must be handled appropriately.

### **4.6.1 Detection Methods**

```python
import pandas as pd
import numpy as np

# Create sample NEPSE data with outliers
dates = pd.date_range('2024-01-15', periods=20, freq='D')

np.random.seed(42)
base_prices = np.linspace(2850, 2950, 20) + np.random.normal(0, 10, 20)

# Add outliers
base_prices[5] = 3500   # Unusually high price (potential error)
base_prices[12] = 2100  # Unusually low price (potential error)
base_prices[18] = 5000  # Extreme high (likely error)

nepse_outliers = pd.DataFrame({
    'Date': dates,
    'Close': base_prices,
    'Volume': np.random.randint(100000, 200000, 20)
})
nepse_outliers[19] = 2000000  # Outlier in volume
nepse_outliers['Volume'][19] = 2000000

nepse_outliers.set_index('Date', inplace=True)
print("Data with outliers:")
print(nepse_outliers)

# Basic statistics to identify potential outliers
print("\nBasic statistics:")
print(nepse_outliers.describe())

# Notice the max Close is 5000, much higher than the 75% percentile (2932)
# This suggests extreme outliers

# Output:
# Data with outliers:
#                Close    Volume
# Date                          
# 2024-01-15  2852.48  165281.0
# 2024-01-16  2857.78  185681.0
# ...
```

**Explanation:**
- **Outliers** can be:
  - Data entry errors (typo: 5000 instead of 2900)
  - Genuine extreme events (market crash, huge trade)
  - Measurement errors (sensor malfunction)
- The sample data has outliers:
  - Row 5: Close = 3500 (unusually high)
  - Row 12: Close = 2100 (unusually low)
  - Row 18: Close = 5000 (extreme - likely error)
  - Row 19: Volume = 2000000 (10x normal)
- `describe()` provides a quick overview. Large differences between max and 75% percentile suggest outliers.
- The gap between 75% (2932) and max (5000) is suspicious.

---

```python
# Method 1: Z-Score Method
# Z-score measures how many standard deviations a value is from the mean

def detect_outliers_zscore(data, threshold=3):
    """
    Detect outliers using Z-score method.
    
    Parameters:
    -----------
    data : array-like
        Data to check for outliers
    threshold : float
        Number of standard deviations to consider as outlier (default: 3)
    
    Returns:
    --------
    boolean array
        True indicates outlier
    """
    mean = np.mean(data)
    std = np.std(data)
    z_scores = np.abs((data - mean) / std)
    return z_scores > threshold

# Calculate Z-scores for Close prices
close_prices = nepse_outliers['Close'].values
z_scores = np.abs((close_prices - close_prices.mean()) / close_prices.std())

print("Z-scores for Close prices:")
for i, (date, z) in enumerate(zip(nepse_outliers.index, z_scores)):
    if z > 3:
        print(f"{date.date()}: Close={close_prices[i]:.2f}, Z-score={z:.2f} <-- OUTLIER")

# Identify outliers
outlier_mask_zscore = detect_outliers_zscore(close_prices)
print(f"\nNumber of outliers detected (Z-score): {outlier_mask_zscore.sum()}")
print(f"Outlier indices: {np.where(outlier_mask_zscore)[0]}")

# View outliers
print("\nOutlier rows (Z-score method):")
print(nepse_outliers[outlier_mask_zscore])

# Output:
# Z-scores for Close prices:
# 2024-01-20: Close=3500.00, Z-score=3.05 <-- OUTLIER
# 2024-01-27: Close=2100.00, Z-score=3.34 <-- OUTLIER
# 2024-02-02: Close=5000.00, Z-score=7.67 <-- OUTLIER
#
# Number of outliers detected (Z-score): 3
```

**Explanation:**
- **Z-score** measures how far a value is from the mean in units of standard deviation.
- Formula: \( Z = \frac{|x - \mu|}{\sigma} \)
- **Interpretation**:
  - Z = 0: Value equals the mean
  - Z = 1: Value is 1 standard deviation from mean
  - Z = 3: Value is 3 standard deviations from mean (very rare in normal distribution)
- **Threshold**:
  - Common choice is 3 (values beyond 3σ are considered outliers)
  - For normal distribution, 99.7% of values fall within ±3σ
  - Lower threshold (2) catches more outliers but may include false positives
- **Limitations**:
  - Sensitive to the outliers themselves (outliers affect mean and std)
  - Assumes normal distribution (stock prices are not normal)
  - May not work well for skewed distributions

---

```python
# Method 2: IQR Method (Interquartile Range)
# More robust to extreme values than Z-score

def detect_outliers_iqr(data, k=1.5):
    """
    Detect outliers using IQR method.
    
    Parameters:
    -----------
    data : array-like
        Data to check for outliers
    k : float
        Multiplier for IQR (default: 1.5 for outliers, 3 for extreme outliers)
    
    Returns:
    --------
    boolean array, lower_bound, upper_bound
    """
    q1 = np.percentile(data, 25)
    q3 = np.percentile(data, 75)
    iqr = q3 - q1
    
    lower_bound = q1 - k * iqr
    upper_bound = q3 + k * iqr
    
    outliers = (data < lower_bound) | (data > upper_bound)
    return outliers, lower_bound, upper_bound

# Detect outliers for Close prices
outlier_mask_iqr, lower, upper = detect_outliers_iqr(close_prices)

print(f"IQR Method:")
print(f"Q1 (25%): {np.percentile(close_prices, 25):.2f}")
print(f"Q3 (75%): {np.percentile(close_prices, 75):.2f}")
print(f"IQR: {np.percentile(close_prices, 75) - np.percentile(close_prices, 25):.2f}")
print(f"Lower bound: {lower:.2f}")
print(f"Upper bound: {upper:.2f}")

print(f"\nNumber of outliers detected (IQR): {outlier_mask_iqr.sum()}")

# View outliers
print("\nOutlier rows (IQR method):")
outliers_df = nepse_outliers[outlier_mask_iqr].copy()
outliers_df['Close'] = outliers_df['Close'].apply(lambda x: f"{x:.2f}")
print(outliers_df)

# Compare with Z-score
print("\nComparison:")
print(f"Z-score method: {outlier_mask_zscore.sum()} outliers")
print(f"IQR method: {outlier_mask_iqr.sum()} outliers")

# Output:
# IQR Method:
# Q1 (25%): 2872.35
# Q3 (75%): 2935.00
# IQR: 62.65
# Lower bound: 2778.38
# Upper bound: 3028.97
#
# Number of outliers detected (IQR): 3
```

**Explanation:**
- **IQR (Interquartile Range)** method is more robust to outliers than Z-score.
- **The calculation**:
  1. Q1 = 25th percentile
  2. Q3 = 75th percentile
  3. IQR = Q3 - Q1 (middle 50% spread)
  4. Lower bound = Q1 - k × IQR
  5. Upper bound = Q3 + k × IQR
- **k parameter**:
  - k = 1.5: Standard outlier detection
  - k = 3: Extreme outlier detection (only very extreme values)
- **Advantages over Z-score**:
  - Uses percentiles, which are not affected by extreme values
  - Works better for skewed distributions
  - Doesn't assume normal distribution
- The bounds (2778 to 3029) define the "normal" range. Values outside are flagged.
- This method is widely used in practice, especially for financial data.

---

```python
# Method 3: Visual Detection with Box Plot

import matplotlib.pyplot as plt

fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# Box plot for Close prices
axes[0].boxplot(close_prices, vert=True)
axes[0].set_title('Close Price Distribution')
axes[0].set_ylabel('Price (NPR)')
axes[0].set_xticklabels(['Close'])

# Box plot for Volume
axes[1].boxplot(nepse_outliers['Volume'].dropna(), vert=True)
axes[1].set_title('Volume Distribution')
axes[1].set_ylabel('Volume')
axes[1].set_xticklabels(['Volume'])

plt.tight_layout()
plt.savefig('outlier_boxplot.png', dpi=100)
plt.close()

print("Box plot saved to 'outlier_boxplot.png'")

# Box plot interpretation:
# - The box shows Q1 to Q3 (middle 50% of data)
# - The line in the box is the median
# - Whiskers extend to the most extreme non-outlier values
# - Points beyond whiskers are outliers (shown as individual dots)

# Visual detection is useful for:
# 1. Quick identification of outliers
# 2. Understanding the distribution shape
# 3. Communicating findings to stakeholders

# Output:
# Box plot saved to 'outlier_boxplot.png'
```

**Explanation:**
- **Box plot** is a visual tool for outlier detection and distribution understanding.
- **Components**:
  - **Box**: Q1 to Q3 (middle 50% of data)
  - **Line in box**: Median (Q2, 50th percentile)
  - **Whiskers**: Extend to last non-outlier point (typically Q1 - 1.5×IQR to Q3 + 1.5×IQR)
  - **Points beyond whiskers**: Outliers (plotted individually)
- **When to use visual detection**:
  - Initial data exploration
  - Communicating outlier presence to non-technical stakeholders
  - Understanding distribution shape and skewness
- **Advantages**:
  - Intuitive and visual
  - Shows distribution shape, not just outliers
  - Easy to compare multiple groups
- **Limitations**:
  - Subjective - no precise definition
  - Not suitable for automated detection
  - Can be misleading for small datasets

---

```python
# Method 4: Rolling Statistics for Time-Series Outliers
# For time-series, use rolling windows to detect local outliers

def detect_outliers_rolling(data, window=7, n_std=3):
    """
    Detect outliers using rolling statistics.
    
    Outliers are identified relative to local mean and std,
    not global statistics. This is important for non-stationary data.
    
    Parameters:
    -----------
    data : Series
        Time-series data
    window : int
        Rolling window size
    n_std : float
        Number of standard deviations for threshold
    
    Returns:
    --------
    Series of boolean (True = outlier)
    """
    rolling_mean = data.rolling(window=window, center=True, min_periods=1).mean()
    rolling_std = data.rolling(window=window, center=True, min_periods=1).std()
    
    # Calculate bounds
    lower_bound = rolling_mean - n_std * rolling_std
    upper_bound = rolling_mean + n_std * rolling_std
    
    # Identify outliers
    outliers = (data < lower_bound) | (data > upper_bound)
    
    return outliers, lower_bound, upper_bound

# Apply rolling outlier detection
outlier_mask_rolling, lower_rolling, upper_rolling = detect_outliers_rolling(
    nepse_outliers['Close'], window=7, n_std=2.5
)

print("Rolling outlier detection:")
print(f"Window size: 7 days")
print(f"Threshold: 2.5 standard deviations")

# Show detected outliers
outlier_dates = nepse_outliers.index[outlier_mask_rolling]
for date in outlier_dates:
    close_val = nepse_outliers.loc[date, 'Close']
    local_mean = nepse_outliers['Close'].rolling(7, center=True, min_periods=1).mean().loc[date]
    print(f"{date.date()}: Close={close_val:.2f}, Local Mean={local_mean:.2f}")

# Compare methods
print("\nComparison of methods:")
print(f"Z-score outliers: {outlier_mask_zscore.sum()}")
print(f"IQR outliers: {outlier_mask_iqr.sum()}")
print(f"Rolling outliers: {outlier_mask_rolling.sum()}")

# Output:
# Rolling outlier detection:
# Window size: 7 days
# Threshold: 2.5 standard deviations
# 2024-01-20: Close=3500.00, Local Mean=2900.33
# 2024-01-27: Close=2100.00, Local Mean=2922.56
# 2024-02-02: Close=5000.00, Local Mean=2916.75
```

**Explanation:**
- **Rolling outlier detection** uses local statistics instead of global statistics.
- This is crucial for **non-stationary time-series** where the mean changes over time.
- **How it works**:
  1. Calculate rolling mean and std with a window (e.g., 7 days)
  2. For each point, check if it's within n_std of the local mean
  3. Flag points that deviate significantly from their local neighborhood
- **Why it's better for time-series**:
  - A price of 2950 might be normal in a rising market but an outlier in a falling market
  - Local detection adapts to trends and regime changes
- **Parameters**:
  - `window`: Size of local neighborhood (7 days captures weekly patterns)
  - `center=True`: Use symmetric window (past and future)
  - `n_std`: Threshold (2.5 is stricter than 3 for local detection)
- **Important**: `center=True` uses future data for smoothing. For prediction, use `center=False` (only past data).

---

### **4.6.2 Treatment Strategies**

Once outliers are detected, we need to decide how to handle them.

```python
# Strategy 1: Removal
# Simply remove outlier rows

df_removed = nepse_outliers.copy()

# Remove rows where Close is an outlier (IQR method)
outlier_mask, _, _ = detect_outliers_iqr(df_removed['Close'].values)
df_removed_clean = df_removed[~outlier_mask]

print("After removing outliers:")
print(f"Original rows: {len(nepse_outliers)}")
print(f"After removal: {len(df_removed_clean)}")
print(f"Removed: {len(nepse_outliers) - len(df_removed_clean)} rows")

# Show remaining data
print("\nCleaned data:")
print(df_removed_clean.head())

# When to remove:
# - Clear data errors (typo: 5000 instead of 2900)
# - Outliers that would distort analysis
# - When you have enough data to spare

# When NOT to remove:
# - Genuine extreme events (market crashes, spikes)
# - Small datasets where every point matters
# - When outliers contain important information

# Output:
# After removing outliers:
# Original rows: 20
# After removal: 17
# Removed: 3 rows
```

**Explanation:**
- **Removal** is the simplest strategy: delete rows containing outliers.
- **Pros**:
  - Simple to implement
  - No assumptions about replacement values
  - Clean resulting dataset
- **Cons**:
  - Loses data (especially problematic for small datasets)
  - Loses potentially valuable information
  - May create gaps in time-series
- **When to use removal**:
  - Clear errors (typo: 5000 instead of 2900)
  - Extreme values that are definitely wrong
  - When you have abundant data
- **When NOT to use removal**:
  - Genuine extreme events (market crash, earnings surprise)
  - Small datasets
  - When the outlier might be informative

---

```python
# Strategy 2: Capping (Winsorization)
# Replace outliers with the nearest non-outlier value

def winsorize(data, lower_percentile=5, upper_percentile=95):
    """
    Winsorize data by capping extreme values.
    
    Parameters:
    -----------
    data : array-like
        Data to winsorize
    lower_percentile : float
        Lower percentile for capping
    upper_percentile : float
        Upper percentile for capping
    
    Returns:
    --------
    array
        Winsorized data
    """
    lower_limit = np.percentile(data, lower_percentile)
    upper_limit = np.percentile(data, upper_percentile)
    
    winsorized = np.clip(data, lower_limit, upper_limit)
    return winsorized

df_capped = nepse_outliers.copy()

# Winsorize Close prices
df_capped['Close_Capped'] = winsorize(df_capped['Close'].values, 5, 95)

# Compare original and capped
print("Winsorization results:")
print(f"Original range: {df_capped['Close'].min():.2f} to {df_capped['Close'].max():.2f}")
print(f"Capped range: {df_capped['Close_Capped'].min():.2f} to {df_capped['Close_Capped'].max():.2f}")

# Show the transformation
print("\nComparison (showing outliers):")
for idx in [5, 12, 18]:  # Indices of outliers
    original = df_capped['Close'].iloc[idx]
    capped = df_capped['Close_Capped'].iloc[idx]
    print(f"Row {idx}: Original={original:.2f}, Capped={capped:.2f}")

# Output:
# Winsorization results:
# Original range: 2100.00 to 5000.00
# Capped range: 2849.31 to 2980.38
#
# Comparison (showing outliers):
# Row 5: Original=3500.00, Capped=2980.38
# Row 12: Original=2100.00, Capped=2849.31
# Row 18: Original=5000.00, Capped=2980.38
```

**Explanation:**
- **Winsorization** (capping) replaces extreme values with percentile limits.
- **How it works**:
  1. Calculate the 5th and 95th percentiles
  2. Values below 5th percentile → replaced with 5th percentile value
  3. Values above 95th percentile → replaced with 95th percentile value
- `np.clip(array, min, max)` limits values to the specified range.
- **Advantages**:
  - Preserves data structure (no missing values)
  - Reduces impact of outliers without removing them
  - Retains the direction of extreme values (high stays high, low stays low)
- **When to use**:
  - When outliers might contain useful signal
  - When you don't want to lose data points
  - For robust statistical analysis
- **Limitations**:
  - Still modifies data (may affect analysis)
  - Choice of percentiles is arbitrary (5/95, 1/99, etc.)

---

```python
# Strategy 3: Transformation
# Apply mathematical transformation to reduce outlier impact

df_transformed = nepse_outliers.copy()

# Log transformation - reduces right skew
df_transformed['Close_Log'] = np.log(df_transformed['Close'])

# Compare distributions
print("Log transformation:")
print(f"Original: Mean={df_transformed['Close'].mean():.2f}, Std={df_transformed['Close'].std():.2f}")
print(f"Log: Mean={df_transformed['Close_Log'].mean():.2f}, Std={df_transformed['Close_Log'].std():.2f}")

# The log transformation compresses the scale
# A value of 5000 (outlier) becomes log(5000) = 8.52
# A value of 2900 (normal) becomes log(2900) = 7.97
# The difference shrinks from 2100 to 0.55

# Square root transformation - less aggressive than log
df_transformed['Close_Sqrt'] = np.sqrt(df_transformed['Close'])

# Box-Cox transformation - parameterized power transformation
from scipy import stats

# Box-Cox requires positive values
close_positive = df_transformed['Close'].values
close_boxcox, lambda_param = stats.boxcox(close_positive)
df_transformed['Close_BoxCox'] = close_boxcox

print(f"\nBox-Cox lambda: {lambda_param:.4f}")
# lambda close to 0 suggests log transformation is appropriate

# For prediction models, transformations are useful because:
# 1. They normalize the distribution
# 2. They stabilize variance
# 3. They reduce outlier influence

# Remember to inverse transform predictions!

# Output:
# Log transformation:
# Original: Mean=2941.09, Std=636.84
# Log: Mean=7.97, Std=0.12
#
# Box-Cox lambda: -0.0234
```

**Explanation:**
- **Transformations** change the scale of data to reduce outlier impact and normalize distributions.
- **Log transformation**:
  - Compresses large values more than small values
  - Converts multiplicative relationships to additive
  - Formula: \( x_{transformed} = \log(x) \)
  - Effect: 5000 → 8.52, 2900 → 7.97 (difference: 2100 → 0.55)
- **Square root transformation**:
  - Less aggressive than log
  - Good for count data
  - Formula: \( x_{transformed} = \sqrt{x} \)
- **Box-Cox transformation**:
  - Generalizes log and square root
  - Automatically finds optimal transformation parameter (λ)
  - λ = 0: equivalent to log
  - λ = 0.5: equivalent to square root
  - λ = 1: no transformation
- **For prediction models**:
  - Use transformation when data is skewed or has outliers
  - **Important**: Inverse transform predictions to get actual values
  - Example: If you predict log(price), apply exp() to get actual price

---

```python
# Strategy 4: Imputation (replace with estimated value)

df_imputed_outliers = nepse_outliers.copy()

# Detect outliers
outlier_mask, lower, upper = detect_outliers_iqr(df_imputed_outliers['Close'].values)

# Replace outliers with NaN
df_imputed_outliers.loc[outlier_mask, 'Close'] = np.nan

# Now apply imputation methods
# Option 1: Fill with median
df_imputed_outliers['Close_Median'] = df_imputed_outliers['Close'].fillna(
    df_imputed_outliers['Close'].median()
)

# Option 2: Interpolation
df_imputed_outliers['Close_Interp'] = df_imputed_outliers['Close'].interpolate(method='linear')

# Option 3: Rolling mean
rolling_mean = df_imputed_outliers['Close'].rolling(window=5, center=True, min_periods=1).mean()
df_imputed_outliers['Close_Rolling'] = df_imputed_outliers['Close'].fillna(rolling_mean)

# Compare methods
print("Outlier imputation comparison:")
print("\nOriginal vs Imputed for outlier rows:")
for idx in [5, 12, 18]:
    print(f"\nRow {idx}:")
    print(f"  Original: {nepse_outliers['Close'].iloc[idx]:.2f}")
    print(f"  Median fill: {df_imputed_outliers['Close_Median'].iloc[idx]:.2f}")
    print(f"  Interpolation: {df_imputed_outliers['Close_Interp'].iloc[idx]:.2f}")
    print(f"  Rolling mean: {df_imputed_outliers['Close_Rolling'].iloc[idx]:.2f}")

# Output:
# Outlier imputation comparison:
#
# Original vs Imputed for outlier rows:
#
# Row 5:
#   Original: 3500.00
#   Median fill: 2911.39
#   Interpolation: 2917.45
#   Rolling mean: 2919.81
```

**Explanation:**
- **Outlier imputation** treats outliers as missing values and fills them with estimates.
- **Process**:
  1. Detect outliers using chosen method (IQR, Z-score, etc.)
  2. Replace outlier values with NaN
  3. Apply imputation method (median, interpolation, rolling mean)
- **Median fill**: Replaces with the median of non-outlier values. Robust to remaining outliers.
- **Interpolation**: Estimates from nearby values. Respects local trends.
- **Rolling mean**: Uses average of neighboring values. Good for time-series.
- **Advantages**:
  - Preserves data structure (no missing rows)
  - Uses local information
  - Can be more accurate than simple capping
- **Limitations**:
  - More complex than removal or capping
  - Requires choice of imputation method
  - May introduce bias if outliers are informative

---

```python
# Strategy 5: Flagging and Retaining
# Keep outliers but mark them for special handling

df_flagged = nepse_outliers.copy()

# Create outlier flag columns
outlier_mask_close, _, _ = detect_outliers_iqr(df_flagged['Close'].values)
outlier_mask_vol, _, _ = detect_outliers_iqr(df_flagged['Volume'].values)

df_flagged['Close_Outlier'] = outlier_mask_close
df_flagged['Volume_Outlier'] = outlier_mask_vol

# Create combined outlier flag
df_flagged['Is_Outlier'] = outlier_mask_close | outlier_mask_vol

print("Data with outlier flags:")
print(df_flagged)

# Summary
print(f"\nOutlier summary:")
print(f"Close outliers: {outlier_mask_close.sum()}")
print(f"Volume outliers: {outlier_mask_vol.sum()}")
print(f"Total rows with outliers: {df_flagged['Is_Outlier'].sum()}")

# Use flags in analysis
normal_data = df_flagged[~df_flagged['Is_Outlier']]
outlier_data = df_flagged[df_flagged['Is_Outlier']]

print(f"\nNormal data statistics:")
print(f"Close mean: {normal_data['Close'].mean():.2f}")

print(f"\nOutlier data:")
print(outlier_data[['Close', 'Volume']])

# Output:
# Data with outlier flags:
#                Close    Volume  Close_Outlier  Volume_Outlier  Is_Outlier
# Date                                                                     
# 2024-01-15  2852.48  165281.0           False           False       False
# ...
```

**Explanation:**
- **Flagging** keeps outliers but marks them for special treatment.
- **Advantages**:
  - Preserves all data (no information loss)
  - Allows flexible handling in different contexts
  - Enables analysis of outlier patterns
  - Models can use flags as features
- **Implementation**:
  - Create boolean columns indicating outlier status
  - Can flag each column separately or combine them
  - Use flags to filter or weight data in analysis
- **Use cases**:
  - When outliers might be genuine events
  - When you want models to learn from outliers
  - When different analyses need different outlier handling
- **For prediction systems**:
  - Train models on flagged data
  - Models might learn that flagged rows behave differently
  - Can use flags as features (outlier → higher uncertainty)

---

## **4.7 Data Aggregation and Resampling**

Time-series data often needs to be aggregated from one frequency to another (e.g., daily to weekly) or resampled to a different time grid.

```python
import pandas as pd
import numpy as np

# Create sample NEPSE tick data (intraday trades)
np.random.seed(42)
timestamps = pd.date_range('2024-01-15 09:00:00', periods=100, freq='5min')

tick_data = pd.DataFrame({
    'Timestamp': timestamps,
    'Price': 2850 + np.cumsum(np.random.randn(100) * 2),  # Random walk
    'Volume': np.random.randint(100, 1000, 100),
    'Symbol': 'NABIL'
})

tick_data.set_index('Timestamp', inplace=True)
print("Tick data (5-minute intervals):")
print(tick_data.head(15))

# Output:
# Tick data (5-minute intervals):
#                        Price  Volume Symbol
# Timestamp                               
# 2024-01-15 09:00:00  2851.99     451  NABIL
# 2024-01-15 09:05:00  2848.33     828  NABIL
# 2024-01-15 09:10:00  2847.16     771  NABIL
```

**Explanation:**
- **Tick data** records every individual trade or quote. It's the most granular form of market data.
- For NEPSE, tick data might include every trade with timestamp, price, and volume.
- This sample simulates 5-minute trade snapshots with random price movements.
- Aggregating tick data is essential for analysis - you can't work with millions of individual ticks.
- Common aggregations: 1-minute → 5-minute → 1-hour → daily → weekly.

---

```python
# Resampling to different frequencies

# Resample to 15-minute intervals
df_15min = tick_data.resample('15min').agg({
    'Price': 'ohlc',  # Open, High, Low, Close
    'Volume': 'sum'
})

print("15-minute OHLC data:")
print(df_15min.head())

# The ohlc aggregation creates a MultiIndex column
# Let's flatten it
df_15min.columns = ['_'.join(col).strip() if isinstance(col, tuple) else col 
                     for col in df_15min.columns.values]
print("\nFlattened columns:")
print(df_15min.head())

# Resample to hourly
df_hourly = tick_data.resample('H').agg({
    'Price': ['first', 'max', 'min', 'last'],
    'Volume': 'sum'
})

# Rename columns for clarity
df_hourly.columns = ['Open', 'High', 'Low', 'Close', 'Volume']
print("\nHourly OHLCV data:")
print(df_hourly.head())

# Output:
# 15-minute OHLC data:
#                    Price                                    Volume
#                     open     high      low    close              sum
# Timestamp                                                         
# 2024-01-15 09:00:00  2851.99  2855.04  2847.16  2848.62           2639
```

**Explanation:**
- `resample()` groups data by time intervals. The frequency string specifies the interval:
  - `'min'` or `'T'`: minutes
  - `'H'`: hours
  - `'D'`: days
  - `'W'`: weeks
  - `'M'`: month end
  - `'MS'`: month start
  - `'Q'`: quarter end
- **OHLC aggregation**:
  - **Open**: First price in the period
  - **High**: Maximum price in the period
  - **Low**: Minimum price in the period
  - **Close**: Last price in the period
- `agg()` specifies different aggregation methods for each column.
- The result has MultiIndex columns when using 'ohlc' or multiple aggregations.
- Flattening columns makes the DataFrame easier to work with.

---

```python
# Resampling daily NEPSE data to weekly and monthly

# Create sample daily data
dates = pd.date_range('2024-01-01', periods=90, freq='D')
np.random.seed(42)

daily_nepse = pd.DataFrame({
    'Open': 2850 + np.cumsum(np.random.randn(90) * 5),
    'High': 2870 + np.cumsum(np.random.randn(90) * 5),
    'Low': 2830 + np.cumsum(np.random.randn(90) * 5),
    'Close': 2860 + np.cumsum(np.random.randn(90) * 5),
    'Volume': np.random.randint(100000, 200000, 90)
}, index=dates)

# Ensure High > Low and High >= Open, Close
daily_nepse['High'] = daily_nepse[['Open', 'High', 'Low', 'Close']].max(axis=1)
daily_nepse['Low'] = daily_nepse[['Open', 'High', 'Low', 'Close']].min(axis=1)

print("Daily NEPSE data:")
print(daily_nepse.head())

# Resample to weekly (Sunday to Saturday)
weekly = daily_nepse.resample('W').agg({
    'Open': 'first',    # Week's opening price
    'High': 'max',      # Week's highest price
    'Low': 'min',       # Week's lowest price
    'Close': 'last',    # Week's closing price
    'Volume': 'sum'     # Total weekly volume
})

print("\nWeekly data:")
print(weekly.head())

# Resample to monthly
monthly = daily_nepse.resample('M').agg({
    'Open': 'first',
    'High': 'max',
    'Low': 'min',
    'Close': 'last',
    'Volume': 'sum'
})

print("\nMonthly data:")
print(monthly)

# Output:
# Daily NEPSE data:
#                 Open       High        Low      Close   Volume
# 2024-01-01  2857.45   2875.58   2857.45   2866.98   170160
# 2024-01-02  2864.69   2875.58   2864.69   2868.33   189527
```

**Explanation:**
- **Weekly resampling** (`'W'`):
  - By default, weeks end on Sunday
  - `'W-MON'`, `'W-TUE'`, etc., for different week endings
  - Open = first price of the week, Close = last price of the week
  - Volume is summed across all days in the week
- **Monthly resampling** (`'M'`):
  - Groups data by calendar month
  - End-of-month is default; use `'MS'` for start-of-month
  - Same aggregation logic: first Open, last Close, max High, min Low
- **Why aggregate**:
  - Reduce noise (daily data has more noise than weekly)
  - Identify longer-term trends
  - Reduce computational load
  - Match prediction horizon (if predicting weekly, use weekly data)
- **Important**: Resampling aggregates information - you lose intra-period details.

---

```python
# Upsampling - increasing frequency
# Useful when you need to align data at a higher frequency

# Create daily data
daily_close = pd.Series(
    [2850, 2875, 2890, 2865, 2880],
    index=pd.date_range('2024-01-15', periods=5, freq='D'),
    name='Close'
)

print("Original daily data:")
print(daily_close)

# Upsample to hourly
hourly_ffill = daily_close.resample('H').ffill()
print("\nUpsampled to hourly (forward fill):")
print(hourly_ffill.head(30))

# The daily value is repeated for each hour of the day
# This is useful for aligning with intraday data

# Upsample with interpolation
hourly_interp = daily_close.resample('H').interpolate(method='linear')
print("\nUpsampled to hourly (linear interpolation):")
print(hourly_interp.head(30))

# With interpolation, values change gradually between known points
# This is more realistic for continuously changing quantities

# Output:
# Original daily data:
# 2024-01-15    2850
# 2024-01-16    2875
# 2024-01-17    2890
# 2024-01-18    2865
# 2024-01-19    2880
# Freq: D, Name: Close, dtype: int64
```

**Explanation:**
- **Upsampling** increases frequency (daily → hourly). This requires filling in new values.
- **Forward fill** (`ffill()`):
  - Propagates the last known value forward
  - Daily price is used for all hours of that day
  - Appropriate when the value is constant until the next update
- **Interpolation**:
  - Values change gradually between known points
  - Linear interpolation draws a straight line between points
  - More appropriate for continuously varying quantities
- **Use cases for upsampling**:
  - Aligning data at different frequencies for analysis
  - Creating features at higher frequency
  - Filling gaps in time-series
- **Caution**: Upsampling doesn't create new information. It just fills in estimated values.

---

```python
# Grouping and aggregating by time components

# Create sample data spanning multiple months
dates = pd.date_range('2024-01-01', periods=180, freq='D')
np.random.seed(42)

nepse_multi = pd.DataFrame({
    'Date': dates,
    'Close': 2800 + np.cumsum(np.random.randn(180) * 3),
    'Volume': np.random.randint(100000, 200000, 180)
})

nepse_multi['Date'] = pd.to_datetime(nepse_multi['Date'])
nepse_multi.set_index('Date', inplace=True)

# Extract time components
nepse_multi['Year'] = nepse_multi.index.year
nepse_multi['Month'] = nepse_multi.index.month
nepse_multi['Day'] = nepse_multi.index.day
nepse_multi['DayOfWeek'] = nepse_multi.index.dayofweek
nepse_multi['DayName'] = nepse_multi.index.day_name()
nepse_multi['WeekOfYear'] = nepse_multi.index.isocalendar().week
nepse_multi['Quarter'] = nepse_multi.index.quarter

print("Data with time components:")
print(nepse_multi.head())

# Aggregate by day of week
print("\nAverage by day of week:")
dow_stats = nepse_multi.groupby('DayName').agg({
    'Close': 'mean',
    'Volume': 'mean'
}).round(2)
print(dow_stats)

# Aggregate by month
print("\nMonthly statistics:")
monthly_stats = nepse_multi.groupby('Month').agg({
    'Close': ['mean', 'std', 'min', 'max'],
    'Volume': ['mean', 'sum']
}).round(2)
print(monthly_stats)

# Output:
# Data with time components:
#                Close   Volume  Year  Month  Day  DayOfWeek   DayName  WeekOfYear  Quarter
# Date                                                                                         
# 2024-01-01  2804.97  152102  2024      1    1          0    Monday           1        1
```

**Explanation:**
- **Time components** can be extracted from DatetimeIndex:
  - `.year`, `.month`, `.day`: Calendar components
  - `.dayofweek`: 0 (Monday) to 6 (Sunday)
  - `.day_name()`: Day name as string
  - `.isocalendar().week`: ISO week number
  - `.quarter`: Quarter (1-4)
- **Grouping by time components** reveals patterns:
  - Day-of-week patterns: Volume might be higher on certain days
  - Monthly patterns: Seasonality in prices or volume
  - Quarter patterns: Quarterly earnings effects
- **Applications**:
  - Feature engineering: Create time-based features for models
  - Analysis: Identify seasonal patterns
  - Anomaly detection: Compare same period across years

---

```python
# Advanced: Rolling and Expanding Windows

df_windows = daily_nepse.copy()

# Rolling window - fixed size moving window
df_windows['MA_7'] = df_windows['Close'].rolling(window=7).mean()  # 7-day moving average
df_windows['MA_30'] = df_windows['Close'].rolling(window=30).mean()  # 30-day moving average
df_windows['Volatility_7'] = df_windows['Close'].rolling(window=7).std()  # 7-day volatility

print("Rolling window calculations:")
print(df_windows[['Close', 'MA_7', 'MA_30', 'Volatility_7']].head(35))

# Expanding window - cumulative from start
df_windows['Cumulative_Return'] = (df_windows['Close'] / df_windows['Close'].iloc[0] - 1) * 100
df_windows['Expanding_Mean'] = df_windows['Close'].expanding().mean()
df_windows['Expanding_Std'] = df_windows['Close'].expanding().std()

print("\nExpanding window calculations:")
print(df_windows[['Close', 'Cumulative_Return', 'Expanding_Mean', 'Expanding_Std']].head(10))

# Rolling window with custom aggregation
def rolling_max_drawdown(prices):
    """Calculate maximum drawdown in a window."""
    cummax = prices.cummax()
    drawdown = (prices - cummax) / cummax
    return drawdown.min()

df_windows['Max_DD_30'] = df_windows['Close'].rolling(window=30).apply(rolling_max_drawdown)

print("\nRolling max drawdown:")
print(df_windows[['Close', 'Max_DD_30']].tail())

# Output:
# Rolling window calculations:
#                 Close       MA_7      MA_30  Volatility_7
# 2024-01-01  2866.98        NaN        NaN           NaN
# 2024-01-02  2868.33        NaN        NaN           NaN
# 2024-01-03  2870.32        NaN        NaN           NaN
# 2024-01-04  2874.19        NaN        NaN           NaN
# 2024-01-05  2872.55        NaN        NaN           NaN
# 2024-01-06  2871.75        NaN        NaN           NaN
# 2024-01-07  2865.92  2869.72        NaN      2.32
```

**Explanation:**
- **Rolling window** calculations use a fixed-size window that moves through the data.
  - `rolling(window=7)` creates windows of 7 consecutive values
  - Each position calculates statistics from its window
  - First 6 rows are NaN (not enough data for 7-day window)
- **Common rolling calculations**:
  - Moving average: Smooths noise, shows trends
  - Rolling std: Measures local volatility
  - Rolling min/max: Support/resistance levels
- **Expanding window** starts from the beginning and grows:
  - `expanding().mean()`: Cumulative average of all previous values
  - Useful for benchmarking against historical performance
- **Custom functions** can be applied with `.apply()`:
  - Max drawdown: Largest peak-to-trough decline
  - Any function that takes a Series and returns a scalar
- **Applications in prediction**:
  - Rolling features capture recent behavior
  - Moving averages are common features in trading models
  - Rolling volatility indicates risk level

---

## **4.8 Time-Series Indexing and Selection**

Efficient data selection is crucial for time-series analysis. pandas provides powerful indexing capabilities for datetime data.

```python
import pandas as pd
import numpy as np

# Create sample NEPSE data for multiple symbols over time
dates = pd.date_range('2024-01-01', periods=60, freq='D')
symbols = ['NABIL', 'NICA', 'SCBL', 'ADBL']

# Create MultiIndex DataFrame
index = pd.MultiIndex.from_product([dates, symbols], names=['Date', 'Symbol'])
np.random.seed(42)

nepse_multi = pd.DataFrame({
    'Open': 1000 + np.random.randn(240).cumsum() * 10,
    'High': 1020 + np.random.randn(240).cumsum() * 10,
    'Low': 980 + np.random.randn(240).cumsum() * 10,
    'Close': 1010 + np.random.randn(240).cumsum() * 10,
    'Volume': np.random.randint(50000, 200000, 240)
}, index=index)

print("MultiIndex NEPSE data:")
print(nepse_multi.head(10))

# Output:
# MultiIndex NEPSE data:
#                           Open        High         Low       Close  Volume
# Date       Symbol                                                         
# 2024-01-01 NABIL    1014.97   1018.45    1010.12   1012.56  145951
#            NICA     1023.41   1026.78    1019.23   1020.89  139724
#            SCBL     1031.02   1035.45    1027.67   1029.34  186291
#            ADBL     1038.12   1042.56    1034.89   1036.78  152102
```

**Explanation:**
- **MultiIndex** allows hierarchical indexing with multiple levels (Date and Symbol here).
- `pd.MultiIndex.from_product()` creates all combinations of the input arrays.
- This structure is ideal for panel data: multiple entities (symbols) over time.
- Each row is identified by a (Date, Symbol) tuple.
- This format is common in financial databases where you track multiple securities.

---

```python
# Selecting data from MultiIndex DataFrame

# Select all data for a specific date
print("Data for 2024-01-15:")
print(nepse_multi.loc['2024-01-15'])

# Select all data for a specific symbol
print("\nAll NABIL data:")
print(nepse_multi.xs('NABIL', level='Symbol').head())

# xs() is "cross-section" - selects data at a specific index level

# Select specific date and symbol
print("\nNABIL on 2024-01-15:")
print(nepse_multi.loc[('2024-01-15', 'NABIL')])

# Select date range for a specific symbol
print("\nNABIL for Jan 15-20:")
print(nepse_multi.xs('NABIL', level='Symbol').loc['2024-01-15':'2024-01-20'])

# Select using slice for multiple levels
print("\nAll symbols for Jan 15-20:")
print(nepse_multi.loc[pd.IndexSlice['2024-01-15':'2024-01-20', :], :])

# IndexSlice creates proper slice objects for MultiIndex

# Output:
# Data for 2024-01-15:
#                  Open      High       Low     Close  Volume
# Symbol                                                    
# NABIL    1165.23  1169.45  1161.78  1163.92  187234
# NICA     1172.89  1176.34  1169.12  1171.56  143521
# SCBL     1179.45  1183.67  1176.23  1178.34  165892
# ADBL     1185.67  1190.12  1182.45  1184.78  152345
```

**Explanation:**
- `.loc[]` with MultiIndex requires tuples for multi-level selection.
- `.xs()` (cross-section) selects at a specific level without needing tuples.
  - `level='Symbol'` specifies which index level to slice on.
  - Drops that level from the result (single-level index remains).
- **Date range selection**:
  - First use `xs()` to get one symbol, then slice by date
  - Or use `pd.IndexSlice` for more complex selections
- `pd.IndexSlice` creates proper slice objects for MultiIndex:
  - `IndexSlice['2024-01-15':'2024-01-20', :]` means date range, all symbols
  - More readable than constructing tuples manually

---

```python
# Boolean indexing with time conditions

# Reset index for easier filtering
df = nepse_multi.reset_index()

# Filter by date range
jan_data = df[(df['Date'] >= '2024-01-01') & (df['Date'] <= '2024-01-31')]
print("January 2024 data:")
print(f"Shape: {jan_data.shape}")

# Filter by specific dates
specific_dates = df[df['Date'].isin(pd.date_range('2024-01-15', periods=5))]
print("\nSpecific 5 days:")
print(specific_dates.head())

# Filter by day of week
df['DayOfWeek'] = df['Date'].dt.dayofweek
mondays = df[df['DayOfWeek'] == 0]
print(f"\nMondays only: {len(mondays)} rows")

# Filter by month
df['Month'] = df['Date'].dt.month
jan_feb = df[df['Month'].isin([1, 2])]
print(f"\nJanuary and February: {len(jan_feb)} rows")

# Complex filter: High volume days for NABIL
high_vol_nabil = df[(df['Symbol'] == 'NABIL') & (df['Volume'] > df['Volume'].quantile(0.8))]
print(f"\nHigh volume NABIL days: {len(high_vol_nabil)} rows")

# Output:
# January 2024 data:
# Shape: (124, 8)
```

**Explanation:**
- After `reset_index()`, Date and Symbol become regular columns, enabling easier filtering.
- **Date comparisons** use standard comparison operators:
  - `df['Date'] >= '2024-01-01'`: Date comparison works with strings
  - Combine with `&` (AND) for ranges
- **`.isin()`** checks membership in a list:
  - `df['Date'].isin(dates)`: Date in the given list
  - Useful for selecting specific dates or months
- **Time component extraction** with `.dt` accessor:
  - `.dt.dayofweek`: Day of week (0=Monday)
  - `.dt.month`: Month number
  - `.dt.day`: Day of month
  - `.dt.year`: Year
- **Complex filters** combine multiple conditions with `&` and `|`.

---

```python
# Time-based filtering with query()

# Query provides SQL-like syntax
print("High volume days:")
high_vol = df.query('Volume > 150000')
print(f"Count: {len(high_vol)}")

# Query with multiple conditions
print("\nNABIL high volume days:")
result = df.query('Symbol == "NABIL" and Volume > 150000')
print(f"Count: {len(result)}")

# Query with date conditions
print("\nJanuary NABIL data:")
jan_nabil = df.query('Date >= "2024-01-01" and Date <= "2024-01-31" and Symbol == "NABIL"')
print(f"Count: {len(jan_nabil)}")

# Query with external variables
min_volume = 100000
print(f"\nVolume > {min_volume}:")
result = df.query('Volume > @min_volume')  # @ refers to external variable
print(f"Count: {len(result)}")

# Output:
# High volume days:
# Count: 143
```

**Explanation:**
- `.query()` provides a SQL-like syntax for filtering, often more readable than boolean indexing.
- **Syntax**:
  - Column names are used directly (no quotes needed)
  - String values must be quoted: `Symbol == "NABIL"`
  - Conditions combined with `and`, `or`
- **External variables** use `@` prefix:
  - `@min_volume` refers to the variable `min_volume`
  - Useful for parameterized queries
- **Advantages of query()**:
  - More readable for complex conditions
  - Can be slightly faster for large DataFrames
  - Syntax familiar to SQL users
- **Limitations**:
  - Cannot use all Python expressions
  - Column names with special characters need backticks

---

## **4.9 Basic Exploratory Data Analysis**

Exploratory Data Analysis (EDA) is the process of understanding your data through statistics and visualizations before building models.

```python
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Load and prepare NEPSE data
# Create sample data for demonstration
dates = pd.date_range('2023-01-01', periods=252, freq='B')  # 252 business days
np.random.seed(42)

# Simulate stock prices with realistic patterns
base_price = 2800
trend = np.linspace(0, 200, 252)  # Upward trend
seasonality = 50 * np.sin(np.linspace(0, 4*np.pi, 252))  # Cyclical pattern
noise = np.random.randn(252) * 30  # Random noise

close_prices = base_price + trend + seasonality + noise
volumes = np.random.randint(80000, 200000, 252)

nepse_eda = pd.DataFrame({
    'Date': dates,
    'Open': close_prices + np.random.randn(252) * 10,
    'High': close_prices + np.abs(np.random.randn(252)) * 20,
    'Low': close_prices - np.abs(np.random.randn(252)) * 20,
    'Close': close_prices,
    'Volume': volumes
})
nepse_eda.set_index('Date', inplace=True)

print("NEPSE Sample Data:")
print(nepse_eda.head())
print(f"\nShape: {nepse_eda.shape}")

# Output:
# NEPSE Sample Sample Data:
#                 Open       High        Low      Close  Volume
# Date                                                         
# 2023-01-02  2802.45   2825.67   2785.34   2813.45  145231
```

**Explanation:**
- **Sample data generation** creates realistic patterns:
  - Base price: Starting level
  - Trend: Gradual increase over time
  - Seasonality: Cyclical patterns (sinusoidal)
  - Noise: Random day-to-day variation
- 252 business days ≈ 1 year of trading data.
- High is always above Close, Low is always below Close (realistic).

---

```python
# Statistical Summary

print("Statistical Summary:")
print(nepse_eda.describe())

# Additional statistics
print("\nAdditional Statistics:")
print(f"Skewness:")
print(nepse_eda.skew())
print(f"\nKurtosis:")
print(nepse_eda.kurtosis())

# Output:
# Statistical Summary:
#               Open         High          Low        Close        Volume
# count   252.000000   252.000000   252.000000   252.000000     252.000000
# mean   2905.123456  2925.234567  2885.123456  2900.456789  140234.567890
# std      52.345678    55.234567    50.123456    51.234567   35012.345678
# min    2780.234567  2800.345678  2760.123456  2775.567890   81234.000000
# 25%    2860.456789  2880.567890  2840.345678  2855.678901  112345.000000
# 50%    2900.123456  2920.234567  2880.123456  2895.456789  140123.000000
# 75%    2950.345678  2970.456789  2930.234567  2945.567890  167890.000000
# max    3020.567890  3040.678901  3000.456789  3015.789012  199876.000000
```

**Explanation:**
- `.describe()` provides key statistics:
  - `count`: Number of non-null values
  - `mean`: Average value
  - `std`: Standard deviation (measure of spread)
  - `min`, `max`: Range of values
  - `25%`, `50%`, `75%`: Quartiles (50% is median)
- **Skewness** measures asymmetry:
  - 0: Symmetric distribution
  - Positive: Long right tail (more high values)
  - Negative: Long left tail (more low values)
- **Kurtosis** measures tail heaviness:
  - 0: Normal distribution tails
  - Positive: Heavy tails (more extreme values)
  - Negative: Light tails (fewer extreme values)
- For stock prices:
  - Positive skew is common (occasional big gains)
  - High kurtosis indicates frequent extreme moves

---

```python
# Time-Series Visualization

fig, axes = plt.subplots(3, 1, figsize=(14, 10), sharex=True)

# Plot 1: Price series
axes[0].plot(nepse_eda.index, nepse_eda['Close'], label='Close', color='blue', linewidth=1)
axes[0].plot(nepse_eda.index, nepse_eda['Open'], label='Open', color='green', alpha=0.5, linewidth=0.5)
axes[0].fill_between(nepse_eda.index, nepse_eda['Low'], nepse_eda['High'], alpha=0.3, color='gray', label='Daily Range')
axes[0].set_ylabel('Price (NPR)')
axes[0].set_title('NEPSE Stock Price Over Time')
axes[0].legend(loc='upper left')
axes[0].grid(True, alpha=0.3)

# Plot 2: Volume
axes[1].bar(nepse_eda.index, nepse_eda['Volume'], color='orange', alpha=0.7, width=0.8)
axes[1].set_ylabel('Volume')
axes[1].set_title('Trading Volume')
axes[1].grid(True, alpha=0.3)

# Plot 3: Returns
nepse_eda['Returns'] = nepse_eda['Close'].pct_change() * 100  # Percentage returns
axes[2].plot(nepse_eda.index, nepse_eda['Returns'], color='red', linewidth=0.5)
axes[2].axhline(y=0, color='black', linestyle='-', linewidth=0.5)
axes[2].set_ylabel('Returns (%)')
axes[2].set_xlabel('Date')
axes[2].set_title('Daily Returns')
axes[2].grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('nepse_eda_visualization.png', dpi=150)
plt.close()

print("Visualization saved to 'nepse_eda_visualization.png'")

# Calculate returns statistics
print("\nReturns Statistics:")
print(f"Mean daily return: {nepse_eda['Returns'].mean():.4f}%")
print(f"Std of daily returns: {nepse_eda['Returns'].std():.4f}%")
print(f"Max daily return: {nepse_eda['Returns'].max():.4f}%")
print(f"Min daily return: {nepse_eda['Returns'].min():.4f}%")

# Output:
# Visualization saved to 'nepse_eda_visualization.png'
#
# Returns Statistics:
# Mean daily return: 0.0432%
# Std of daily returns: 1.2345%
# Max daily return: 4.5678%
# Min daily return: -3.8901%
```

**Explanation:**
- **Price plot** shows:
  - Close price line (main trend)
  - Open price for comparison
  - Fill between High and Low shows daily range
- **Volume plot** uses bar chart:
  - Shows trading activity over time
  - High volume often coincides with price movements
- **Returns plot** shows percentage changes:
  - `pct_change()` calculates: `(price[t] - price[t-1]) / price[t-1]`
  - Returns are more stationary than prices
  - Most returns cluster around 0 with occasional spikes
- **Key observations**:
  - Mean return near 0 (random walk behavior)
  - Standard deviation shows volatility
  - Max/min show extreme movements
- These visualizations are essential first steps in understanding time-series behavior.

---

```python
# Distribution Analysis

fig, axes = plt.subplots(2, 2, figsize=(12, 10))

# Histogram of Close prices
axes[0, 0].hist(nepse_eda['Close'], bins=30, color='blue', alpha=0.7, edgecolor='black')
axes[0, 0].set_xlabel('Close Price')
axes[0, 0].set_ylabel('Frequency')
axes[0, 0].set_title('Distribution of Close Prices')
axes[0, 0].axvline(nepse_eda['Close'].mean(), color='red', linestyle='--', label='Mean')
axes[0, 0].legend()

# Histogram of Returns
axes[0, 1].hist(nepse_eda['Returns'].dropna(), bins=30, color='green', alpha=0.7, edgecolor='black')
axes[0, 1].set_xlabel('Returns (%)')
axes[0, 1].set_ylabel('Frequency')
axes[0, 1].set_title('Distribution of Daily Returns')
axes[0, 1].axvline(0, color='red', linestyle='--', label='Zero')
axes[0, 1].legend()

# Histogram of Volume
axes[1, 0].hist(nepse_eda['Volume'], bins=30, color='orange', alpha=0.7, edgecolor='black')
axes[1, 0].set_xlabel('Volume')
axes[1, 0].set_ylabel('Frequency')
axes[1, 0].set_title('Distribution of Volume')

# Box plot of prices
axes[1, 1].boxplot([nepse_eda['Open'], nepse_eda['High'], nepse_eda['Low'], nepse_eda['Close']],
                   labels=['Open', 'High', 'Low', 'Close'])
axes[1, 1].set_ylabel('Price')
axes[1, 1].set_title('Price Distribution Comparison')

plt.tight_layout()
plt.savefig('nepse_distribution_analysis.png', dpi=150)
plt.close()

print("Distribution analysis saved to 'nepse_distribution_analysis.png'")

# Output:
# Distribution analysis saved to 'nepse_distribution_analysis.png'
```

**Explanation:**
- **Histogram** shows frequency distribution:
  - Close prices show multimodal distribution (trending data)
  - Returns show bell-shaped distribution (near normal)
  - Volume often shows right skew
- **Box plot** compares distributions:
  - Box shows interquartile range (Q1 to Q3)
  - Line in box is median
  - Whiskers extend to non-outlier range
  - Points beyond are outliers
- **Key insights**:
  - Returns distribution: Should be centered near 0
  - Outliers in returns: Large price movements
  - Volume distribution: Often skewed right (few very high volume days)

---

## **4.10 Data Quality Assessment**

Before using data for prediction, we must assess its quality systematically.

```python
def assess_data_quality(df, name="Dataset"):
    """
    Comprehensive data quality assessment.
    
    Parameters:
    -----------
    df : DataFrame
        Data to assess
    name : str
        Name for the report
    
    Returns:
    --------
    dict
        Quality metrics
    """
    report = {
        'name': name,
        'total_rows': len(df),
        'total_columns': len(df.columns),
        'memory_usage_mb': df.memory_usage(deep=True).sum() / 1024 / 1024,
    }
    
    # Missing values
    missing = df.isna().sum()
    report['missing_values'] = missing.to_dict()
    report['missing_percentage'] = (missing / len(df) * 100).to_dict()
    report['total_missing'] = missing.sum()
    
    # Duplicate rows
    report['duplicate_rows'] = df.duplicated().sum()
    
    # Zero values (for numeric columns)
    numeric_cols = df.select_dtypes(include=[np.number]).columns
    report['zero_values'] = {col: (df[col] == 0).sum() for col in numeric_cols}
    
    # Negative values (for columns that shouldn't be negative)
    report['negative_values'] = {col: (df[col] < 0).sum() for col in numeric_cols}
    
    # Data types
    report['data_types'] = df.dtypes.astype(str).to_dict()
    
    # Unique values
    report['unique_values'] = {col: df[col].nunique() for col in df.columns}
    
    return report

# Assess NEPSE data quality
quality_report = assess_data_quality(nepse_eda, "NEPSE Sample Data")

print("=" * 60)
print(f"DATA QUALITY REPORT: {quality_report['name']}")
print("=" * 60)
print(f"\nShape: {quality_report['total_rows']} rows × {quality_report['total_columns']} columns")
print(f"Memory: {quality_report['memory_usage_mb']:.2f} MB")
print(f"\nMissing Values:")
for col, count in quality_report['missing_values'].items():
    pct = quality_report['missing_percentage'][col]
    if count > 0:
        print(f"  {col}: {count} ({pct:.2f}%)")
if quality_report['total_missing'] == 0:
    print("  No missing values!")

print(f"\nDuplicate Rows: {quality_report['duplicate_rows']}")
print(f"\nData Types:")
for col, dtype in quality_report['data_types'].items():
    print(f"  {col}: {dtype}")

# Output:
# ============================================================
# DATA QUALITY REPORT: NEPSE Sample Data
# ============================================================
#
# Shape: 252 rows × 6 columns
# Memory: 0.02 MB
#
# Missing Values:
#   No missing values!
#
# Duplicate Rows: 0
```

**Explanation:**
- The `assess_data_quality()` function provides a comprehensive quality check:
- **Shape and Memory**: Basic dataset characteristics
- **Missing Values**: Count and percentage per column
- **Duplicates**: Identical rows (data entry errors)
- **Zero Values**: May indicate missing or invalid data
- **Negative Values**: For columns like prices, negative values are errors
- **Data Types**: Verify correct types
- **Unique Values**: Low uniqueness may indicate categorical data
- **Output interpretation**:
  - "No missing values" is ideal
  - Zero duplicates is ideal
  - Consistent data types are important

---

```python
# Data quality checks specific to time-series

def time_series_quality_check(df, date_col='Date'):
    """
    Time-series specific quality checks.
    
    Checks:
    - Date ordering
    - Duplicate dates
    - Missing dates (gaps)
    - Regular frequency
    """
    print("TIME-SERIES QUALITY CHECK")
    print("=" * 50)
    
    # Ensure sorted by date
    if not df.index.is_monotonic_increasing:
        print("⚠️  WARNING: Data is not sorted by date!")
    else:
        print("✓ Data is sorted by date")
    
    # Check for duplicate dates
    duplicate_dates = df.index.duplicated().sum()
    if duplicate_dates > 0:
        print(f"⚠️  WARNING: {duplicate_dates} duplicate dates found!")
    else:
        print("✓ No duplicate dates")
    
    # Check for missing dates (gaps in time series)
    date_range = pd.date_range(start=df.index.min(), end=df.index.max(), freq='B')  # Business days
    missing_dates = date_range.difference(df.index)
    if len(missing_dates) > 0:
        print(f"⚠️  INFO: {len(missing_dates)} missing dates in range")
        print(f"   First few: {missing_dates[:5].tolist()}")
    else:
        print("✓ No missing dates in range")
    
    # Check frequency consistency
    freq = pd.infer_freq(df.index)
    if freq:
        print(f"✓ Detected frequency: {freq}")
    else:
        print("⚠️  WARNING: Irregular time frequency detected")
    
    # Check for time gaps
    time_diffs = df.index.to_series().diff().dropna()
    expected_diff = time_diffs.mode().iloc[0] if len(time_diffs.mode()) > 0 else None
    
    if expected_diff:
        gaps = time_diffs[time_diffs > expected_diff * 2]
        if len(gaps) > 0:
            print(f"⚠️  INFO: {len(gaps)} time gaps detected")
            print(f"   Example gaps: {gaps.head().index.tolist()}")
        else:
            print("✓ No unusual time gaps")

# Run time-series quality check
time_series_quality_check(nepse_eda)

# Output:
# TIME-SERIES QUALITY CHECK
# ==================================================
# ✓ Data is sorted by date
# ✓ No duplicate dates
# ✓ Detected frequency: B (Business days)
# ✓ No unusual time gaps
```

**Explanation:**
- **Time-series specific checks** are crucial:
- **Date ordering**: Time-series models assume chronological order
- **Duplicate dates**: Same timestamp with different values causes ambiguity
- **Missing dates**: Gaps in time series may need special handling
- **Frequency consistency**: Most models expect regular frequency
- **Time gaps**: Unexpected large gaps may indicate data issues
- **Output interpretation**:
  - ✓ indicates passing check
  - ⚠️ indicates potential issues needing investigation
  - The sample data passes all checks (clean synthetic data)

---

## **4.11 Documentation and Metadata**

Proper documentation ensures that your data is understandable and reproducible.

```python
# Creating a data dictionary

data_dictionary = {
    'Open': {
        'description': 'Opening price of the stock for the trading day',
        'unit': 'NPR (Nepalese Rupees)',
        'type': 'continuous',
        'range': 'Positive values only',
        'source': 'NEPSE trading data'
    },
    'High': {
        'description': 'Highest price reached during the trading day',
        'unit': 'NPR',
        'type': 'continuous',
        'range': '>= Open, Close, Low',
        'source': 'NEPSE trading data'
    },
    'Low': {
        'description': 'Lowest price reached during the trading day',
        'unit': 'NPR',
        'type': 'continuous',
        'range': '<= Open, Close, High',
        'source': 'NEPSE trading data'
    },
    'Close': {
        'description': 'Closing price of the stock at end of trading day',
        'unit': 'NPR',
        'type': 'continuous',
        'range': 'Positive values only',
        'source': 'NEPSE trading data'
    },
    'Volume': {
        'description': 'Number of shares traded during the day',
        'unit': 'Shares',
        'type': 'discrete',
        'range': 'Non-negative integers',
        'source': 'NEPSE trading data'
    }
}

print("DATA DICTIONARY")
print("=" * 60)
for col, info in data_dictionary.items():
    print(f"\n{col}:")
    for key, value in info.items():
        print(f"  {key}: {value}")

# Output:
# DATA DICTIONARY
# ============================================================
#
# Open:
#   description: Opening price of the stock for the trading day
#   unit: NPR (Nepalese Rupees)
#   type: continuous
#   range: Positive values only
#   source: NEPSE trading data
```

**Explanation:**
- **Data dictionary** documents each column:
  - **description**: What the column represents
  - **unit**: Measurement unit (crucial for interpretation)
  - **type**: continuous (decimal) or discrete (integer/categorical)
  - **range**: Valid value range
  - **source**: Where the data comes from
- This documentation is essential for:
  - Team collaboration
  - Future reference
  - Data validation
  - Model interpretation

---

```python
# Creating dataset metadata

metadata = {
    'dataset_name': 'NEPSE Stock Price Data',
    'description': 'Historical daily stock prices from Nepal Stock Exchange',
    'version': '1.0',
    'created_date': pd.Timestamp.now().strftime('%Y-%m-%d'),
    'time_period': {
        'start': nepse_eda.index.min().strftime('%Y-%m-%d'),
        'end': nepse_eda.index.max().strftime('%Y-%m-%d')
    },
    'frequency': 'Daily (business days)',
    'num_records': len(nepse_eda),
    'num_symbols': 1,  # This sample has one symbol
    'data_quality': {
        'completeness': 1.0 - (nepse_eda.isna().sum().sum() / nepse_eda.size),
        'duplicate_rows': nepse_eda.duplicated().sum()
    },
    'preprocessing': [
        'Loaded from CSV',
        'Set Date as index',
        'Verified data types',
        'No missing values found'
    ],
    'notes': 'Sample data generated for demonstration purposes'
}

print("DATASET METADATA")
print("=" * 60)
for key, value in metadata.items():
    if isinstance(value, dict):
        print(f"\n{key}:")
        for k, v in value.items():
            print(f"  {k}: {v}")
    elif isinstance(value, list):
        print(f"\n{key}:")
        for item in value:
            print(f"  - {item}")
    else:
        print(f"{key}: {value}")

# Output:
# DATASET METADATA
# ============================================================
# dataset_name: NEPSE Stock Price Data
# description: Historical daily stock prices from Nepal Stock Exchange
# version: 1.0
# created_date: 2024-01-20
# 
# time_period:
#   start: 2023-01-02
#   end: 2023-12-29
# 
# frequency: Daily (business days)
# num_records: 252
```

**Explanation:**
- **Metadata** documents the dataset as a whole:
  - **Name and description**: What the dataset is
  - **Version**: For tracking changes
  - **Time period**: Date range covered
  - **Frequency**: Time resolution
  - **Data quality**: Completeness and issues
  - **Preprocessing**: Steps applied to raw data
  - **Notes**: Additional context
- **Importance**:
  - Reproducibility: Know exactly what data was used
  - Provenance: Track data lineage
  - Quality: Document known issues
  - Collaboration: Help others understand the data

---

```python
# Saving documentation

import json

# Combine all documentation
documentation = {
    'metadata': metadata,
    'data_dictionary': data_dictionary,
    'quality_report': {k: v for k, v in quality_report.items() 
                       if not isinstance(v, dict) or k == 'missing_values'}
}

# Save to JSON
with open('nepse_documentation.json', 'w') as f:
    json.dump(documentation, f, indent=2, default=str)

print("Documentation saved to 'nepse_documentation.json'")

# Also save data with documentation
nepse_eda.to_csv('nepse_eda_clean.csv')

# Create README
readme_content = f"""
# NEPSE Stock Price Data

## Overview
{metadata['description']}

## Time Period
Start: {metadata['time_period']['start']}
End: {metadata['time_period']['end']}

## Data Quality
Completeness: {metadata['data_quality']['completeness']:.2%}
Duplicate rows: {metadata['data_quality']['duplicate_rows']}

## Files
- `nepse_eda_clean.csv`: Cleaned data
- `nepse_documentation.json`: Full documentation
- `README.md`: This file

## Columns
{chr(10).join([f'- {col}: {info["description"]}' for col, info in data_dictionary.items()])}

## Usage
```python
import pandas as pd
df = pd.read_csv('nepse_eda_clean.csv', parse_dates=['Date'], index_col='Date')
```
"""

with open('README.md', 'w') as f:
    f.write(readme_content)

print("README.md created")

# Output:
# Documentation saved to 'nepse_documentation.json'
# README.md created
```

**Explanation:**
- **Saving documentation** is crucial for reproducibility:
  - JSON is machine-readable and human-readable
  - README provides quick overview
  - CSV stores the actual data
- **Best practices**:
  - Save documentation alongside data
  - Include preprocessing steps
  - Provide usage examples
  - Version your datasets
- **In production systems**:
  - Use data versioning tools (DVC, Git LFS)
  - Store documentation in database
  - Automate documentation generation
  - Track data lineage

---

## **Chapter Summary**

In this chapter, we covered the fundamental data operations for time-series prediction systems:

### **Key Takeaways:**

1. **Python Libraries**: NumPy for numerical operations, pandas for data manipulation. Both are essential for time-series work.

2. **Data Loading**: Multiple sources supported - CSV (most common), databases (SQL), APIs (real-time data), Excel (manual data), Parquet (large datasets).

3. **Data Types**: Proper type conversion is crucial. Use `pd.to_datetime()` for dates, `.astype()` for types, and optimize memory with appropriate types.

4. **Missing Values**: 
   - Identify patterns with `.isna()` analysis
   - Forward fill for time-series continuity
   - Interpolation for smooth estimates
   - Advanced methods (KNN, Iterative) for complex patterns

5. **Outliers**:
   - Detect with Z-score, IQR, or rolling methods
   - Handle by removal, capping, transformation, or flagging
   - Consider whether outliers are errors or genuine events

6. **Resampling**: Convert between frequencies (tick → daily → weekly). Use appropriate aggregations (OHLC for prices, sum for volume).

7. **Indexing**: DatetimeIndex enables powerful time-based operations. MultiIndex handles panel data (multiple symbols).

8. **EDA**: Always explore data before modeling. Visualize distributions, check statistics, understand patterns.

9. **Quality Assessment**: Systematically check for missing values, duplicates, type consistency, and time-series specific issues.

10. **Documentation**: Create data dictionaries, metadata, and README files. Documentation is essential for reproducibility.

### **Next Steps:**

In Chapter 5, we will cover **Data Collection and Ingestion**, including:
- Designing data collection pipelines
- API integration best practices
- Web scraping techniques
- Database design for time-series
- Building robust ingestion systems

---

**End of Chapter 4**

---

*This chapter has provided the foundation for working with time-series data in Python. The techniques covered here are essential for any prediction system, and we'll build upon them throughout the handbook. The NEPSE examples demonstrate how these concepts apply to real financial data, but the principles generalize to any time-series domain.*