### Exercises

#### Question 1

Alongside this notebook is a data file named `daily_quotes.csv` which contains EOD OHLC/Volume data for a small number of equities over a 6 month period.

The first step is to load up this data into a dataframe, ensuring that all data types are correct (datetime objects for dates, floats for OHLC data, and integers for Volume).

Write a function that receives the file name as an argument and returns a dataframe that:
- has the correct data type for each column (`str`, `float`, `int`)
- has a row index based on the `symbol` column

In addition, we would like our dataframe to contain columns named and ordered in a specific way:
- symbol (`str`)
- date (`datetime`)
- open (`float`)
- high (`float`)
- low (`float`)
- close (`float`)
- volume (`int`)

(with `symbol` being used as the row index)

Hint: 

You will want to read up the Pandas docs for `read_csv` to see how you can handle datetime data directly while loading the data (in particular you should look at the `parse_dates` option):

[https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html
](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html
)

Alternatively, you could convert these objects into proper datetime types after loading by using the Pandas function `to_datetime`, documented here:

[https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.to_datetime.html
](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.to_datetime.html
)

and then use concatenation to build up a dataframe that replaces the "old" `date` column with the "new" (properly typed) column.

In [1]:
from __future__ import annotations

from pathlib import Path
from typing import Union

import pandas as pd


def load_daily_quotes(csv_path: Union[str, Path]) -> pd.DataFrame:
    csv_path = Path(csv_path)
    if not csv_path.exists():
        raise FileNotFoundError(f"CSV file not found: {csv_path}")

    # Load without parse_dates first
    df = pd.read_csv(csv_path)

    # Normalize column names (VERY important)
    df.columns = (
        df.columns
        .str.strip()
        .str.lower()
    )

    required_cols = ["symbol", "date", "open", "high", "low", "close", "volume"]
    missing = [c for c in required_cols if c not in df.columns]
    if missing:
        raise ValueError(f"Missing required column(s): {missing}")

    # Type conversions
    df["symbol"] = df["symbol"].astype("string")
    df["date"] = pd.to_datetime(df["date"], errors="raise")

    for col in ["open", "high", "low", "close"]:
        df[col] = pd.to_numeric(df[col], errors="raise").astype("float64")

    vol = pd.to_numeric(df["volume"], errors="raise")
    df["volume"] = vol.astype("Int64") if vol.isna().any() else vol.astype("int64")

    # Enforce column order
    df = df[required_cols]

    # Set index but keep symbol column
    df = df.set_index("symbol", drop=False)

    return df

def load_daily_quotes(csv_path: Union[str, Path]) -> pd.DataFrame:
    """Load daily EOD quotes from a CSV into a well-typed DataFrame.

    Requirements satisfied:
    - date parsed as datetime
    - OHLC as float
    - volume as int
    - symbol kept as a column but also used as the row index
    - columns ordered exactly as: symbol, date, open, high, low, close, volume

    Parameters
    ----------
    csv_path:
        Path to the CSV file.

    Returns
    -------
    pd.DataFrame
        DataFrame indexed by symbol.
    """
    csv_path = Path(csv_path)
    if not csv_path.exists():
        raise FileNotFoundError(f"CSV file not found: {csv_path}")

    required_cols = ["symbol", "date", "open", "high", "low", "close", "volume"]

    # Read with strong dtype hints. We parse date at read-time.
    # Note: volume may contain missing values in some datasets; we guard below.
    df = pd.read_csv(
        csv_path,
        parse_dates=["date"],
        dtype={
            "symbol": "string",
            "open": "float64",
            "high": "float64",
            "low": "float64",
            "close": "float64",
        },
        keep_default_na=True,
    )

    missing = [c for c in required_cols if c not in df.columns]
    if missing:
        raise ValueError(f"Missing required column(s): {missing}")

    # Ensure volume is integer (nullable safe conversion).
    # If volume has no missing values, we downcast to a regular int64.
    vol = pd.to_numeric(df["volume"], errors="raise")
    if vol.isna().any():
        df["volume"] = vol.astype("Int64")
    else:
        df["volume"] = vol.astype("int64")

    # Ensure date is datetime even if parse_dates didn't trigger due to upstream formatting.
    df["date"] = pd.to_datetime(df["date"], errors="raise")

    # Enforce column order.
    df = df[required_cols]

    # Set index to symbol but keep symbol column.
    df = df.set_index("symbol", drop=False)

    # Defensive: consistent string dtype for symbol column.
    df["symbol"] = df["symbol"].astype("string")

    return df


# Example usage (uncomment if you want to run it):
# quotes = load_daily_quotes("daily_quotes.csv")
# quotes.head()


#### Question 2

Write a function that, given a dataframe sructured as the one we created in Question 1 and a symbol name as a string (e.g. `AAPL`, `MSFT`, etc), will:
- return a similarly structured dataframe consisting of the row (or rows) containing the records with the highest volume for the given symbol
- raises a `ValueError` if the symbol is not in the dataframe

In [2]:
import pandas as pd


def rows_with_max_volume(df: pd.DataFrame, symbol: str) -> pd.DataFrame:
    """Return the row(s) with the highest volume for a given symbol.

    The returned DataFrame keeps the same structure as the input.
    Raises ValueError if the symbol does not exist.
    """
    if "symbol" not in df.columns:
        raise ValueError("Input DataFrame must contain a 'symbol' column.")

    symbol = str(symbol)
    # Use the symbol column for filtering (robust even if index changes).
    sub = df.loc[df["symbol"] == symbol]
    if sub.empty:
        raise ValueError(f"Symbol not found in dataframe: {symbol}")

    max_vol = sub["volume"].max()
    return sub.loc[sub["volume"] == max_vol]


# Example (uncomment):
# rows_with_max_volume(quotes, "AAPL")


#### Question 3

Using the same dataframe as in the preceding questions, our goal now is to write a function that will return, for a specific symbol, the row that had the largest high-low spread.

Write a function to do that - it should just return a dataframe with the row (or rows) with the largest high-low spread.

In [3]:
import pandas as pd


def rows_with_max_spread_for_symbol(df: pd.DataFrame, symbol: str) -> pd.DataFrame:
    """Return the row(s) for `symbol` with the largest (high - low) spread.

    Returns a DataFrame (possibly multiple rows if ties).
    Raises ValueError if symbol not present.
    """
    required = {"symbol", "high", "low"}
    missing = required - set(df.columns)
    if missing:
        raise ValueError(f"Input DataFrame missing required column(s): {sorted(missing)}")

    symbol = str(symbol)
    sub = df.loc[df["symbol"] == symbol]
    if sub.empty:
        raise ValueError(f"Symbol not found in dataframe: {symbol}")

    spread = sub["high"] - sub["low"]
    max_spread = spread.max()
    return sub.loc[spread == max_spread]


# Example (uncomment):
# rows_with_max_spread_for_symbol(quotes, "MSFT")


#### Question 4

Using the same dataframe as the preceding questions, write a function that returns a single dataframe containing the record(s) with maximum high-low spread for each symbol in the dataframe. (Do not hardcode symbol names in this function - instead you should recover the possible symbol names from the data itself).

The returned dataframe should have the same structure as the original dataframe, but just contain the rows of maximum high-low spread for each symbol.

In [4]:
import pandas as pd


def rows_with_max_spread_per_symbol(df: pd.DataFrame) -> pd.DataFrame:
    """Return a DataFrame containing the row(s) of maximum (high - low) spread for each symbol.

    - Does not hardcode symbol names.
    - Preserves the input DataFrame structure and column set.
    - If multiple rows tie for max spread within a symbol, all tied rows are included.
    """
    required = {"symbol", "high", "low"}
    missing = required - set(df.columns)
    if missing:
        raise ValueError(f"Input DataFrame missing required column(s): {sorted(missing)}")

    # Compute spread once for grouping operations.
    spread = df["high"] - df["low"]

    # For each symbol, compute its maximum spread.
    max_spread_by_symbol = spread.groupby(df["symbol"]).transform("max")

    # Keep rows whose spread equals the per-symbol maximum.
    out = df.loc[spread == max_spread_by_symbol].copy()

    # Optional: stable ordering (by symbol then date if date exists).
    sort_cols = ["symbol"] + (["date"] if "date" in out.columns else [])
    out = out.sort_values(sort_cols)

    # Preserve the same kind of index semantics as Q1 if present:
    # if the input is indexed by symbol and also has a symbol column, keep that convention.
    if df.index.name == "symbol" and "symbol" in out.columns:
        out = out.set_index("symbol", drop=False)

    return out


# Example end-to-end (uncomment):
# quotes = load_daily_quotes("daily_quotes.csv")
# rows_with_max_spread_per_symbol(quotes)


In [5]:
# Exploratory cell: common problems when invoking the solution functions
# This cell is meant for *diagnostics*, not correctness tests.
# Run sections one by one and read the printed output / raised errors.

import pandas as pd
from pathlib import Path

print("--- Exploration: function invocation issues ---")

# -------------------------------------------------
# 1. File path and working-directory issues (Q1)
# -------------------------------------------------
print("\n[Q1] Checking CSV path and working directory")
print("Current working directory:", Path.cwd())

csv_path = Path("daily_quotes.csv")
print("Does daily_quotes.csv exist here?", csv_path.exists())

if not csv_path.exists():
    print(
        "❌ Problem: daily_quotes.csv not found.\n"
        "   - Ensure the file is in the same directory as this notebook, OR\n"
        "   - Pass an absolute / correct relative path to load_daily_quotes()."
    )
else:
    print("✅ CSV file found")


# -------------------------------------------------
# 2. Inspecting the loaded dataframe (Q1)
# -------------------------------------------------
print("\n[Q1] Inspecting dataframe structure")

try:
    quotes = load_daily_quotes(csv_path)
    print("DataFrame loaded successfully")
    print("Shape:", quotes.shape)
    print("Columns:", list(quotes.columns))
    print("Index name:", quotes.index.name)
    print("Dtypes:\n", quotes.dtypes)
except Exception as e:
    print("❌ Error while loading dataframe:")
    print(type(e).__name__ + ":", e)
    quotes = None


# -------------------------------------------------
# 3. Symbol handling pitfalls (Q2, Q3)
# -------------------------------------------------
print("\n[Q2/Q3] Exploring symbol-related issues")

if quotes is not None:
    unique_symbols = quotes["symbol"].astype(str).unique().tolist()
    print("Available symbols:", unique_symbols)

    # Common mistake: wrong casing or whitespace
    test_symbol = unique_symbols[0]
    print("Using test symbol:", repr(test_symbol))

    print("Trying correct symbol → should succeed")
    try:
        rows_with_max_volume(quotes, test_symbol)
        rows_with_max_spread_for_symbol(quotes, test_symbol)
        print("✅ Correct symbol invocation works")
    except Exception as e:
        print("❌ Unexpected error with valid symbol:", e)

    print("\nTrying incorrect symbol → should fail")
    try:
        rows_with_max_volume(quotes, "not_a_real_symbol")
    except Exception as e:
        print("Expected failure:", type(e).__name__, e)


# -------------------------------------------------
# 4. Column dependency issues (Q2–Q4)
# -------------------------------------------------
print("\n[Q2–Q4] Exploring missing-column problems")

if quotes is not None:
    # Simulate a user accidentally dropping a required column
    broken_df = quotes.drop(columns=["low"])
    print("Dropped column 'low'. Columns now:", list(broken_df.columns))

    try:
        rows_with_max_spread_for_symbol(broken_df, test_symbol)
    except Exception as e:
        print("Expected failure due to missing column:", type(e).__name__, e)


# -------------------------------------------------
# 5. Index vs column confusion
# -------------------------------------------------
print("\n[Index vs Column] Exploring index-related confusion")

if quotes is not None:
    print("Index name:", quotes.index.name)
    print("Is 'symbol' also a column?", "symbol" in quotes.columns)

    # Common pitfall: user assumes symbol is only in the index
    df_no_symbol_col = quotes.drop(columns=["symbol"])
    print("Removed 'symbol' column, relying only on index")

    try:
        rows_with_max_volume(df_no_symbol_col, test_symbol)
    except Exception as e:
        print("Expected failure when 'symbol' column is missing:", type(e).__name__, e)


# -------------------------------------------------
# 6. Sanity check for Q4 grouping logic
# -------------------------------------------------
print("\n[Q4] Inspecting per-symbol spread calculation")

if quotes is not None:
    try:
        q4_out = rows_with_max_spread_per_symbol(quotes)
        print("Q4 output shape:", q4_out.shape)
        print("Symbols in output:", q4_out["symbol"].unique().tolist())
        print("Sample output rows:\n", q4_out.head())
    except Exception as e:
        print("❌ Error invoking Q4:", type(e).__name__, e)


print("\n--- End of invocation exploration ---")


--- Exploration: function invocation issues ---

[Q1] Checking CSV path and working directory
Current working directory: D:\_Udemy_course_PRACTICE\Python_3_Fundamentals_Udemy_by_Fred_Baptiste\30_Pandas\11_Exercises
Does daily_quotes.csv exist here? True
✅ CSV file found

[Q1] Inspecting dataframe structure
❌ Error while loading dataframe:
ValueError: Missing column provided to 'parse_dates': 'date'

[Q2/Q3] Exploring symbol-related issues

[Q2–Q4] Exploring missing-column problems

[Index vs Column] Exploring index-related confusion

[Q4] Inspecting per-symbol spread calculation

--- End of invocation exploration ---
