# Notebook 1: Data Analysis & Feature Engineering

This notebook explores the raw Mark Six dataset and demonstrates the functionality of the `FeatureEngineer` class. It's a crucial first step to understand the data we're working with before moving on to modeling.

### 1. Setup and Imports

First, we import the necessary libraries and load our project's configuration and `FeatureEngineer` class. We need to add the `src` directory to our path to make the custom modules available.

In [None]:
import pandas as pd
import numpy as np
import sys
import os
import matplotlib.pyplot as plt
import seaborn as sns

# Add the source directory to the Python path
sys.path.append(os.path.abspath(os.path.join('..')))

from src.config import CONFIG
from src.feature_engineering import FeatureEngineer

print("Setup complete. Modules loaded.")

### 2. Loading the Data

We load the historical data from `data/raw/Mark_Six.csv`. As discovered during development, the CSV has a malformed header, so we must define the column names manually and skip the initial rows.

In [None]:
col_names = [
    'Draw', 'Date', 'Winning_Num_1', 'Winning_Num_2', 'Winning_Num_3',
    'Winning_Num_4', 'Winning_Num_5', 'Winning_Num_6', 'Extra_Num',
    'From_Last', 'Low', 'High', 'Odd', 'Even', '1-10', '11-20', '21-30',
    '31-40', '41-50', 'Div_1_Winners', 'Div_1_Prize', 'Div_2_Winners',
    'Div_2_Prize', 'Div_3_Winners', 'Div_3_Prize', 'Div_4_Winners',
    'Div_4_Prize', 'Div_5_Winners', 'Div_5_Prize', 'Div_6_Winners',
    'Div_6_Prize', 'Div_7_Winners', 'Div_7_Prize', 'Turnover'
]

# Construct the relative path to the data file
data_path = os.path.join('..', CONFIG["data_path"])

# Load the dataset, skipping the header and assigning names
df = pd.read_csv(data_path, header=None, skiprows=33, names=col_names)

# Convert Date column to datetime objects and sort
df['Date'] = pd.to_datetime(df['Date'])
df = df.sort_values(by='Date').reset_index(drop=True)

print(f"Dataset loaded successfully with {len(df)} records.")
df.head()

### 3. Feature Engineering Demonstration

Now, we'll instantiate and fit the `FeatureEngineer` on our historical data. The `.fit()` method calculates all the necessary statistics (like number and pair frequencies) from the entire dataset.

In [None]:
feature_engineer = FeatureEngineer()
feature_engineer.fit(df)

print(f"FeatureEngineer fitted on {feature_engineer.total_draws} draws.")
print(f"Total unique historical combinations found: {len(feature_engineer.historical_sets)}")

#### Transforming a Sample Set

Let's see the `FeatureEngineer` in action. The `.transform()` method takes a 6-number list and converts it into the feature vector that our model uses for scoring.

In [None]:
sample_set = [1, 2, 3, 4, 5, 6]
current_draw_index = len(df) # The index for a new, hypothetical draw

feature_vector = feature_engineer.transform(sample_set, current_draw_index)

print(f"Sample Set: {sample_set}")
print(f"Generated Feature Vector (shape: {feature_vector.shape}):\n{feature_vector}")

### 4. Basic Data Visualization

Visualizing the data can reveal interesting patterns. Let's plot the frequency of each number across all historical draws.

In [None]:
# The number counts are already calculated in our fitted FeatureEngineer
number_frequencies = feature_engineer.number_counts

# Prepare data for plotting
numbers = sorted(number_frequencies.keys())
counts = [number_frequencies[n] for n in numbers]

# Create the plot
plt.figure(figsize=(18, 8))
sns.barplot(x=numbers, y=counts, palette='viridis')
plt.title('Historical Frequency of Each Mark Six Number', fontsize=16)
plt.xlabel('Lottery Number', fontsize=12)
plt.ylabel('Frequency Count', fontsize=12)
plt.xticks(rotation=45, ha='right')
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.tight_layout()
plt.show()

This plot shows that while most numbers appear a similar number of times, there are slight variations, which is a pattern our model can learn from.