# Task 1: Data Analysis Workflow and Event Research
## Brent Oil Price Analysis - Geopolitical Event Impact Study

### Objectives:
1. Define complete analysis workflow
2. Research and compile historical events
3. Document assumptions and limitations
4. Plan communication strategy

## 1. Analysis Workflow Documentation

### Step-by-Step Analysis Plan

#### Phase 1: Data Preparation
1. **Data Loading**: Load Brent oil price CSV, convert dates, handle missing values
2. **Data Cleaning**: Remove duplicates, handle outliers, ensure consistent formatting
3. **Feature Engineering**: Calculate returns, rolling statistics, volatility measures

#### Phase 2: Exploratory Analysis
1. **Descriptive Statistics**: Summary stats, distribution analysis
2. **Time Series Visualization**: Price trends, seasonal patterns, structural breaks
3. **Statistical Tests**: Stationarity, autocorrelation, normality tests

#### Phase 3: Event Data Integration
1. **Event Research**: Identify 15+ key geopolitical/economic events
2. **Event Database Creation**: Structured CSV with dates, descriptions, impact types
3. **Timeline Alignment**: Map events to price data timeline

#### Phase 4: Bayesian Modeling
1. **Model Specification**: Define priors, likelihood, change point parameters
2. **MCMC Sampling**: Implement PyMC model with NUTS sampler
3. **Convergence Checking**: Trace plots, R-hat statistics, effective sample size

#### Phase 5: Model Validation
1. **Posterior Predictive Checks**: Compare model predictions with actual data
2. **Model Comparison**: Test different numbers of change points
3. **Sensitivity Analysis**: Test different prior specifications

#### Phase 6: Impact Analysis
1. **Change Point Detection**: Identify significant structural breaks
2. **Event Correlation**: Statistical alignment with historical events
3. **Impact Quantification**: Measure price changes around events

#### Phase 7: Insight Generation
1. **Dashboard Development**: Interactive visualization of results
2. **Report Generation**: Technical and executive summaries
3. **Recommendations**: Data-driven insights for stakeholders

### Workflow Visualization

In [None]:
import matplotlib.pyplot as plt
import matplotlib.patches as patches

fig, ax = plt.subplots(figsize=(12, 8))
ax.set_xlim(0, 10)
ax.set_ylim(0, 8)
ax.axis('off')

# Workflow boxes
boxes = [
    (1, 6, "Data\nPreparation", "#4C72B0"),
    (3, 6, "Exploratory\nAnalysis", "#55A868"),
    (5, 6, "Event Data\nIntegration", "#C44E52"),
    (7, 6, "Bayesian\nModeling", "#8172B2"),
    (9, 6, "Model\nValidation", "#CCB974"),
    (5, 4, "Impact\nAnalysis", "#64B5CD"),
    (5, 2, "Insight\nGeneration", "#DA8BC3")
]

for x, y, text, color in boxes:
    rect = patches.Rectangle((x-0.8, y-0.6), 1.6, 1.2, 
                            linewidth=2, edgecolor='black', 
                            facecolor=color, alpha=0.7)
    ax.add_patch(rect)
    ax.text(x, y, text, ha='center', va='center', 
            fontsize=10, fontweight='bold')

# Arrows
arrows = [
    (1.8, 6, 2.2, 6),
    (3.8, 6, 4.2, 6),
    (5.8, 6, 6.2, 6),
    (7.8, 6, 8.2, 6),
    (9, 5.4, 9, 4.6),
    (7, 4, 5, 4),
    (5, 3.4, 5, 2.6)
]

for x1, y1, x2, y2 in arrows:
    ax.annotate('', xy=(x2, y2), xytext=(x1, y1),
                arrowprops=dict(arrowstyle='->', lw=2))

plt.title('Brent Oil Analysis Workflow', fontsize=14, fontweight='bold', pad=20)
plt.tight_layout()
plt.show()

## 2. Event Dataset Creation

In [None]:
import pandas as pd
import numpy as np

# Create event dataset
events_data = {
    'event_date': [
        '1990-08-02', '1997-12-01', '2001-09-11', '2003-03-20',
        '2005-08-29', '2008-09-15', '2011-02-15', '2014-11-27',
        '2016-01-16', '2020-03-06', '2020-03-11', '2022-02-24',
        '2015-12-12', '2018-05-08', '2019-09-14'
    ],
    'event_name': [
        'Iraq invades Kuwait',
        'Asian Financial Crisis',
        '9/11 Terrorist Attacks',
        'Iraq War begins',
        'Hurricane Katrina',
        'Lehman Brothers collapse',
        'Arab Spring (Libya conflict)',
        'OPEC maintains production',
        'Iran sanctions lifted',
        'OPEC+ price war begins',
        'COVID-19 declared pandemic',
        'Russia invades Ukraine',
        'Paris Climate Agreement',
        'US withdraws from Iran deal',
        'Saudi oil facilities attack'
    ],
    'event_type': [
        'Geopolitical Conflict', 'Economic Crisis', 'Geopolitical',
        'Geopolitical Conflict', 'Natural Disaster', 'Financial Crisis',
        'Geopolitical Conflict', 'Policy Decision', 'Sanctions Change',
        'Policy Decision', 'Global Health Crisis', 'Geopolitical Conflict',
        'Policy/Regulatory', 'Sanctions Change', 'Geopolitical Conflict'
    ],
    'region_org': [
        'Middle East', 'Asia', 'Global', 'Middle East',
        'Gulf of Mexico', 'Global', 'Middle East/N Africa', 'OPEC',
        'Middle East', 'OPEC/Russia', 'Global', 'Europe',
        'Global', 'Middle East', 'Middle East'
    ],
    'expected_impact': [
        'Severe price spike', 'Price decline', 'Initial spike, then uncertainty',
        'Price increase', 'Supply disruption spike', 'Price crash',
        'Price surge', 'Price collapse', 'Price pressure',
        'Price crash', 'Historic price drop', 'Energy crisis spike',
        'Long-term demand shift', 'Supply uncertainty', 'Temporary spike'
    ],
    'impact_direction': [
        'positive', 'negative', 'mixed', 'positive',
        'positive', 'negative', 'positive', 'negative',
        'negative', 'negative', 'negative', 'positive',
        'negative', 'positive', 'positive'
    ],
    'source': [
        'Historical records', 'Economic history', 'News archives',
        'Military records', 'NOAA records', 'Financial records',
        'News archives', 'OPEC announcements', 'UN resolutions',
        'OPEC+ meeting minutes', 'WHO declarations', 'News reports',
        'UNFCCC records', 'US State Dept', 'News reports'
    ]
}

events_df = pd.DataFrame(events_data)
events_df['event_date'] = pd.to_datetime(events_df['event_date'])

# Save to CSV
events_df.to_csv('../data/historical_events.csv', index=False)

print("Event Dataset Created:")
print(f"Number of events: {len(events_df)}")
print("\nFirst 5 events:")
print(events_df.head())
print("\nEvent types distribution:")
print(events_df['event_type'].value_counts())

## 3. Assumptions and Limitations Documentation

### Key Assumptions

1. **Market Efficiency Assumption**: We assume Brent oil prices reflect all available information at each time point, following the Efficient Market Hypothesis.

2. **Data Quality Assumption**: The provided daily price data is accurate, complete, and free from systematic measurement errors.

3. **Event Timing Assumption**: Geopolitical and economic events immediately affect oil prices (within days of occurrence).

4. **Independence Assumption**: Each event's impact can be analyzed separately, though we acknowledge potential interactions.

5. **Structural Break Assumption**: Significant price changes detected by our models represent meaningful market regime shifts.

### Critical Limitations

#### 1. Correlation vs. Causation
**This is the most important limitation to understand:**
- **Temporal Correlation**: When we detect a change point near an event date, it suggests a relationship
- **Causal Impact**: Requires three conditions:
  1. **Temporal Precedence**: Cause must precede effect
  2. **Covariation**: Changes in cause associate with changes in effect
  3. **Elimination of Alternatives**: No other plausible explanations

**Our analysis can only establish #1 and #2. #3 requires experimental design we cannot implement with observational data.**

#### 2. Confounding Variables
- Multiple events often occur simultaneously
- Unmeasured factors (technology changes, alternative energy adoption, etc.)
- Global economic conditions not captured in our event database

#### 3. Model Limitations
- Bayesian models provide probabilities, not certainties
- Results sensitive to prior specifications
- Computational constraints limit model complexity

#### 4. Data Limitations
- Daily frequency may miss intraday reactions
- Limited to Brent crude (other oil benchmarks may differ)
- Historical data quality may vary over 35-year period


### Statistical vs. Causal Inference
```python
# Pseudo-code illustrating the difference

# Statistical correlation (what we can measure):
correlation = detect_correlation(price_change, event_occurrence)

# Causal impact (what we want but cannot guarantee):
causal_effect = run_experiment(
    treatment = event_occurrence,
    control = parallel_universe_without_event
)

## 4. Communication Channels Plan

### Target Audiences and Their Needs

| Audience | Primary Needs | Preferred Format | Frequency |
|----------|---------------|------------------|-----------|
| **Investors** | Risk assessment, Return optimization | Interactive dashboard, Brief reports | Weekly/Monthly |
| **Policymakers** | Economic stability, Energy security | Executive summaries, Policy briefs | Quarterly |
| **Energy Companies** | Operational planning, Cost control | Technical reports, API access | Monthly |
| **Analysts** | Detailed methodology, Model validation | Technical documentation, Code access | As needed |

### Communication Channels Matrix

#### Primary Channels:
1. **Interactive Dashboard (Streamlit)**
   - Real-time exploration of change points
   - Event timeline visualization
   - Price scenario simulation
   - Target: All stakeholders

2. **Executive Report (PDF/PPT)**
   - Key findings (1-2 pages)
   - Strategic recommendations
   - Visual data storytelling
   - Target: Executives, Policymakers

3. **Technical Documentation**
   - Full methodology (10-15 pages)
   - Model validation results
   - Code implementation details
   - Target: Analysts, Data Scientists

4. **API Service**
   - RESTful API for automated access
   - Real-time price predictions
   - Event impact alerts
   - Target: Energy companies, Automated systems

5. **Quarterly Briefings**
   - Live presentations with Q&A
   - Strategy workshops
   - Stakeholder feedback sessions
   - Target: All key stakeholders

### Implementation Timeline

**Phase 1 (Weeks 1-2)**: Technical documentation and prototype dashboard.

**Phase 2 (Weeks 3-4)**: Executive report and API development.

**Phase 3 (Ongoing)**: Quarterly updates and stakeholder engagement.

In [None]:
# Communication plan visualization
import matplotlib.pyplot as plt

fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# Audience needs chart
audiences = ['Investors', 'Policymakers', 'Energy Companies', 'Analysts']
needs = ['Risk/Reward', 'Stability', 'Operations', 'Methodology']
formats = ['Dashboard', 'Briefs', 'Reports', 'Documentation']

ax1 = axes[0]
x_pos = np.arange(len(audiences))
ax1.bar(x_pos - 0.2, [8, 9, 7, 6], 0.4, label='Urgency', color='#4C72B0')
ax1.bar(x_pos + 0.2, [7, 8, 9, 8], 0.4, label='Complexity', color='#C44E52')
ax1.set_xticks(x_pos)
ax1.set_xticklabels(audiences)
ax1.set_ylabel('Score (1-10)')
ax1.set_title('Stakeholder Analysis')
ax1.legend()
ax1.grid(True, alpha=0.3)

# Channel effectiveness
channels = ['Dashboard', 'Reports', 'API', 'Briefings']
effectiveness = [9, 7, 6, 8]
reach = [8, 9, 5, 7]

ax2 = axes[1]
x_pos = np.arange(len(channels))
ax2.bar(x_pos - 0.2, effectiveness, 0.4, label='Effectiveness', color='#55A868')
ax2.bar(x_pos + 0.2, reach, 0.4, label='Reach', color='#CCB974')
ax2.set_xticks(x_pos)
ax2.set_xticklabels(channels)
ax2.set_ylabel('Score (1-10)')
ax2.set_title('Communication Channel Analysis')
ax2.legend()
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## Summary

This notebook establishes the foundation for our Brent oil price analysis:
1. **Complete workflow** from data to insights
2. **Structured event database** with 15+ key events
3. **Clear documentation** of assumptions and limitations
4. **Comprehensive communication plan** for all stakeholders

Next steps: Implement the time series analysis and Bayesian modeling.