# News & Stock â€” Week 1: Interim EDA

This notebook performs the Task 1 interim analysis: it loads the downloaded files in `data/raw/`, reads the generated summaries in `outputs/summaries/`, attempts to load any CSV datasets, and produces basic exploratory analysis using the lightweight `src` helpers (DataLoader and EDA).

Steps:
- Inspect downloaded files
- Load summaries
- Attempt to load CSVs and run basic EDA (head, describe, correlation, moving average)
- Save a short analysis summary to `outputs/analysis/`

In [3]:
# Cell 2: imports and paths
from pathlib import Path
import json
import pandas as pd
import numpy as np
from src.data_loader import DataLoader
from src.eda import EDA

RAW = Path('data/raw')
SUM = Path('outputs/summaries')
ANALYSIS_OUT = Path('outputs/analysis')
ANALYSIS_OUT.mkdir(parents=True, exist_ok=True)

print('RAW exists:', RAW.exists())
print('Summaries exists:', SUM.exists())
print('Analysis out:', ANALYSIS_OUT)


ModuleNotFoundError: No module named 'src'

In [4]:
# Cell 3: list files found
for p in sorted(RAW.glob('*')):
    print('-', p.name, 'size=', p.stat().st_size)

print('
Summaries:')
for s in sorted(SUM.glob('*')):
    print('-', s.name)

SyntaxError: unterminated string literal (detected at line 5) (2672888937.py, line 5)

In [None]:
# Cell 4: load and inspect summaries
summaries = {}
for s in SUM.glob('*_summary.json'):
    try:
        with open(s, 'r', encoding='utf8') as f:
            summaries[s.name] = json.load(f)
    except Exception as e:
        summaries[s.name] = {'error': str(e)}

len(summaries)

In [2]:
# Cell 5: show brief summary table
from IPython.display import display
rows = []
for name, info in summaries.items():
    rows.append({
        'file': name,
        'type': info.get('type', 'unknown'),
        'shape': info.get('shape'),
        'error': info.get('error')
    })
df = pd.DataFrame(rows)
display(df)


NameError: name 'summaries' is not defined

In [None]:
# Cell 6: Attempt to load CSVs and run lightweight EDA using src helpers
analysis = {}
for name, info in summaries.items():
    entry = {'file': name}
    p = RAW / name.replace('_summary.json', '')
    if info.get('type') == 'csv' and p.exists():
        try:
            dl = DataLoader(p)
            df = dl.load_csv(parse_dates=True, low_memory=False)
            ed = EDA(df)
            entry['shape'] = df.shape
            entry['head'] = df.head(3).to_dict(orient='records')
            entry['describe'] = df.select_dtypes(include=['number']).describe().to_dict()
            # correlation
            entry['corr'] = ed.correlation_matrix().to_dict()
            # moving average for first numeric column, if present
            numcols = df.select_dtypes(include='number').columns.tolist()
            if numcols:
                ma = ed.moving_average(numcols[0], window=7).fillna(method='bfill').head(5).tolist()
                entry['moving_average_sample'] = ma
            else:
                entry['moving_average_sample'] = None
        except Exception as e:
            entry['error'] = str(e)
    else:
        entry['note'] = 'not csv or missing file'
    analysis[name] = entry

# save analysis summary
outp = ANALYSIS_OUT / 'task1_analysis_summary.json'
with open(outp, 'w', encoding='utf8') as f:
    json.dump(analysis, f, indent=2, default=str)

print('Wrote', outp)
len(analysis)

# Initial Findings & Next Steps

- The notebook attempts to load and analyze all CSVs discovered in `data/raw/`.
- A short JSON `outputs/analysis/task1_analysis_summary.json` contains the per-file EDA results or errors.
- Next steps: run additional sentiment extraction (TextBlob/NLTK), compute correlations between sentiment and stock indicators, and visualise results in the notebook.

If you want, I can now: (A) expand EDA and add plots in this notebook, or (B) run sentiment extraction and correlation analysis next. Choose A or B.