# Data Collection - CS2 Professional Match Parsing

**Author**: Lucas Lachaume  
**Project**: CS2 Economic Analysis  
**Date**: December 2024

---

## Objective

Parse professional CS2 demos from HLTV to extract round-level economic and tactical features at freeze-time.

**Input**: `.dem` files in `data/raw/`  
**Output**: Consolidated DataFrame in `data/processed/all_matches.csv`

---

## Features Extracted (~48 variables)

- **Economy**: money_total, cash, armor_count, helmet_count, defuser_count
- **Armament**: awp_count, rifle_count, smg_count, heavy_count, ssg_count
- **Utility**: smoke_count, molo_count, flash_count, he_count, utility_value
- **Equipment**: equipment_value, equipment_value_avg
- **Context**: round_number, ct_score, t_score, rounds_won_streak, rounds_lost_streak
- **Equipment Saved**: survivors_previous, equipment_saved_value
- **Target**: round_winner (0=T, 1=CT)

---

In [1]:
import sys
from pathlib import Path
import pandas as pd
import numpy as np

# Add src to path
sys.path.append(str(Path.cwd().parent / 'src'))

# Import custom parser
from data.parser import parse_demo

# Configure pandas display
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)

print("Imports successful")

Imports successful


## 1. Setup Paths

In [2]:
# Define paths relative to notebook
DATA_RAW = Path("../data/raw")
DATA_INTERIM = Path("../data/interim")
DATA_PROCESSED = Path("../data/processed")

# Create directories if they don't exist
DATA_RAW.mkdir(parents=True, exist_ok=True)
DATA_INTERIM.mkdir(parents=True, exist_ok=True)
DATA_PROCESSED.mkdir(parents=True, exist_ok=True)

print(f"Data directories configured:")
print(f"  - Raw: {DATA_RAW.resolve()}")
print(f"  - Interim: {DATA_INTERIM.resolve()}")
print(f"  - Processed: {DATA_PROCESSED.resolve()}")

Data directories configured:
  - Raw: E:\Projects\cs2-economic-analysis\bad\data\raw
  - Interim: E:\Projects\cs2-economic-analysis\bad\data\interim
  - Processed: E:\Projects\cs2-economic-analysis\bad\data\processed


## 2. List Available Demos

Scan `data/raw/` for `.dem` files to process.

In [3]:
# Find all .dem files
demo_files = list(DATA_RAW.rglob("*.dem"))

print(f"Found {len(demo_files)} demo(s) in {DATA_RAW}:")
for i, demo in enumerate(demo_files, 1):
    size_mb = demo.stat().st_size / (1024 * 1024)
    print(f"  {i}. {demo.name} ({size_mb:.1f} MB)")

if len(demo_files) == 0:
    print("\n No demos found. Please add .dem files to data/raw/")

Found 1 demo(s) in ..\data\raw:
  1. vitality_vs_mongolz_inferno_2025-01-26.dem (485.5 MB)


## 3. Parse All Demos

Process each demo file and extract features for all rounds.

**Processing time**: ~30-60s per demo depending on match length.

In [9]:
# Store individual DataFrames
all_dfs = []
match_metadata = []

for demo_path in demo_files:
    try:
        print(f"\n{'=' * 60}")
        
        # Parse demo
        df = parse_demo(demo_path)
        
        # Add match identifier
        df['match_file'] = demo_path.stem
        
        # Save interim CSV (one per match)
        interim_file = DATA_INTERIM / f"{demo_path.stem}.csv"
        df.to_csv(interim_file, index=False)
        print(f"Saved interim file: {interim_file.name}")
        
        # Store for consolidation
        all_dfs.append(df)
        
        # Track metadata
        last_round = df.iloc[-1]

        if last_round['round_winner'] == 1:  # CT won
            final_ct_score = last_round['ct_score'] + 1
            final_t_score = last_round['t_score']
        else:  # T won
            final_ct_score = last_round['ct_score']
            final_t_score = last_round['t_score'] + 1

        match_metadata.append({
            'match_file': demo_path.stem,
            'map_name': df['map_name'].iloc[0],
            'total_rounds': len(df),
            'final_score': f"{max(final_ct_score, final_t_score)}-{min(final_ct_score, final_t_score)}",
            'overtime': df['is_overtime'].sum() > 0
        })
        
    except Exception as e:
        print(f"Error parsing {demo_path.name}: {e}")
        continue

print(f"\n{'=' * 60}")
print(f"Parsing complete: {len(all_dfs)}/{len(demo_files)} demos processed successfully")


Parsing: vitality_vs_mongolz_inferno_2025-01-26.dem
✓ Extracted 19 rounds from vitality_vs_mongolz_inferno_2025-01-26.dem
Saved interim file: vitality_vs_mongolz_inferno_2025-01-26.csv

Parsing complete: 1/1 demos processed successfully


## 4. Consolidate All Matches

Merge all individual match DataFrames into a single consolidated dataset.

In [10]:
if len(all_dfs) > 0:
    # Concatenate all DataFrames
    df_final = pd.concat(all_dfs, ignore_index=True)
    
    # Save consolidated dataset
    output_file = DATA_PROCESSED / "all_matches.csv"
    df_final.to_csv(output_file, index=False)
    
    print(f"✓ Consolidated dataset saved: {output_file.name}")
    print(f"  - Total matches: {df_final['match_file'].nunique()}")
    print(f"  - Total rounds: {len(df_final)}")
    print(f"  - Shape: {df_final.shape}")
    print(f"  - Memory: {df_final.memory_usage(deep=True).sum() / 1024**2:.2f} MB")
else:
    print("No data to consolidate. Please ensure demos are available in data/raw/")

✓ Consolidated dataset saved: all_matches.csv
  - Total matches: 1
  - Total rounds: 19
  - Shape: (19, 51)
  - Memory: 0.01 MB


## 5. Match Summary Statistics

Overview of parsed matches.

In [11]:
if len(match_metadata) > 0:
    df_metadata = pd.DataFrame(match_metadata)
    
    print("\n" + "=" * 60)
    print("MATCH SUMMARY")
    print("=" * 60)
    print(df_metadata.to_string(index=False))
    
    print("\n" + "=" * 60)
    print("AGGREGATED STATISTICS")
    print("=" * 60)
    print(f"Total matches: {len(df_metadata)}")
    print(f"Total rounds: {df_metadata['total_rounds'].sum()}")
    print(f"Average rounds per match: {df_metadata['total_rounds'].mean():.1f}")
    print(f"Matches with overtime: {df_metadata['overtime'].sum()}")
    print(f"\nMaps played:")
    print(df_metadata['map_name'].value_counts().to_string())
else:
    print("No metadata available.")


MATCH SUMMARY
                            match_file   map_name  total_rounds final_score  overtime
vitality_vs_mongolz_inferno_2025-01-26 de_inferno            19        13-6     False

AGGREGATED STATISTICS
Total matches: 1
Total rounds: 19
Average rounds per match: 19.0
Matches with overtime: 0

Maps played:
map_name
de_inferno    1


## 6. Data Preview

Preview the final consolidated dataset.

In [12]:
if len(all_dfs) > 0:
    print("\n" + "=" * 60)
    print("DATA PREVIEW")
    print("=" * 60)
    
    print("\nColumns:")
    print(list(df_final.columns))
    
    print("\nFirst 3 rounds:")
    display(df_final.head(3))
    
    print("\nData types:")
    print(df_final.dtypes)
    
    print("\nBasic statistics (numeric features):")
    display(df_final.describe())
else:
    print("No data available for preview.")


DATA PREVIEW

Columns:
['ct_money_total', 't_money_total', 'ct_cash', 't_cash', 'ct_cash_avg', 't_cash_avg', 'ct_armor_count', 't_armor_count', 'ct_helmet_count', 't_helmet_count', 'ct_defuser_count', 'ct_awp_count', 't_awp_count', 'ct_ssg_count', 't_ssg_count', 'ct_rifle_count', 't_rifle_count', 'ct_smg_count', 't_smg_count', 'ct_heavy_count', 't_heavy_count', 'ct_ak_count', 'ct_smoke_count', 't_smoke_count', 'ct_molo_count', 't_molo_count', 'ct_flash_count', 't_flash_count', 'ct_he_count', 't_he_count', 'ct_utility_value', 't_utility_value', 'ct_equipment_value', 't_equipment_value', 'ct_equipment_value_avg', 't_equipment_value_avg', 'round_number', 'ct_score', 't_score', 'ct_rounds_won_streak', 'ct_rounds_lost_streak', 't_rounds_won_streak', 't_rounds_lost_streak', 'map_name', 'is_overtime', 'ct_survivors_previous', 't_survivors_previous', 'ct_equipment_saved_value', 't_equipment_saved_value', 'round_winner', 'match_file']

First 3 rounds:


Unnamed: 0,ct_money_total,t_money_total,ct_cash,t_cash,ct_cash_avg,t_cash_avg,ct_armor_count,t_armor_count,ct_helmet_count,t_helmet_count,ct_defuser_count,ct_awp_count,t_awp_count,ct_ssg_count,t_ssg_count,ct_rifle_count,t_rifle_count,ct_smg_count,t_smg_count,ct_heavy_count,t_heavy_count,ct_ak_count,ct_smoke_count,t_smoke_count,ct_molo_count,t_molo_count,ct_flash_count,t_flash_count,ct_he_count,t_he_count,ct_utility_value,t_utility_value,ct_equipment_value,t_equipment_value,ct_equipment_value_avg,t_equipment_value_avg,round_number,ct_score,t_score,ct_rounds_won_streak,ct_rounds_lost_streak,t_rounds_won_streak,t_rounds_lost_streak,map_name,is_overtime,ct_survivors_previous,t_survivors_previous,ct_equipment_saved_value,t_equipment_saved_value,round_winner,match_file
0,5000.0,5000.0,550,600,110.0,120.0,3,3,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,2,1,1,0,1,1,0,1100,1250,4450,4200,890.0,840.0,1,0,0,0,0,0,0,de_inferno,0,0,0,0,0,0,vitality_vs_mongolz_inferno_2025-01-26
1,12250.0,21500.0,350,700,70.0,140.0,5,5,3,5,0,0,0,0,0,0,4,2,1,0,0,0,4,5,0,4,3,4,2,3,2400,4800,11100,19950,2220.0,3990.0,2,0,1,0,1,1,0,de_inferno,0,0,1,0,850,1,vitality_vs_mongolz_inferno_2025-01-26
2,23700.0,9900.0,1200,200,240.0,40.0,5,4,5,4,2,0,0,0,0,3,0,2,0,0,0,1,4,3,3,0,4,3,4,1,4700,1800,21850,8500,4370.0,1700.0,3,1,1,1,0,0,1,de_inferno,0,1,0,4500,0,1,vitality_vs_mongolz_inferno_2025-01-26



Data types:
ct_money_total              float64
t_money_total               float64
ct_cash                       int64
t_cash                        int64
ct_cash_avg                 float64
t_cash_avg                  float64
ct_armor_count                int64
t_armor_count                 int64
ct_helmet_count               int64
t_helmet_count                int64
ct_defuser_count              int64
ct_awp_count                  int64
t_awp_count                   int64
ct_ssg_count                  int64
t_ssg_count                   int64
ct_rifle_count                int64
t_rifle_count                 int64
ct_smg_count                  int64
t_smg_count                   int64
ct_heavy_count                int64
t_heavy_count                 int64
ct_ak_count                   int64
ct_smoke_count                int64
t_smoke_count                 int64
ct_molo_count                 int64
t_molo_count                  int64
ct_flash_count                int64
t_flash_count  

Unnamed: 0,ct_money_total,t_money_total,ct_cash,t_cash,ct_cash_avg,t_cash_avg,ct_armor_count,t_armor_count,ct_helmet_count,t_helmet_count,ct_defuser_count,ct_awp_count,t_awp_count,ct_ssg_count,t_ssg_count,ct_rifle_count,t_rifle_count,ct_smg_count,t_smg_count,ct_heavy_count,t_heavy_count,ct_ak_count,ct_smoke_count,t_smoke_count,ct_molo_count,t_molo_count,ct_flash_count,t_flash_count,ct_he_count,t_he_count,ct_utility_value,t_utility_value,ct_equipment_value,t_equipment_value,ct_equipment_value_avg,t_equipment_value_avg,round_number,ct_score,t_score,ct_rounds_won_streak,ct_rounds_lost_streak,t_rounds_won_streak,t_rounds_lost_streak,is_overtime,ct_survivors_previous,t_survivors_previous,ct_equipment_saved_value,t_equipment_saved_value,round_winner
count,19.0,19.0,19.0,19.0,19.0,19.0,19.0,19.0,19.0,19.0,19.0,19.0,19.0,19.0,19.0,19.0,19.0,19.0,19.0,19.0,19.0,19.0,19.0,19.0,19.0,19.0,19.0,19.0,19.0,19.0,19.0,19.0,19.0,19.0,19.0,19.0,19.0,19.0,19.0,19.0,19.0,19.0,19.0,19.0,19.0,19.0,19.0,19.0,19.0
mean,34523.684211,21042.105263,11071.052632,3865.789474,2214.210526,773.157895,4.842105,4.210526,3.631579,3.842105,2.157895,0.368421,0.157895,0.105263,0.0,3.052632,2.684211,0.684211,0.315789,0.0,0.0,0.842105,4.105263,3.789474,3.368421,2.631579,3.842105,3.105263,3.789474,2.105263,4815.789474,3444.736842,22815.789474,16586.842105,4563.157895,3317.368421,10.0,3.947368,5.052632,1.315789,0.473684,0.473684,1.315789,0.0,1.947368,0.736842,9315.789474,2589.473684,0.631579
std,19437.652239,9972.660728,12742.63748,5371.272677,2548.527496,1074.254535,0.50146,1.474937,1.73879,2.061907,1.500487,0.495595,0.374634,0.315302,0.0,1.928548,2.212405,1.204281,0.671038,0.0,0.0,0.898342,1.448936,1.685854,2.005839,1.977949,1.641922,1.696229,1.618605,1.629408,2146.123937,1902.111016,8888.148361,8672.39942,1777.629672,1734.479884,5.627314,2.296705,4.88164,1.529438,0.841191,0.841191,1.529438,0.0,1.778691,1.367971,8565.175821,5502.741209,0.495595
min,5000.0,5000.0,150.0,150.0,30.0,30.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,500.0,0.0,4200.0,3450.0,840.0,690.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,19075.0,14300.0,900.0,525.0,180.0,105.0,5.0,4.0,3.0,4.0,1.0,0.0,0.0,0.0,0.0,1.5,0.0,0.0,0.0,0.0,0.0,0.0,4.0,3.0,1.5,1.0,3.5,1.5,2.5,1.0,3450.0,1900.0,18250.0,8850.0,3650.0,1770.0,5.5,3.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,35100.0,21050.0,5400.0,700.0,1080.0,140.0,5.0,5.0,4.0,5.0,3.0,0.0,0.0,0.0,0.0,4.0,3.0,0.0,0.0,0.0,0.0,1.0,5.0,5.0,4.0,3.0,5.0,4.0,5.0,2.0,5800.0,4100.0,27400.0,19950.0,5480.0,3990.0,10.0,4.0,2.0,1.0,0.0,0.0,1.0,0.0,2.0,0.0,8000.0,0.0,1.0
75%,53275.0,27825.0,23650.0,5800.0,4730.0,1160.0,5.0,5.0,5.0,5.0,3.0,1.0,0.0,0.0,0.0,4.0,5.0,1.0,0.0,0.0,0.0,1.0,5.0,5.0,5.0,4.5,5.0,4.5,5.0,3.0,6500.0,4950.0,28900.0,24450.0,5780.0,4890.0,14.5,5.5,10.5,2.0,1.0,1.0,2.0,0.0,3.0,1.0,15925.0,1275.0,1.0
max,70100.0,43800.0,40750.0,17900.0,8150.0,3580.0,5.0,5.0,5.0,5.0,4.0,1.0,1.0,1.0,0.0,5.0,5.0,4.0,2.0,0.0,0.0,3.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,6500.0,6000.0,30650.0,26850.0,6130.0,5370.0,19.0,8.0,12.0,5.0,3.0,3.0,5.0,0.0,5.0,5.0,25300.0,19400.0,1.0


## Notes

- **Granularity**: 1 observation = 1 round at freeze-time
- **Target variable**: `round_winner` (0=T win, 1=CT win)
- **Side switches**: Streaks and equipment saved reset at round 13 and every 3 rounds in overtime
- **Snapshot timing**: Features captured at `freeze_end + 2s` to account for late buy actions

---