# AVIOS — Part 1: Manual Task Classification

In this notebook, we perform **rule-based manual classification** of Linux tasks into four dimensions:
- **Resource usage**: CPU-bound, IO-bound, Mixed  
- **Interactivity**: Interactive, Batch, Background, Real-time, Other  
- **Execution length**: Short, Medium, Long  
- **Priority**: High, Medium, Low  

This served as the **baseline labeling process**, which was later used to train machine learning models (Part 2).  


In [None]:
import pandas as pd

df = pd.read_csv("/content/AVIOS-dataset.csv")

# ✅ Display basic info
print("Columns in dataset:\n", df.columns.tolist())
print("\nDataset shape:", df.shape)

# ✅ Display first few rows to inspect structure
df.head()


Columns in dataset:
 ['Timestamp', 'PID', 'Name', 'Cmdline', 'PPid', 'State', 'Threads', 'Priority', 'Nice', 'Scheduling_Policy', 'CPU_Usage_%', 'Total_Time_Ticks', 'Elapsed_Time_sec', 'VmRSS', 'VmSize', 'Voluntary_ctxt_switches', 'Nonvoluntary_ctxt_switches', 'IO_Read_Bytes', 'IO_Write_Bytes', 'IO_Read_Count', 'IO_Write_Count', 'se.exec_start', 'se.vruntime', 'se.sum_exec_runtime', 'nr_switches', 'nr_voluntary_switches', 'nr_involuntary_switches', 'se.load.weight']

Dataset shape: (3069226, 28)


Unnamed: 0,Timestamp,PID,Name,Cmdline,PPid,State,Threads,Priority,Nice,Scheduling_Policy,...,IO_Write_Bytes,IO_Read_Count,IO_Write_Count,se.exec_start,se.vruntime,se.sum_exec_runtime,nr_switches,nr_voluntary_switches,nr_involuntary_switches,se.load.weight
0,2025-09-08T19:15:55.317697,1,systemd,/sbin/init auto noprompt splash,-1,sleeping,1,20,0,SCHED_OTHER,...,14000128,86264,29798,545104.748975,151.473604,10680.166978,5750.0,4654.0,1096.0,1048576.0
1,2025-09-08T19:15:55.317697,2,kthreadd,,-1,sleeping,1,20,0,SCHED_OTHER,...,0,0,0,430132.297611,61606.365244,297.62188,501.0,469.0,32.0,1048576.0
2,2025-09-08T19:15:55.317697,3,pool_workqueue_release,,2,sleeping,1,20,0,SCHED_OTHER,...,0,0,0,17655.291718,562.952559,1.459164,6.0,6.0,0.0,1048576.0
3,2025-09-08T19:15:55.317697,4,kworker/R-rcu_g,,2,idle,1,0,-20,SCHED_OTHER,...,0,0,0,6190.041895,6.641979,0.2511,2.0,2.0,0.0,90891264.0
4,2025-09-08T19:15:55.317697,5,kworker/R-rcu_p,,2,idle,1,0,-20,SCHED_OTHER,...,0,0,0,6190.989595,6.937434,0.215199,2.0,2.0,0.0,90891264.0


In [None]:
import numpy as np
import pandas as pd

# 🔧 Clean and convert columns as needed
df['CPU_Usage_%'] = pd.to_numeric(df['CPU_Usage_%'], errors='coerce').fillna(0)
df['Nice'] = pd.to_numeric(df['Nice'], errors='coerce').fillna(0)
df['Priority'] = pd.to_numeric(df['Priority'], errors='coerce').fillna(0)
df['Total_Time_Ticks'] = pd.to_numeric(df['Total_Time_Ticks'], errors='coerce').fillna(0)
df['Elapsed_Time_sec'] = pd.to_numeric(df['Elapsed_Time_sec'], errors='coerce').replace(0, 1e-5)
df['Voluntary_ctxt_switches'] = pd.to_numeric(df['Voluntary_ctxt_switches'], errors='coerce').fillna(0)
df['Nonvoluntary_ctxt_switches'] = pd.to_numeric(df['Nonvoluntary_ctxt_switches'], errors='coerce').fillna(0)
df['IO_Read_Bytes'] = pd.to_numeric(df['IO_Read_Bytes'], errors='coerce').fillna(0)
df['IO_Write_Bytes'] = pd.to_numeric(df['IO_Write_Bytes'], errors='coerce').fillna(0)

# ✅ Derived helper
df['avg_cpu_time'] = df['Total_Time_Ticks'] / df['Elapsed_Time_sec']
df['total_io_bytes'] = df['IO_Read_Bytes'] + df['IO_Write_Bytes']

# ----------------------------------------
# 🔷 1. Resource Usage Classification
# ----------------------------------------
def classify_resource(row):
    cpu = row['CPU_Usage_%']
    v, nv = row['Voluntary_ctxt_switches'], row['Nonvoluntary_ctxt_switches']
    io_bytes = row['total_io_bytes']

    if cpu > 50:
        return 'CPU-bound'
    elif io_bytes > 1e6:   # >1MB read/write → clearly I/O heavy
        return 'IO-bound'
    else:
        return 'Mixed'

df['Resource_Type'] = df.apply(classify_resource, axis=1)

# ----------------------------------------
# 🔷 2. Interactivity Classification
# ----------------------------------------
def classify_interactivity(row):
    policy = str(row['Scheduling_Policy']).upper()
    nice = row['Nice']
    cpu, ticks = row['CPU_Usage_%'], row['Total_Time_Ticks']
    v, nv = row['Voluntary_ctxt_switches'], row['Nonvoluntary_ctxt_switches']

    if policy in ['SCHED_FIFO', 'SCHED_RR']:
        return 'Real-time'
    if policy == 'SCHED_IDLE' or nice > 10:
        return 'Background'
    if ticks < 500 and cpu < 30 and v > nv and row['total_io_bytes'] > 0:
        return 'Interactive'
    if cpu > 50 and ticks > 2000:
        return 'Batch'
    return 'Other'

df['Interactivity'] = df.apply(classify_interactivity, axis=1)

# ----------------------------------------
# 🔷 3. Priority Classification
# ----------------------------------------
def classify_priority(row):
    nice = row['Nice']
    if nice < 0:
        return 'High'
    elif nice == 0:
        return 'Medium'
    else:
        return 'Low'

df['Priority_Class'] = df.apply(classify_priority, axis=1)

# ----------------------------------------
# 🔷 4. Execution Time Classification
# ----------------------------------------
def classify_execution_time(row):
    ticks = row['Total_Time_Ticks']
    if ticks < 500:
        return 'Short'
    elif ticks < 1200:
        return 'Medium'
    else:
        return 'Long'

df['Execution_Time_Class'] = df.apply(classify_execution_time, axis=1)

# ----------------------------------------
# ✅ Check distributions to ensure correctness
print("\n🔷 Resource Usage Classification Distribution:")
print(df['Resource_Type'].value_counts())

print("\n🔷 Interactivity Classification Distribution:")
print(df['Interactivity'].value_counts())

print("\n🔷 Priority Classification Distribution:")
print(df['Priority_Class'].value_counts())

print("\n🔷 Execution Time Classification Distribution:")
print(df['Execution_Time_Class'].value_counts())

# ✅ Save for model training
import csv
df.to_csv("classified_dataset.csv", index=False, quoting=csv.QUOTE_ALL, escapechar='\\')



🔷 Resource Usage Classification Distribution:
Resource_Type
Mixed        2648722
IO-bound      411590
CPU-bound       8914
Name: count, dtype: int64

🔷 Interactivity Classification Distribution:
Interactivity
Other          2014596
Interactive     594761
Real-time       424726
Background       27468
Batch             7675
Name: count, dtype: int64

🔷 Priority Classification Distribution:
Priority_Class
Medium    2204760
High       824756
Low         39710
Name: count, dtype: int64

🔷 Execution Time Classification Distribution:
Execution_Time_Class
Short     2959336
Long        61882
Medium      48008
Name: count, dtype: int64
