# Severity Analysis (Step 1 → Clean severity)

## Anchor Project: Decision Analytics for Supplier Quality Risk

What this notebook/script does:

1) Loads the raw recall dataset (handles encoding + messy rows)
2) Checks classification values
3) Keeps only true severity classes: Class I / II / II
4) Maps them to a numeric severity score (5 / 3 / 1)
5) Saves a clean processed CSV for downstream steps

In [1]:
import pandas as pd

### Load Raw Data Safely

In [2]:
RAW_PATH = "data/raw/fda_drug_recalls.csv"

In [3]:
df = pd.read_csv(
    RAW_PATH,
    encoding="latin1",
    engine="python",
    on_bad_lines="skip"
)

In [5]:
print("Raw shape:", df.shape)

Raw shape: (641, 24)


In [6]:
print("Columns:", list(df.columns))

Columns: ['country', 'city', 'address_1', 'reason_for_recall', 'address_2', 'product_quantity', 'code_info', 'center_classification_date', 'distribution_pattern', 'state', 'product_description', 'report_date', 'classification', 'openfda', 'recalling_firm', 'recall_number', 'initial_firm_notification', 'product_type', 'event_id', 'more_code_info', 'recall_initiation_date', 'postal_code', 'voluntary_mandated', 'status']


### Quick Preview

In [8]:
display(df.head())

Unnamed: 0,country,city,address_1,reason_for_recall,address_2,product_quantity,code_info,center_classification_date,distribution_pattern,state,...,recalling_firm,recall_number,initial_firm_notification,product_type,event_id,more_code_info,recall_initiation_date,postal_code,voluntary_mandated,status
0,United States,Davie,4131 SW 47th Ave Ste 1403,Recall initiated as a precautionary measure du...,,"1,990 bottles","UPC No. 632687615989; Lot No. 30661601, Exp. D...",20161025,"FL, MI, MS, and OH.",FL,...,Pharmatech LLC,F-0276-2017,Letter,Food,75272,,20160808,33314-4036,Voluntary: Firm initiated,Ongoing
1,United States,Miami,13439 NW 19 LANE,Virginia State (VDACS) found Listeria monocyto...,,144 pieces,UPC 635349 000390 Best By dates: 07/01/14 thr...,20141202,"FL, GA. NC, and TN",FL,...,"Oasis Brands, Inc",F-0609-2015,"Two or more of the following: Email, Fax, Lett...",Food,69516,20170328.0,20141010,33182,Voluntary: Firm initiated,Terminated
2,United States,Seattle,3429 Airport Way S,Coffee Toffee is recalled because pecan is lis...,,24 packages,no codes,20180614,distributed in WA,WA,...,Yukon Jackson,F-1578-2018,Visit,Food,80233,20180625.0,20180525,98134-2139,Voluntary: Firm initiated,Terminated
3,United States,Brooklyn,47 Bridgewater St # 57,"Product contains dried peaches, but front labe...",,unknown,UPC CODE: 6868978724496 BEST BEFORE: 11/15/2021,20200424,Unknown,NY,...,Rong Shing Trading NY Inc,F-0921-2020,"Two or more of the following: Email, Fax, Lett...",Food,85364,20210318.0,20200401,11222-3820,Voluntary: Firm initiated,Terminated
4,United States,Tipp City,320 N 2nd St,The firm stated that the product contains unde...,,480/20 ib cases,"Product #29973B Code Dates: 10/20/2016, 11/8/...",20170605,Product was sent to one manufacturer in MI,OH,...,Trophy Nut Co Inc,F-2326-2017,Letter,Food,77213,20180213.0,20170505,45371,Voluntary: Firm initiated,Terminated


### Inspect the 'classification' Column Values

In [10]:
if "classification" not in df.columns:
    raise ValueError("❌ Column 'classification' not found. Check your dataset columns.")

print("\n--- Unique values in classification ---")
print(df["classification"].dropna().unique())

print("\n--- Top classification value counts (raw) ---")
print(df["classification"].value_counts(dropna=False).head(20))


--- Unique values in classification ---
['Class II' 'Class I' 'Class III' 'Voluntary: Firm initiated'
 'FDA Mandated']

--- Top classification value counts (raw) ---
classification
Class II                     268
Class I                      208
Voluntary: Firm initiated    137
Class III                     27
FDA Mandated                   1
Name: count, dtype: int64


### Filter to True Severity Classes Only

In [11]:
valid_classes = ["Class I", "Class II", "Class III"]

df_severity = df[df["classification"].isin(valid_classes)].copy()

print("\n Filtered to severity classes only")
print("Severity-only shape:", df_severity.shape)

print("\n--- Severity distribution (Class I/II/III) ---")
print(df_severity["classification"].value_counts())


 Filtered to severity classes only
Severity-only shape: (503, 24)

--- Severity distribution (Class I/II/III) ---
classification
Class II     268
Class I      208
Class III     27
Name: count, dtype: int64


### Map Class I/II/III to a numeric Severity Score

In [12]:
severity_map = {
    "Class I": 5,
    "Class II": 3,
    "Class III": 1
}

df_severity["severity_score"] = df_severity["classification"].map(severity_map)

In [13]:
missing_scores = df_severity["severity_score"].isna().sum()
if missing_scores > 0:
    raise ValueError(f"Found {missing_scores} rows with missing severity_score. Check mapping/data.")

In [14]:
print("\n Severity score created")
print(df_severity[["classification", "severity_score"]].head())


 Severity score created
  classification  severity_score
0       Class II               3
1        Class I               5
2      Class III               1
3      Class III               1
4       Class II               3


### Save clean processed dataset for next steps

In [19]:
OUTPUT_PATH = "data/processed/recalls_severity_clean.csv"
df_severity.to_csv(OUTPUT_PATH, index=False)
print(f"Saved processed severity dataset to: {OUTPUT_PATH}")

Saved processed severity dataset to: data/processed/recalls_severity_clean.csv


### Basic summary stats

In [None]:
severity_summary = (
    df_severity.groupby("classification")["severity_score"]
    .agg(["count", "mean"])
    .sort_values("mean", ascending=False)
)

print("\n--- Severity summary (count + mean score) ---")
print(severity_summary)