<div align="center">
    <h1><b><u>TRAFFIC OPTIMIZATION</u></b></h1>
</div>


**GROUP NUMBER: 7**
     
**GROUP MEMBERS**
   
- **Adebola**
- **Rahim**
- **Sayeed**
- **Yinka**
- **Minto**
   

<div align="center">
    <h3><b><u>3. Encoding</u></b></h3>
</div>


In [2]:
# Import Libraries 
import pandas as pd

In [9]:
# Importing the Data
data=pd.read_csv('data/Cleaned_Data.csv')
data.head()

Unnamed: 0,ROAD_CLASS,DISTRICT,ACCLOC,TRAFFCTL,VISIBILITY,LIGHT,RDSFCOND,ACCLASS,IMPACTYPE,INVTYPE,INVAGE,INJURY,INITDIR,VEHTYPE,MANOEUVER,DRIVACT,DRIVCOND,DIVISION
0,Major Arterial,Toronto and East York,Intersection Related,No Control,Clear,Dark,Wet,Non-Fatal Injury,Approaching,Passenger,50 to 54,Major,East,"Automobile, Station Wagon",Going Ahead,Driving Properly,Normal,D55
1,Major Arterial,Toronto and East York,Intersection Related,No Control,Clear,Dark,Wet,Non-Fatal Injury,Approaching,Passenger,15 to 19,Minor,East,"Automobile, Station Wagon",Going Ahead,Driving Properly,Normal,D55
2,Major Arterial,Toronto and East York,Intersection Related,No Control,Clear,Dark,Wet,Non-Fatal Injury,Approaching,Driver,55 to 59,Minor,North,"Automobile, Station Wagon",Going Ahead,Driving Properly,Normal,D55
3,Major Arterial,Toronto and East York,Intersection Related,No Control,Clear,Dark,Wet,Non-Fatal Injury,Approaching,Passenger,20 to 24,Minor,East,"Automobile, Station Wagon",Going Ahead,Driving Properly,Normal,D55
4,Major Arterial,Toronto and East York,Intersection Related,No Control,Clear,Dark,Wet,Non-Fatal Injury,Approaching,Passenger,15 to 19,Minor,East,"Automobile, Station Wagon",Going Ahead,Driving Properly,Normal,D55


In [10]:
data.columns

Index(['ROAD_CLASS', 'DISTRICT', 'ACCLOC', 'TRAFFCTL', 'VISIBILITY', 'LIGHT',
       'RDSFCOND', 'ACCLASS', 'IMPACTYPE', 'INVTYPE', 'INVAGE', 'INJURY',
       'INITDIR', 'VEHTYPE', 'MANOEUVER', 'DRIVACT', 'DRIVCOND', 'DIVISION'],
      dtype='object')

In [11]:
# HAVING ALL CATEGORICAL DATA IN ONE LIST 
categorical_columns = data.select_dtypes(include='object').columns.tolist()

### **Final Encoding Plan**
#### **1. Ordinal Encoding (for ordered categories)**
- **INJURY (Target Variable)**
  - Minimal → 0
  - Minor → 1
  - Major → 2
  - Fatal → 3  
- **INVAGE** (Ordered age groups)
- **LIGHT** (Ordered by brightness level: Daylight < Dawn/Dusk < Dark)

#### **2. One-Hot Encoding (for categories with ≤ 4 unique values)**
- **DISTRICT** (4 unique values)
  - `Toronto and East York`, `North York`, `Scarborough`, `Etobicoke York`  
- **ACCLASS** (3 unique values)
  - `Non-Fatal Injury`, `Fatal`, `Property Damage O`  
- **INITDIR** (5 unique values → Need to switch to Frequency Encoding)

#### **3. Frequency Encoding (for categories with > 4 unique values)**
- **ROAD_CLASS (11 values)**
- **ACCLOC (10 values)**
- **TRAFFCTL (10 values)**
- **VISIBILITY (8 values)**
- **RDSFCOND (9 values)**
- **IMPACTYPE (10 values)**
- **INVTYPE (18 values)**
- **INITDIR (5 values)**
- **VEHTYPE (33 values)**
- **MANOEUVER (15 values)**
- **DRIVACT (13 values)**
- **DRIVCOND (10 values)**
- **DIVISION (17 values)**

---

### **Why This Works Well**
1. **Reduces High-Dimensional Data:** Instead of creating many one-hot columns, frequency encoding keeps data compact.
2. **Avoids Overfitting:** High-cardinality categorical variables won’t cause excessive feature explosion.
3. **Retains Interpretability:** Frequency values still reflect category importance.

In [12]:
# Encoding the target variable (INJURY) using label encoding
injury_mapping = {"Minimal": 0, "Minor": 1, "Major": 2, "Fatal": 3}
data["INJURY"] = data["INJURY"].map(injury_mapping)

# A function to convert age ranges to midpoints
def age_to_midpoint(age_range):
    if age_range == "unknown":
        return None  # We'll fill missing values later
    elif age_range == "Over 95":
        return 97  
    else:
        start, end = map(int, age_range.split(" to "))
        return (start + end) / 2

# Apply the function to the INVAGE column
data["INVAGE"] = data["INVAGE"].apply(age_to_midpoint)

# Fill missing values with the median age
data["INVAGE"]= data["INVAGE"].fillna(data["INVAGE"].median())


# Columns to one-hot encode (4 or fewer unique values)
one_hot_cols = ["DISTRICT", "ACCLASS"]

# Apply one-hot encoding
data = pd.get_dummies(data, columns=one_hot_cols, drop_first=True)  # Drop first to avoid multicollinearity

# Columns to use frequency encoding (more than 4 unique values)
freq_cols = [
    "ROAD_CLASS", "ACCLOC", "TRAFFCTL", "VISIBILITY", "LIGHT", "RDSFCOND", 
    "IMPACTYPE", "INVTYPE", "INITDIR", "VEHTYPE", "MANOEUVER",
    "DRIVACT", "DRIVCOND", "DIVISION"
]

# Apply frequency encoding
for col in freq_cols:
    freq_map = data[col].value_counts(normalize=True)  # Get frequency of each category
    data[col] = data[col].map(freq_map)

# Display the first few rows
data.head()


Unnamed: 0,ROAD_CLASS,ACCLOC,TRAFFCTL,VISIBILITY,LIGHT,RDSFCOND,IMPACTYPE,INVTYPE,INVAGE,INJURY,...,VEHTYPE,MANOEUVER,DRIVACT,DRIVCOND,DIVISION,DISTRICT_North York,DISTRICT_Scarborough,DISTRICT_Toronto and East York,ACCLASS_Non-Fatal Injury,ACCLASS_Property Damage O
0,0.731234,0.084613,0.479823,0.864958,0.197605,0.165638,0.050588,0.152398,52.0,2,...,0.595664,0.764625,0.723427,0.81495,0.080709,False,False,True,True,False
1,0.731234,0.084613,0.479823,0.864958,0.197605,0.165638,0.050588,0.152398,17.0,1,...,0.595664,0.764625,0.723427,0.81495,0.080709,False,False,True,True,False
2,0.731234,0.084613,0.479823,0.864958,0.197605,0.165638,0.050588,0.457193,57.0,1,...,0.595664,0.764625,0.723427,0.81495,0.080709,False,False,True,True,False
3,0.731234,0.084613,0.479823,0.864958,0.197605,0.165638,0.050588,0.152398,22.0,1,...,0.595664,0.764625,0.723427,0.81495,0.080709,False,False,True,True,False
4,0.731234,0.084613,0.479823,0.864958,0.197605,0.165638,0.050588,0.152398,17.0,1,...,0.595664,0.764625,0.723427,0.81495,0.080709,False,False,True,True,False


In [13]:
# Save cleaned data (without encoding)
data.to_csv('data/encoded_data.csv', index=False)