<a href="https://colab.research.google.com/github/Kirtikaa25/redLight/blob/main/Apriori3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
!pip install mlxtend pandas




In [3]:
import pandas as pd
from mlxtend.frequent_patterns import apriori, association_rules
from mlxtend.preprocessing import TransactionEncoder


In [25]:
# 🟩 STEP 3: Load and Clean the Dataset

# Load CSV file
df = pd.read_csv('/content/red light violation.csv')

# Clean up messy column names: remove leading/trailing spaces and line breaks
df.columns = df.columns.str.strip().str.replace('\n', ' ', regex=True)

# 🔁 Remove exact duplicate columns (if any)
df = df.loc[:, ~df.columns.duplicated()]

# 🔍 Remove columns with suffixes like .1, .2 (often duplicates from Excel or form exports)
df = df.loc[:, ~df.columns.str.contains(r'\.\d+$')]

# Drop unwanted columns (only if they still exist)
columns_to_drop = [
    'Name', 'Email', 'Timestamp',
    'n which areas do you think red light violations are more of a problem ?',
    'Do you believe that adding countdown timers to traffic lights would reduce red light violations?',
    'What strategies or policies might make you less likely to run a red light ?',
    'In which areas do you think red light violations are more of a problem ?',
    'What strategies or policies might make you less likely to run a red light , and which measures do you think would most effectively reduce red light running?',
    'What type of public awareness campaign would be most likely to influence your driving behaviour regarding red light running ?',
    ''
]

# Only drop columns if they exist
df = df.drop(columns=[col for col in columns_to_drop if col in df.columns])

# Optional: Print remaining column names for inspection
print("Remaining columns:", df.columns.tolist())

# Preview cleaned dataframe
df.head()




Remaining columns: ['Age Group', 'Gender', 'Occupation', 'Approximate monthly income of the person who violates traffic signals', 'Education level', 'Driving Experience', 'How often do your friends describe you as argumentative?', 'Do you openly express disagreement with your friends?', 'Do you find yourself getting into arguments when people disagree with you?', 'Do you get into physical fights more frequently than the average person?', 'Have you ever been so angry that you broke something?', 'Have you ever crossed the street during a red light?', 'How do you react when you see someone running a red light?', 'How many hours a day do you usually drive?', 'What are the main reasons that make you more likely to cross during a red light?', 'How often do you cross during a red light ?', 'Do you believe that adding countdown timers effect on red light violation?', 'When do you believe red light violations are most common?', 'Which type of vehicle is more likely to violate red light?', 'Whic

Unnamed: 0,Age Group,Gender,Occupation,Approximate monthly income of the person who violates traffic signals,Education level,Driving Experience,How often do your friends describe you as argumentative?,Do you openly express disagreement with your friends?,Do you find yourself getting into arguments when people disagree with you?,Do you get into physical fights more frequently than the average person?,Have you ever been so angry that you broke something?,Have you ever crossed the street during a red light?,How do you react when you see someone running a red light?,How many hours a day do you usually drive?,What are the main reasons that make you more likely to cross during a red light?,How often do you cross during a red light ?,Do you believe that adding countdown timers effect on red light violation?,When do you believe red light violations are most common?,Which type of vehicle is more likely to violate red light?,Which type of intersections do you believe is most prone to red light violations?
0,19 - 29,Male,Employed,20k -50k,Post -Graduation and above,More than 5 years,Moderate,Somewhat,No,No,Yes,No,Feel frustrated,1 -3 hours,To save time or avoid long waits,,,,,
1,19 - 29,Male,Student,Less than 20 k,Graduation,0 -1 year,Moderate,Somewhat,Yes,No,No,Yes,,,,Only in case of emergency,"Yes ,but only slightly",Weekdays during peak traffic hours,Auto rickshaw,All of the above
2,19 - 29,Male,Student,Less than 20 k,Graduation,,Moderate,Somewhat,Yes,No,No,Yes,,,,Only in case of emergency,"Yes ,but only slightly",Weekdays during peak traffic hours,Auto rickshaw,All of the above
3,19 - 29,Male,Student,20k -50k,Graduation,More than 5 years,Moderate,Somewhat,Sometimes,No,No,No,Feel frustrated,Less then 1 hour,To save time or avoid long waits,,,Weekdays during non -peak or late night hours,Motorcycles,Circular intersections
4,19 - 29,Male,Student,50k - 1 lakh,Graduation,1- 5 years,Moderate,Somewhat,Sometimes,No,Yes,Yes,,,,Only in case of emergency,"Yes,it would help a lot",Weekdays during peak traffic hours,Motorcycles,All of the above


In [26]:
# 🟩 STEP 4: Encode Data for Apriori

# Optional but recommended: Drop rows with missing values to avoid issues during encoding
df.dropna(inplace=True)

# Convert each row into a list of strings in the format "column=value"
transactions = df.apply(lambda row: [f"{col}={row[col]}" for col in df.columns], axis=1).tolist()

# Use TransactionEncoder to convert the transactions into a one-hot encoded format
te = TransactionEncoder()
te_array = te.fit(transactions).transform(transactions)

# Create a DataFrame from the encoded array
df_encoded = pd.DataFrame(te_array, columns=te.columns_)

# Preview the encoded DataFrame
df_encoded.head()



Unnamed: 0,Age Group=19 - 29,Age Group=30 - 49,Age Group=Above 50,Approximate monthly income of the person who violates traffic signals=20k -50k,Approximate monthly income of the person who violates traffic signals=50k - 1 lakh,Approximate monthly income of the person who violates traffic signals=Above 1 lakh,Approximate monthly income of the person who violates traffic signals=Less than 20 k,Approximate monthly income of the person who violates traffic signals=Not like to disclose,Do you believe that adding countdown timers effect on red light violation?=Negative effect,"Do you believe that adding countdown timers effect on red light violation?=No ,it wouldn't make a difference",...,When do you believe red light violations are most common?=Weekends during non - peak or late night hours,When do you believe red light violations are most common?=Weekends during peak traffic hours,Which type of intersections do you believe is most prone to red light violations?=All of the above,Which type of intersections do you believe is most prone to red light violations?=Circular intersections,Which type of intersections do you believe is most prone to red light violations?=Four signalised intersections,Which type of intersections do you believe is most prone to red light violations?=Three signalised intersectiions,Which type of vehicle is more likely to violate red light?=Auto rickshaw,Which type of vehicle is more likely to violate red light?=Cars,Which type of vehicle is more likely to violate red light?=Heavy vehicles,Which type of vehicle is more likely to violate red light?=Motorcycles
0,False,False,True,False,False,True,False,False,False,False,...,False,False,True,False,False,False,False,False,False,True
1,False,False,True,False,True,False,False,False,False,False,...,True,False,True,False,False,False,False,False,False,True
2,False,True,False,False,False,True,False,False,False,False,...,False,True,True,False,False,False,True,False,False,False
3,False,True,False,False,False,True,False,False,False,False,...,False,False,False,False,True,False,False,False,False,True
4,False,False,True,False,True,False,False,False,False,False,...,False,False,False,False,False,True,False,False,False,True


In [27]:
from mlxtend.frequent_patterns import apriori, association_rules

# Optional: Drop very rare columns (appear in < 1% of transactions)
min_col_support = 0.01
min_count = min_col_support * len(df_encoded)
df_filtered = df_encoded.loc[:, df_encoded.sum() > min_count]

# Apply Apriori algorithm
frequent_itemsets = apriori(
    df_filtered,
    min_support=0.01,
    use_colnames=True,
    max_len=3  # You can change this depending on how deep you want the rules to go
)

# Add a column for 'fatal_occurrences' = actual counts
frequent_itemsets['fatal_occurrences'] = (frequent_itemsets['support'] * len(df_encoded)).astype(int)

# Sort itemsets by occurrence count
frequent_itemsets = frequent_itemsets.sort_values(by='fatal_occurrences', ascending=False)

# Show top itemsets
frequent_itemsets.head()




Unnamed: 0,support,itemsets,fatal_occurrences
28,0.96,(Gender=Male),48
49,0.84,(Occupation=Employed),42
1314,0.82,"(Gender=Male, Occupation=Employed)",41
15,0.76,(Do you get into physical fights more frequent...,38
802,0.74,(Do you get into physical fights more frequent...,37


In [28]:
from mlxtend.frequent_patterns import association_rules

# Generate rules using confidence as the main metric
rules = association_rules(frequent_itemsets, metric="confidence", min_threshold=0.5)

# Filter rules:
# 1. Lift > 1 → rule is better than random
# 2. Confidence < 1 → avoid rules that are always true (too obvious)
rules = rules[(rules['lift'] > 1) & (rules['confidence'] < 1)]

# Sort by confidence or lift (optional)
rules = rules.sort_values(by='confidence', ascending=False)

# Display top rules
rules.head()



Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,representativity,leverage,conviction,zhangs_metric,jaccard,certainty,kulczynski
1,(Occupation=Employed),(Gender=Male),0.84,0.96,0.82,0.97619,1.016865,1.0,0.0136,1.68,0.103659,0.836735,0.404762,0.915179
2,(Do you get into physical fights more frequent...,(Gender=Male),0.76,0.96,0.74,0.973684,1.014254,1.0,0.0104,1.52,0.058559,0.755102,0.342105,0.872259
13,(Age Group=30 - 49),(Gender=Male),0.68,0.96,0.66,0.970588,1.011029,1.0,0.0072,1.36,0.034091,0.673469,0.264706,0.829044
29,(How often do your friends describe you as arg...,(Gender=Male),0.6,0.96,0.58,0.966667,1.006944,1.0,0.004,1.2,0.017241,0.591837,0.166667,0.785417
40,(Do you get into physical fights more frequent...,(Occupation=Employed),0.58,0.84,0.56,0.965517,1.149425,1.0,0.0728,4.64,0.309524,0.651163,0.784483,0.816092


In [23]:
rules['confidence'].value_counts().sort_index()


Unnamed: 0_level_0,count
confidence,Unnamed: 1_level_1
0.500000,1506
0.501377,1
0.501484,1
0.501511,1
0.501805,1
...,...
0.998165,4
0.998291,3
0.998294,1
0.998302,11


In [29]:
# STEP 7: Format and sort output properly

# Select relevant rule columns
final_rules = rules[['antecedents', 'consequents', 'confidence', 'lift']].copy()

# Combine antecedents and consequents into a single list of attribute names
final_rules['Attributes'] = final_rules.apply(
    lambda x: list(x['antecedents']) + list(x['consequents']),
    axis=1
)

# Define function to count how many rows fully match all items in the rule
def count_fatal_occurrences(attrs, data):
    try:
        return data[list(attrs)].all(axis=1).sum()
    except:
        return 0

# Apply the function
final_rules['Fatal occurrences'] = final_rules['Attributes'].apply(
    lambda attrs: count_fatal_occurrences(attrs, df_encoded)
)

# Format 'Attributes' for readability
final_rules['Attributes'] = final_rules['Attributes'].apply(lambda x: ', '.join(x))

# Final selection of columns
final_rules = final_rules[['Attributes', 'Fatal occurrences', 'confidence', 'lift']]

# Round numerical values
final_rules['confidence'] = final_rules['confidence'].round(2)
final_rules['lift'] = final_rules['lift'].round(2)

# Sort the rules
final_rules = final_rules.sort_values(by=['confidence', 'lift'], ascending=False).reset_index(drop=True)

# Add Rule Number
final_rules.index += 1
final_rules.insert(0, 'Rule No.', final_rules.index)

# Show top rules
final_rules.head(15)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  final_rules['confidence'] = final_rules['confidence'].round(2)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  final_rules['lift'] = final_rules['lift'].round(2)


Unnamed: 0,Rule No.,Attributes,Fatal occurrences,confidence,lift
1,1,"Occupation=Employed, Gender=Male",41,0.98,1.02
2,2,Do you get into physical fights more frequentl...,28,0.97,1.15
3,3,Do you get into physical fights more frequentl...,37,0.97,1.01
4,4,"Age Group=30 - 49, Gender=Male",33,0.97,1.01
5,5,How often do your friends describe you as argu...,29,0.97,1.01
6,6,"Education level=Graduation, Do you get into ph...",22,0.96,1.29
7,7,"Education level=Graduation, Do you get into ph...",22,0.96,1.26
8,8,"Education level=Graduation, Gender=Male, Do yo...",22,0.96,1.26
9,9,Do you get into physical fights more frequentl...,22,0.96,1.14
10,10,Do you believe that adding countdown timers ef...,27,0.96,1.0
