# Automotive NER Assignment
-------------------------------------------

First task involves analyzing the data and identifying what are some entities that can be 
extracted from this data. We are interested in entities related to automotive domain. Some 
examples could be component, failure issue, vehicle model, corrective action etc. You may 
choose what all and how many entities you are planning to extract.

## Reading and Creating the Data
----------------------------------

In [1]:
## importing libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import csv ## to create a csv file out of the data

In [10]:
pd.set_option("display.max_rows", 100)

In [8]:
## reading the data

cols = [1, 7, 8, 11, 20, 21, 22, 23]  ## columns number to extract

with open("FLAT_RCL/FLAT_RCL.txt", "r") as file:
    ## creating the csv reader
    reader = csv.reader(file, delimiter = '\t')
    ## skip the header row
    # next(reader)


    ## writing the read data to a csv file
    with open("FLAT_RCL/FLAT_RCL.csv", "w", newline='') as c_file:
        ## creating the csv writer
        writer = csv.writer(c_file)
        ## info about the header
        writer.writerow(["RECORD_ID", "COMPNAME" , "MFGNAME", "RCLTYPECD", 
                         "DESC_DEFECT", "CONEQUENCE_DEFECT", "CORRECTIVE_ACTION", "NOTES"])

        ## iterate through rows and write in the file
        for row in reader:
            info = [row[i-1] for i in cols]
            ## writing into the file
            writer.writerow(info)

In [2]:
## reading the csv file and performing remaining tasks

data = pd.read_csv("FLAT_RCL/FLAT_RCL.csv", index_col = "RECORD_ID")
data.head()

Unnamed: 0_level_0,COMPNAME,MFGNAME,RCLTYPECD,DESC_DEFECT,CONEQUENCE_DEFECT,CORRECTIVE_ACTION,NOTES
RECORD_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1,ELECTRICAL SYSTEM:12V/24V/48V BATTERY:CABLES,FORD MOTOR COMPANY,V,CERTAIN PASSENGER VEHICLES EQUIPPED WITH ZETEC...,"THIS, IN TURN, COULD CAUSE THE BATTERY CABLES ...",DEALERS WILL INSPECT THE BATTERY CABLES FOR TH...,ALSO CONTACT THE NATIONAL HIGHWAY TRAFFIC SAFE...
2,ELECTRICAL SYSTEM:12V/24V/48V BATTERY:CABLES,FORD MOTOR COMPANY,V,CERTAIN PASSENGER VEHICLES EQUIPPED WITH ZETEC...,"THIS, IN TURN, COULD CAUSE THE BATTERY CABLES ...",DEALERS WILL INSPECT THE BATTERY CABLES FOR TH...,ALSO CONTACT THE NATIONAL HIGHWAY TRAFFIC SAFE...
3,EQUIPMENT:OTHER:LABELS,"JAYCO, INC.",V,"ON CERTAIN FOLDING TENT CAMPERS, THE FEDERAL C...","IF THE TIRES WERE INFLATED TO 80 PSI, THEY COU...",OWNERS WILL BE MAILED CORRECT LABELS FOR INSTA...,"ALSO, CUSTOMERS CAN CONTACT THE NATIONAL HIGHW..."
4,STRUCTURE,MONACO COACH CORP.,V,"ON CERTAIN CLASS A MOTOR HOMES, THE FLOOR TRUS...",CONDITIONS CAN RESULT IN THE BOTTOMING OUT THE...,DEALERS WILL INSPECT THE FLOOR TRUSS NETWORK S...,CUSTOMERS CAN ALSO CONTACT THE NATIONAL HIGHWA...
5,STRUCTURE,MONACO COACH CORP.,V,"ON CERTAIN CLASS A MOTOR HOMES, THE FLOOR TRUS...",CONDITIONS CAN RESULT IN THE BOTTOMING OUT THE...,DEALERS WILL INSPECT THE FLOOR TRUSS NETWORK S...,CUSTOMERS CAN ALSO CONTACT THE NATIONAL HIGHWA...


In [2]:
## about the data

data['COMPNAME'].value_counts().reset_index().sort_values(by = 'count', ascending = False)

In [3]:
data.loc[:1]

Unnamed: 0_level_0,COMPNAME,MFGNAME,RCLTYPECD,DESC_DEFECT,CONEQUENCE_DEFECT,CORRECTIVE_ACTION,NOTES
RECORD_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1,ELECTRICAL SYSTEM:12V/24V/48V BATTERY:CABLES,FORD MOTOR COMPANY,V,CERTAIN PASSENGER VEHICLES EQUIPPED WITH ZETEC...,"THIS, IN TURN, COULD CAUSE THE BATTERY CABLES ...",DEALERS WILL INSPECT THE BATTERY CABLES FOR TH...,ALSO CONTACT THE NATIONAL HIGHWAY TRAFFIC SAFE...


CERTAIN PASSENGER VEHICLES EQUIPPED WITH ZETEC ENGINES, LOOSE OR BROKEN ATTACHMENTS AND MISROUTED BATTERY CABLES COULD LEAD TO CABLE INSULATION DAMAGE.

In [4]:
data.shape

(255657, 7)

In [16]:
from transformers import pipeline

# Load NER pipeline
ner_pipeline = pipeline("ner", model="dbmdz/bert-large-cased-finetuned-conll03-english")

# Input text
text = "conditions can result in the bottoming out the suspension and amplification of the stress placed on the floor truss network. the additional stress can result in the fracture of welds securing the floor truss network system to the chassis frame rail and/or fracture of the floor truss network support system. the possibility exists that there could be damage to electrical wiring and/or fuel lines which could potentially lead to a fire."

# Perform NER
entities = ner_pipeline(text)

# Post-processing and labeling
labeled_entities = []
for entity in entities:
    labeled_entity = {
        "Entity": entity["word"],
        "Label": "Failure Issue" if "failure" in entity["entity_group"] else "Component"
    }
    labeled_entities.append(labeled_entity)

# Print labeled entities
print(labeled_entities)


All PyTorch model weights were used when initializing TFBertForTokenClassification.

All the weights of TFBertForTokenClassification were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertForTokenClassification for predictions without further training.


[]
