<a href="https://colab.research.google.com/github/Abogundipe/Analysis-of-NHTSA-Car-complaints/blob/main/Associative_Mining.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
%%capture
!pip install fpgrowth_py
!pip install wordcloud

# Importance of Topic Modeling


Not all entries may accurately depict the components involved in complaints, and some complaints might encompass multiple components. Employing topic modeling can address such instances by clustering complaints into topics based on their textual content. This method offers a means to validate and enhance the classification of complaints.

Topic modeling also aids in identifying emerging trends in vehicle complaints that may not yet be adequately covered in existing component descriptions. This capability is particularly valuable for early detection of novel issues.

Enhancing Customer Service and Product Development: Analyzing the range and frequency of topics discussed in complaints can guide customer service strategies and product development initiatives. It assists in prioritizing issues of utmost importance to customers and pinpointing areas for enhancing vehicle design and features.

In [2]:
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
from matplotlib import colors
import pandas as pd
import numpy as np
import seaborn as sns
from mpl_toolkits.mplot3d import Axes3D
from matplotlib.colors import ListedColormap
import pandas as pd
from tqdm.auto import tqdm
tqdm.pandas()  # Prepare tqdm to work with pandas apply()
import spacy
from collections import Counter
import fpgrowth_py.fpgrowth as fpgrowth
from wordcloud import WordCloud

In [16]:
df = pd.read_csv('/content/drive/MyDrive/NHTSA Project/COMPLAINTS_RECEIVED_2020-2024.csv')

In [17]:
df.head()

Unnamed: 0,Complain ID,MANUFACTURER'S NAME,VEHICLE MAKE,VEHICLE MODEL,MODEL YEAR,WAS VEHICLE INVOLVED IN A CRASH,WAS VEHICLE INVOLVED IN A FIRE,NUMBER OF PERSONS INJURED,NUMBER OF FATALITIES,SPECIFIC COMPONENT'S DESCRIPTION,CONSUMER'S CITY,CONSUMER'S STATE CODE,VEHICLE MILEAGE AT FAILURE,DESCRIPTION OF THE COMPLAINT,WAS INCIDENT REPORTED TO POLICE,ANTI-LOCK BRAKES,CRUISE CONTROL,VEHICLE SPEED,WAS VEHICLE TOWED,Year
0,514478,"Nissan North America, Inc.",NISSAN,ALTIMA,2000.0,N,N,0,0,UNKNOWN OR OTHER,SUNNYVALE,CA,,"HEAT DISPAY, SPEEDOMETER/ODOMETER/RPM METER FA...",N,N,Y,65.0,N,2005
1,514479,"Nissan North America, Inc.",NISSAN,ALTIMA,2000.0,N,N,0,0,ELECTRICAL SYSTEM,SUNNYVALE,CA,,"HEAT DISPAY, SPEEDOMETER/ODOMETER/RPM METER FA...",N,N,Y,65.0,N,2005
2,514486,"Porsche Cars North America, Inc.",PORSCHE,911,2002.0,N,N,0,0,ENGINE AND ENGINE COOLING:ENGINE,CHANDLER,AZ,,WHILE TRAVELING AT 45 MPH THERE WAS A LOUD RUM...,N,N,N,45.0,N,2005
3,514487,UNKNOWN MANUFACTURER,UNKNOWN,UNKNOWN,9999.0,N,N,0,0,TIRES,GRANITE CITY,IL,,WHILE DRIVING THE FRONT INSIDE DUEL TIRE SEPA...,N,N,N,,N,2005
4,514490,"Chrysler (FCA US, LLC)",PLYMOUTH,VOYAGER,1999.0,N,N,0,0,AIR BAGS:FRONTAL:DRIVER SIDE:INFLATOR MODULE,SAINT PETERSBURG,FL,,CONSUMER STATES WHILE TRAVELING AIR BAG INDIC...,N,Y,N,,N,2005


In [18]:
df.shape

(209480, 20)

In [19]:
df.columns

Index(['Complain ID', 'MANUFACTURER'S NAME', 'VEHICLE MAKE', 'VEHICLE MODEL',
       'MODEL YEAR', 'WAS VEHICLE INVOLVED IN A CRASH',
       'WAS VEHICLE INVOLVED IN A FIRE', 'NUMBER OF PERSONS INJURED',
       'NUMBER OF FATALITIES', 'SPECIFIC COMPONENT'S DESCRIPTION',
       'CONSUMER'S CITY', 'CONSUMER'S STATE CODE',
       'VEHICLE MILEAGE AT FAILURE', 'DESCRIPTION OF THE COMPLAINT',
       'WAS INCIDENT REPORTED TO POLICE', 'ANTI-LOCK BRAKES', 'CRUISE CONTROL',
       'VEHICLE SPEED', 'WAS VEHICLE TOWED', 'Year'],
      dtype='object')

In [66]:
select_df = df[['Complain ID', 'VEHICLE MAKE', 'VEHICLE MODEL', 'DESCRIPTION OF THE COMPLAINT','SPECIFIC COMPONENT\'S DESCRIPTION'
       ]]

In [67]:
select_df.head()

Unnamed: 0,Complain ID,VEHICLE MAKE,VEHICLE MODEL,DESCRIPTION OF THE COMPLAINT,SPECIFIC COMPONENT'S DESCRIPTION
0,514478,NISSAN,ALTIMA,"HEAT DISPAY, SPEEDOMETER/ODOMETER/RPM METER FA...",UNKNOWN OR OTHER
1,514479,NISSAN,ALTIMA,"HEAT DISPAY, SPEEDOMETER/ODOMETER/RPM METER FA...",ELECTRICAL SYSTEM
2,514486,PORSCHE,911,WHILE TRAVELING AT 45 MPH THERE WAS A LOUD RUM...,ENGINE AND ENGINE COOLING:ENGINE
3,514487,UNKNOWN,UNKNOWN,WHILE DRIVING THE FRONT INSIDE DUEL TIRE SEPA...,TIRES
4,514490,PLYMOUTH,VOYAGER,CONSUMER STATES WHILE TRAVELING AIR BAG INDIC...,AIR BAGS:FRONTAL:DRIVER SIDE:INFLATOR MODULE


In [68]:
selected_df = select_df[select_df['VEHICLE MAKE'] == 'UNKNOWN']
selected_df.head()

Unnamed: 0,Complain ID,VEHICLE MAKE,VEHICLE MODEL,DESCRIPTION OF THE COMPLAINT,SPECIFIC COMPONENT'S DESCRIPTION
3,514487,UNKNOWN,UNKNOWN,WHILE DRIVING THE FRONT INSIDE DUEL TIRE SEPA...,TIRES
128,514676,UNKNOWN,UNKNOWN,THE TRAILER WAS PARKED AND WITHOUT WARNING THE...,STRUCTURE
217,514802,UNKNOWN,UNKNOWN,THE CHILD BITES THE RUBBER SURROUNDING THE HAR...,CHILD SEAT:TETHER: STRAP/WEBBING
231,514819,UNKNOWN,UNKNOWN,AFTER RECEIVING NHTSA RECALL 03C244000 AFFEC...,EQUIPMENT:RECREATIONAL VEHICLE/TRAILER
232,514821,UNKNOWN,UNKNOWN,AFTER RECEIVING NHTSA RECALL 03C244000 AFFEC...,EQUIPMENT ADAPTIVE/MOBILITY


In [69]:
selected_df.shape

(2753, 5)

In [70]:
select_df2 = select_df.copy()

In [71]:
# Filtering out rows where 'Column_Name' contains 'unknown'
select_df3 = select_df2[select_df2['VEHICLE MAKE'] != 'UNKNOWN']

print(select_df3)

        Complain ID VEHICLE MAKE VEHICLE MODEL  \
0            514478       NISSAN        ALTIMA   
1            514479       NISSAN        ALTIMA   
2            514486      PORSCHE           911   
4            514490     PLYMOUTH       VOYAGER   
5            514494        EAGLE        VISION   
...             ...          ...           ...   
209475       750635        DODGE       CHARGER   
209476       750636        DODGE       DURANGO   
209477       750637    CHEVROLET        COBALT   
209478       750638   VOLKSWAGEN          GOLF   
209479       750641       TOYOTA         CAMRY   

                             DESCRIPTION OF THE COMPLAINT  \
0       HEAT DISPAY, SPEEDOMETER/ODOMETER/RPM METER FA...   
1       HEAT DISPAY, SPEEDOMETER/ODOMETER/RPM METER FA...   
2       WHILE TRAVELING AT 45 MPH THERE WAS A LOUD RUM...   
4       CONSUMER STATES WHILE TRAVELING  AIR BAG INDIC...   
5       THE DRIVER'S SIDE FRONT SEAT BACK WOULDN'T STA...   
...                              

In [72]:
select_df3.head()

Unnamed: 0,Complain ID,VEHICLE MAKE,VEHICLE MODEL,DESCRIPTION OF THE COMPLAINT,SPECIFIC COMPONENT'S DESCRIPTION
0,514478,NISSAN,ALTIMA,"HEAT DISPAY, SPEEDOMETER/ODOMETER/RPM METER FA...",UNKNOWN OR OTHER
1,514479,NISSAN,ALTIMA,"HEAT DISPAY, SPEEDOMETER/ODOMETER/RPM METER FA...",ELECTRICAL SYSTEM
2,514486,PORSCHE,911,WHILE TRAVELING AT 45 MPH THERE WAS A LOUD RUM...,ENGINE AND ENGINE COOLING:ENGINE
4,514490,PLYMOUTH,VOYAGER,CONSUMER STATES WHILE TRAVELING AIR BAG INDIC...,AIR BAGS:FRONTAL:DRIVER SIDE:INFLATOR MODULE
5,514494,EAGLE,VISION,THE DRIVER'S SIDE FRONT SEAT BACK WOULDN'T STA...,SEATS:FRONT ASSEMBLY:RECLINER


# Natural Langiage Processing

# Functions to Extract Noun and Noun Phrases from "DESCRIPTION OF THE COMPLAINT"

In [73]:
select_df3.shape

(206727, 5)

In [74]:
from tqdm.auto import tqdm
import spacy
from collections import Counter

# Initialize tqdm progress bar for pandas apply()
tqdm.pandas()



# Load the English model from spaCy
nlp_1 = spacy.load("en_core_web_sm")

# Define your specific stopwords including numbers as strings
specific_stopwords = set([
    '1', '2', '3', '4', '5', '6', '7', '8', '9','0',  # Numbers as strings
    'failure', 'vehicle', 'vehicles', 'problem', 'car', 'ford', 'manufacturer',
    'dealer', 'consumer', 'driver', 'time', 'times', 'issue',
    'model', 'make', 'recall', 'system', 'complaint', 'safety', 'service',
    'reported', 'called', 'told', 'said', 'repair', 'replace', 'fixed',
    'mileage', 'year', 'model year', 'incident', 'occurred', 'happened',
    'experience', 'action', 'process', 'technical', '*ak',
    'reflective', 'specific', 'general', 'term', 'nhtsa', 'ga', '*jb', '.', '*'
])


# Update spaCy's default stopwords with the specific stopwords
for word in specific_stopwords:
    nlp_1.vocab[word].is_stop = True

# Ensure all stopwords are recognized as such by spaCy
for word in specific_stopwords:
    lexeme = nlp_1.vocab[word]
    lexeme.is_stop = True

#Function to extract lemmatized nouns not in specific stopwords
def extract_lemmatized_nouns(text):
    # Convert text to lowercase before processing
    text = text.lower()
    doc = nlp_1(text)
    lemmatized_nouns = [token.lemma_ for token in doc if token.pos_ == "NOUN" and token.lemma_ not in specific_stopwords]
    return lemmatized_nouns


def extract_noun_phrases(text):
    text = text.lower()
    doc = nlp_1(text)
    # Use list comprehension to filter noun chunks without stopwords
    noun_phrases = [chunk.text for chunk in doc.noun_chunks if not any(token.is_stop for token in chunk)]
    return noun_phrases



# Function to create a DataFrame with Complaint ID, Vehicle Model, and Keywords


def createNounPhraseDF(modelDF):
    # Create a copy of the DataFrame to avoid modifying the original DataFrame
    resultDF = modelDF.copy()


    resultDF.loc[:, 'Keywords'] = resultDF['DESCRIPTION OF THE COMPLAINT'].progress_apply(extract_noun_phrases)

    # Select and rename the columns as needed
    resultDF = resultDF[['Complain ID', 'VEHICLE MODEL', 'Keywords']]

    return resultDF

def createNounDF(modelDF):
    # Create a copy of the DataFrame to avoid modifying the original DataFrame
    resultDF = modelDF.copy()


    resultDF.loc[:, 'Keywords'] = resultDF['DESCRIPTION OF THE COMPLAINT'].progress_apply(extract_lemmatized_nouns)

    # Select and rename the columns as needed
    resultDF = resultDF[['Complain ID', 'VEHICLE MODEL', 'Keywords']]

    return resultDF

In [75]:

print(nlp_1.vocab['*jb'].is_stop)  # Should print True
print(nlp_1.vocab['*ak'].is_stop)  # Should also print True

True
True


In [76]:
# Step 1: Filter the DataFrame to select rows where 'VEHICLE MAKE' is 'FORD'
filtered_df = select_df3[select_df3['VEHICLE MAKE'] == 'FORD']

# Step 2: Ensure filtered_df contains valid data
if filtered_df.empty:
    print("No data found for 'FORD' vehicles.")
else:
    # Step 3: Print a few rows of filtered_df to inspect the data
    print(filtered_df.head())

    # Step 4: Call createNounDF() function to process filtered_df
    noun_df = createNounDF(filtered_df)

    # Display the resulting DataFrame
    print(noun_df.head())

    Complain ID VEHICLE MAKE VEHICLE MODEL  \
7        514513         FORD       CONTOUR   
9        514518         FORD        TAURUS   
16       514525         FORD       CONTOUR   
22       514532         FORD         FOCUS   
29       514539         FORD      WINDSTAR   

                         DESCRIPTION OF THE COMPLAINT  \
7   DASHBOARD SEPARATED FROM  STRUCTURE AND ROSE U...   
9   WHILE BACKING OUT OF THE DRIVEWAY CONSUMER HEA...   
16  CONSUMER COMPLAINED ABOUT  DASH BOARD PROBLEM....   
22  WHILE PLACING THE KEY IN THE IGNITION SWITCH V...   
29  WHILE DRIVING 65 MPH, THE OVER DRIVE LIGHT APP...   

                     SPECIFIC COMPONENT'S DESCRIPTION  
7                                           STRUCTURE  
9               SUSPENSION:FRONT:SPRINGS:COIL SPRINGS  
16                                          STRUCTURE  
22                  ELECTRICAL SYSTEM:IGNITION:SWITCH  
29  POWER TRAIN:AUTOMATIC TRANSMISSION:TORQUE CONV...  


  0%|          | 0/34364 [00:00<?, ?it/s]

    Complain ID VEHICLE MODEL  \
7        514513       CONTOUR   
9        514518        TAURUS   
16       514525       CONTOUR   
22       514532         FOCUS   
29       514539      WINDSTAR   

                                             Keywords  
7   [dashboard, structure, inch, front, windshield...  
9   [driveway, noise, coil, spring, tire, spring, ...  
16  [dash, board, dashboard, obstruction, dashboar...  
22  [key, ignition, switch, inspection, mechanic, ...  
29  [mph, light, dashboard, smoke, exhaust, pipe, ...  


In [77]:
noun_df.rename(columns={
    'Keywords': 'Noun'
}, inplace=True)

In [78]:
noun_df.head()

Unnamed: 0,Complain ID,VEHICLE MODEL,Noun
7,514513,CONTOUR,"[dashboard, structure, inch, front, windshield..."
9,514518,TAURUS,"[driveway, noise, coil, spring, tire, spring, ..."
16,514525,CONTOUR,"[dash, board, dashboard, obstruction, dashboar..."
22,514532,FOCUS,"[key, ignition, switch, inspection, mechanic, ..."
29,514539,WINDSTAR,"[mph, light, dashboard, smoke, exhaust, pipe, ..."


In [83]:
# Step 1: Filter the DataFrame to select rows where 'VEHICLE MAKE' is 'FORD'
filtered_df_1 = selected_df[selected_df['VEHICLE MAKE'] == 'FORD']

# Step 2: Ensure filtered_df_1 contains valid data
if filtered_df_1.empty:
    print("No data found for 'FORD' vehicles.")
else:
    # Step 3: Print a few rows of filtered_df_1 to inspect the data
    print(filtered_df_1.head())

    # Step 4: Call createNounPhraseDF() function to process filtered_df_1
    noun_phrases_df = createNounPhraseDF(filtered_df_1)

    # Display the resulting DataFrame
    print(noun_phrases_df.head())

No data found for 'FORD' vehicles.


In [81]:
noun_phrases_df.rename(columns={
    'Keywords': 'Noun Phrases'
}, inplace=True)

In [82]:
noun_phrases_df.head()

Unnamed: 0,Complain ID,VEHICLE MODEL,Noun Phrases
