# Jupyter notebook TF-IDF model één team.

## What is TF-IDF?

The scores generated by the TF-IDF vectorizer are numerical values that indicate the importance of each word for each complaint compared to other complaints in the dataset.

TF (Term Frequency) represents the frequency of a word in a specific complaint. A higher frequency means the word is more important for that particular complaint.
IDF (Inverse Document Frequency) calculates the extent to which a word appears in the entire corpus of complaints. A higher IDF value indicates that the word is less important across all complaints in the dataset.

The TF-IDF score is the product of the TF and IDF values. Words that appear frequently in a specific complaint but rarely in other complaints have a higher TF-IDF score and are considered more important for that particular complaint. Words that appear frequently in all complaints have a lower TF-IDF score and are considered less important.


## In the code below, all needed imports for the following code are made up here.
    - Pandas (Used for making dataframes)
    - Seaborn (Used for visualizing the data in a graph)
    - Matplotlib.pyplot (Used for generating an graph)
    - Numpy (Used for generating arrays)
    - Sklearn.model_selection (Used for splitting data in train and test data)
    - Sklearn.feature_extraction.text (Used for giving an score to every word in a sentence)
    - Nltk.corpus (Used for checking if word in english language)
    - Sklearn.svm (Used for using the LinearSVC model)
    - Nltk (Used for checking if word in english language)

In [1]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
from sklearn.model_selection import train_test_split

from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.corpus import wordnet
from sklearn.svm import LinearSVC
import nltk
nltk.download('words')

[nltk_data] Downloading package words to
[nltk_data]     C:\Users\SKIKK\AppData\Roaming\nltk_data...
[nltk_data]   Package words is already up-to-date!


True

In the code below, the option to display graphs inline will be set to true.

In [2]:
%config IPCompleter.greedy = True
%matplotlib inline

Defining the new column types for the dataframe.

In [3]:
# Define the data types for each column
dtypes = {
    'Date received': str,
    'Product': "category",
    'Sub-product': "category",
    'Issue': "category",
    'Sub-issue': "category",
    'Consumer complaint narrative': str,
    'Company public response': str,
    'Company': "category",
    'State': "category",
    'ZIP code': str,
    'Tags': "category",
    'Consumer consent provided?': str,
    'Submitted via': "category",
    'Date sent to company': str,
    'Company response to consumer': str,
    'Timely response?': str,
    'Consumer disputed?': str,
    'Complaint ID': int
}

Reading the data and assign it to the correct variable.

In [4]:
# Define the columns to parse as dates
parse_dates = ['Product', 'Date received', 'Date sent to company']

# Read the CSV file with specified data types and parse dates
DS1_data = pd.read_csv("../Data/complaints-2023-04-25_05_07.csv", low_memory=False, dtype=dtypes, parse_dates=parse_dates, nrows=2000)

# Convert 'Timely response?' and 'Consumer disputed?' columns to boolean values
DS1_data[['Timely response?', 'Consumer disputed?']] = DS1_data[['Timely response?', 'Consumer disputed?']].replace({'Yes': True, 'No': False}).astype(bool)

# Convert 'Consumer consent provided?' column to boolean values
DS1_data['Consumer consent provided?'] = DS1_data['Consumer consent provided?'].replace({'Consent provided': True, '': False}).astype(bool)

# Drop rows with missing complaint narratives
DS1_data.dropna(subset=['Consumer complaint narrative'], inplace=True)

# Drop the 'Sub-issue' column as it is not needed
DS1_data.drop(columns=['Sub-issue'], inplace=True)

# Replace alle X occurences with emty strings, to avoid it from being the most important word.
DS1_data['Consumer complaint narrative'] = DS1_data['Consumer complaint narrative'].str.replace('X', '')

# Calculate the normalized count of issue categories
IssueCountNormalized = DS1_data['Issue'].value_counts(normalize=True)

  DS1_data = pd.read_csv("../Data/complaints-2023-04-25_05_07.csv", low_memory=False, dtype=dtypes, parse_dates=parse_dates, nrows=2000)
  DS1_data = pd.read_csv("../Data/complaints-2023-04-25_05_07.csv", low_memory=False, dtype=dtypes, parse_dates=parse_dates, nrows=2000)


To ensure accurate and meaningful tf-idf scores, it is recommended to enable smoothing (smooth_idf=True)
and normalization (norm='l2') while using TfidfVectorizer.
These settings help account for variations in text length and improve the quality of tf-idf scores.
Since these settings are recommended by Python, we have enabled them here.

Then we fit the already existing complaints and save their scores in the 'TF-IDF scores' row

In [6]:
# Create a TfidfVectorizer with optimized settings
vectorizer = TfidfVectorizer(stop_words='english',              # Exclude common English words
                             token_pattern=r'\b[a-zA-Z]+\b',    # Consider only alphabetic tokens
                             analyzer='word',                   # Analyze at the word level
                             use_idf=True,                      # Apply inverse document frequency weighting
                             smooth_idf=True,                   # Apply smoothing to idf weights
                             strip_accents='ascii',
                             min_df=2,
                             norm='l2')

# Fit and transform the vectorizer to get an score per word in an array returned
Vectorized_Data = vectorizer.fit_transform(DS1_data['Consumer complaint narrative'])

# Add the scores array back into the dataframe.
DS1_data['TF-IDF scores'] = list(Vectorized_Data.toarray())

In [15]:
# Defines the dictionary of english words
english_vocab = set(w.lower() for w in nltk.corpus.words.words())

# Gets the list of words from the vectorizer
feature_names = vectorizer.get_feature_names_out()

# Get unique values from the "Issue" column
unique_issues = DS1_data['Issue'].unique()

# Concatenate the unique issues into a single string
all_issues = ' '.join(unique_issues)

# Split the concatenated string into individual words
all_words = all_issues.split()

# Get unique words
Mortgage_Terms = set(all_words)

# Set an empty top 3 words list.
top_words = []


for i in range(Vectorized_Data.shape[0]):
    # Get the array with scores from this row
    row_scores = Vectorized_Data[i].toarray()[0]

    # Generate a dictionary with each word and then the Scores
    scores_dict = {name: score for name, score in zip(feature_names,row_scores)}

    # Loop over all the words and adjust the score
    for name in feature_names:
        # If term is in categoryname, add to score
        if name in Mortgage_Terms:
            scores_dict[name] *= (1 + 0.02*len(name) + 0.2)
        else:
            scores_dict[name] *= (1 + 0.02*len(name))

    # Create a new array of adjusted scores in the same order as the feature names
    adjusted_scores = np.zeros(len(feature_names))
    for i, term in enumerate(feature_names):
        adjusted_scores[i] = scores_dict[term]

    #Set the top 3 to a empty list.
    Top3Words = []

    # Get the index of the highest scoring word
    max_index = adjusted_scores.argmax()

    # Iterate until 3 top words found or list of words empty
    while max_index != 0 and len(Top3Words) <= 2:
        #Check if there are any vowels in the topword, if not select new word
        while True:
            vowels = {"a", "e", "i", "o", "u", "A", "E", "I", "O", "U"}
            # Get the corresponding word
            top_word = feature_names[max_index]
            if any(char in vowels for char in feature_names[max_index]) and top_word in english_vocab:
                Top3Words.append(top_word)
            break

        #Sets the current score of this word to 0 to select the second most popular word
        adjusted_scores[max_index] = 0
        max_index = adjusted_scores.argmax()
    # Voeg het bijbehorende woord toe aan de lijst van top_words
    top_words.append(Top3Words)

# Add the list of topwords to the dataframe
DS1_data['top_word'] = top_words

Traindata,Testdata = train_test_split(DS1_data,test_size=0.25,random_state = 0)

# Save the generated table to a new csv Dataset
DS1_data.head()
#DS1_data.to_csv('TrainData.csv')

Unnamed: 0,Date received,Product,Sub-product,Issue,Consumer complaint narrative,Company public response,Company,State,ZIP code,Tags,Consumer consent provided?,Submitted via,Date sent to company,Company response to consumer,Timely response?,Consumer disputed?,Complaint ID,TF-IDF scores,top_word
0,2015-07-23,Mortgage,FHA mortgage,"Loan modification,collection,foreclosure",""" The California Homeowner Bill of Rights beca...",Company chooses not to provide a public response,"CITIBANK, N.A.",CA,95965,,True,Web,2015-07-30,Closed with explanation,True,True,1484039,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[fair, underwriter, transparent]"
1,2021-12-09,Mortgage,FHA mortgage,Closing on a mortgage,"Hello, I am trying to refinance my property an...",Company has responded to the consumer and the ...,"LoanCare, LLC",VA,20136,,True,Web,2021-12-09,Closed with explanation,True,True,4991921,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[year, complicated, killing]"
2,2016-09-29,Mortgage,Other mortgage,"Loan modification,collection,foreclosure",Bank of America has caused to be recorded NOD ...,Company has responded to the consumer and the ...,"BANK OF AMERICA, NATIONAL ASSOCIATION",CA,92114,,True,Web,2016-09-29,Closed with explanation,True,False,2138230,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[america, nod, regular]"
4,2021-12-09,Mortgage,Conventional home mortgage,Struggling to pay mortgage,We have 2 loans w/ Loan Car . We were in a For...,Company has responded to the consumer and the ...,"LoanCare, LLC",IL,60471,Servicemember,True,Web,2021-12-09,Closed with explanation,True,True,4993695,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[payoff, properly, forced]"
5,2016-01-05,Mortgage,Home equity loan or line of credit,"Loan servicing, payments, escrow account",My husband has died and I am filing bankruptcy...,,"CITIBANK, N.A.",VT,5664,Older American,True,Web,2016-01-05,Closed with explanation,True,False,1729048,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[account, lawyer, finance]"


In the following cells, we will perform tf-idf conversion on the string data.


In [8]:
# Function creates an array of every word in text and gives it an score
def tfidf_custom_scoring(input_text):
    # Fit the data to the vectorizer
    vectorizer.fit(DS1_data['Consumer complaint narrative'])

    # Adds the input from the user to the fit using an transform
    transformed_data = vectorizer.transform([input_text])

    # Get all the feature name(words instead of numbers) corresponding to the array
    feature_names = vectorizer.get_feature_names_out()

    # calculate initial scores and store them in a dictionary
    scores_dict = {name: score for name, score in zip(feature_names, transformed_data.toarray()[0])}

    # adjust scores based on term length and update the dictionary
    for name in feature_names:
        if name in Mortgage_Terms:
            scores_dict[name] *= (1 + 0.01*len(name) + 1.5)
        else:
            scores_dict[name] *= (1 + 0.01*len(name))

    # create a new array of adjusted scores in the same order as the feature names
    adjusted_scores = np.zeros(len(feature_names))
    for INDEX, term in enumerate(feature_names):
        adjusted_scores[INDEX] = scores_dict[term]

    # return a tuple of the feature names and adjusted scores
    return feature_names, adjusted_scores

In [38]:
#user_input = input("What question do you want to categorize? ")
def Classifiy_string(user_input):
    # Get an array of words with corresponding scores
    data = tfidf_custom_scoring(user_input)

    # Create a dictionary with feature names as keys and scores as values
    scores_dict = {name: score for name, score in zip(data[0], data[1])}

    # Create a new array of adjusted scores in the same order as the feature names
    adjusted_scores = np.zeros(len(feature_names))
    for i, term in enumerate(feature_names):
        adjusted_scores[i] = scores_dict[term]

    # Set the top 3 words to a list
    Top3Words = []

    def get_top3_words():
        # Get the index of the highest score in the scores array
        index_max = adjusted_scores.argmax()

        while index_max != 0 and len(Top3Words) <= 2:
            # Check if there are any vowels in the top word, if not, select a new word
            while True:
                vowels = {"a", "e", "i", "o", "u", "A", "E", "I", "O", "U"}
                # Get the corresponding word
                top_word = feature_names[index_max]
                if any(char in vowels for char in top_word) and top_word in english_vocab:
                    Top3Words.append(top_word)
                break

            # Get the corresponding word
            top_word = str(data[0][index_max])

            # If the word is in the English vocabulary, add it to the top 3 list
            if top_word in english_vocab:
                Top3Words.append(top_word)

            # Set the current score of this word to 0 to select the second most popular word
            adjusted_scores[index_max] = 0
            index_max = adjusted_scores.argmax()

    get_top3_words()

    # Check all past result classifications
    def check_corresponding_word(relevant_word):
        return DS1_data[(DS1_data["top_word"].str[0] == relevant_word[0]) & DS1_data["top_word"].str[1].isin(relevant_word)]

    filtered_df = check_corresponding_word(Top3Words)

    # Count the occurrences of each issue and get the most common one, as long as there are words left
    value_counts = filtered_df["Issue"].value_counts()
    NormalizedTable = pd.concat([IssueCountNormalized, value_counts], axis=1, keys=('perc', 'valuecount'))
    Endscores = NormalizedTable.valuecount / NormalizedTable.perc
    NormalizedTable["Endscores"] = Endscores
    NormalizedTable["IssueName"] = IssueCountNormalized.keys().tolist()

    Toprow = NormalizedTable.loc[NormalizedTable['Endscores'].idxmax()]
    return Toprow.IssueName

To test the algorithm we made above, we test the classification on 100 rows and give back an percentage of correct classifications.

In [37]:
import math
# Varaible storing if categorising was correct
CorrectPrognosed = 0
WrongPrognosed = 0

for index, row in Testdata.head(100).iterrows():
    QuestionCateogryPrediction = Classifiy_string(row["Consumer complaint narrative"])
    if(QuestionCateogryPrediction == DS1_data.at[index,"Issue"]):
        CorrectPrognosed += 1
    else:
        WrongPrognosed += 1

print("Correct voorspeld: " + str(CorrectPrognosed) + "%")
print("Fout voorspeld: " + str(WrongPrognosed) + "%")

Correct voorspeld: 69%
Fout voorspeld: 31%


In [76]:
print("Your question will be in the following category: "+Classifiy_string(input("What question do you want to categorize? ")))

Your question will be in the following category: Trouble during payment process


In conclusion, the training model that has been developed to extract the top three most important words from a given question and predict the corresponding category has shown promising results. With an accuracy rate of 69%, the model has demonstrated its ability to correctly identify the appropriate category in a considerable number of cases.

This model's performance is notable considering the complexity of the task it undertakes. By focusing on the most significant words, it efficiently captures the essence of the question and utilizes this information to make accurate predictions. While the 69% accuracy rate indicates room for improvement, it is a commendable achievement and serves as a strong foundation for future enhancements.

The successful implementation of this training model opens up various possibilities for practical applications. It can be utilized in various domains where categorization of questions is crucial, such as customer support systems, information retrieval systems, and automated chatbots. The ability to swiftly categorize questions based on a few essential words can enhance user experiences and streamline processes, saving time and resources.

Moving forward, further improvements can be made to enhance the model's accuracy. Exploring advanced natural language processing techniques, such as contextual embeddings or attention mechanisms, could help capture more nuanced relationships and context between words. Additionally, expanding the training dataset and fine-tuning the model parameters can contribute to better generalization and performance.

In conclusion, the developed training model that extracts the top three most important words from a question and predicts the corresponding category has demonstrated promising accuracy, achieving a success rate of 69%. While there is scope for improvement, this model lays a solid foundation for future advancements in question categorization, facilitating more efficient and effective systems across various industries and domains.