# SEAMLESS COMPLAINTS RESOLUTION.

In [None]:
from IPython.display import Image
Image(filename='blob.jpeg', width=1000, height=1000)

In the fast-paced and interconnected world of modern business, customer feedback plays a crucial role in shaping the success of any organization. Companies across various industries receive a multitude of complaints from their customers daily. Effectively handling these complaints and routing them to the appropriate departments for timely resolution is essential for maintaining customer satisfaction and streamlining internal operations.

To address this challenge, I am embarking on an innovative project to develop a cutting-edge Natural Language Processing (NLP) model. The primary objective of this model is to read and comprehend natural language in customer complaints, enabling us to automatically route each complaint to the right department or team for efficient and tailored resolution.

# IMPORTING LIBRARIES

In [None]:
#Importing libraries for data cleaning, processing and visualisation
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns 
import plotly.express  as px
import time
from wordcloud import WordCloud
from collections import Counter

#Libraries for Natural Language processing
import nltk
from nltk import TreebankWordTokenizer, SnowballStemmer
from nltk.stem import WordNetLemmatizer,PorterStemmer
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import nltk
import string
import urllib
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')

#Libraries for data engineering and modeling
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.tree import DecisionTreeClassifier,plot_tree
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score, classification_report, confusion_matrix
from sklearn.neighbors import KNeighborsClassifier
from sklearn.utils import resample
from sklearn.svm import SVC
from sklearn.feature_selection import VarianceThreshold
from sklearn.svm import SVC
Random_state=42

# READING THE DATA

In [None]:
df=pd.read_csv('complaints_processed.csv')
df.head()

**Observation**

- The dataframe has three features namely unnamed: 0, product and narrative. It is important to also note that the unnamed column consist of the data that wont be necessary for the project. 

# EXPLORATORY DATA ANALYSIS

**More infomation on the data**

In [None]:
print('Checking number of observations and datatype')
print()
print(df.info())
print()
print('Checking for nulls')
print()
print(df.isnull().sum())

**Observation:**
    
- The data consist of 162421 entries, with the narrative column having 10 missing observations.
- The two features namely Product and narrative datatypes are object and the third one datatype is integers.

**Pie chart showing the distribution of consumer complaints**

In [None]:
# Get the value counts for the 'product' column
product_counts = df['product'].value_counts()

# Set custom shades of blue
colors = ['#1f77b4', '#3581b8', '#4d8fbf', '#63a8c6', '#79b0cd']

# Plotting the pie chart
plt.figure(figsize=(8, 8))  # Set the figure size (optional)
wedges, texts, _ = plt.pie(product_counts.values, labels=product_counts.index, autopct='%1.1f%%', startangle=140, colors=colors)

# Set the text in the center of each slice with the exact count
for text in texts:
    text.set_color('white')  # Setting text color to white
    text.set_fontsize(12)  # Setting text font size
    text.set_fontweight('bold')  # Setting text font weight

# Adding the count as legend on the left side
legend_labels = [f'{product}: {count}' for product, count in zip(product_counts.index, product_counts.values)]
plt.legend(wedges, legend_labels, title='Product', loc='center left', bbox_to_anchor=(-0.4, 0.5))

# Title
plt.title('Consumer Complaints Distribution')

# Displaying the plot
plt.show()

**Observation:**

- Consumer complaints are directed to five departments namely: credit reporting, debt collection, mortgages & loans, credit cards and retail banking. 
- Most of the consumer complaints are directed to credit reporting departing, it accounts to more than 50% of the consumer complaints. The class imbalance will be handled during feature engineering to ensure the robustness of the model. 
- Credit reporting is a department that gather and maintain information about individuals' credit activities and creditworthiness. 

**A bar graph showing the distribution of word count for each deprtment consumer complaint.**

In [None]:
#Changing the narrative column to string
df['narrative']=df['narrative'].astype(str)

#Creating a number of Words feature
df['num_words']=df['narrative'].apply(lambda x:len(nltk.word_tokenize(x)))

# Creating a 3x2 grid of subplots
fig, axes = plt.subplots(3, 2, figsize=(10, 12))

# Plot histogram for  'credict_reporting'
sns.histplot(df[df['product'] == 'credit_reporting']['num_words'], ax=axes[0, 0], color='blue')
axes[0, 0].set_title('Department: Credit Reporting')

# Plotting histogram for 'debt_collection'
sns.histplot(df[df['product'] == 'debt_collection']['num_words'], ax=axes[0, 1], color='red')
axes[0, 1].set_title('Department: Debt Collection')

# Plotting histogram for  'mortgages_and_loans'
sns.histplot(df[df['product'] == 'mortgages_and_loans']['num_words'], ax=axes[1, 0], color='green')
axes[1, 0].set_title('Department: Mortgages and Loans')

# Plotting histogram for'credit_card'
sns.histplot(df[df['product'] == 'credit_card']['num_words'], ax=axes[1, 1], color='purple')
axes[1, 1].set_title('Department: Credit Card')

# Plotting histogram for 'retail_banking'
sns.histplot(df[df['product'] == 'retail_banking']['num_words'], ax=axes[2, 0], color='purple')
axes[2, 0].set_title('Department: Retail Banking')

# An empty subplot for spacing
axes[2, 1].axis('off')

# Adjusting the spacing between subplots
plt.tight_layout()

# Showingthe subplots
plt.show()


**Observation:**

- The graph above illustrates that the word counts of the consumer complaints from all the department are right skewed. That is, most of the consumer complaints have very words counts.
- It is important to note that there are outliers; that is, some consumers get into great depth when sending a consumer complaints. 

**Removing stopwords**

In [None]:
#A function of removing stopwords
def remove_stopwords(text):
    """
    Remove stopwords from the given text.

    Args:
        text (str): The input text from which stopwords are to be removed.

    Returns:
        str: The text with stopwords removed.
    """
    stop_words = set(stopwords.words('english'))  # Set of stopwords in English
    tokens = text.split()                         # Splitting the text into individual words
    filtered_tokens = [word for word in tokens if word.lower() not in stop_words]  # Filtering out stopwords
    return ' '.join(filtered_tokens)              # Joining the filtered tokens back into a text

#Removing stop words from training data
df['updated_message'] = df['narrative'].apply(remove_stopwords)
#Viewing the changes on the training dataframe
df.head()

**Removing punctuations**


In [None]:
#A function of removing puntuationa marks
def remove_punctuation(post):
    """
    Remove punctuation marks from the given post.

    Args:
        post (str): The input post from which punctuation marks are to be removed.

    Returns:
        str: The post with punctuation marks removed.
    """
    return ''.join([l for l in post if l not in string.punctuation])

# Create and Check if a new column contains messages with no punctuations
df['updated_message'] = df['updated_message'].apply(remove_punctuation).str.lower() #Words converted to lower case
#Viewing changes made on the training daraframe
df.head()

**Masking each department**

In [None]:
#Credit Reporting
credit = df.loc[df['product'] == 'credit_reporting', 'updated_message']
#Debt Collection
debt = df.loc[df['product'] == 'debt_collection', 'updated_message']
#Mortgages and loans
mortgages = df.loc[df['product'] == 'mortgages_and_loans', 'updated_message']
#Credit cards
cards = df.loc[df['product'] == 'credit_card', 'updated_message']
#Retail banking
retail = df.loc[df['product'] == 'retail_banking', 'updated_message']

**Word Clouds for the five departments**

In [None]:
# Defining the categories and their respective texts
categories = ['credit reporting', 'debt collection', 'Mortgages and loans', 'credit cards', 'retail banking']
texts = [credit, debt, mortgages, cards, retail]

# Creating subplots
fig, axs = plt.subplots(3, 2, figsize=(12, 9))
fig.subplots_adjust(hspace=0)  # Adjusting the hspace parameter to remove vertical spacing

# Generating word clouds for each category and plot them
for i, ax in enumerate(axs.flat):
    if i < len(categories):  # Ensure we only use the first five categories
        # Calculating word distribution
        text = ' '.join(texts[i])
        words = text.split()
        print(f"Category: {categories[i]}, Number of Words: {len(words)}")  # Add this line to check the number of words

        # Generating word cloud
        wordcloud = WordCloud(max_words=20)
        wordcloud.generate_from_frequencies(Counter(words))

        # Ploting  the word cloud
        ax.imshow(wordcloud, interpolation='bilinear')
        ax.axis('off')
        ax.set_title(categories[i])
    else:
        ax.axis('off')

# Showing the subplots
plt.show()

**Observation:**
    
- Credit Reporting Complaints:
The most common words in complaints related to credit reporting are "Credit," "account," "reporting," "report," and "information." This suggests that consumers are often expressing concerns or frustrations related to their credit reports, accounts, and the accuracy or handling of credit-related information.

- Debt Collection Complaints:
The most common words in complaints related to debt collection are "debt," "credit," "account," "collection," and "report." This indicates that consumers are frequently expressing issues with the collection of debts, interactions with debt collectors, and how these activities might impact their credit report.

- Mortgage and Loan Complaints:
The most common words in complaints related to mortgages and loans are "Loan," "payment," "Mortgage," "credit," and "account." This suggests that consumers often have complaints concerning loan payments, mortgage-related issues, and credit aspects associated with loans.

- Credit Card Complaints:
The most common words in complaints related to credit cards are "Credit," "Card," "account," "payment," and "charge." This indicates that consumers commonly express concerns regarding credit card transactions, account management, and charges they may have incurred.

- Retail Banking Complaints:
The most common words in complaints related to retail banking are "account," "bank," "money," "fund," and "check." This suggests that consumers frequently express issues with their bank accounts, money transactions, funds management, and perhaps problems with check-related services.

# DATA ENGINEERING

**Data Resampling**

In [None]:
#Credit Reporting
credit = df[df['product'] == 'credit_reporting']
#Debt Collection
debt = df[df['product'] == 'debt_collection']
#Mortgages and loans
mortgages = df[df['product'] == 'mortgages_and_loans']
#Credit cards
cards = df[df['product'] == 'credit_card']
#Retail banking
retail = df[df['product'] == 'retail_banking']



Credit= resample(credit,
                    replace=False, 
                    n_samples=13000, 
                    random_state=42)
Debt = resample(debt,
                   replace=False, 
                   n_samples=13000, 
                 random_state=42)
Mortgages= resample(mortgages,
                          replace=False,
                          n_samples=13000, 
                          random_state=42)
Cards= resample(cards,
             replace=False, 
            n_samples=13000,
            random_state=42)
Retail= resample(retail,
             replace=False, 
            n_samples=13000, 
            random_state=42)

df = pd.concat([Credit, Debt, Mortgages, Cards, Retail])

**Lemmatization**

In [None]:
def lemmatize_text(text):
    # Initializing the WordNet Lemmatizer
    lemmatizer = WordNetLemmatizer()
    
    # Tokenizing the text into individual words
    words = nltk.word_tokenize(text)
    
    # Lemmatize each word and join them back into a sentence
    lemmatized_text = ' '.join(lemmatizer.lemmatize(word) for word in words)
    
    return lemmatized_text

#applying the function on the updated text
df['lemmatized_message'] = df['updated_message'].apply(lemmatize_text)

**Vectorization**

In [None]:
tf = TfidfVectorizer(max_features=1000)
X = tf.fit_transform(df['lemmatized_message']).toarray()

In [None]:
tf.get_feature_names_out()[:20]

**Defining the response variable**

In [None]:
# Assuming 'response_variable' contains the class labels in string format
label_encoder = LabelEncoder()
y = label_encoder.fit_transform(df['product'])
y=y.reshape(-1, 1)

# DATA MODELING

**Splitting the data**

In [None]:
#Spliting the data
X_train, X_test, y_train, y_test=train_test_split(X, y,test_size=0.2, random_state=42)

**Building the models**

In [None]:
#Logistic Regression
LR=LogisticRegression(max_iter=1000)
#Random Forest Classifier
RFC=RandomForestClassifier()
#Gradient Boosting Classifier
GBC=GradientBoostingClassifier()
#Decision Tree classifier
DTC=DecisionTreeClassifier()

**Training the models**

In [None]:
#Logistic Regression
LR.fit(X_train, y_train)
#Random Forest Classifier
RFC.fit(X_train, y_train)
#Gradient Boosting classifier
GBC.fit(X_train, y_train)
#Decision Tree Classifier
DTC.fit(X_train, y_train)

**Predicting on the test and training data**

In [None]:
#Logistic Regression
LR_train=LR.predict(X_train)
LR_test=LR.predict(X_test)
#Random Forest Classifier
RFC_train=RFC.predict(X_train)
RFC_test=RFC.predict(X_test)
#Gradient Boosting classifier
GBC_train=GBC.predict(X_train)
GBC_test=GBC.predict(X_test)
#Decision Tree Classifier
DTC_train=DTC.predict(X_train)
DTC_test=DTC.predict(X_test)

# **Comparing Model performance**

# **Explaining the Model**

**WORK IN PROGRESS....**