<p align="center" width="100%">
    <img width="40%" src="customer_support_icon.JPG"> 
</p>

A retail company is on a transformative journey, aiming to elevate their customer services through cutting-edge advancements in Speech Recognition and Natural Language Processing (NLP). As the machine learning engineer for this initiative, you are tasked with developing functionalities that not only convert customer support audio calls into text but also explore methodologies to extract insights from transcribed texts.

In this dynamic project, we leverage the power of `SpeechRecognition`, `Pydub`, and `spaCy` – three open-source packages that form the backbone of your solution. Your objectives are:
  - Transcribe a sample customer audio call, stored at `sample_customer_call.wav`, to showcase the power of open-source speech recognition technology.
  - Analyze sentiment, identify common named entities, and enhance user experience by searching for the most similar customer calls based on a given query from a subset of their pre-transcribed call data, stored at `customer_call_transcriptions.csv`.

This project is an opportunity to unlock the potential of machine learning to revolutionize customer support. Let's delve into the interplay between technology and service excellence.

In [10]:
!pip install SpeechRecognition
!pip install pydub
!pip install spacy
!python3 -m spacy download en_core_web_sm

Defaulting to user installation because normal site-packages is not writeable
Defaulting to user installation because normal site-packages is not writeable
Defaulting to user installation because normal site-packages is not writeable
Defaulting to user installation because normal site-packages is not writeable
Collecting en-core-web-sm==3.6.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.6.0/en_core_web_sm-3.6.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m12.1 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [11]:
# Import required libraries
import pandas as pd

import nltk
nltk.download('vader_lexicon')
from nltk.sentiment.vader import SentimentIntensityAnalyzer

import speech_recognition as sr
from pydub import AudioSegment

import spacy

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /home/repl/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


In [12]:
# -----------------------------
# 1. IMPORT LIBRARIES
# -----------------------------
import pandas as pd
import speech_recognition as sr
from pydub import AudioSegment

import nltk
nltk.download("vader_lexicon")
from nltk.sentiment.vader import SentimentIntensityAnalyzer

import spacy
nlp = spacy.load("en_core_web_sm")

# -----------------------------
# 2. SPEECH RECOGNITION
# -----------------------------
recognizer = sr.Recognizer()

with sr.AudioFile("sample_customer_call.wav") as source:
    audio_data = recognizer.record(source)

# Transcribe audio
transcribed_text = recognizer.recognize_google(audio_data)

# Extract audio stats
audio_seg = AudioSegment.from_file("sample_customer_call.wav")
frame_rate = audio_seg.frame_rate
number_channels = audio_seg.channels

# -----------------------------
# 3. SENTIMENT ANALYSIS
# -----------------------------
# Load your dataset (your file uses column: text)
df = pd.read_csv("customer_call_transcriptions.csv")

sia = SentimentIntensityAnalyzer()

# Sentiment classifier
def classify_sentiment(text):
    scores = sia.polarity_scores(str(text))
    comp = scores["compound"]
    if comp >= 0.05:
        return "positive"
    elif comp <= -0.05:
        return "negative"
    else:
        return "neutral"

# Apply classifier to correct column name → "text"
df["sentiment_predicted"] = df["text"].apply(classify_sentiment)

# REQUIRED: submission variable
predicted = df["sentiment_predicted"]

# Count true positives
true_positive = df.loc[
    (df["sentiment_label"] == "positive") &
    (df["sentiment_predicted"] == "positive")
].shape[0]

# -----------------------------
# 4. NAMED ENTITY RECOGNITION
# -----------------------------
all_entities = []

for text in df["text"]:
    doc = nlp(str(text))
    for ent in doc.ents:
        all_entities.append(ent.text)

most_freq_ent = (
    pd.Series(all_entities).value_counts().idxmax()
    if all_entities else ""
)

# -----------------------------
# 5. MOST SIMILAR TEXT SEARCH
# -----------------------------
query = "wrong package delivery"
query_doc = nlp(query)

best_score = -1
most_similar_text = ""

for text in df["text"]:
    doc = nlp(str(text))
    score = query_doc.similarity(doc)
    if score > best_score:
        best_score = score
        most_similar_text = text

# -----------------------------
# PRINT RESULTS
# -----------------------------
print("Transcribed Text:", transcribed_text)
print("Frame Rate:", frame_rate)
print("Channels:", number_channels)

print("\nTrue Positive:", true_positive)
print("Most Frequent Entity:", most_freq_ent)
print("Most Similar Text:", most_similar_text)

print("\nPredicted column preview:")
print(predicted.head())


[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /home/repl/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


Transcribed Text: hello I'm experiencing an issue with your product I'd like to speak to someone about a replacement
Frame Rate: 44100
Channels: 1

True Positive: 2
Most Frequent Entity: yesterday
Most Similar Text: wrong package delivered

Predicted column preview:
0    negative
1    positive
2    negative
3     neutral
4     neutral
Name: sentiment_predicted, dtype: object
