# Loading

In [1]:
import autocompleter 
autocompl = autocompleter.Autocompleter()

In [2]:
df = autocompl.import_json("sample_conversations.json")
df.shape, df.columns

load json file...
(22264, 3)


((22264, 3), Index(['IsFromCustomer', 'Text', 'index'], dtype='object'))

The file contains 22K conversations between a customer and a representative.
For the purpose of this project, we are only interested in completing the threads of the representative.

In [3]:
df.head()

Unnamed: 0,IsFromCustomer,Text,index
0,True,Hi! I placed an order on your website and I ca...,0
1,True,I think I used my email address to log in.,0
2,True,My battery exploded!,1
3,True,"It's on fire, it's melting the carpet!",1
4,True,What should I do!,1


# Data Selection and Cleaning

The data is going to separate the threads from the customer and the representative, separate the sentenses based on the punctuation (we will keep the punctuation), the final text will be cleaned up with some light regex and only the sentense larger than 1 word will be kept.

Finally, since the representative has the tendency to ask the same question over and over again, the autocomplete is extremely useful by suggesting a complete sentense. In our case, we will count the number of occurence of the same sentense so we can use it as a feature later on and delete the duplicates.

In [4]:
new_df = autocompl.process_data(df)
new_df.shape, new_df.columns
print(new_df.shape)
print(new_df.columns)

select representative threads...
split sentenses on punctuation...
Text Cleaning using simple regex...
calculate nb words of sentenses...
count occurence of sentenses...
remove duplicates (keep last)...
(8560, 5)
(8560, 5)
Index(['IsFromCustomer', 'Text', 'index', 'nb_words', 'Counts'], dtype='object')


# Model and TFIDF matrix

A matrice of similarity is calculated based on the frequency of all the words in the data using tfidfvectorizer

In [5]:
try:
    model_tf, tfidf_matrice = autocompl.calc_matrice(new_df)
    print("TF-IDF matrix and model successfully created.")
except AttributeError as e:
    print(f"Error in calc_matrice: {e}")
    raise

# No need to call calc_matrice again here
print("TF-IDF matrix and vectorizer successfully created.")
print(new_df.columns)
# Ensure that the TF-IDF matrix is in the correct format
print(type(tfidf_matrice))

tfidf_matrice  (8560, 252)
TF-IDF matrix and model successfully created.
TF-IDF matrix and vectorizer successfully created.
Index(['IsFromCustomer', 'Text', 'index', 'nb_words', 'Counts'], dtype='object')
<class 'scipy.sparse._csr.csr_matrix'>


In [13]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, accuracy_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score

X = tfidf_matrice
y = new_df['IsFromCustomer']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train a Random Forest classifier
rf_model = RandomForestClassifier(n_estimators=50, random_state=42)
rf_model.fit(X_train, y_train)

# Make predictions
rf_y_pred = rf_model.predict(X_test)

# Output the classification report
print("Random Forest Classifier:")
print(classification_report(y_test, rf_y_pred))

# Output the accuracy
rf_accuracy = accuracy_score(y_test, rf_y_pred)
print(f'Accuracy: {rf_accuracy}')


cross_val_scores = cross_val_score(rf_model, X, y, cv=5)
print("Cross-Validation Scores:", cross_val_scores)
print("Mean CV Score:", cross_val_scores.mean())
print(new_df['IsFromCustomer'].value_counts())


Random Forest Classifier:
              precision    recall  f1-score   support

       False       1.00      1.00      1.00      1712

    accuracy                           1.00      1712
   macro avg       1.00      1.00      1.00      1712
weighted avg       1.00      1.00      1.00      1712

Accuracy: 1.0
Cross-Validation Scores: [1. 1. 1. 1. 1.]
Mean CV Score: 1.0
IsFromCustomer
False    8560
Name: count, dtype: int64


# Ranking Function

Finally, the autocomplete is calculating the similarity between the sentense in the data and the prefix of the sentense written by the representative. As a weight feature, we chose to reorder using the frequency of the most common similar sentense.

examples of auto completions

In [7]:
prefix = 'What is your'

print(prefix,"    \n ")

autocompl.generate_completions(prefix, new_df, model_tf,tfidf_matrice)

What is your     
 


['What is your zip code?', 'What is your username?', 'What is your address?']

In [8]:
prefix = 'How can'
print(prefix,"     ")
autocompl.generate_completions(prefix, new_df, model_tf,tfidf_matrice)

How can      


['Yes I can', 'I can imagine', 'I can include a dust cleaner']

In [9]:
prefix = 'Let me'
print(prefix,"     ")
autocompl.generate_completions(prefix, new_df, model_tf,tfidf_matrice)

Let me      


['Let me research this',
 'Let me investigate',
 'Let me investigate this quickly']

In [10]:
prefix = 'when was'
print(prefix,"     ")
autocompl.generate_completions(prefix, new_df, model_tf,tfidf_matrice)

when was      


['When was the last time you changed your password?',
 'Please use code 20% when ordering',
 'Use the code 20% when ordering']

Now, without any uppercase and just with the important words...

In [11]:
prefix = 'when time password'
print(prefix,"     ")
autocompl.generate_completions(prefix, new_df, model_tf,tfidf_matrice)

when time password      


['When was the last time you changed your password?',
 'When was the last time you changed your wi-fi password?',
 'When you select you password?']

# Online Sources for this project

In [12]:
# https://gist.github.com/jlln/338b4b0b55bd6984f883 modified to keep punctuation
# kaggle google store competition for json read
# https://www.kaggle.com/hamishdickson/weighted-word-autocomplete-using-star-wars-dataset