# Given A path to a csv file with a single "text" column, this demo will:
- Identify Questions in the set
- Flag unanswered Questions
- Print the number of questions identified
- Print the number of questions answered
- Print the index and content of unanswered questions as "needs follow up"

# Setup

In [57]:
from nltk import word_tokenize
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()


# Preprocessing import for tokenizer
def tokenize_and_stem(text):
    tokens = word_tokenize(text)
    stemmed_tokens = [stemmer.stem(token) for token in tokens]
    return stemmed_tokens

In [58]:
# Load Pretrained Models, Tokenizers, and PDF
import joblib
import pickle
import q_or_not_q_classification.q_or_not_q_api as q_or_not_q_api

q_or_a_model = joblib.load('../models/q_or_no.pkl')
q_or_a_vect = joblib.load('../models/q_or_a_vectorizer.pkl')
pair_model = joblib.load('../models/pair_model.pkl')
pair_vectorizer = joblib.load('../models/pair_vectorizer.pkl')
with open('../models/pdf.pickle', 'rb') as f:
    pdf = pickle.load(f)

In [59]:
# Load test data, and drop all columns except text for testing, add r_index column
import pandas as pd
df = pd.read_csv('../question_flagging/test_conversation.csv')
test_df = pd.DataFrame(df['text'], columns= ['text'])
test_df.reset_index(inplace=True)
test_df.columns= ['r_index', 'text']
df.fillna(0, inplace= True)

# Identify Questions

In [60]:
# Vectorize text column using q_or_a
q_id_in_vec = df['text'].astype(str)
vectorized_input = q_or_a_vect.transform(q_id_in_vec)

In [61]:
# Make Question Predictions, and add to testDF
preds = q_or_a_model.predict(vectorized_input)
test_df['question'] = preds

# Now run Answered Model

In [62]:
# Now run Answered Model
tf_idf_matrix = pair_vectorizer.transform(df['text'].astype(str))

In [63]:
# Create cosine similarity matrix
from sklearn.metrics.pairwise import cosine_similarity
sim_matrix = cosine_similarity(tf_idf_matrix, tf_idf_matrix)

In [64]:
# Calculates cosign similarity of a given row and its offset
def calculate_cosign_similarity(index_a, offset):
    if index_a + offset >= len(test_df):
        return 0
    return sim_matrix[index_a, index_a + offset]

In [65]:
# Make prediction on row based on pdf and 4 probability values
def pdf_prediction(num1, num2, num3, num4):
    n = sum(pdf.values())
    pdf_1 = num1 * (pdf[1]/n)
    pdf_2 = num2 * (pdf[2]/n)
    pdf_3 = num3 * (pdf[3]/n)
    pdf_4 = num4 * (pdf[4]/n)
    ev = pdf_1 + pdf_2 + pdf_3 + pdf_4
    if ev >= 0.5:
        return 1
    return 0

In [66]:
# Create Similarity Columns: 
test_df['pair_1_sim'] = test_df['r_index'].apply(lambda x: calculate_cosign_similarity(int(x), 1))
test_df['pair_2_sim'] = test_df['r_index'].apply(lambda x: calculate_cosign_similarity(int(x), 2))
test_df['pair_3_sim'] = test_df['r_index'].apply(lambda x: calculate_cosign_similarity(int(x), 3))
test_df['pair_4_sim'] = test_df['r_index'].apply(lambda x: calculate_cosign_similarity(int(x), 4))

In [67]:
# Filter df to only predicted questions
questions_only = test_df[test_df['question'] == 1]

In [68]:
# Isolate and rename similarity columns for model compatibility
offset_1 = questions_only[['question', 'pair_1_sim']]
offset_1.columns = ['question', 'similarity']
offset_2 = questions_only[['question', 'pair_2_sim']]
offset_2.columns = ['question', 'similarity']
offset_3 = questions_only[['question', 'pair_3_sim']]
offset_3.columns = ['question', 'similarity']
offset_4 = questions_only[['question', 'pair_4_sim']]
offset_4.columns = ['question', 'similarity']

In [69]:
# Make probability predictions using q_a_pair model
offset_1_preds = pair_model.predict_proba(offset_1[['similarity']])
offset_2_preds = pair_model.predict_proba(offset_2[['similarity']])
offset_3_preds = pair_model.predict_proba(offset_3[['similarity']])
offset_4_preds = pair_model.predict_proba(offset_4[['similarity']])

In [70]:
# Add predicions back to offset dataframes
offset_1['predicted'] = offset_1_preds[:, 1]
offset_2['predicted'] = offset_2_preds[:, 1]
offset_3['predicted'] = offset_3_preds[:, 1]
offset_4['predicted'] = offset_4_preds[:, 1]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  offset_1['predicted'] = offset_1_preds[:, 1]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  offset_2['predicted'] = offset_2_preds[:, 1]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  offset_3['predicted'] = offset_3_preds[:, 1]
A value is trying to be set on a copy of a slice from a DataFrame.
Try

In [71]:
# Add predictions to questions only df
questions_only['pred_1'] = offset_1['predicted']
questions_only['pred_2'] = offset_2['predicted']
questions_only['pred_3'] = offset_3['predicted']
questions_only['pred_4'] = offset_4['predicted']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  questions_only['pred_1'] = offset_1['predicted']
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  questions_only['pred_2'] = offset_2['predicted']
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  questions_only['pred_3'] = offset_3['predicted']
A value is trying to be set on a copy of a slice from a Da

In [72]:
questions_only.columns

Index(['r_index', 'text', 'question', 'pair_1_sim', 'pair_2_sim', 'pair_3_sim',
       'pair_4_sim', 'pred_1', 'pred_2', 'pred_3', 'pred_4'],
      dtype='object')

In [73]:
# Use pdf to generate final predictions
questions_only['answered_question_pred'] = questions_only.apply(lambda row: pdf_prediction(row['pred_1'], row['pred_2'], row['pred_3'], row['pred_4']), axis=1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  questions_only['answered_question_pred'] = questions_only.apply(lambda row: pdf_prediction(row['pred_1'], row['pred_2'], row['pred_3'], row['pred_4']), axis=1)


In [74]:
# Print questions for follow up for user
print("Follow up items:")
for index, row in questions_only.iterrows():
    str_out = "Unanswered question in message " + str(row['r_index']) + ":  "
    str_out += row['text']
    print(str_out)

Follow up items:
Unanswered question in message 0:  seems exactly what I want
Unanswered question in message 8:  idk its not that terribly complicated although i dont know what you have used in the past
Unanswered question in message 24:  What was meant by that
Unanswered question in message 26:  i think what you are reaching at is the notion of namespace
Unanswered question in message 33:  a class is more like a dictionary except a dictionary where the values can be functions which handily in Python is trivial
Unanswered question in message 38:  or hierarchy of namespaces
Unanswered question in message 57:  What library
Unanswered question in message 59:  Bruh Py2 is impossible
Unanswered question in message 85:  yeah ik what the differenences are but i just had a oubt where can u use elif muliple times
Unanswered question in message 110:  Let's say I have a condition in the program to check if string doesn't include anything other than letters and want to successfully exit out if it'

# Final End to end analytics of Discord test data

In [75]:
# Clean questions for mapping back to test set
questions_only.columns= ['r_index', 'text', 'question_pred', 'pair_1_sim', 'pair_2_sim', 'pair_3_sim',
       'pair_4_sim', 'pred_1', 'pred_2', 'pred_3', 'pred_4',
       'answered_question_pred']
questions_only.drop(columns=['pair_1_sim', 'pair_2_sim', 'pair_3_sim',
       'pair_4_sim', 'pred_1', 'pred_2', 'pred_3', 'pred_4'], inplace=True)
questions_only


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  questions_only.drop(columns=['pair_1_sim', 'pair_2_sim', 'pair_3_sim',


Unnamed: 0,r_index,text,question_pred,answered_question_pred
0,0,seems exactly what I want,1,0
8,8,idk its not that terribly complicated although...,1,0
24,24,What was meant by that,1,0
26,26,i think what you are reaching at is the notion...,1,0
33,33,a class is more like a dictionary except a dic...,1,0
38,38,or hierarchy of namespaces,1,0
57,57,What library,1,0
59,59,Bruh Py2 is impossible,1,0
85,85,yeah ik what the differenences are but i just ...,1,1
110,110,Let's say I have a condition in the program to...,1,1


In [76]:
# Calculate difference of question and answered to get final prediction
questions_only['final_pred'] = questions_only['question_pred'] - questions_only['answered_question_pred']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  questions_only['final_pred'] = questions_only['question_pred'] - questions_only['answered_question_pred']


In [83]:
# Drop User column
df.drop(columns=['user'], inplace=True)

In [84]:
df

Unnamed: 0,text,Question,Answer
0,seems exactly what I want,0.0,0.0
1,for larger things I use a scratch file or pych...,0.0,0.0
2,was about to say pycharm has an ipython console,0.0,0.0
3,I use this normally because I don't want to la...,0.0,0.0
4,for stuff like showing a get request in a help...,0.0,0.0
...,...,...,...
195,I am looking for an text/code editor where cop...,0.0,0.0
196,with C/C++ Emacs was fine with python I am get...,0.0,0.0
197,yeah i have no clue about emacs but if your lo...,0.0,0.0
198,I also believe you can set up emacs to be smoo...,0.0,0.0


In [87]:
# Copy question predictions to main dataframe
df['question_predicted'] = test_df['question']

In [91]:
# Reset index in place to generate r_index column
df.reset_index(inplace=True)

In [96]:
# Initialize answered predictions to 0
df['answered_question_pred'] = 0

In [100]:
# Impute answer predicitons from questions only
for index_1, row in questions_only.iterrows():
    df.at[int(row['r_index']), 'answered_question_pred'] = row['answered_question_pred']

In [106]:
# Initialize final preds to 0
df['final_pred'] = 0

In [107]:
# Impute final predictions into main df from questions only
for index_1, row in questions_only.iterrows():
    df.at[int(row['r_index']), 'final_pred'] = row['final_pred']

In [108]:
df['final_pred'].describe()

count    200.000000
mean       0.085000
std        0.279582
min        0.000000
25%        0.000000
50%        0.000000
75%        0.000000
max        1.000000
Name: final_pred, dtype: float64

In [111]:
# calculate actual final results
df['final_actual'] = df.apply(lambda row: max(row['Question'] - row['Answer'], 0), axis=1)

In [113]:
# Question Identification on test corpus classification report
from sklearn.metrics import accuracy_score, classification_report

# Evaluation
print("Identification of Questions in test corpus")
print("Accuracy:", accuracy_score(df['Question'], df['question_predicted']))
print("Classification Report:\n", classification_report(df['Question'], df['question_predicted']))

Identification of Questions in test corpus
Accuracy: 0.84
Classification Report:
               precision    recall  f1-score   support

         0.0       0.90      0.92      0.91       174
         1.0       0.36      0.31      0.33        26

    accuracy                           0.84       200
   macro avg       0.63      0.61      0.62       200
weighted avg       0.83      0.84      0.83       200


In [127]:
# Prediction of an Answer Given the prediction of an answer vs actual answers
print("Identification of Questions in test corpus")
print("Accuracy:", accuracy_score(df[df['question_predicted'] == 1]['Answer'], df[df['question_predicted'] == 1]['answered_question_pred']))
print("Classification Report:\n", classification_report(df[df['question_predicted'] == 1]['Answer'], df[df['question_predicted'] == 1]['answered_question_pred']))

Identification of Questions in test corpus
Accuracy: 0.7272727272727273
Classification Report:
               precision    recall  f1-score   support

         0.0       0.82      0.82      0.82        17
         1.0       0.40      0.40      0.40         5

    accuracy                           0.73        22
   macro avg       0.61      0.61      0.61        22
weighted avg       0.73      0.73      0.73        22


In [115]:
df.columns

Index(['index', 'text', 'Question', 'Answer', 'question_predicted',
       'answered_question_pred', 'final_pred', 'final_actual'],
      dtype='object')

In [116]:
# Prediction of an Answer Given the prediction of an answer vs actual answers
print("Final Report of predicted unanswered questions")
print("Accuracy:", accuracy_score(df['final_actual'], df['final_pred']))
print("Classification Report:\n", classification_report(df['final_actual'], df['final_pred']))

Final Report of predicted unanswered questions
Accuracy: 0.89
Classification Report:
               precision    recall  f1-score   support

         0.0       0.96      0.93      0.94       189
         1.0       0.18      0.27      0.21        11

    accuracy                           0.89       200
   macro avg       0.57      0.60      0.58       200
weighted avg       0.91      0.89      0.90       200


In [118]:
# Total Positive predictions
df['final_pred'].sum()

17

In [119]:
# Total positive cases
df['final_actual'].sum()

11.0

In [125]:
# Filter and display correct positive predictions
correct_positive_preds= df[df['final_pred'] == 1]
correct_positive_preds= correct_positive_preds[correct_positive_preds['final_actual'] == 1]
correct_positive_preds

Unnamed: 0,index,text,Question,Answer,question_predicted,answered_question_pred,final_pred,final_actual
24,24,What was meant by that,1.0,0.0,1,0,1,1.0
57,57,What library,1.0,0.0,1,0,1,1.0
120,120,or I should throw an exception telling the tha...,1.0,0.0,1,0,1,1.0
