## Data Insights

In this section, we will explore the given dataset and try to understand its characteristics.

First, let's load the data and check its size:

In [1]:
import pandas as pd

eval_df = pd.read_excel("/home/adesoji/Data_dir/evaluation.xlsx")
train_df = pd.read_excel("/home/adesoji/Data_dir/train.xlsx")

print("Train size:", train_df.shape)
print("Evaluation size:", eval_df.shape)


Train size: (2061, 3)
Evaluation size: (9000, 3)


We have `2061 samples` in the training set and `9000 samples` in the evaluation set. Next, let's check the distribution of the labels:

In [2]:
import pandas as pd

# Read the dataset
df = eval_df

# Get the count of each unique reason
reason_counts = eval_df['reason'].value_counts()

# Create a dataframe with the most common reasons
most_common_reasons = pd.DataFrame({'reason': reason_counts.index, 'count': reason_counts.values})

# Sort the dataframe by count in descending order
most_common_reasons = most_common_reasons.sort_values('count', ascending=False)

# Print the top 10 most common reasons
print(most_common_reasons.head(10))


                                        reason  count
0                            unable to use app    482
1       good app for conducting online meeting    343
2         good for watching movies and serials    236
3                              app is not good    188
4                        unable to play videos    174
5          good app for connecting with people    138
6                  want to cancel subscription    130
7                              unable to login    130
8          app is good to watch disney content    128
9  getting ads despite paying for subscription    124


In [3]:
import pandas as pd

# Read the dataset
df = train_df

# Get the count of each unique reason
reason_counts = train_df['reason'].value_counts()

# Create a dataframe with the most common reasons
most_common_reasons = pd.DataFrame({'reason': reason_counts.index, 'count': reason_counts.values})

# Sort the dataframe by count in descending order
most_common_reasons = most_common_reasons.sort_values('count', ascending=False)

# Print the top 10 most common reasons
print(most_common_reasons.head(10))

                                      reason  count
0     good app for conducting online classes      1
1417               unable to connect meeting      1
1383             good for video conferencing      1
1382             unable to download zoom app      1
1381                want to download the app      1
1380                      app is not working      1
1379  good app for conducting online meeting      1
1378     unable to switch virtual background      1
1377                   video quality is poor      1
1376                           want to login      1


In [4]:
print("Train distribution:\n", train_df['label'].value_counts(normalize=True))
print("Evaluation distribution:\n", eval_df['label'].value_counts(normalize=True))


Train distribution:
 1    1.0
Name: label, dtype: float64
Evaluation distribution:
 0    0.666556
1    0.333444
Name: label, dtype: float64


We can see that the training set has only positive samples (label 1), while the evaluation set has both positive and negative samples. This means that we need to artificially generate negative samples during training to prevent the model from overfitting to the positive class 

Let's now look at some examples from the dataset:

In [5]:
for i in range(5):
    print(f"Text: {train_df['text'][i]}")
    print(f"Reason: {train_df['reason'][i]}")
    print(f"Label: {train_df['label'][i]}\n")


Text: this is an amazing app for online classes!but
Reason: good app for conducting online classes
Label: 1

Text: very practical and easy to use
Reason: app is user-friendly
Label: 1

Text: this app is very good for video conferencing.
Reason: good for video conferencing
Label: 1

Text: i can not download this zoom app
Reason: unable to download zoom app
Label: 1

Text: i am not able to download this app
Reason: want to download the app
Label: 1



We can see that the text and reason are related to each other, and the label indicates whether they match or not.

## Baseline Approach

Before we start building our machine learning model, let's define a baseline approach that we can use to compare the performance of our model.

A simple baseline approach is to check if the text and reason share any common words. If they do, we predict the label as 1, otherwise, we predict 0. Let's implement this approach:

In [6]:
def baseline_predict(text, reason):
    text_words = set(text.lower().split())
    reason_words = set(reason.lower().split())
    
    if len(text_words.intersection(reason_words)) > 0:
        return 1
    else:
        return 0


Now let's evaluate this baseline approach on the evaluation set:

In [7]:
y_true = eval_df['label'].values
y_pred = [baseline_predict(text, reason) for text, reason in zip(eval_df['text'], eval_df['reason'])]

from sklearn.metrics import classification_report

print(classification_report(y_true, y_pred))


              precision    recall  f1-score   support

           0       0.82      0.34      0.48      5999
           1       0.39      0.85      0.54      3001

    accuracy                           0.51      9000
   macro avg       0.61      0.60      0.51      9000
weighted avg       0.68      0.51      0.50      9000



The baseline approach achieves an F1 score of 0.48 on the negative class and 0.54 on the positive class, with an overall accuracy of 0.51 . We can use this as a starting point and try to improve the performance using machine learning models.

In [2]:
import pandas as pd

# Replace 'train.csv' and 'eval.csv' with the actual paths to your downloaded CSV files
train_df 
eval_df 

# Dataset size
train_size = train_df.shape[0]
eval_size = eval_df.shape[0]

print(f"Training dataset size: {train_size}")
print(f"Evaluation dataset size: {eval_size}")

# Label distribution in the training dataset
train_label_counts = train_df['label'].value_counts(normalize=True)
print("\nLabel distribution in the training dataset:")
print(train_label_counts)

# Label distribution in the evaluation dataset
eval_label_counts = eval_df['label'].value_counts(normalize=True)
print("\nLabel distribution in the evaluation dataset:")
print(eval_label_counts)


Training dataset size: 2061
Evaluation dataset size: 9000

Label distribution in the training dataset:
1    1.0
Name: label, dtype: float64

Label distribution in the evaluation dataset:
0    0.666556
1    0.333444
Name: label, dtype: float64


Generative models are a class of models that can generate new data that is similar to the data it was trained on. Semantic similarity is a measure of the degree of equivalence in the underlying semantics of paired snippets of text. There are many generative models that can be used for semantic similarity. One such model is the **BERT-based semantic text similarity models**¹. Another model is the **Generative Adversarial Networks (GANs)**⁶. GANs are a class of generative models that can generate new data that is similar to the data it was trained on. They have shown great promise in a variety of applications, including image and speech synthesis, natural language processing, and drug discovery⁶. 

You can find more information about generative models and semantic similarity on **Google Scholar**³ and **GitHub**¹².

Source: Conversation with Bing, 3/19/2023(1) GitHub - AndriyMulyar/semantic-text-similarity: an easy-to-use .... https://github.com/AndriyMulyar/semantic-text-similarity Accessed 3/19/2023.
(2) Beyond Statistical Similarity: Rethinking Metrics for Deep Generative .... https://arxiv.org/abs/2302.02913 Accessed 3/19/2023.
(3) Google Scholar. https://scholar.google.com/ Accessed 3/19/2023.
(4) semantic-similarity · GitHub Topics · GitHub. https://github.com/topics/semantic-similarity Accessed 3/19/2023.
(5) [PDF] Generative models for similarity-based classification | Semantic .... https://www.semanticscholar.org/paper/Generative-models-for-similarity-based-Cazzanti-Gupta/732cc3f31df533833ac0f6620fa0d3036cf38d6e Accessed 3/19/2023.
(6) ‪Joshua B. Tenenbaum‬ - ‪Google Scholar‬. https://scholar.google.com/citations?user=rRJ9wTJMUB8C Accessed 3/19/2023.
(7) Bilingual Generative Transformer for Semantic Sentence Embedding. https://aclanthology.org/2020.emnlp-main.122.pdf Accessed 3/19/2023.
(8) nv-tlabs/semanticGAN_code - GitHub. https://github.com/nv-tlabs/semanticGAN_code Accessed 3/19/2023.

https://pypi.org/project/semantic-text-similarity/
