### This notebook is part 2 of [100% accuracy using One-Hot Encoding](https://www.kaggle.com/code/mghobashy/100-accuracy-model-using-one-hot-encoding)
### In this continuation, we talk about ***Train-Test Contamination*** while exploring different approaches to improve our model.
# Import the necessary libraries:

In [None]:
import random
import regex as re
import numpy as np
import pandas as pd
import plotly.express as px

from sklearn.manifold import TSNE
from sklearn.cluster import KMeans
from sklearn.preprocessing import LabelEncoder
from sklearn.neighbors import NearestNeighbors
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import adjusted_rand_score, normalized_mutual_info_score

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.layers import TextVectorization

import warnings
warnings.filterwarnings("ignore")
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

In [None]:
main_data = pd.read_csv("/kaggle/input/disease-symptom-description-dataset/dataset.csv")

In [None]:
main_data.sample(5)

# Clean and preprocess the data:

## In Part 1, we converted the text data into integers to feed through our neural network. Here, we will explore more advanced strategies like LSTMs and K-means clustering that can work with text.

### First we concatenate the entire Symptoms columns into a single column.

In [None]:
symptom_cols = main_data.columns.difference(['Disease'])  # Select all columns except 'Disease'

# Combine all symptom columns into a single column
conc_df = main_data.copy()  # Copy the original data incase we needed the original later
conc_df['Symptoms'] = conc_df[symptom_cols].apply(lambda row: ','.join(row.dropna()), # Dropping NaN
                                                  axis=1)

# Drop duplicate symptoms within each cell
conc_df['Symptoms'] = conc_df['Symptoms'].apply(lambda x: ','.join(sorted(set(
                                                x.split(','))) if x else ''))

# Keep only the 'Disease' and 'Symptoms' columns
stay_cols = ['Disease', 'Symptoms']
conc_df = conc_df[stay_cols]

conc_df.head()

In [None]:
# Let's view the Symptoms
conc_df['Symptoms'][0]

### Then we clean the 'Symptoms' column.

In [None]:
def strip_to_basic_tokens(text):
    # Remove double spaces and underscores
    text = re.sub(r'[_\s]+', ' ', text)
    # Split by commas and lowercase the tokens
    tokens = [token.strip().lower() for token in text.split(',')]
    return tokens

data = conc_df.copy() # Making a copy

# Apply the function to 'Symptoms' column
data['Basic Tokens'] = data['Symptoms'].apply(strip_to_basic_tokens)
data['Basic Tokens'] = data['Basic Tokens'].apply(lambda x: ', '.join(x))
data = data.drop(['Symptoms'], axis = 1)
data.head()

In [None]:
# Now the Symptoms column is ready
data['Basic Tokens'][0]

### We will also label encode the 'Disease' column.

In [None]:
# Flatten the 'Disease' column into a single Series
encoded_data = data.copy()
flattened_series = encoded_data['Disease'].astype(str)
# Apply label encoding on the Disease column
encoder = LabelEncoder()
encoded_values = encoder.fit_transform(flattened_series)
encoded_data['Disease'] = encoded_values
# Saving the encoding to reverse it later in the predictions
label_mapping = {index: label for index, label in enumerate(encoder.classes_)}

encoded_data.head()

## Now that the data is ready, let's review how we are going to train and evaluate our models.

# Bidirectional LSTM:

### First we create the train, test, val.

In [None]:
train_sentences, val_sentences, train_labels, val_labels = train_test_split(
    encoded_data["Basic Tokens"].to_numpy(),
    encoded_data["Disease"].to_numpy(),
    stratify = data["Disease"], # To make sure of an even distribution between the train and test
    test_size=0.25,
    random_state=42)
val_sentences, test_sentences, val_labels, test_labels = train_test_split(
    val_sentences,
    val_labels,
    test_size=0.5,
    random_state=42)

### Next, we will set up our `text_vectorizer` layer and adapt it.

In [None]:
max_length = max(len(sentence.split()) for sentence in train_sentences)
text_vectorizer = TextVectorization(split="whitespace", # how to split tokens
                                    output_mode="int", # how to map tokens to numbers
                                    output_sequence_length = max_length,
                                    #standardize="lower_and_strip_punctuation", # We've already done that
                                   )

In [None]:
text_vectorizer.adapt(train_sentences)

### Let's see how the text vectorizer handles the text.

In [None]:
random_sentence = random.choice(train_sentences)
print(f"Original text:\n{random_sentence}\
      \n\nVectorized version:")
text_vectorizer([random_sentence])

In [None]:
# Getting the max vocab length
words_in_vocab = text_vectorizer.get_vocabulary()
max_vocab_length = len(words_in_vocab)
max_vocab_length

In [None]:
# Getting the number of targets to put in the output layer
targets = data['Disease'].nunique()
targets

## Now, let's create, compile, and train the model.

In [None]:
tf.random.set_seed(42)
# We set the embedding layer
embedding = layers.Embedding(input_dim=max_vocab_length,
                                     output_dim=128,
                                     embeddings_initializer="uniform",
                                     input_length=max_length,
                                     name="embedding")

inputs = layers.Input(shape=(1,), dtype="string")
x = text_vectorizer(inputs)
x = embedding(x)
x = layers.Bidirectional(layers.LSTM(32, return_sequences=True))(x)
x = layers.Bidirectional(layers.LSTM((64)))(x)
x = layers.Dense(64, activation="relu")(x)
outputs = layers.Dense(targets, activation="softmax")(x)
model_1 = tf.keras.Model(inputs, outputs, name="Bidirectional_LSTM")

In [None]:
model_1.compile(loss="sparse_categorical_crossentropy",
                optimizer=tf.keras.optimizers.Adam(),
                metrics=["accuracy"])

In [None]:
model_1.summary()

In [None]:
early_stopping = EarlyStopping(monitor='val_accuracy', patience=3, mode='max', 
                               restore_best_weights=True)


model_1_history = model_1.fit(
    train_sentences,
    train_labels,
    epochs=500,
    validation_data=(val_sentences, val_labels),
    callbacks=[early_stopping]
)

In [None]:
model_1.evaluate(test_sentences, test_labels)

## Great! 100% accuracy on both train and test. Looks like the model is flawless, but ***is it though?***

### Let's view the nearest neighbors (symptoms) for each class label (disease). It should make sense *medically* since the model achieved 100% accuracy.

In [None]:
# Generate a random number between 0 and 40 since there's 41 targets
random_number = random.randint(0, 40)

def find_nearest_neighbors(class_label):
    # Get the embedding layer of the model
    embedding_layer = model_1.get_layer("embedding")

    # Get the weights of the embedding layer
    weights = embedding_layer.get_weights()[0]

    # Prepare the nearest neighbors model
    neighbors_model = NearestNeighbors(n_neighbors=5, algorithm='auto')
    
    # Fit the nearest neighbors model with the weights of the embedding layer
    neighbors_model.fit(weights)

    # Get the embedding vector for the class label
    class_embedding = weights[class_label]

    # Find the nearest neighbors to the class embedding
    distances, indices = neighbors_model.kneighbors([class_embedding])
    print("Nearest neighbors to class label '{}':".format(label_mapping[class_label]))
    for i, idx in enumerate(indices[0]):
        print("{}: {}".format(i+1, text_vectorizer.get_vocabulary()[idx]))

find_nearest_neighbors(class_label=12) # replace '12' with 'random_number' to view the rest

## Well, this is completely wrong. But why is the model behaving poorly? Didn't we achieve 100% accuracy? What happened?
## We encountered this issue in part 1 when we used LabelEncoder to encode the entire dataset. Despite achieving around 100% accuracy on both training and testing, the model's results were nonsensical when deployed.
## This happened due to the **"Train-Test Contamination"**

# Train-Test Contamination:
[Read more about it here](https://towardsdatascience.com/the-dreaded-antagonist-data-leakage-in-machine-learning-5f08679852cc)
## It's a part of larger concept known as *Data Leakage*.
### Data leakage occurs when information from the test dataset is mistakenly included in the training dataset.
### The result? Unrealistically good performance metrics during training, but poor performance when the model is actually put to use.
### In simpler terms, the model memorized information that it should not have access to, leading to artificially inflated performance metrics during training.

### But how and when did that happen?
### Well, if we revisit the data again after knowing about train-test contamination, we will notice a crucial oversight.

In [None]:
data.head(10)

### The "Basic Tokens" rows for each disease are too similar. This similarity means that when we split the data into train and test sets, both sets ended up containing nearly identical or highly overlapping instances of data.

#### How can we solve this problem with this particular dataset? While we can't directly fix it, we can approach the data differently. 
#### One method is to concatenate the rows for each disease while emphasizing unique symptoms to give them more weight during model training. 
#### Alternatively, we can employ unsupervised techniques that focus solely on the features. And one of these techniques is...

# K-means clustering:

#### Clustering helps mitigate train-test contamination by promoting a generalized learning approach based on clusters. This approach enhances the model’s ability to generalize and make accurate predictions on new, unseen data points.
#### Clustering is particularly suitable for this dataset, whether as a preprocessing step or the primary model.

In [None]:
data.head()

### We can either use the Keras text vectorizer with an embedding layer to vectorize the text, or we can use TF-IDF.

## Keras Text Vectorizer and Embedding

In [None]:
text_vectorizer

In [None]:
X = text_vectorizer(data['Basic Tokens'])
X_embed = embedding(X)
X_embed.shape

#### If we used X_embed we would get this error `ValueError: Found array with dim 3. KMeans expected <= 2`. That's why we need to use `GlobalAverage`.

In [None]:
# Apply GlobalAveragePooling1D
global_avg_pooling = layers.GlobalAveragePooling1D(data_format='channels_last')
X_avg = global_avg_pooling(X_embed).numpy()
X_avg.shape

### Label_encode the disease column.

In [None]:
le = LabelEncoder()
y = le.fit_transform(data['Disease'])

### Now we create our K-Means model.

In [None]:
n_clusters = 41  # The number of diseases the dataset
kmeans_1 = KMeans(n_clusters=n_clusters, random_state=42)
data['cluster_embed'] = kmeans_1.fit_predict(X_avg)

### Evaluating Clustering Performance.
#### ARI and NMI values closer to 1 indicate higher similarity or correlation between the predicted clusters and the true disease labels, while values closer to 0 indicate no meaningful similarity.

In [None]:
# Compute Adjusted Rand Index
ari = adjusted_rand_score(y, data['cluster_embed'])
print(f'Adjusted Rand Index (ARI): {ari:.4f}')

# Compute Normalized Mutual Information
nmi = normalized_mutual_info_score(y, data['cluster_embed'])
print(f'Normalized Mutual Information (NMI): {nmi:.4f}')

### Let's visualize the clustering.

In [None]:
# Reduce dimensionality for visualization
tsne = TSNE(n_components=2, random_state=42)
X_tsne = tsne.fit_transform(X_avg)

tsne_df = pd.DataFrame(X_tsne, columns=['TSNE Component 1', 'TSNE Component 2'])
tsne_df['Cluster'] = data['cluster_embed']
tsne_df['Disease'] = data['Disease']

# Creating the plot using Plotly Express
fig = px.scatter(
    tsne_df, 
    x='TSNE Component 1', 
    y='TSNE Component 2', 
    color='Cluster', 
    hover_data=['Disease'],
    title='t-SNE Visualization of Symptom Clusters',
    color_continuous_scale=px.colors.qualitative.Vivid
)

fig.show()

### Looks great, but we can do better.

## TF-IDF
##### Term Frequency (TF): The frequency of a word in a document.
##### Inverse Document Frequency (IDF): The rarity of the word across all documents.

In [None]:
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(data['Basic Tokens'])

### Label_encode the disease column.

In [None]:
le = LabelEncoder()
y = le.fit_transform(data['Disease'])

### And now we train the model

In [None]:
n_clusters = 41  # The number of diseases the dataset
kmeans_2 = KMeans(n_clusters=n_clusters, random_state=42)
data['cluster'] = kmeans_2.fit_predict(X)

In [None]:
data.head()

### Evaluating Clustering Performance.

In [None]:
# Compute Adjusted Rand Index
ari = adjusted_rand_score(y, data['cluster'])
print(f'Adjusted Rand Index (ARI): {ari:.4f}')

# Compute Normalized Mutual Information
nmi = normalized_mutual_info_score(y, data['cluster'])
print(f'Normalized Mutual Information (NMI): {nmi:.4f}')

##### Slight Enhancements in Performance

### Visualizing the Clustering:

In [None]:
# Reduce dimensionality for visualization
tsne = TSNE(n_components=2, random_state=42)
X_tsne = tsne.fit_transform(X.toarray())

# Create a DataFrame with the t-SNE results and the cluster labels
tsne_df = pd.DataFrame(X_tsne, columns=['TSNE Component 1', 'TSNE Component 2'])
tsne_df['Cluster'] = data['cluster']
tsne_df['Disease'] = data['Disease']

# Interactive plot using Plotly Express
fig = px.scatter(
    tsne_df, 
    x='TSNE Component 1', 
    y='TSNE Component 2', 
    color='Cluster', 
    hover_data=['Disease'],
    title='t-SNE Visualization of Symptom Clusters',
    color_continuous_scale=px.colors.qualitative.Vivid
)

fig.show()

### Looking much better. Now let's test all the models.

In [None]:
def predict_disease(user_input):
    
    '''
    Predict the disease based on the provided symptoms using three different models:
    LSTM, Embeddings, and TF-IDF.

    Args:
        user_input (str): A string of symptoms separated by commas, representing the symptoms provided by the user for disease prediction.

    Returns:
        LSTM_prediction (str): The disease predicted by the LSTM model.
        predicted_disease_tfidf (str): The disease predicted by the TF-IDF and k-means clustering model.
        predicted_disease_embed (str): The disease predicted by the Keras vectorizer and embedding model.
    '''
    
    
    #LSTM
    user_input_array = np.array([user_input], dtype=object)
    user_prediction = model_1.predict(user_input_array)
    LSTM_prediction = label_mapping[user_prediction.argmax()]
    
    # Keras vectorizer and embedding
    user_input_vector = text_vectorizer([user_input])
    user_input_vector_embed = embedding(user_input_vector)
    user_input_vector_avg = global_avg_pooling(user_input_vector_embed).numpy()
    predicted_cluster_embed = kmeans_1.predict(user_input_vector_avg)
    cluster_to_disease_embed = data.groupby('cluster_embed')['Disease'].apply(lambda x: x.mode()[0]).to_dict()
    predicted_disease_embed = cluster_to_disease_embed[predicted_cluster_embed[0]]
    
    # TF-IDF
    user_input_tfidf = vectorizer.transform([user_input])
    predicted_cluster_tfidf = kmeans_2.predict(user_input_tfidf)
    cluster_to_disease = data.groupby('cluster')['Disease'].apply(lambda x: x.mode()[0]).to_dict()
    predicted_disease_tfidf = cluster_to_disease[predicted_cluster_tfidf[0]]
    
    return LSTM_prediction, predicted_disease_tfidf, predicted_disease_embed


user_input = "breathlessness,cough" # These symptoms are taken from Bronchial Asthma symptoms
LSTM_prediction, predicted_disease_tfidf, predicted_disease_embed = predict_disease(user_input)
print(f"Using LSTM,the predicted disease for the symptoms '{user_input}' is: {LSTM_prediction}\n")
print(f"Using Embeddings,the predicted disease for the symptoms '{user_input}' is: {predicted_disease_embed}\n")
print(f"Using TF-IDF,the predicted disease for the symptoms '{user_input}' is: {predicted_disease_tfidf}")

### Looks like the TF-IDF model is the most accurate one, but let's test the models even further.

In [None]:
def shuffle_tokens(df, num_tokens=None):
    '''
    Randomly shuffle and select a subset of tokens (symptoms) from a randomly selected row in the DataFrame.

    Args:
        df (pd.DataFrame): A DataFrame containing two columns: 'Disease' and 'Basic Tokens'.
        num_tokens (int, optional): The number of tokens to select from the shuffled list. If None, a random number of tokens will be selected.

    Returns:
        disease (str): The disease corresponding to the randomly selected row.
        shuffled_tokens (str): A string of randomly shuffled and selected tokens (symptoms) from the chosen row.

    '''
    # Select a random row index
    idx = np.random.randint(0, len(df))

    # Retrieve disease value
    disease = df.iloc[idx]['Disease']

    # Retrieve tokens and shuffle
    tokens_str = df.iloc[idx]['Basic Tokens']
    tokens_list = [token.strip() for token in tokens_str.split(',')]
    np.random.shuffle(tokens_list)
    
    # Select a random number of tokens if num_tokens is not specified
    if num_tokens is None:
        num_tokens = np.random.randint(1, len(tokens_list) + 1)
    
    # Randomly select a subset of tokens
    selected_tokens = np.random.choice(tokens_list, num_tokens, replace=False)
    shuffled_tokens = ', '.join(selected_tokens)

    return disease, shuffled_tokens

def predict_disease(user_input):
    # LSTM prediction
    user_input_array = np.array([user_input], dtype=object)
    user_prediction = model_1.predict(user_input_array, verbose = 0)
    LSTM_prediction = label_mapping[user_prediction.argmax()]
    
    # Keras vectorizer and embedding prediction
    user_input_vector = text_vectorizer([user_input])
    user_input_vector_embed = embedding(user_input_vector)
    user_input_vector_avg = global_avg_pooling(user_input_vector_embed).numpy()
    predicted_cluster_embed = kmeans_1.predict(user_input_vector_avg)
    cluster_to_disease_embed = data.groupby('cluster_embed')['Disease'].apply(lambda x: x.mode()[0]).to_dict()
    predicted_disease_embed = cluster_to_disease_embed[predicted_cluster_embed[0]]
    
    # TF-IDF prediction
    user_input_tfidf = vectorizer.transform([user_input])
    predicted_cluster_tfidf = kmeans_2.predict(user_input_tfidf)
    cluster_to_disease = data.groupby('cluster')['Disease'].apply(lambda x: x.mode()[0]).to_dict()
    predicted_disease_tfidf = cluster_to_disease[predicted_cluster_tfidf[0]]
    
    return LSTM_prediction, predicted_disease_tfidf, predicted_disease_embed

# Create the DataFrame
results = []

for _ in range(20):
    disease, shuffled_tokens = shuffle_tokens(data)
    LSTM_prediction, predicted_disease_tfidf, predicted_disease_embed = predict_disease(shuffled_tokens)
    
    results.append({
        'Shuffled Tokens': shuffled_tokens,
        'Actual Disease': disease,
        'LSTM Prediction': LSTM_prediction,
        'Embedding Prediction': predicted_disease_embed,
        'TF-IDF Prediction': predicted_disease_tfidf
    })

results_df = pd.DataFrame(results)

def highlight_matches(s):
    is_match = s == results_df['Actual Disease']
    return ['background-color: darkgreen' if v else '' for v in is_match]

# Apply the highlighting to the prediction columns
styled_df = results_df.style.apply(highlight_matches, subset=['LSTM Prediction', 'Embedding Prediction', 'TF-IDF Prediction'])

styled_df

### As shown in the table above, the highlighted cells indicate where the predictions match the actual disease, with the TF-IDF model demonstrating the best performance.

### Did we solve the train-test contamination? ***NO***
### We simply navigated around that problem without resulting in an overfitted or poor preformance model.

#### Can this problem be fixed in this particular dataset? As far as I know, it can't be fixed because the issue is inherent to the entire dataset and the way it was made.

# Conclusion
#### In this notebook, we tackled the challenge of train-test contamination, a persistent issue in machine learning. Despite the inherent difficulties posed by the dataset, we explored various approaches to mitigate its impact. By leveraging techniques like TF-IDF and careful model selection, we navigated around the problem while enhancing the model's performance. While the issue remains unsolved due to its dataset-wide nature, our strategies allowed us to build robust models that generalize well to new data. Moving forward, continued exploration and innovation in preprocessing and modeling techniques will be crucial in addressing such challenges effectively.