<a href="https://colab.research.google.com/github/An210/ML/blob/main/text_classification_with_hub.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##### Copyright 2019 The TensorFlow Authors.

In [None]:
#@title Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

In [None]:
#@title MIT License
#
# Copyright (c) 2017 François Chollet
#
# Permission is hereby granted, free of charge, to any person obtaining a
# copy of this software and associated documentation files (the "Software"),
# to deal in the Software without restriction, including without limitation
# the rights to use, copy, modify, merge, publish, distribute, sublicense,
# and/or sell copies of the Software, and to permit persons to whom the
# Software is furnished to do so, subject to the following conditions:
#
# The above copyright notice and this permission notice shall be included in
# all copies or substantial portions of the Software.
#
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
# THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
# FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
# DEALINGS IN THE SOFTWARE.

# Text classification with TensorFlow Hub: Movie reviews

<table class="tfo-notebook-buttons" align="left">
  <td>
    <a target="_blank" href="https://www.tensorflow.org/tutorials/keras/text_classification_with_hub"><img src="https://www.tensorflow.org/images/tf_logo_32px.png" />View on TensorFlow.org</a>
  </td>
  <td>
    <a target="_blank" href="https://colab.research.google.com/github/tensorflow/docs/blob/master/site/en/tutorials/keras/text_classification_with_hub.ipynb"><img src="https://www.tensorflow.org/images/colab_logo_32px.png" />Run in Google Colab</a>
  </td>
  <td>
    <a target="_blank" href="https://github.com/tensorflow/docs/blob/master/site/en/tutorials/keras/text_classification_with_hub.ipynb"><img src="https://www.tensorflow.org/images/GitHub-Mark-32px.png" />View on GitHub</a>
  </td>
  <td>
    <a href="https://storage.googleapis.com/tensorflow_docs/docs/site/en/tutorials/keras/text_classification_with_hub.ipynb"><img src="https://www.tensorflow.org/images/download_logo_32px.png" />Download notebook</a>
  </td>
  <td>
    <a href="https://tfhub.dev/s?module-type=text-embedding"><img src="https://www.tensorflow.org/images/hub_logo_32px.png" />See TF Hub models</a>
  </td>
</table>

This notebook classifies movie reviews as *positive* or *negative* using the text of the review. This is an example of *binary*—or two-class—classification, an important and widely applicable kind of machine learning problem.

The tutorial demonstrates the basic application of transfer learning with [TensorFlow Hub](https://tfhub.dev) and Keras.

It uses the [IMDB dataset](https://www.tensorflow.org/api_docs/python/tf/keras/datasets/imdb) that contains the text of 50,000 movie reviews from the [Internet Movie Database](https://www.imdb.com/). These are split into 25,000 reviews for training and 25,000 reviews for testing. The training and testing sets are *balanced*, meaning they contain an equal number of positive and negative reviews.

This notebook uses [`tf.keras`](https://www.tensorflow.org/guide/keras), a high-level API to build and train models in TensorFlow, and [`tensorflow_hub`](https://www.tensorflow.org/hub), a library for loading trained models from [TFHub](https://tfhub.dev) in a single line of code. For a more advanced text classification tutorial using `tf.keras`, see the [MLCC Text Classification Guide](https://developers.google.com/machine-learning/guides/text-classification/).

In [None]:
!pip install tensorflow-hub
!pip install tensorflow-datasets
!pip install tf-keras



In [None]:
import os
import numpy as np

import tensorflow as tf
import tensorflow_hub as hub
import tensorflow_datasets as tfds
import tf_keras as keras

print("Version: ", tf.__version__)
print("Eager mode: ", tf.executing_eagerly())
print("Hub version: ", hub.__version__)
print("GPU is", "available" if tf.config.list_physical_devices("GPU") else "NOT AVAILABLE")

Version:  2.17.0
Eager mode:  True
Hub version:  0.16.1
GPU is NOT AVAILABLE


## Download the IMDB dataset

The IMDB dataset is available on [imdb reviews](https://www.tensorflow.org/datasets/catalog/imdb_reviews) or on [TensorFlow datasets](https://www.tensorflow.org/datasets). The following code downloads the IMDB dataset to your machine (or the colab runtime):

In [None]:
import pandas as pd
import numpy as np
import io
import os
import tensorflow as tf
import tensorflow_hub as hub
import tensorflow_datasets as tfds
import tf_keras as keras

from google.colab import files

# Upload the file
uploaded = files.upload()
filename = list(uploaded.keys())[0]  # Assuming only one file is uploaded

# Read the CSV file using the correct filename
df = pd.read_csv(io.StringIO(uploaded[filename].decode('utf-8')))
df

# Convert the Pandas DataFrame to a tf.data.Dataset
# Assuming 'Data' column contains the data and 'Label' column contains the labels
dataset = tf.data.Dataset.from_tensor_slices((df['Data'].values, df['Label'].values))

# Now you can split the dataset as needed
# For example, to split into 60% train, 20% validation, and 20% test:
train_size = int(0.6 * len(df))
val_size = int(0.2 * len(df))
test_size = len(df) - train_size - val_size

train_data = dataset.take(train_size)
validation_data = dataset.skip(train_size).take(val_size)
test_data = dataset.skip(train_size + val_size).take(test_size)

train_examples_batch, train_labels_batch = next(iter(train_data.batch(1)))
train_examples_batch

Saving Train.csv to Train (1).csv


<tf.Tensor: shape=(1,), dtype=string, numpy=array([b'BI '], dtype=object)>

In [None]:
for data, label in zip(train_examples_batch.numpy(), train_labels_batch.numpy()): print(f"Data: {data}, Label: {label}")

Data: b'BI ', Label: b'D&A'


## Explore the data

Let's take a moment to understand the format of the data. Each example is a sentence representing the movie review and a corresponding label. The sentence is not preprocessed in any way. The label is an integer value of either 0 or 1, where 0 is a negative review, and 1 is a positive review.

Let's print first 10 examples.

In [None]:
train_examples_batch, train_labels_batch = next(iter(train_data.batch(1)))
train_examples_batch
train_labels_batch

<tf.Tensor: shape=(1,), dtype=string, numpy=array([b'D&A'], dtype=object)>

Let's also print the first 10 labels.

In [None]:
train_labels_batch

<tf.Tensor: shape=(1,), dtype=string, numpy=array([b'D&A'], dtype=object)>

## Build the model

The neural network is created by stacking layers—this requires three main architectural decisions:

* How to represent the text?
* How many layers to use in the model?
* How many *hidden units* to use for each layer?

In this example, the input data consists of sentences. The labels to predict are either 0 or 1.

One way to represent the text is to convert sentences into embeddings vectors. Use a pre-trained text embedding as the first layer, which will have three advantages:

*   You don't have to worry about text preprocessing,
*   Benefit from transfer learning,
*   the embedding has a fixed size, so it's simpler to process.

For this example you use a **pre-trained text embedding model** from [TensorFlow Hub](https://tfhub.dev) called [google/nnlm-en-dim50/2](https://tfhub.dev/google/nnlm-en-dim50/2).

There are many other pre-trained text embeddings from TFHub that can be used in this tutorial:

* [google/nnlm-en-dim128/2](https://tfhub.dev/google/nnlm-en-dim128/2) - trained with the same NNLM architecture on the same data as [google/nnlm-en-dim50/2](https://tfhub.dev/google/nnlm-en-dim50/2), but with a larger embedding dimension. Larger dimensional embeddings can improve on your task but it may take longer to train your model.
* [google/nnlm-en-dim128-with-normalization/2](https://tfhub.dev/google/nnlm-en-dim128-with-normalization/2) - the same as [google/nnlm-en-dim128/2](https://tfhub.dev/google/nnlm-en-dim128/2), but with additional text normalization such as removing punctuation. This can help if the text in your task contains additional characters or punctuation.
* [google/universal-sentence-encoder/4](https://tfhub.dev/google/universal-sentence-encoder/4) - a much larger model yielding 512 dimensional embeddings trained with a deep averaging network (DAN) encoder.

And many more! Find more [text embedding models](https://tfhub.dev/s?module-type=text-embedding) on TFHub.

Let's first create a Keras layer that uses a TensorFlow Hub model to embed the sentences, and try it out on a couple of input examples. Note that no matter the length of the input text, the output shape of the embeddings is: `(num_examples, embedding_dimension)`.

In [None]:
embedding = "https://tfhub.dev/google/nnlm-en-dim50/2"
hub_layer = hub.KerasLayer(embedding, input_shape=[],
                           dtype=tf.string, trainable=True)
hub_layer(train_examples_batch[:3])

<tf.Tensor: shape=(1, 50), dtype=float32, numpy=
array([[ 0.03549673,  0.09000611,  0.16476136,  0.09238922, -0.01018765,
        -0.15804552,  0.00639375,  0.01932171, -0.23423967, -0.16821812,
        -0.1677435 ,  0.18670902,  0.00087197,  0.11487269,  0.21351618,
         0.27116123,  0.00554988,  0.15449746,  0.1601547 ,  0.28777978,
        -0.04522782, -0.02295106,  0.15829036, -0.04858424,  0.06919032,
        -0.20833255, -0.13717455,  0.05805244, -0.12117814, -0.2054949 ,
         0.18067047,  0.01401768, -0.09218553,  0.01343871, -0.06666373,
        -0.2762104 ,  0.24723285, -0.12726785, -0.1715946 ,  0.18940216,
        -0.11338463,  0.00767711,  0.10877595,  0.03579775, -0.08470007,
        -0.11258791,  0.02044052,  0.03852302, -0.20510758,  0.08121522]],
      dtype=float32)>

Let's now build the full model:

In [None]:
model = keras.Sequential()
model.add(hub_layer)
model.add(keras.layers.Dense(16, activation='relu'))
model.add(keras.layers.Dense(1))

model.summary()

Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 keras_layer_1 (KerasLayer)  (None, 50)                48190600  
                                                                 
 dense_2 (Dense)             (None, 16)                816       
                                                                 
 dense_3 (Dense)             (None, 1)                 17        
                                                                 
Total params: 48191433 (183.84 MB)
Trainable params: 48191433 (183.84 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


The layers are stacked sequentially to build the classifier:

1. The first layer is a TensorFlow Hub layer. This layer uses a pre-trained Saved Model to map a sentence into its embedding vector. The pre-trained text embedding model that you are using ([google/nnlm-en-dim50/2](https://tfhub.dev/google/nnlm-en-dim50/2)) splits the sentence into tokens, embeds each token and then combines the embedding. The resulting dimensions are: `(num_examples, embedding_dimension)`. For this NNLM model, the `embedding_dimension` is 50.
2. This fixed-length output vector is piped through a fully-connected (`Dense`) layer with 16 hidden units.
3. The last layer is densely connected with a single output node.

Let's compile the model.

### Loss function and optimizer

A model needs a loss function and an optimizer for training. Since this is a binary classification problem and the model outputs logits (a single-unit layer with a linear activation), you'll use the `binary_crossentropy` loss function.

This isn't the only choice for a loss function, you could, for instance, choose `mean_squared_error`. But, generally, `binary_crossentropy` is better for dealing with probabilities—it measures the "distance" between probability distributions, or in our case, between the ground-truth distribution and the predictions.

Later, when you are exploring regression problems (say, to predict the price of a house), you'll see how to use another loss function called mean squared error.

Now, configure the model to use an optimizer and a loss function:

In [None]:
model.compile(optimizer='adam',
              loss=keras.losses.BinaryCrossentropy(from_logits=True),
              metrics=['accuracy'])

## Train the model

Train the model for 10 epochs in mini-batches of 512 samples. This is 10 iterations over all samples in the `x_train` and `y_train` tensors. While training, monitor the model's loss and accuracy on the 10,000 samples from the validation set:

In [None]:
history = model.fit(train_data.shuffle(10000).batch(512),
                    epochs=10,
                    validation_data=validation_data.batch(512),
                    verbose=1)

Epoch 1/10


ValueError: in user code:

    File "/usr/local/lib/python3.10/dist-packages/tf_keras/src/engine/training.py", line 1398, in train_function  *
        return step_function(self, iterator)
    File "/usr/local/lib/python3.10/dist-packages/tf_keras/src/engine/training.py", line 1381, in step_function  **
        outputs = model.distribute_strategy.run(run_step, args=(data,))
    File "/usr/local/lib/python3.10/dist-packages/tf_keras/src/engine/training.py", line 1370, in run_step  **
        outputs = model.train_step(data)
    File "/usr/local/lib/python3.10/dist-packages/tf_keras/src/engine/training.py", line 1147, in train_step
        y_pred = self(x, training=True)
    File "/usr/local/lib/python3.10/dist-packages/tf_keras/src/utils/traceback_utils.py", line 70, in error_handler
        raise e.with_traceback(filtered_tb) from None
    File "/usr/local/lib/python3.10/dist-packages/tf_keras/src/engine/input_spec.py", line 197, in assert_input_compatibility
        raise ValueError(

    ValueError: Missing data for input "keras_layer_1_input". You passed a data dictionary with keys ["0_b'BI '", "0_b'Report'", "0_b'transaction code'", "1_b'IT'", "1_b'SAP'"]. Expected the following keys: ['keras_layer_1_input']


In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# Sample dataset of movie reviews
reviews = [
    "This movie was amazing! I loved every minute of it.",
    "The acting was terrible and the story was boring.",
    "I really enjoyed the special effects in this film.",
    "The plot was confusing and the characters were unlikable.",
    "I would definitely recommend this movie to anyone.",
    "I regret watching this movie. It was a waste of time."
]
labels = ["positive", "negative", "positive", "negative", "positive", "negative"]

# Preprocess the text data
vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(reviews)

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, labels, test_size=0.2, random_state=42)

# Train the k-NN model
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train, y_train)

# Evaluate the k-NN model
y_pred = knn.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

# Classify a new review
new_review = "The special effects were great, but the story was confusing."
new_review_vector = vectorizer.transform([new_review])
prediction = knn.predict(new_review_vector)
print("Prediction:", prediction)


Accuracy: 0.5
Prediction: ['positive']


## Evaluate the model

And let's see how the model performs. Two values will be returned. Loss (a number which represents our error, lower values are better), and accuracy.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
from gensim.models import Word2Vec

# Preprocess the text data
text_data = [preprocess(text) for text in raw_text_data]

# Convert the text data into word2vec vectors
model = Word2Vec.load("path/to/pretrained/model")
X = [model.wv[word] for text in text_data for word in text]
X = [sum(vectors) / len(vectors) for vectors in X]

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, labels, test_size=0.2, random_state=42)

# Train the k-NN model
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)

# Evaluate the k-NN model
y_pred = knn.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)


NameError: name 'raw_text_data' is not defined

In [None]:
import pandas as pd

# Read the Excel file into a pandas DataFrame
df = pd.read_csv("/content/Train.csv")

# Extract the text data and labels from the DataFrame
raw_text_data = df["Data"].tolist()
labels = df["Label"].tolist()

df


Unnamed: 0,Data,Label
0,BI,D&A
1,Report,D&A
2,transaction code,SAP
3,AD,IT
4,account,IT
5,server,IT
6,module,SAP
7,SAP BI,D&A


In [None]:
import tensorflow_hub as hub
import tensorflow as tf
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# Load the TensorFlow Hub word2vec model
url = "https://tfhub.dev/google/Wiki-words-300/2"
embed = hub.load(url)


# Read the Excel file into a pandas DataFrame
df = pd.read_csv("/content/Train.csv")

# Extract the text data and labels from the DataFrame
text_data = df["Data"].tolist()
labels = df["Label"].tolist()


# Convert the text data into word2vec vectors
vectors = embed(text_data)

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(vectors, labels, test_size=0.2, random_state=42)

# Train the k-NN model
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)

# Evaluate the k-NN model
y_pred = knn.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)


OSError: https://tfhub.dev/google/Wiki-words-300/2 does not appear to be a valid module.

In [None]:
import tensorflow_hub as hub
import tensorflow as tf
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# Load the TensorFlow Hub word2vec model
# The original URL might be outdated or incorrect.
# Trying a different, similar model.
url = "https://tfhub.dev/google/nnlm-en-dim128/2"
embed = hub.load(url)

# Read the Excel file into a pandas DataFrame
df = pd.read_csv("/content/Train.csv")

# Extract the text data and labels from the DataFrame
text_data = df["Data"].tolist()
labels = df["Label"].tolist()

# Convert the text data into word2vec vectors
vectors = embed(text_data)

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(vectors, labels, test_size=0.2, random_state=42)

# Train the k-NN model
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)

# Evaluate the k-NN model
y_pred = knn.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

TypeError: Only integers, slices (`:`), ellipsis (`...`), tf.newaxis (`None`) and scalar tf.int32/tf.int64 tensors are valid indices, got array([0, 7, 2, 4, 3, 6])

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
from sklearn.feature_extraction.text import TfidfVectorizer
import nltk
from nltk.corpus import wordnet

nltk.download('wordnet')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')


def augment_text(text, num_aug=1):
    """Augments text using synonym replacement."""
    augmented_texts = [text]
    for _ in range(num_aug):
        tokens = nltk.word_tokenize(text)
        pos_tags = nltk.pos_tag(tokens)

        new_tokens = []
        for token, pos_tag in pos_tags:
            synonyms = []
            for syn in wordnet.synsets(token):
                for lemma in syn.lemmas():
                    synonyms.append(lemma.name())

            if synonyms and pos_tag.startswith(('N', 'V', 'J', 'R')):
                new_token = np.random.choice(synonyms, 1)[0]
            else:
                new_token = token

            new_tokens.append(new_token)

        augmented_texts.append(" ".join(new_tokens))

    return augmented_texts


# Read the Excel file into a pandas DataFrame
df = pd.read_csv("/content/Train.csv")

# Extract the text data and labels from the DataFrame
text_data = df["Data"].tolist()
labels = df["Label"].tolist()

# Data Augmentation
augmented_text_data = []
augmented_labels = []
for text, label in zip(text_data, labels):
    augmented_texts = augment_text(text, num_aug=2)
    augmented_text_data.extend(augmented_texts)
    augmented_labels.extend([label] * len(augmented_texts))

# 1. TF-IDF Feature Extraction
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(augmented_text_data)
X = X.toarray()

# 2. Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, augmented_labels, test_size=0.2, random_state=42
)

# 3. Train the k-NN model
knn_classifier = KNeighborsClassifier(n_neighbors=5)  # Create k-NN classifier
knn_classifier.fit(X_train, y_train)  # Train the model

# 4. Evaluate the k-NN model
y_pred = knn_classifier.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)
for i in range(len(X_test)):
    print("Test data point:", X_test[i])
    print("True label:", y_test[i])
    print("Predicted label:", y_pred[i])
    print()

Accuracy: 0.4
Test data point: [0.         0.         0.         0.         0.         0.
 0.         0.77957299 0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.62631139]
True label: SAP
Predicted label: IT

Test data point: [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0.]
True label: IT
Predicted label: IT

Test data point: [0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
True label: D&A
Predicted label: D&A

Test data point: [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0.]
True label: SAP
Predicted label: D&A

Test data point: [0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
True label: IT
Predicted label: D&A



[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
from sklearn.feature_extraction.text import TfidfVectorizer

# Read the Excel file into a pandas DataFrame
df = pd.read_csv("/content/Train.csv")

# Extract the text data and labels from the DataFrame
text_data = df["Data"].tolist()
labels = df["Label"].tolist()

# 1. TF-IDF Feature Extraction
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(text_data)
X = X.toarray()  # Convert to NumPy array if needed by your SVM implementation

# 2. Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, labels, test_size=0.2, random_state=42)

# 3. Train the SVM model
svm_classifier = SVC(kernel='linear', C=1.0)  # Create an SVM classifier object
svm_classifier.fit(X_train, y_train)  # Train the model

# 4. Evaluate the SVM model
y_pred = svm_classifier.predict(X_test)  # Make predictions on the test set
accuracy = accuracy_score(y_test, y_pred)  # Calculate accuracy
print("Accuracy:", accuracy)

for i in range(len(X_test)):
    print("Test data point:", X_test[i])
    print("True label:", y_test[i])
    print("Predicted label:", y_pred[i])
    print()

Accuracy: 0.0
Test data point: [0. 0. 0. 0. 0. 1. 0. 0. 0.]
True label: D&A
Predicted label: SAP

Test data point: [0. 0. 0. 0. 0. 0. 0. 1. 0.]
True label: IT
Predicted label: SAP



In [None]:
!pip install nltk

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
from sklearn.feature_extraction.text import TfidfVectorizer
import nltk
from nltk.corpus import wordnet

nltk.download('wordnet')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

def augment_text(text, num_aug=1):
    """Augments text using synonym replacement."""
    augmented_texts = [text]  # Start with original text
    for _ in range(num_aug):
        tokens = nltk.word_tokenize(text)
        pos_tags = nltk.pos_tag(tokens)

        new_tokens = []
        for token, pos_tag in pos_tags:
            synonyms = []
            for syn in wordnet.synsets(token):
                for lemma in syn.lemmas():
                    synonyms.append(lemma.name())

            if synonyms and pos_tag.startswith(('N', 'V', 'J', 'R')):  # Augment nouns, verbs, adjectives, adverbs
                new_token = np.random.choice(synonyms, 1)[0]
            else:
                new_token = token

            new_tokens.append(new_token)

        augmented_texts.append(" ".join(new_tokens))

    return augmented_texts

# Read the Excel file into a pandas DataFrame
df = pd.read_csv("/content/Train.csv")

# Extract the text data and labels from the DataFrame
text_data = df["Data"].tolist()
labels = df["Label"].tolist()

# Data Augmentation
augmented_text_data = []
augmented_labels = []
for text, label in zip(text_data, labels):
    augmented_texts = augment_text(text, num_aug=2)  # Generate 2 augmented texts per original text
    augmented_text_data.extend(augmented_texts)
    augmented_labels.extend([label] * len(augmented_texts))

# 1. TF-IDF Feature Extraction
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(augmented_text_data)
X = X.toarray()

# 2. Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, augmented_labels, test_size=0.2, random_state=42
)

# 3. Train the KNN model with Cross-Validation
knn_classifier = KNeighborsClassifier(n_neighbors=5)

cv_scores = cross_val_score(knn_classifier, X, augmented_labels, cv=5)

# 4. Print Cross-Validation Scores
print("Cross-Validation Scores:", cv_scores)
print("Average Cross-Validation Score:", np.mean(cv_scores))

# Fit the model after cross-validation
knn_classifier.fit(X_train, y_train)

# 5. Evaluate the KNN model
y_pred = knn_classifier.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

for i in range(len(X_test)):
    print("Test data point:", X_test[i])
    print("True label:", y_test[i])
    print("Predicted label:", y_pred[i])
    print()

Cross-Validation Scores: [0.4 0.4 0.4 0.6 0.5]
Average Cross-Validation Score: 0.4600000000000001
Accuracy: 0.4
Test data point: [0.         0.         0.         0.         0.         0.
 0.         0.70710678 0.         0.70710678 0.         0.
 0.         0.         0.         0.        ]
True label: SAP
Predicted label: D&A

Test data point: [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0.]
True label: IT
Predicted label: IT

Test data point: [0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
True label: D&A
Predicted label: D&A

Test data point: [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0.]
True label: SAP
Predicted label: IT

Test data point: [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
True label: IT
Predicted label: D&A



[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


This fairly naive approach achieves an accuracy of about 87%. With more advanced approaches, the model should get closer to 95%.

## Further reading

* For a more general way to work with string inputs and for a more detailed analysis of the progress of accuracy and loss during training, see the [Text classification with preprocessed text](./text_classification.ipynb) tutorial.
* Try out more [text-related tutorials](https://www.tensorflow.org/hub/tutorials#text-related-tutorials) using trained models from TFHub.

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score # Import cross_val_score
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
from sklearn.feature_extraction.text import TfidfVectorizer
import nltk
from nltk.corpus import wordnet

nltk.download('wordnet')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

# ... (augment_text function remains the same) ...

# Read the Excel file into a pandas DataFrame
df = pd.read_csv("/content/Train.csv")

# ... (data augmentation loop remains the same) ...

# 1. TF-IDF Feature Extraction
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(augmented_text_data)
X = X.toarray()

# 2. Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, augmented_labels, test_size=0.2, random_state=42
)

# 3. Train the SVM model with Regularization
svm_classifier = SVC(kernel='rbf', C=1.2, gamma='scale')  # Use 'rbf' kernel
# Try different kernels like 'rbf' or 'poly' if needed
svm_classifier.fit(X_train, y_train)

# 4. Evaluate the SVM model
y_pred = svm_classifier.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

cv_scores = cross_val_score(svm_classifier, X, augmented_labels, cv=5)  # 5-fold cross-validation

# 4. Print Cross-Validation Scores
print("Cross-Validation Scores:", cv_scores)
print("Average Cross-Validation Score:", np.mean(cv_scores))

for i in range(len(X_test)):
    print("Test data point:", X_test[i])
    print("True label:", y_test[i])
    print("Predicted label:", y_pred[i])
    print()

Accuracy: 1.0
Cross-Validation Scores: [0.4 0.2 0.4 1.  1. ]
Average Cross-Validation Score: 0.6
Test data point: [0.         0.         0.         0.         0.         0.
 0.         0.70710678 0.         0.70710678 0.         0.
 0.         0.         0.         0.        ]
True label: SAP
Predicted label: SAP

Test data point: [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0.]
True label: IT
Predicted label: IT

Test data point: [0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
True label: D&A
Predicted label: D&A

Test data point: [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0.]
True label: SAP
Predicted label: SAP

Test data point: [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
True label: IT
Predicted label: IT



[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [None]:
import nltk
nltk.download('wordnet')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('stopwords')

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.corpus import stopwords

def augment_text(text, num_aug=1):
    augmented_texts = [text]
    for _ in range(num_aug):
        tokens = nltk.word_tokenize(text)
        pos_tags = nltk.pos_tag(tokens)
        new_tokens = []
        for token, pos_tag in pos_tags:
            synonyms = []
            for syn in wordnet.synsets(token):
                for lemma in syn.lemmas():
                    synonyms.append(lemma.name())
            if synonyms and pos_tag.startswith(('N', 'V', 'J', 'R')):
                new_token = np.random.choice(synonyms, 1)[0]
            else:
                new_token = token
            new_tokens.append(new_token)
        augmented_texts.append(" ".join(new_tokens))
    return augmented_texts

# Load data
df = pd.read_csv("/content/Train.csv")
text_data = df["Data"].tolist()
labels = df["Label"].tolist()

# Data augmentation
augmented_text_data = []
augmented_labels = []
for text, label in zip(text_data, labels):
    augmented_texts = augment_text(text, num_aug=2)
    augmented_text_data.extend(augmented_texts)
    augmented_labels.extend([label] * len(augmented_texts))

# TF-IDF with N-grams and Stop Word Removal
stop_words = stopwords.words('english') # Get stop words as a list
vectorizer = TfidfVectorizer(ngram_range=(1, 2), stop_words=stop_words)
X = vectorizer.fit_transform(augmented_text_data)
X = X.toarray()

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, augmented_labels, test_size=0.2, random_state=42)

# Hyperparameter Tuning with Grid Search
param_grid = {'C': [0.1, 1, 10, 100], 'gamma': [0.1, 1, 10]}
grid = GridSearchCV(SVC(kernel='rbf'), param_grid, refit=True, verbose=2)
grid.fit(X_train, y_train)

# Print best parameters and estimator
print("Best Parameters:", grid.best_params_)
print("Best Estimator:", grid.best_estimator_)

# Evaluate the model
y_pred = grid.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

# Print predictions
for i in range(len(X_test)):
    print("Test data point:", X_test[i])
    print("True label:", y_test[i])
    print("Predicted label:", y_pred[i])
    print()

cv_scores = cross_val_score(grid.best_estimator_, X, augmented_labels, cv=5)  # Evaluate on the entire dataset

# Print Cross-Validation Scores
print("Cross-Validation Scores:", cv_scores)
print("Average Cross-Validation Score:", np.mean(cv_scores))

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Fitting 5 folds for each of 12 candidates, totalling 60 fits
[CV] END ...................................C=0.1, gamma=0.1; total time=   0.0s
[CV] END ...................................C=0.1, gamma=0.1; total time=   0.0s
[CV] END ...................................C=0.1, gamma=0.1; total time=   0.0s
[CV] END ...................................C=0.1, gamma=0.1; total time=   0.0s
[CV] END ...................................C=0.1, gamma=0.1; total time=   0.0s
[CV] END .....................................C=0.1, gamma=1; total time=   0.0s
[CV] END .....................................C=0.1, gamma=1; total time=   0.0s
[CV] END .....................................C=0.1, gamma=1; total time=   0.0s
[CV] END .....................................C=0.1, gamma=1; total time=   0.0s
[CV] END .....................................C=0.1, gamma=1; total time=   0.0s
[CV] END ....................................C=0.1, gamma=10; total time=   0.0s
[CV] END ....................................C=0