To predict the **label** based on the **text_** column, you can follow a standard **text classification pipeline** using machine learning. Here’s the process:

---

### **1. Data Preprocessing**
- Remove unnecessary punctuation and special characters.
- Convert text to lowercase.
- Tokenization (split text into words).
- Remove stopwords (common words like "the," "and," etc.).
- Lemmatization (convert words to their base forms, e.g., "loved" → "love").

---

### **2. Feature Engineering**
- Convert text into numerical format using **TF-IDF** (Term Frequency-Inverse Document Frequency) or **Word Embeddings** (like Word2Vec, BERT, etc.).

---

### **3. Model Selection**
- Use a **Machine Learning model** like:
  - **Logistic Regression** (Good baseline)
  - **Naive Bayes** (Performs well on text classification)
  - **Random Forest**
  - **SVM (Support Vector Machine)**
  - **Deep Learning (LSTM, BERT, etc.)** for more complex cases.

---

### **4. Training the Model**
- Split data into training (80%) and testing (20%).
- Train the model on the training data.
- Evaluate using metrics like **accuracy, precision, recall, and F1-score**.

---

### **5. Predicting Labels**
- Once the model is trained, pass new reviews through the same preprocessing steps.
- Use the trained model to predict the label.

---

### **6. Deployment**
- Deploy using Flask, FastAPI, or Django if needed for real-world applications.

In [1]:
import numpy as np
import pandas as pd
import DataPipeline as dp

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\mradu\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\mradu\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\mradu\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


### **1. Data Preprocessing**
- Remove unnecessary punctuation and special characters.
- Convert text to lowercase.
- Tokenization (split text into words).
- Remove stopwords (common words like "the," "and," etc.).
- Lemmatization (convert words to their base forms, e.g., "loved" → "love").


### **2. Feature Engineering**
- Convert text into numerical format using **TF-IDF** (Term Frequency-Inverse Document Frequency) or **Word Embeddings** (like Word2Vec, BERT, etc.).

In [2]:
df = pd.read_csv("data.csv")

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 40432 entries, 0 to 40431
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   category  40432 non-null  object 
 1   rating    40432 non-null  float64
 2   label     40432 non-null  object 
 3   text_     40432 non-null  object 
dtypes: float64(1), object(3)
memory usage: 1.2+ MB


In [4]:
df.isnull().sum()

category    0
rating      0
label       0
text_       0
dtype: int64

In [5]:
df.describe()

Unnamed: 0,rating
count,40432.0
mean,4.256579
std,1.144354
min,1.0
25%,4.0
50%,5.0
75%,5.0
max,5.0


In [8]:
import re
from gensim.models import Word2Vec
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

In [9]:
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\mradu\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\mradu\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\mradu\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [10]:
lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words('english'))

In [11]:
def preprocess_text(text, return_tokens=True):
    """
    Cleans and tokenizes text data.

    Args:
        text (str): Input text.

    Returns:
        list: Tokenized and lemmatized words.
    """
    if not isinstance(text, str) or pd.isna(text):  
        return []  # Return an empty list if text is NaN or not a string

    text = re.sub(r'[^a-zA-Z\s]', '', text)  # Remove special characters
    text = text.lower().strip()  # Convert to lowercase and remove extra spaces
    
    if not return_tokens:
        return text  # Return cleaned text if not tokenizing
    
    tokens = word_tokenize(text)  # Tokenize

    # Remove stopwords using set operation (faster lookup)
    tokens = [lemmatizer.lemmatize(word) for word in tokens if word not in stop_words]
    
    return tokens 

In [12]:
# Tokenize text data
tokenized_texts = df['text_'].apply(preprocess_text).tolist()

In [13]:
tokenized_texts[1]

['love', 'great', 'upgrade', 'original', 'ive', 'mine', 'couple', 'year']

In [14]:
word2vec_model = Word2Vec(vector_size=100, window=5, min_count=1, workers=4)

In [15]:
word2vec_model.build_vocab(tokenized_texts)
word2vec_model.train(tokenized_texts, total_examples=len(tokenized_texts), epochs=10)

(11607667, 12778950)

In [16]:
# Get word vectors
word_vectors = word2vec_model.wv

# Example: Get vector for the word "love"
print("Vector for 'love':")
print(word_vectors["love"])

Vector for 'love':
[ 1.5967331  -0.56937474  1.0252733  -1.2883037   2.8542306   0.44401702
  1.8470085  -1.2330059  -0.64124006  0.29404876 -1.0456376  -0.16074888
  2.569506    0.28775528  0.46492323  2.3119192   0.8722445   0.8755863
 -2.4262805  -0.27493778  0.18199678  0.18548332  1.5814512  -0.81522316
  0.5897589  -0.3603973   1.3619702   0.60154814 -2.0165274   0.37490317
  0.8194316  -1.2665157  -1.6206524   0.41881898 -1.3718954   0.7925512
 -0.12078405  2.2625754   2.7491891  -0.17060392 -0.83384675 -0.13375951
 -1.4296724   2.574428    0.7458072   1.1308368  -0.56630296 -0.73024046
 -0.5677177  -0.0308758   0.90575296 -0.54529804 -0.30168638 -1.2305642
 -2.1559942   0.30665514  0.98258424 -0.83023864 -0.3905523  -0.28582767
 -1.1266986  -0.00940782  1.8499604   2.2290757  -0.02287936  3.4112456
  1.2255063   1.3963727   1.1385323  -0.8806984  -0.91929567 -0.57344663
  0.86584085 -0.2299536  -1.88281     3.0921242  -0.10281567 -0.01431896
 -2.812101   -1.2451503  -0.50343573

In [17]:
# Example: Find similar words to "love"
print("\nWords similar to 'love':")
print(word_vectors.most_similar("love"))


Words similar to 'love':
[('loved', 0.6791303157806396), ('cute', 0.5656462907791138), ('great', 0.5520269274711609), ('boredgreat', 0.5236473083496094), ('adorable', 0.5178564786911011), ('awesome', 0.5136257410049438), ('like', 0.5134742856025696), ('everyonethis', 0.49875909090042114), ('liked', 0.4940889775753021), ('happy', 0.48140522837638855)]


In [None]:
import numpy as np

def sentence_to_vector(tokens, word2vec_model, vector_size=100):
    """
    Converts a tokenized sentence into a numerical vector using Word2Vec.

    Args:
        tokens (list): List of preprocessed words (tokens).
        word2vec_model (Word2Vec): A trained Word2Vec model.
        vector_size (int): Size of the word vectors (default: 100).

    Returns:
        np.array: The sentence vector.
    """
    word_vectors = [word2vec_model.wv[word] for word in tokens if word in word2vec_model.wv]
    # print(word_vectors)  # Debugging: Check the word vectors
    
    if not word_vectors:  # If no valid words in the sentence
        return np.zeros(vector_size)  
    
    return np.mean(word_vectors, axis=0)  # Compute the mean vector


In [23]:
df['tokens'] = df['text_'].apply(preprocess_text)  # First preprocess text
df['text_vector'] = df['tokens'].apply(lambda x: sentence_to_vector(x, word2vec_model))  # Then vectorize

In [24]:
df.head()

Unnamed: 0,category,rating,label,text_,tokens,text_vector
0,Home_and_Kitchen_5,5.0,CG,"Love this! Well made, sturdy, and very comfor...","[love, well, made, sturdy, comfortable, love, ...","[1.5406663, -0.99293655, 0.98273134, -1.073451..."
1,Home_and_Kitchen_5,5.0,CG,"love it, a great upgrade from the original. I...","[love, great, upgrade, original, ive, mine, co...","[0.56903756, -0.5683802, 0.123951286, -0.30268..."
2,Home_and_Kitchen_5,5.0,CG,This pillow saved my back. I love the look and...,"[pillow, saved, back, love, look, feel, pillow]","[0.57553756, 0.44632658, 0.48697212, -0.731448..."
3,Home_and_Kitchen_5,1.0,CG,"Missing information on how to use it, but it i...","[missing, information, use, great, product, pr...","[0.764389, -0.9699766, -0.633388, -0.5582797, ..."
4,Home_and_Kitchen_5,5.0,CG,Very nice set. Good quality. We have had the s...,"[nice, set, good, quality, set, two, month]","[1.0560967, -0.7239972, 0.06422721, -0.9062034..."


In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Convert tokenized words back to full sentences
processed_texts = [" ".join(words) for words in tokenized_texts]

# Initialize TF-IDF Vectorizer
tfidf_vectorizer = TfidfVectorizer(max_features=1000)

# Fit and transform the processed text
tfidf_matrix = tfidf_vectorizer.fit_transform(processed_texts)

# Convert to array (if needed)
tfidf_array = tfidf_matrix.toarray()

print("TF-IDF Matrix Shape:", tfidf_matrix.shape)

# Display feature names and vectors
print("Feature Names:", tfidf_vectorizer.get_feature_names_out())
print("TF-IDF Vectors:\n", tfidf_array)


TF-IDF Matrix Shape: (40432, 1000)
Feature Names: ['able' 'absolutely' 'accurate' 'acting' 'action' 'actor' 'actually'
 'adapter' 'add' 'added' 'addition' 'adjust' 'adjustable' 'admit' 'adult'
 'adventure' 'advertised' 'age' 'ago' 'air' 'almost' 'alone' 'along'
 'already' 'also' 'although' 'always' 'amazing' 'amazon' 'american'
 'amount' 'animal' 'annoying' 'another' 'anyone' 'anything' 'apart' 'arc'
 'area' 'arent' 'arm' 'around' 'arrived' 'art' 'assemble' 'attention'
 'author' 'available' 'away' 'awesome' 'baby' 'back' 'bad' 'bag' 'ball'
 'band' 'bar' 'base' 'based' 'basic' 'bathroom' 'battery' 'beautiful'
 'become' 'bed' 'beginning' 'behind' 'believable' 'believe' 'belt' 'best'
 'better' 'big' 'bigger' 'bike' 'birthday' 'bit' 'black' 'blade' 'blue'
 'board' 'body' 'book' 'booki' 'boot' 'boring' 'bosch' 'bottle' 'bottom'
 'bought' 'bowl' 'box' 'boy' 'bra' 'brand' 'break' 'bright' 'broke'
 'broken' 'brother' 'build' 'built' 'bulb' 'bulky' 'business' 'button'
 'buy' 'buying' 'cable' 'c

In [33]:
from gensim.models import FastText
import numpy as np

# Sample corpus
sentences = [["this", "is", "a", "test"], ["word", "embeddings", "are", "powerful"]]

# Train FastText model
fasttext_model = FastText(sentences, vector_size=100, window=5, min_count=1, workers=4)

def sentence_to_vector(sentence, model, vector_size=100):
    words = sentence.split()
    word_vectors = [model.wv[word] for word in words if word in model.wv]
    return np.mean(word_vectors, axis=0) if word_vectors else np.zeros(vector_size)

# Example sentences
original_text = "word embeddings are powerful"
ai_generated_text = "this is a generated text"

original_vector = sentence_to_vector(original_text, fasttext_model)
ai_vector = sentence_to_vector(ai_generated_text, fasttext_model)

print("Original Vector:", original_vector.shape)
print("AI-Generated Vector:", ai_vector.shape)

Original Vector: (100,)
AI-Generated Vector: (100,)


In [36]:
from sentence_transformers import SentenceTransformer

# Load SBERT model
sbert_model = SentenceTransformer('all-MiniLM-L6-v2')

# Get sentence embeddings
original_vector = sbert_model.encode(original_text)
ai_vector = sbert_model.encode(ai_generated_text)

print("SBERT Original Vector:", original_vector.shape)  # Output: (384,)
print("SBERT AI-Generated Vector:", ai_vector.shape)  # Output: (384,)

  from .autonotebook import tqdm as notebook_tqdm
To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


SBERT Original Vector: (384,)
SBERT AI-Generated Vector: (384,)


In [None]:
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

# Reshape the vectors for sklearn's cosine_similarity function
original_vector = original_vector.reshape(1, -1)
ai_vector = ai_vector.reshape(1, -1)

# Compute cosine similarity
similarity_score = cosine_similarity(original_vector, ai_vector)[0][0]

print(f"Cosine Similarity: {similarity_score:.4f}")  # Closer to 1 means more similar


Cosine Similarity: 0.2150
