Women clothing review system

In [None]:

pip install pandas numpy scikit-learn nltk matplotlib



In [1]:
# Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report, accuracy_score
import nltk
import re

# Download NLTK stopwords
nltk.download('stopwords')
nltk.download('punkt')

# Define a function for text preprocessing
def preprocess_text(text):

    text = text.lower() # Convert text to lowercase

    text = re.sub(r'[^\w\s]', '', text) # Remove punctuation

    # Tokenize and remove stopwords
    from nltk.corpus import stopwords
    from nltk.tokenize import word_tokenize
    stop_words = set(stopwords.words('english'))
    tokens = word_tokenize(text)
    tokens = [word for word in tokens if word not in stop_words]
    return ' '.join(tokens)

# Load the dataset
url = "https://raw.githubusercontent.com/YBIFoundation/MachineLearning/main/Dataset/Women%20Clothing%20E-Commerce%20Review.csv"
df = pd.read_csv(url)

# Drop rows with missing values
df = df.dropna(subset=['Review', 'Rating'])

# Preprocess the review text
df['Processed_Review'] = df['Review'].apply(preprocess_text)

# Convert text data into numerical features using TF-IDF
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(df['Processed_Review'])

# Define the target variable
y = df['Rating']

# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Initialize and train the model
model = MultinomialNB()
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Evaluate the model
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


Accuracy: 0.5586633298984248
Classification Report:
               precision    recall  f1-score   support

           1       0.00      0.00      0.00       227
           2       0.00      0.00      0.00       462
           3       0.25      0.00      0.00       874
           4       0.27      0.00      0.00      1438
           5       0.56      1.00      0.72      3792

    accuracy                           0.56      6793
   macro avg       0.22      0.20      0.14      6793
weighted avg       0.40      0.56      0.40      6793



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


Explanation:-
Libraries: Imports libraries for data manipulation (pandas, numpy), machine learning (scikit-learn), and natural language processing (nltk).

Text Preprocessing:

Converts text to lowercase.
Removes punctuation.
Tokenizes text and removes stopwords.
Data Loading: Loads the dataset from the provided URL and drops rows with missing values in key columns.

Feature Extraction: Converts text reviews into numerical features using TF-IDF.

Model Training and Evaluation:

Splits the data into training and test sets.
Trains a Naive Bayes model.
Evaluates the model using accuracy and classification report.