<a href="https://colab.research.google.com/github/Moditha06/Customer-Feedback-Analysis-System/blob/main/MLE2024.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Customer Feedback Analysis System**

**Summary:**

This project involves creating a comprehensive machine learning pipeline to analyze customer feedback data. Here's an approach to building and deploying the system:*italicized text*

**1. Data Ingestion and Preprocessing
Tasks:**

Data Collection: Create a system to ingest data from multiple sources (CSV files, APIs). Utilize the Kaggle dataset: Amazon Product Reviews.

Cleaning: Remove duplicates, handle missing values, normalize text (e.g., lowercase, remove stopwords, punctuation).

Handling Imbalanced Data: Apply oversampling (e.g., SMOTE) or class weighting in model training.


**Implementation Plan:**

Use Python with Pandas for data ingestion and cleaning.

1.   Use Python with Pandas for data ingestion and cleaning.
2.   For text normalization, use libraries like NLTK or SpaCy.
3.   Use scikit-learn for handling imbalanced datasets.

In [6]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, accuracy_score
import re
import string
from imblearn.over_sampling import SMOTE
from flask import Flask, request, jsonify
import joblib
import os
from google.colab import drive

In [3]:
# Mount Google Drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# Step 1: Data Ingestion and Preprocessing
def clean_text(text):
    text = text.lower()
    text = re.sub(f"[{string.punctuation}]", "", text)
    text = re.sub("\\s+", " ", text).strip()
    return text

def load_and_preprocess_data(/content/drive/MyDrive/Reviews[1].csv):
    data = pd.read_csv(/content/drive/MyDrive/Reviews[1].csv)
    data.drop_duplicates(inplace=True)
    data = data.dropna(subset=['reviewText', 'overall'])
    data['cleaned_text'] = data['reviewText'].apply(clean_text)
    data['label'] = data['overall'].apply(lambda x: 'positive' if x > 3 else 'negative' if x < 3 else 'neutral')
    return data[['cleaned_text', 'label']]

**2. Model Development
Tasks:**

**Model Selection:**
Traditional ML (e.g., Logistic Regression, Random Forest) for simplicity and interpretability.

Deep Learning (e.g., BERT) for better performance on text classification.

**Training and Evaluation:**
Split data into training, validation, and test sets.

Use metrics like precision, recall, and F1 score to evaluate performance.

**Implementation Plan:**

1. Explore TF-IDF and embeddings (e.g., Word2Vec, BERT).
2. Train models using libraries like scikit-learn, TensorFlow, or PyTorch.
3. Use GridSearchCV for hyperparameter tuning.

In [None]:
# Step 2: Feature Engineering and Splitting
def feature_engineering(data):
    vectorizer = TfidfVectorizer(max_features=5000)
    X = vectorizer.fit_transform(data['cleaned_text'])
    y = data['label']
    return X, y, vectorizer

**3. Feature Engineering
Tasks:**

Perform EDA to understand dataset distributions, identify class imbalances, and determine feature importance.

**Extract features using:**
1. TF-IDF (Term Frequency-Inverse Document Frequency)
2. Pre-trained embeddings like BERT for semantic understanding.

**Implementation Plan:**

1. Visualize data with Matplotlib or Seaborn.
2. Use scikit-learn or transformers library for feature extraction.

In [None]:
# Step 3: Model Training
def train_model(X, y):
    smote = SMOTE()
    X_resampled, y_resampled = smote.fit_resample(X, y)
    X_train, X_test, y_train, y_test = train_test_split(X_resampled, y_resampled, test_size=0.2, random_state=42)
    model = RandomForestClassifier(n_estimators=100, random_state=42)
    model.fit(X_train, y_train)
    predictions = model.predict(X_test)
    print("Accuracy:", accuracy_score(y_test, predictions))
    print("Classification Report:\n", classification_report(y_test, predictions))
    return model, vectorizer

**4. Model Deployment
Tasks:**

1. Build an API for real-time predictions using FastAPI or Flask.

2. Deploy on a cloud platform like AWS, Google Cloud, or Heroku.

3. Ensure low latency and high concurrency using containerization with Docker.

**Implementation Plan:**

1. Create endpoints for classification predictions.
2. Use Gunicorn with Docker to handle concurrent requests.


In [None]:
# Step 4: API Deployment
def create_api(model, vectorizer):
    app = Flask(__name__)

    @app.route('/predict', methods=['POST'])
    def predict():
        if request.method == 'POST':
            input_data = request.json.get('feedback', '')
            if not input_data:
                return jsonify({'error': 'No feedback provided'}), 400
            cleaned_input = clean_text(input_data)
            features = vectorizer.transform([cleaned_input])
            prediction = model.predict(features)[0]
            return jsonify({'prediction': prediction})

    return app

**5. Monitoring and Feedback Loop
Tasks:**

1. Track model performance metrics (accuracy, precision, recall) in production.

2. Monitor data drift and implement alerts for significant deviations.

3. Periodically retrain the model with new data.

**Implementation Plan:**

1. Use Prometheus and Grafana for monitoring.

2. Automate retraining using a CI/CD pipeline with tools like GitHub Actions or Jenkin

In [None]:
# Step 5: Monitoring and Feedback Loop
def retrain_model_with_new_data(new_data_filepath, model, vectorizer):
    new_data = load_and_preprocess_data(new_data_filepath)
    X_new, y_new, _ = feature_engineering(new_data)
    X_resampled, y_resampled = SMOTE().fit_resample(X_new, y_new)
    model.fit(X_resampled, y_resampled)
    return model

In [None]:
# Main execution flow
if __name__ == '__main__':
    # Load and preprocess data
    data_filepath = 'amazon_reviews.csv'  # Replace with actual dataset path
    data = load_and_preprocess_data(data_filepath)

    # Feature engineering and model training
    X, y, vectorizer = feature_engineering(data)
    model, vectorizer = train_model(X, y)

    # Save the model and vectorizer
    joblib.dump(model, 'feedback_model.pkl')
    joblib.dump(vectorizer, 'vectorizer.pkl')

    # Create and run the API
    app = create_api(model, vectorizer)
    app.run(debug=True, host='0.0.0.0', port=5000)

**Evaluation Plan
Metrics:**

1. Precision, Recall, F1 Score for model evaluation.

2. Latency and throughput for API performance.

**Scalability:**

Deploy a scalable architecture using container orchestration tools like Kubernetes.

**Next Steps**

1. Set up project repository: Structure code and documents.

2. Collect and preprocess data: Start with the Kaggle dataset and preprocess it.

3. Train initial models: Use simple models (e.g., Logistic Regression) and progressively test advanced models (e.g., BERT).

4. Build API: Create and test the prediction endpoint.

5. Deploy and monitor: Host API and implement monitoring systems.

I’ve built the Python code for the customer feedback analysis system following your requirements. The code includes data ingestion, preprocessing, model training, and API deployment.

 Key features:

1. Data Preprocessing: Cleans and normalizes text reviews.
2. Feature Engineering: Uses TF-IDF for feature extraction.
3. Model Training: Employs a Random Forest model with SMOTE for handling class imbalance.
4. API Deployment: A Flask-based API for real-time predictions.