# End-to-end machine learning process for text classification

In [1]:
# Imports
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.model_selection import cross_val_score, KFold
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint
import joblib
import warnings
warnings.filterwarnings('ignore')

# Data Exploration and Preparation

In [2]:
# Load and display the dataset
file_path = 'sample_data_for_task1.csv'
data = pd.read_csv(file_path)
data.head()

Unnamed: 0,text,label
0,zucker fabrik,ft
1,Lebensmittel kommssionierung,ft
2,geländer biegen,mr
3,gebäudeausrüstung technische,ct
4,kürbiskernöl softgels,ft


In [4]:
# Data Preparation:

# Removing rows with missing labels
data_clean = data.dropna(subset=['label'])

# Lowercasing the text
data_clean['text'] = data_clean['text'].apply(lambda x: x.lower())

# Analyze label distribution
label_distribution = data_clean['label'].value_counts()

# Display the cleaned data and label distribution
data_clean.head(), label_distribution.head()

(                           text label
 0                 zucker fabrik    ft
 1  lebensmittel kommssionierung    ft
 2               geländer biegen    mr
 3  gebäudeausrüstung technische    ct
 4         kürbiskernöl softgels    ft,
 label
 ft     11226
 pkg     9617
 ct      5061
 mr      5016
 ch      3688
 Name: count, dtype: int64)

# Feature Extraction

In [6]:
# Feature Extraction with TF-IDF
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(data_clean['text'])
y = data_clean['label']

# Splitting the dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

- I used a TF-IDF vectorizer, which is a standard and effective method for converting text data into numerical features. This approach helps the model to understand the importance of words relative to their frequency across all documents.

# Model Development

For this task I am deploying 3 different ML models:
- Logistic regression
- SVC
- Random forest

Later on I will choose the best model among them.
For each model, hpyerparameter tuning and cross validation based on accuracy has been done.

### Logistics regression

In [3]:
# Define the pipeline with LogisticRegression
pipeline = Pipeline([
    ('logistic', LogisticRegression(max_iter=1000))
])

# Define the hyperparameters grid for grid search
param_grid = {
    'logistic__C': [0.1, 1.0, 10.0],
}

# GridSearchCV
grid_search = GridSearchCV(pipeline, param_grid, cv=5, scoring='accuracy', n_jobs=-1)

# Fit the grid search to find the best parameters
grid_search.fit(X_train, y_train)

# Get the best parameters and best score
best_params = grid_search.best_params_
best_score = grid_search.best_score_

# Use the best model to make predictions
best_model_LR = grid_search.best_estimator_
y_pred = best_model_LR.predict(X_test)

# Evaluate the best model
evaluation_report = classification_report(y_test, y_pred)
print("Best Parameters:", best_params)
print("Best Score:", best_score)
print("Evaluation Report:")
print(evaluation_report)


Best Parameters: {'logistic__C': 10.0}
Best Score: 0.8819734146063822
Evaluation Report:
              precision    recall  f1-score   support

          ch       0.88      0.89      0.88       706
         cnc       0.88      0.73      0.80       513
          ct       0.95      0.88      0.91      1022
          ft       0.87      0.93      0.90      2281
          mr       0.89      0.82      0.86      1009
         pkg       0.87      0.90      0.88      1908

    accuracy                           0.88      7439
   macro avg       0.89      0.86      0.87      7439
weighted avg       0.88      0.88      0.88      7439



#### Insights:
- Logistic Regression is a solid choice for a baseline model in text classification tasks due to its simplicity, interpretability, and efficiency in training.
- I have implemented hyperparameter tuning for the Logistic Regression model using GridSearchCV. This is crucial as it systematically searches through a range of parameters to find the best performing model configuration. 
- Cross-validation (CV=5) during hyperparameter tuning was employed to enhances the model's reliability by ensuring that its performance is consistent across different subsets of the training data.
- Evaluate the Logistic Regression model using classification report, which provides a detailed insight into the model's performance across all classes. Metrics like precision, recall, and F1-score are crucial for understanding the model's strengths and weaknesses in classifying each category.
- The best 'C' parameter for Logistic Regression is 10.0. This indicates that a lower regularization strength helped the model perform better.
- The cross-validation score of approximately 0.882 is quite strong, suggesting that the model generalizes well across different subsets of the data.
- The model has high precision across all categories, particularly for 'ct' at 0.95, indicating a low rate of false positives.
- Recall values are also strong, though 'cnc' at 0.73 indicates some room for improvement in capturing all relevant instances for this category.
- The F1-scores, which balance precision and recall, are consistently high, demonstrating the model's robustness. The 'ct' category shows particularly strong performance with an F1-score of 0.90.
- The overall accuracy of 0.88 is very good.

The model is performing well, especially in distinguishing between most categories. The high precision and recall for most classes indicate that the model is effective in classifying the text phrases accurately.

### SVC

In [7]:
# Define the hyperparameters grid for grid search
param_grid = {
    'C': [0.1, 1.0, 10.0],
    'kernel': ['linear', 'poly', 'rbf', 'sigmoid']
}

# Initialize GridSearchCV
grid_search = GridSearchCV(SVC(), param_grid, cv=5, scoring='accuracy', n_jobs=-1)

# Fit the grid search to find the best parameters
grid_search.fit(X_train, y_train)

# Get the best parameters and best score
best_params = grid_search.best_params_
best_score = grid_search.best_score_

# Use the best model to make predictions
best_model_SVC = grid_search.best_estimator_
y_pred = best_model_SVC.predict(X_test)

# Evaluate the best model
evaluation_report = classification_report(y_test, y_pred)
print("Best Parameters:", best_params)
print("Best Score:", best_score)
print("Evaluation Report:")
print(evaluation_report)


Best Parameters: {'C': 10.0, 'kernel': 'rbf'}
Best Score: 0.8783103058941932
Evaluation Report:
              precision    recall  f1-score   support

          ch       0.88      0.88      0.88       706
         cnc       0.90      0.72      0.80       513
          ct       0.96      0.86      0.90      1022
          ft       0.85      0.94      0.89      2281
          mr       0.90      0.81      0.86      1009
         pkg       0.88      0.91      0.89      1908

    accuracy                           0.88      7439
   macro avg       0.90      0.85      0.87      7439
weighted avg       0.89      0.88      0.88      7439



#### Insights:
- The best parameters found through hyperparameter tuning are a 'C' value of 10.0 and the 'rbf' kernel.
- The cross-validation score of approximately 0.878 is slightly lower than that of the Logistic Regression model but still indicates strong generalization capability.
- The precision is high across all classes, notably for 'ct' and 'mr' with 0.96 and 0.90, respectively. This indicates a low rate of false positives for these classes.
- The recall shows some variability, with 'cnc' at 0.72, suggesting some difficulty in capturing all relevant instances for this class. However, 'ft' and 'pkg' show strong recall, indicating effective identification of positive instances.
- The F1-scores are consistently high, with 'ft' showing a notable score of 0.89, indicating a balanced performance between precision and recall.
- With an overall accuracy of 0.88, the SVM model performs comparably to the Logistic Regression model, indicating effective classification capability.

The SVM model demonstrates strong classification performance, especially in terms of precision.
The comparison between the SVM and Logistic Regression models shows that they have similar overall performance metrics, but differences in precision and recall for specific classes may guide the choice of one model over the other.

### Random Forest

In [5]:
# Define the hyperparameters grid for grid search
param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [None, 10, 20],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

# Initialize GridSearchCV
grid_search = GridSearchCV(RandomForestClassifier(random_state=42), param_grid, cv=5, scoring='accuracy', n_jobs=-1)

# Fit the grid search to find the best parameters
grid_search.fit(X_train, y_train)

# Get the best parameters and best score
best_params = grid_search.best_params_
best_score = grid_search.best_score_

# Use the best model to make predictions
best_model_RF = grid_search.best_estimator_
y_pred = best_model_RF.predict(X_test)

# Evaluate the best model
evaluation_report = classification_report(y_test, y_pred)
print("Best Parameters:", best_params)
print("Best Score:", best_score)
print("Evaluation Report:")
print(evaluation_report)

Best Parameters: {'max_depth': None, 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 200}
Best Score: 0.8767979380893787
Evaluation Report:
              precision    recall  f1-score   support

          ch       0.82      0.89      0.85       706
         cnc       0.90      0.78      0.84       513
          ct       0.95      0.86      0.90      1022
          ft       0.88      0.93      0.91      2281
          mr       0.89      0.81      0.85      1009
         pkg       0.89      0.91      0.90      1908

    accuracy                           0.89      7439
   macro avg       0.89      0.87      0.87      7439
weighted avg       0.89      0.89      0.89      7439



#### Insights:
- The optimal parameters found through tuning include 'max_depth' as None, 'min_samples_leaf' as 1, 'min_samples_split' as 2, and 'n_estimators' as 200.
- The cross-validation score of approximately 0.877 is competitive, indicating that the Random Forest model has good generalization capabilities, though it's slightly lower than the Logistic Regression and SVM models.
- The model shows strong precision across most classes, particularly for 'cnc', 'ct', and 'mr', indicating a low rate of false positives.
- The recall values are generally high, with 'ch' showing an impressive recall of 0.89. The model appears to be effective in identifying relevant instances across classes.
- The F1-scores are consistent, reflecting a balanced performance between precision and recall. The 'ft' category stands out with an F1-score of 0.91.
- An overall accuracy of 0.89 is notable, suggesting that the Random Forest model is very effective in classifying the text data.

The Random Forest model demonstrates robust performance, with particular strength in recall, suggesting it is quite reliable in identifying positive instances across different classes.

### Best model

- While Logistic Regression offers straightforward interpretation due to its linear decision boundaries, it falls short in capturing the non-linear relationships present in the data, rendering it more susceptible to overfitting. Therefore, opting for models like SVM with an RBF kernel and Random Forest proves wiser.
- Random Forest is inherently adept at handling complex and non-linear relationships in data due to its ensemble approach, combining multiple decision trees to produce a more robust and generalized model.
- With Random Forest, I have the flexibility to increase the complexity of the model (e.g., number of trees, depth of trees) without the immediate risk of overfitting.
- In my evaluations, Random Forest slightly outperformed the other models in terms of accuracy and maintained high precision and recall across different classes. 
- Random Forest offers insights into feature importance, which can be beneficial for understanding the driving factors behind the model's decisions.

In conclusion, Random Forest is the recommended model to proceed with for deployment, offering a strong balance of performance, complexity handling and robustness.

## REST API Implementation

In [7]:
# Save the best trained Random Forest model
joblib.dump(best_model_RF, 'random_forest_model.joblib')

# Save the TF-IDF vectorizer to transform new text data for predictions
joblib.dump(vectorizer, 'tfidf_vectorizer.joblib')

['tfidf_vectorizer.joblib']

### Setting up a REST API

- The script starts by loading the trained Random Forest model and the TF-IDF vectorizer using joblib. 
- I have defined a Pydantic model (TextData) to specify the expected input data format. 
- The /predict/ endpoint is designed to receive POST requests with text data and return the model's prediction. The endpoint uses the trained vectorizer to transform the incoming text into the same format the model was trained on.

# Future Work Suggestions

- Model Exploration and Ensemble Techniques: Experiment with more advanced models like deep learning architectures (e.g., LSTM, GRU) that might capture the nuances of language better for text classification tasks.

- Investigate additional feature engineering techniques beyond TF-IDF, such as word embeddings (Word2Vec, GloVe) or doc2vec, to capture semantic meanings of words.

- Implement data augmentation strategies for text data, such as synonym replacement, back-translation, or text generation, to increase the diversity of the training dataset, which can be particularly beneficial for underrepresented classes.

- Create a feedback loop where new data or corrected labels can be used to retrain and continuously improve the model.

- Creating visualization for better understanding the data and perormance of various models. 
