# Text Classification Using Logistic Regression 

Developed a text classification model using logistic regression. Logistic regression works well with text data, especially when it is represented using approaches such as TF-IDF. It makes a good starting point for evaluating advanced models. If logistic regression works well, it can be used as a benchmark to compare the performance of advanced models. As logistic regression require numerical input, I used TF-IDF vectorization that converts the text data into numerical values. Accuracy provides an over all measure of correct prediction and F1 score balances both the precesion and recall. It is specially useful in situations where class distribution is imbalanced.

For both the models f1 score values are similar, hence I used simple logistic regression model for this task.According to requirement Fast API is used to create REST API end points. Also, a simple html page is developed creating an entire workflow utilising the model.



In [6]:
# Importing required libraries and modules
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
from typing import Union
from joblib import dump
import re

In [7]:
#Loading data into the pandas dataframe
Sentences= pd.read_csv("sample_data.csv")
print(Sentences.head())

                           text label
0                 zucker fabrik    ft
1  Lebensmittel kommssionierung    ft
2               geländer biegen    mr
3  gebäudeausrüstung technische    ct
4         kürbiskernöl softgels    ft


In [8]:
# Calculating  the distribution of unique values in the 'label' column of the Sentences DataFrame
class_distribution = Sentences['label'].value_counts()
print(class_distribution)
#removing  rows from the DataFrame where the 'label' column has missing values (NaN).
Sentences = Sentences.dropna(subset=['label'])
print("Number of missing values in the dataset:")
print(Sentences.isnull().sum())

label
ft     11226
pkg     9617
ct      5061
mr      5016
ch      3688
cnc     2587
Name: count, dtype: int64
Number of missing values in the dataset:
text     0
label    0
dtype: int64


In [9]:
#Defined a function to remove chinese characters
def remove_chinese(text):
    if pd.isnull(text):
        return text
    
    chinese_pattern = re.compile('[\u4e00-\u9fff]+')
    cleaned_text = re.sub(chinese_pattern, '', text)
    return cleaned_text

Sentences['text'] = Sentences['text'].apply(remove_chinese)
Sentences['text'] = Sentences['text'].str.encode('utf-8').str.decode('utf-8', 'ignore')
Sentences['text'] = Sentences['text'].str.lower()
print(Sentences.text)

0                        zucker fabrik
1         lebensmittel kommssionierung
2                      geländer biegen
3         gebäudeausrüstung technische
4                kürbiskernöl softgels
                     ...              
37290        spirituosen dienstleister
37291           mini hydraulikzylinder
37292    blockbodenbeutel verpackungen
37293              drehteile verpacken
37294                     bagger tanks
Name: text, Length: 37195, dtype: object


In [10]:
#defind a function to develop logistic regression model
def logisticreg() :
# Train-test split
    X_train, X_test, y_train, y_test = train_test_split(Sentences['text'], Sentences['label'], test_size=0.2, random_state=42)

    # text vectorization using TF-IDF and training a logistic regression model
    vectorizer = TfidfVectorizer()
    X_train_tfidf = vectorizer.fit_transform(X_train)
    X_test_tfidf = vectorizer.transform(X_test)
    model = LogisticRegression(max_iter=1000)
    model.fit(X_train_tfidf, y_train)
    predictions = model.predict(X_test_tfidf)

    # Predictions on the test set
    predictions = model.predict(X_test_tfidf)

    # save both model and vectorizer
    dump(model, "logReg.pkl")
    dump(vectorizer, "tfidf_vectorizer.pkl")

    # Evaluate the model
    accuracy = accuracy_score(y_test, predictions)
    report = classification_report(y_test, predictions)
    #printing the results
    print(f"Accuracy: {accuracy:.2f}")
    print("Classification Report:")
    print(report)
    

logisticreg()

Accuracy: 0.87
Classification Report:
              precision    recall  f1-score   support

          ch       0.95      0.84      0.89       706
         cnc       0.79      0.75      0.77       513
          ct       0.95      0.85      0.90      1022
          ft       0.82      0.94      0.88      2281
          mr       0.89      0.80      0.84      1009
         pkg       0.89      0.89      0.89      1908

    accuracy                           0.87      7439
   macro avg       0.88      0.84      0.86      7439
weighted avg       0.88      0.87      0.87      7439

