# Text classification using Support Vector Machine

Developed a text classification model using Support vector machine. SVMs perform well when the data is linearly separable, and their goal is to locate the hyperplane that optimally separates the classes in the feature space. In many circumstances, even if the original feature space is multidimensional, the data can be successfully split by a hyperplane. SVMs work well with sparse data, and TF-IDF naturally produces a sparse representation by giving low weights to common words and high weights to distinctive words. Accuracy provides an over all measure of correct prediction and F1 score balances both the precesion and recall. It is specially useful in situations where class distribution is imbalanced.

In [None]:
#importing libraries and modules
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score,classification_report



In [None]:
# Loading data into pandas data frame 
data = pd.read_csv('sample_data.csv')

# Check the first few rows
print(data.head)

In [None]:
#Calculating  the distribution of unique values in the 'label' column of data DataFrame
class_distribution = data['label'].value_counts()
print(class_distribution)

data = data.dropna(subset=['label'])
print("Number of missing values in the dataset:")
print(data.isnull().sum())
data['text'] = data['text'].str.encode('utf-8').str.decode('utf-8', 'ignore')
data['text'] = data['text'].str.lower()

In [36]:


# Create TF-IDF vectorizer
vectorizer = TfidfVectorizer()

# Fit and transform the text data
X = vectorizer.fit_transform(data['text'])
y = data['label']


In [40]:

def svm():
    # Split the data into training and testing sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

    # Initialize and train the SVM classifier
    svm_classifier = SVC()
    svm_classifier.fit(X_train, y_train)

    # Predict on the test set
    y_pred = svm_classifier.predict(X_test)

    # Evaluate accuracy
    accuracy = accuracy_score(y_test, y_pred)
    report = classification_report(y_test, y_pred)

    print(f'Accuracy: {accuracy}')
    print("Classification Report:")
    print(report)
svm()

Accuracy: 0.8706815432181745
Classification Report:
              precision    recall  f1-score   support

          ch       0.87      0.84      0.86       706
         cnc       0.92      0.70      0.80       513
          ct       0.96      0.84      0.90      1022
          ft       0.83      0.94      0.88      2281
          mr       0.90      0.79      0.84      1009
         pkg       0.87      0.90      0.88      1908

    accuracy                           0.87      7439
   macro avg       0.89      0.84      0.86      7439
weighted avg       0.88      0.87      0.87      7439

