# Text Classification

The process of categorizing text into organized categories

1. Data Collection: From a website or a database
2. Preprocessing: To Remove anything that is not goint to be needed in order to understand the context and meaning.
3. Feature Extraction: The key features that are going to be useful in determining what do we mean by that text and how can we classify into multiple categories.
4. Model training: Pick a classification model that will enable us to pre-label the data and explain us the categories that the dataset belongs to.
5. Prediction: Which class does our text belongs to?

## Component of Text Classification Sytems

Data Source: Documents, Online Articles, Collection using Web Scraping and APIs

Preprocessing tools and libraries: Cleaning -> Tokenize -> Normalize -> Stop Words Removal -> Stemming and Lemmatizing

Feature extraction: Vectorization (Transforming into Numerical Values), Embeddings (Capturing Semantic Meaning)

Classification Algorithms: Naive Bayes, Logistic Regression, Support Vector Machine, Decision Trees, Random Forest, Neural Networks

Evaluation and Optimization (Using accuracy optimization): Metrics, Hyperparameter tuning (Adjusting model parameters), Cross Validation (Testing Using Subsets of the data)

## Binary vs. Multi-class Classification

Binary Classification: Categorizing data into two distinct groups

Examples: Email Filtering, Sentiment Analysis

Characteristics: Clear-cut decision boundary, Simpler as it involves only two classes, Commonly used for yes-no type decisions

Mutli-Class Classification: More than two groups

Examples: News Categorization, Product Categorization

Characteristics: Multiple decision boundaries, More complex due to presence of several classes, Used when data can belong to multiple distinct categories.

### Feature Selection Example

In [4]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_selection import SelectKBest, chi2 # select some of the best features in our text

texts = ["Sport news", "Cooking blog"]

labels = [0, 1] # 0 for sports, 1 for cooking

X = TfidfVectorizer().fit_transform(texts) # Converting text data into numerical values

s = SelectKBest(chi2, k=2).fit(X, labels) # Select the top features which are relevant


## Text Preprocessing and Vectorization Techniques

Vectorization methods = Bag-of-Words, TF-IDF, Word Embeddings

In [10]:
from sklearn.feature_extraction.text import TfidfVectorizer

texts = ["Machine Learning is fascinating"]

# Initialize and apply TF-IDF Vectorizer
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(texts)

print(tfidf_matrix.toarray())

[[0.5 0.5 0.5 0.5]]


## Preprocessing the Profiles Dataset

In [11]:
# Setup packages
import pandas as pd

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.preprocessing import LabelEncoder, normalize

import nltk
nltk.download('stopwords')
nltk.download('punkt')
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer

import re

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/deepshah/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /Users/deepshah/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [12]:
# Load Dataset
df = pd.read_csv('Demo Profiles.csv')
df.head()

Unnamed: 0,first_name,last_name,company,position,industry,location
0,John,Doe,ABC Corp,Marketing Manager,Technology,San Francisco
1,Jane,Smith,XYZ Inc,Social Media Specialist,Advertising & Marketing,New York
2,Michael,Johnson,123 Company,Digital Marketing Analyst,Consulting,Chicago
3,Sarah,Williams,ABC Corp,Content Writer,Media & Publishing,London
4,David,Brown,XYZ Inc,Brand Manager,Consumer Goods,Miami


In [15]:
# Text Preprocessing Techniques
def preprocess_text(text):
    text = text.lower()
    
    text = re.sub(f'[^\w\s]','',text)
    text = re.sub(f'\d+','',text)
    
    # Tokenization
    tokens = nltk.word_tokenize(text)
    
    # Remove Stop Words
    stop_words = set(stopwords.words('english'))
    tokens = [word for word in tokens if word not in stop_words]
    
    # Stemming
    stemmer = PorterStemmer()
    stemmed = [stemmer.stem(word) for word in tokens]
    return ' '.join(stemmed)

In [18]:
# Apply the technique
df['processed_position'] = df['position'].apply(preprocess_text)
df.head()

Unnamed: 0,first_name,last_name,company,position,industry,location,processed_positon,processed_position
0,John,Doe,ABC Corp,Marketing Manager,Technology,San Francisco,market manag,market manag
1,Jane,Smith,XYZ Inc,Social Media Specialist,Advertising & Marketing,New York,social media specialist,social media specialist
2,Michael,Johnson,123 Company,Digital Marketing Analyst,Consulting,Chicago,digit market analyst,digit market analyst
3,Sarah,Williams,ABC Corp,Content Writer,Media & Publishing,London,content writer,content writer
4,David,Brown,XYZ Inc,Brand Manager,Consumer Goods,Miami,brand manag,brand manag


In [20]:
# Process of Text Vectorization

# A: Bag-of-Words
vectorizer = CountVectorizer()
bow_matrix = vectorizer.fit_transform(df['processed_position'])
print(bow_matrix.toarray())

# B: TF-IDF
tfidf_vectorizer = TfidfVectorizer()
X = tfidf_vectorizer.fit_transform(df['processed_position'])
print(X.toarray())

[[0 0 0 ... 0 0 0]
 [0 0 0 ... 1 0 0]
 [0 0 1 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 1 0]
 [0 0 0 ... 1 0 0]]
[[0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.60225663 0.         0.        ]
 [0.         0.         0.62026425 ... 0.         0.         0.        ]
 ...
 [0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.86288949 0.        ]
 [0.         0.         0.         ... 0.60225663 0.         0.        ]]


In [22]:
# Normalize the Vectorized data
normalized_matrix = normalize(X, norm='l2', axis=1)
print(normalized_matrix.toarray())

[[0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.60225663 0.         0.        ]
 [0.         0.         0.62026425 ... 0.         0.         0.        ]
 ...
 [0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.86288949 0.        ]
 [0.         0.         0.         ... 0.60225663 0.         0.        ]]


In [23]:
# Encode the Target Variable
unique_values = df['industry'].unique()
for i, value in enumerate(unique_values, 0):
    print(f"{i}. {value}")

0. Technology
1. Advertising & Marketing
2. Consulting
3. Media & Publishing
4. Consumer Goods
5. E-commerce
6. Fashion & Apparel
7. Beauty & Cosmetics
8. Market Research
9.  Marketing Coordinator


In [24]:
label_encoder = LabelEncoder()
y = label_encoder.fit_transform(df['industry'])
print(y)

[9 1 3 8 4 5 6 9 1 3 8 4 2 7 9 1 8 5 6 9 3 8 4 1 5 6 9 1 3 8 4 9 5 6 8 4 2
 7 9 1 8 5 6 9 3 8 4 1 5 6 9 1 3 8 4 9 5 6 8 4 2 7 9 1 8 5 6 9 3 8 4 1 5 6
 9 1 3 8 4 9 5 6 8 4 2 7 9 1 0 5 6 9 3 8 4 1 5 6 9 1]


# Modeling, Evaluation and Advanced Classification Techniques

## Exploration of Classification Models Algorithms

1. Naive Bayes: The likelihood of a data point belonging to a particular class. Best for Spam detection, sentiment analysis, document classification. It is simple, fast and effective with large feature sets.

2. Logistic Regression: It calculates the likelihood of a specific input belonging to a particular category. Best for Binary Classification, Email filtering, Fraud detection. It outputs probabilities, Good interpretability, Efficient

3. Support Vector Machine: The process of drawing a line between two groups of points to separate them as clearly as possible. Best for Categorizing emails, articles, web pages. Effectiveness in high dimensions, Adaptable to various data types, Tends to be very accurate.

4. Decision Trees: Tree-like structure with branches representing decision paths and leaves representing outcomes or decisions. Best for strategic decision, classification tasks. Visualize the decision-making process, Doesn't require normalization.

5. Random Forest: A set of decision trees are used to predict classes. The class with the highest number of votes is chosen as the prediction. Best for Scenarios with multiple classes, Understanding feature importance. High accuracy and robustness, good performance for datasets with unbalanced classes.

6. Neural Networks: Made up of layers capable of detecting patterns in data, ranging from simple to complex. Best for Language translation, Sentiment analysis and text generation. Learning features directly from data, adapt to wide range of tasks, can model complex features.

Hyperparameter tuning: The process of finding the optimal combination of parameters that yields best model performance.

Methods:
1. Grid Search
2. Random Search
3. Automated Machine Learning Tools

Cross-validation: Give models subsets of data, Each subset will be tested in different way, evaluate the model's learning progress accurately.

Text Dataset -> Split data into folds -> Train some folds -> Test the performance using the other folds -> Switch the folds now for training and testing -> Calculate the avg performance

In [25]:
# Sentiment Analysis Example
from nltk.sentiment import SentimentIntensityAnalyzer
import nltk
nltk.download('vader_lexicon')

sa = SentimentIntensityAnalyzer()

text = "I love this phone. The camera is amazing!"

sentiment = "Positive" if sa.polarity_scores(text)['compound'] >= 0 else "Negative"

print(sentiment)

Positive


[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /Users/deepshah/nltk_data...


In [26]:
# Spam Detection Example
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer

texts = ["Win a free smartphone!", "Lunch at noon"]
labels = [1,0] # 1 for spam and 0 for not spam

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(texts)
model = MultinomialNB().fit(X, labels)

new_text = vectorizer.transform(["Free Money!!!"])
predicted = model.predict(new_text)

print("Spam" if predicted[0]==1 else "Not Spam")

Spam


In [27]:
# Content Recommendation Example
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel

contents = ["Action movie", "Romantic movie", "Documentary about nature"]

user_favorite = "Action and adventure"

tfidf = TfidfVectorizer().fit_transform(contents+[user_favorite]) # Transforming text into TF-IDF vectors

sm = linear_kernel(tfidf[-1], tfidf[:-1]).flatten() # Calculating the similarity of the user fav content with the list

print("Recommended: ", contents[sm.argmax()]) # Printing most similar content

Recommended:  Action movie


In [34]:
# Information Retrieval Example
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel

documents = ["financial markets", "AI trends" "Paris travel"]

query = "AI technology"

vec = TfidfVectorizer().fit_transform(documents + [query])
# doc_vec = vec.transform(documents)
# query_vec = vec.transform([query])

sm = linear_kernel(vec[-1], vec[:-1]).flatten()

print("Retrieved: ", documents[sm.argmax()])

Retrieved:  AI trendsParis travel


## Classifying Professional profiles Into Multiple Categories

In [35]:
# Preprocessing the Profiles Dataset
import pandas as pd
import numpy as np

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV, cross_val_score
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
from sklearn.preprocessing import LabelEncoder

import nltk
nltk.download('punkt')
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer

import re
import warnings
warnings.filterwarnings('ignore', category=UserWarning)

[nltk_data] Downloading package punkt to /Users/deepshah/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/deepshah/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [36]:
df = pd.read_csv("Demo Profiles.csv")

In [41]:
# Preprocess data
def preprocess_text(text):
    text = text.lower()
    text = re.sub(r'[^\w\s]','', text)
    text = re.sub(r'\d+','', text)
    tokens = nltk.word_tokenize(text)
    stop_words = set(stopwords.words('english'))
    tokens = [word for word in tokens if word not in stop_words]
    stemmer = PorterStemmer()
    stemmed = [stemmer.stem(word) for word in tokens]
    return ' '.join(stemmed)

df['processed_text'] = df['position'].apply(preprocess_text)

tfidf_vec = TfidfVectorizer()
X = tfidf_vec.fit_transform(df['processed_text'])

label_encoder = LabelEncoder()
y = label_encoder.fit_transform(df['industry'])

In [42]:
# Split into training and testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=44)

In [43]:
# Define Models
models = {
    "Naive Bayes": MultinomialNB(),
    "Logistic Regression": LogisticRegression(),
    "SVM": SVC(),
    "Random Forest": RandomForestClassifier()
}

In [44]:
# Define Hyperparameter Tuning
param_grid = {
    "Naive Bayes": {'alpha': [0.1, 1, 10]},
    "Logistic Regression": {'C': [0.1, 1, 10]},
    "SVM": {'C': [0.1, 1, 10], 'kernel': ['linear', 'rbf']},
    "Random Forest": {'n_estimators': [10, 50, 100], 'max_depth': [None, 10, 20]}
}

In [47]:
# Train and Evaluate Models
best_models = {}
model_scores = {}

for name, model in models.items():
    grid = GridSearchCV(model, param_grid[name], cv=4)
    grid.fit(X_train, y_train)
    best_model = grid.best_estimator_
    best_models[name] = best_model
    scores = cross_val_score(best_model, X_train, y_train, cv=4)
    avg_score = np.mean(scores)
    model_scores[name] = avg_score
    print(f"{name}: Best Params: {grid.best_params_}, Cross-Val Score: {avg_score}")

Naive Bayes: Best Params: {'alpha': 0.1}, Cross-Val Score: 0.7875
Logistic Regression: Best Params: {'C': 10}, Cross-Val Score: 0.85
SVM: Best Params: {'C': 10, 'kernel': 'linear'}, Cross-Val Score: 0.8750000000000001
Random Forest: Best Params: {'max_depth': None, 'n_estimators': 100}, Cross-Val Score: 0.8250000000000001


In [48]:
y_pred = best_models[max(model_scores, key=model_scores.get)].predict(X_test)
print(f"Test Set Report:\n{classification_report(y_test, y_pred)}\n")

Test Set Report:
              precision    recall  f1-score   support

           1       1.00      1.00      1.00         3
           2       1.00      1.00      1.00         1
           3       0.33      1.00      0.50         1
           4       1.00      1.00      1.00         3
           5       1.00      1.00      1.00         3
           6       1.00      0.33      0.50         3
           7       1.00      1.00      1.00         1
           8       1.00      1.00      1.00         1
           9       1.00      1.00      1.00         4

    accuracy                           0.90        20
   macro avg       0.93      0.93      0.89        20
weighted avg       0.97      0.90      0.90        20




In [50]:
# Predict
digital_ex = "Digital Marketing Specialist"

processed_ex = preprocess_text(digital_ex)

vec_ex = tfidf_vec.transform([processed_ex])

predicted_cat_idx = best_models["SVM"].predict(vec_ex)
pred_cat = label_encoder.inverse_transform(predicted_cat_idx)

print(f"The predicted industry for {digital_ex} is: {pred_cat[0]}")

The predicted industry for Digital Marketing Specialist is: E-commerce
