# Text Classification

The process of categorizing text into organized categories

1. Data Collection: From a website or a database
2. Preprocessing: To Remove anything that is not goint to be needed in order to understand the context and meaning.
3. Feature Extraction: The key features that are going to be useful in determining what do we mean by that text and how can we classify into multiple categories.
4. Model training: Pick a classification model that will enable us to pre-label the data and explain us the categories that the dataset belongs to.
5. Prediction: Which class does our text belongs to?

## Component of Text Classification Sytems

Data Source: Documents, Online Articles, Collection using Web Scraping and APIs

Preprocessing tools and libraries: Cleaning -> Tokenize -> Normalize -> Stop Words Removal -> Stemming and Lemmatizing

Feature extraction: Vectorization (Transforming into Numerical Values), Embeddings (Capturing Semantic Meaning)

Classification Algorithms: Naive Bayes, Logistic Regression, Support Vector Machine, Decision Trees, Random Forest, Neural Networks

Evaluation and Optimization (Using accuracy optimization): Metrics, Hyperparameter tuning (Adjusting model parameters), Cross Validation (Testing Using Subsets of the data)

## Binary vs. Multi-class Classification

Binary Classification: Categorizing data into two distinct groups

Examples: Email Filtering, Sentiment Analysis

Characteristics: Clear-cut decision boundary, Simpler as it involves only two classes, Commonly used for yes-no type decisions

Mutli-Class Classification: More than two groups

Examples: News Categorization, Product Categorization

Characteristics: Multiple decision boundaries, More complex due to presence of several classes, Used when data can belong to multiple distinct categories.

### Feature Selection Example

In [4]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_selection import SelectKBest, chi2 # select some of the best features in our text

texts = ["Sport news", "Cooking blog"]

labels = [0, 1] # 0 for sports, 1 for cooking

X = TfidfVectorizer().fit_transform(texts) # Converting text data into numerical values

s = SelectKBest(chi2, k=2).fit(X, labels) # Select the top features which are relevant


## Text Preprocessing and Vectorization Techniques

Vectorization methods = Bag-of-Words, TF-IDF, Word Embeddings

In [10]:
from sklearn.feature_extraction.text import TfidfVectorizer

texts = ["Machine Learning is fascinating"]

# Initialize and apply TF-IDF Vectorizer
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(texts)

print(tfidf_matrix.toarray())

[[0.5 0.5 0.5 0.5]]


## Preprocessing the Profiles Dataset

In [11]:
# Setup packages
import pandas as pd

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.preprocessing import LabelEncoder, normalize

import nltk
nltk.download('stopwords')
nltk.download('punkt')
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer

import re

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/deepshah/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /Users/deepshah/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [12]:
# Load Dataset
df = pd.read_csv('Demo Profiles.csv')
df.head()

Unnamed: 0,first_name,last_name,company,position,industry,location
0,John,Doe,ABC Corp,Marketing Manager,Technology,San Francisco
1,Jane,Smith,XYZ Inc,Social Media Specialist,Advertising & Marketing,New York
2,Michael,Johnson,123 Company,Digital Marketing Analyst,Consulting,Chicago
3,Sarah,Williams,ABC Corp,Content Writer,Media & Publishing,London
4,David,Brown,XYZ Inc,Brand Manager,Consumer Goods,Miami


In [15]:
# Text Preprocessing Techniques
def preprocess_text(text):
    text = text.lower()
    
    text = re.sub(f'[^\w\s]','',text)
    text = re.sub(f'\d+','',text)
    
    # Tokenization
    tokens = nltk.word_tokenize(text)
    
    # Remove Stop Words
    stop_words = set(stopwords.words('english'))
    tokens = [word for word in tokens if word not in stop_words]
    
    # Stemming
    stemmer = PorterStemmer()
    stemmed = [stemmer.stem(word) for word in tokens]
    return ' '.join(stemmed)

In [18]:
# Apply the technique
df['processed_position'] = df['position'].apply(preprocess_text)
df.head()

Unnamed: 0,first_name,last_name,company,position,industry,location,processed_positon,processed_position
0,John,Doe,ABC Corp,Marketing Manager,Technology,San Francisco,market manag,market manag
1,Jane,Smith,XYZ Inc,Social Media Specialist,Advertising & Marketing,New York,social media specialist,social media specialist
2,Michael,Johnson,123 Company,Digital Marketing Analyst,Consulting,Chicago,digit market analyst,digit market analyst
3,Sarah,Williams,ABC Corp,Content Writer,Media & Publishing,London,content writer,content writer
4,David,Brown,XYZ Inc,Brand Manager,Consumer Goods,Miami,brand manag,brand manag


In [20]:
# Process of Text Vectorization

# A: Bag-of-Words
vectorizer = CountVectorizer()
bow_matrix = vectorizer.fit_transform(df['processed_position'])
print(bow_matrix.toarray())

# B: TF-IDF
tfidf_vectorizer = TfidfVectorizer()
X = tfidf_vectorizer.fit_transform(df['processed_position'])
print(X.toarray())

[[0 0 0 ... 0 0 0]
 [0 0 0 ... 1 0 0]
 [0 0 1 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 1 0]
 [0 0 0 ... 1 0 0]]
[[0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.60225663 0.         0.        ]
 [0.         0.         0.62026425 ... 0.         0.         0.        ]
 ...
 [0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.86288949 0.        ]
 [0.         0.         0.         ... 0.60225663 0.         0.        ]]


In [22]:
# Normalize the Vectorized data
normalized_matrix = normalize(X, norm='l2', axis=1)
print(normalized_matrix.toarray())

[[0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.60225663 0.         0.        ]
 [0.         0.         0.62026425 ... 0.         0.         0.        ]
 ...
 [0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.86288949 0.        ]
 [0.         0.         0.         ... 0.60225663 0.         0.        ]]


In [23]:
# Encode the Target Variable
unique_values = df['industry'].unique()
for i, value in enumerate(unique_values, 0):
    print(f"{i}. {value}")

0. Technology
1. Advertising & Marketing
2. Consulting
3. Media & Publishing
4. Consumer Goods
5. E-commerce
6. Fashion & Apparel
7. Beauty & Cosmetics
8. Market Research
9.  Marketing Coordinator


In [24]:
label_encoder = LabelEncoder()
y = label_encoder.fit_transform(df['industry'])
print(y)

[9 1 3 8 4 5 6 9 1 3 8 4 2 7 9 1 8 5 6 9 3 8 4 1 5 6 9 1 3 8 4 9 5 6 8 4 2
 7 9 1 8 5 6 9 3 8 4 1 5 6 9 1 3 8 4 9 5 6 8 4 2 7 9 1 8 5 6 9 3 8 4 1 5 6
 9 1 3 8 4 9 5 6 8 4 2 7 9 1 0 5 6 9 3 8 4 1 5 6 9 1]
