# Text Classification

The process of categorizing text into organized categories

1. Data Collection: From a website or a database
2. Preprocessing: To Remove anything that is not goint to be needed in order to understand the context and meaning.
3. Feature Extraction: The key features that are going to be useful in determining what do we mean by that text and how can we classify into multiple categories.
4. Model training: Pick a classification model that will enable us to pre-label the data and explain us the categories that the dataset belongs to.
5. Prediction: Which class does our text belongs to?

## Component of Text Classification Sytems

Data Source: Documents, Online Articles, Collection using Web Scraping and APIs

Preprocessing tools and libraries: Cleaning -> Tokenize -> Normalize -> Stop Words Removal -> Stemming and Lemmatizing

Feature extraction: Vectorization (Transforming into Numerical Values), Embeddings (Capturing Semantic Meaning)

Classification Algorithms: Naive Bayes, Logistic Regression, Support Vector Machine, Decision Trees, Random Forest, Neural Networks

Evaluation and Optimization (Using accuracy optimization): Metrics, Hyperparameter tuning (Adjusting model parameters), Cross Validation (Testing Using Subsets of the data)

## Binary vs. Multi-class Classification

Binary Classification: Categorizing data into two distinct groups

Examples: Email Filtering, Sentiment Analysis

Characteristics: Clear-cut decision boundary, Simpler as it involves only two classes, Commonly used for yes-no type decisions

Mutli-Class Classification: More than two groups

Examples: News Categorization, Product Categorization

Characteristics: Multiple decision boundaries, More complex due to presence of several classes, Used when data can belong to multiple distinct categories.

### Feature Selection Example

In [4]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_selection import SelectKBest, chi2 # select some of the best features in our text

texts = ["Sport news", "Cooking blog"]

labels = [0, 1] # 0 for sports, 1 for cooking

X = TfidfVectorizer().fit_transform(texts) # Converting text data into numerical values

s = SelectKBest(chi2, k=2).fit(X, labels) # Select the top features which are relevant


## Text Preprocessing and Vectorization Techniques

Vectorization methods = Bag-of-Words, TF-IDF, Word Embeddings

In [10]:
from sklearn.feature_extraction.text import TfidfVectorizer

texts = ["Machine Learning is fascinating"]

# Initialize and apply TF-IDF Vectorizer
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(texts)

print(tfidf_matrix.toarray())

[[0.5 0.5 0.5 0.5]]


## Preprocessing the Profiles Dataset