## Practical Exercises with Text Processing (Exercise Solutions)
In this final section, we will engage in practical exercises involving preprocessing text data, vectorizing it, extracting meaningful features, and building multiclass classification models. These exercises are designed to reinforce the concepts learned throughout the chapter and demonstrate how to effectively apply text processing and classification techniques in various scenarios. By the end of this section, we will have hands-on experience to apply in our own machine learning projects.

### Exercise 1: Preprocessing and Vectorizing Text
In this exercise, we’ll preprocess raw text data and transform it into numerical features using vectorization techniques.


In [43]:
# Load libraries
from sklearn.feature_extraction.text import TfidfVectorizer

# Load the Dataset
texts = [
    "Data preprocessing is essential.",
    "Vectorization transforms text.",
    "Clean data improves model performance.",
    "Machine learning algorithms use numerical data."
]

# Preprocess the Data (basic cleaning)
texts_cleaned = [text.lower() for text in texts]

# Vectorize the Text
vectorizer = TfidfVectorizer()
X_train_vect = vectorizer.fit_transform(texts_cleaned)


### Exercise 2: Feature Extraction with N-grams
In this exercise, we'll extract n-gram features from text to capture context better for classification tasks.


In [44]:
# Load libraries
from sklearn.feature_extraction.text import CountVectorizer

# Load the Dataset
texts = [
    "Feature extraction is crucial.",
    "N-grams capture context.",
    "Models benefit from good features.",
    "Contextual information enhances classification."
]

# Extract N-gram Features
ngram_vectorizer = CountVectorizer(ngram_range=(1,2))
X_train_ngram = ngram_vectorizer.fit_transform(texts)


### Exercise 3: Building and Evaluating a Multiclass Classifier
Here, we will build, train, and evaluate a multiclass classifier using logistic regression.


In [None]:
# Load libraries
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

# Load the Dataset
texts = [
    "This is an economic news article.",
    "Sports events are exciting.",
    "Technology advances rapidly.",
    "Politics affect global relations."
]
labels = ["Economy", "Sports", "Technology", "Politics"]

# Vectorize Text
vectorizer = TfidfVectorizer()
X_vect = vectorizer.fit_transform(texts)

# Split Data
X_train, X_test, y_train, y_test = train_test_split(X_vect, labels, test_size=0.5, random_state=2024)

# Train the Classifier
clf = LogisticRegression(random_state=2024)
clf.fit(X_train, y_train)
6.	Make Predictions and Evaluate
y_pred = clf.predict(X_test)
print(classification_report(y_test, y_pred))
