# ***Numeric classification***

## **Numeric classification using ml**

*Gradient boosting is well-established through Python libraries and helps in performing numeric classification. Here are the main four:*


*   
XGBoost: eXtreme Gradient Boosting
*  LightGBM: Light Gradient Boosting Machine


*   CatBoost: Categorical Boosting
*   Scikit-learn: Has two estimators for regression and classification
Out of these lightgbm is the fastest model available, and we are using that to make predictions.






We will predict the cut quality of diamonds based on their price and other physical measurements. This dataset is built into the Seaborn library.

***STEPS***

**Import Necessary Libraries**: Import libraries for data handling (pandas), visualization (seaborn), preprocessing, model building, and evaluation from sklearn. Import LGBMClassifier from lightgbm.

**Load Dataset**: Load the diamonds dataset using seaborn and store it in a DataFrame.

**Prepare Features and Target**: Separate the features (X) by dropping the target column (cut) and define the target (y) as the cut column.

**Split the Data**: Use train_test_split to split the data into training (80%) and testing (20%) sets to train and evaluate the model.

**Identify Feature Types**: Identify categorical and numerical features in the dataset using select_dtypes to handle them differently in preprocessing.

**Set Up Preprocessing Steps**: Use ColumnTransformer to create preprocessing steps:

**Apply OneHotEncoder to categorical features.**
***Apply StandardScaler to numerical features.***
**Create the Pipeline** : Set up a Pipeline that first applies the preprocessing steps and then fits the LGBMClassifier for classification.

**Perform Cross-Validation**: Use cross_val_score to perform 5-fold cross-validation on the training data to estimate model performance.

**Train the Model**: Fit the pipeline to the training data to train the LGBMClassifier with the preprocessed features.

**Evaluate and Print Results**: Predict on the test set, generate a classification report using classification_report, and print the mean cross-validation accuracy and detailed classification metrics.

In [None]:
import pandas as pd
import seaborn as sns
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import classification_report
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from lightgbm import LGBMClassifier

Dask dataframe query planning is disabled because dask-expr is not installed.

You can install it with `pip install dask[dataframe]` or `conda install dask`.
This will raise in a future version.



In [None]:
diamonds = sns.load_dataset("diamonds")
X = diamonds.drop("cut", axis=1)
y = diamonds["cut"]

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
   X, y, test_size=0.2, random_state=42
)

In [None]:
categorical_features = X.select_dtypes(
   include=["object"]
).columns.tolist()

numerical_features = X.select_dtypes(
   include=["float64", "int64"]
).columns.tolist()

In [None]:
preprocessor = ColumnTransformer(
   transformers=[
       ("cat", OneHotEncoder(), categorical_features),
       ("num", StandardScaler(), numerical_features),
   ]
)

In [None]:
pipeline = Pipeline(
   [
       ("preprocessor", preprocessor),
       ("classifier", LGBMClassifier(random_state=42)),
   ]
)

In [None]:
cv_scores = cross_val_score(pipeline, X_train, y_train, cv=5)

pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
report = classification_report(y_test, y_pred)

[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.002354 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 1501
[LightGBM] [Info] Number of data points in the train set: 34521, number of used features: 7
[LightGBM] [Info] Start training from score -3.521765
[LightGBM] [Info] Start training from score -2.403414
[LightGBM] [Info] Start training from score -0.916392
[LightGBM] [Info] Start training from score -1.365340
[LightGBM] [Info] Start training from score -1.492586
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.003830 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 1503
[LightGBM] [Info] Number of data points in the train set: 34521, number of used features: 7
[LightGBM] [Info] Start training from score -3.521765
[LightGBM] [Info] Start training from score -2.403414
[LightGBM] [Info] Start training from score -0

In [None]:
print(f"Mean Cross-Validation Accuracy: {cv_scores.mean():.4f}")
print("\nClassification Report:")
print(report)

Mean Cross-Validation Accuracy: 0.7945

Classification Report:
              precision    recall  f1-score   support

        Fair       0.93      0.90      0.91       335
        Good       0.80      0.69      0.74      1004
       Ideal       0.82      0.92      0.87      4292
     Premium       0.82      0.82      0.82      2775
   Very Good       0.69      0.59      0.64      2382

    accuracy                           0.80     10788
   macro avg       0.81      0.78      0.80     10788
weighted avg       0.80      0.80      0.79     10788



## **Numeric classification using dl**

***Steps***

**Import Libraries**: Load pandas, seaborn, scikit-learn, and TensorFlow libraries.

**Load Dataset**: Load and separate features (X) and target (y) from the "diamonds" dataset.

**One-Hot Encode Target**: Convert the categorical target y to one-hot encoded format.

**Split Data**: Divide the data into training and testing sets.

**Identify Feature Types:** Determine categorical and numerical features in X.

**Set Up Preprocessing:** Use ColumnTransformer for preprocessing categorical and numerical features.

**Determine Input Shape:** Calculate the input shape dynamically after preprocessing.

**Define Neural Network**: Create a Keras model with Input layer and Dense layers.

**Wrap with KerasClassifier**: Use KerasClassifier to integrate the model into a scikit-learn pipeline.

**Create Pipeline and Train**: Build a pipeline, fit it to the training data, and evaluate using classification_report.

In [None]:

!pip install scikeras

Collecting scikeras
  Downloading scikeras-0.13.0-py3-none-any.whl.metadata (3.1 kB)
Collecting scikit-learn>=1.4.2 (from scikeras)
  Downloading scikit_learn-1.5.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Downloading scikeras-0.13.0-py3-none-any.whl (26 kB)
Downloading scikit_learn-1.5.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (13.4 MB)
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m13.4/13.4 MB[0m [31m107.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: scikit-learn, scikeras
  Attempting uninstall: scikit-learn
    Found existing installation: scikit-learn 1.3.2
    Uninstalling scikit-learn-1.3.2:
      Successfully uninstalled scikit-learn-1.3.2
Successfully installed scikeras-0.13.0 scikit-learn-1.5.1


In [None]:
import pandas as pd
import seaborn as sns
from sklearn.compose import ColumnTransformer
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense,Input
from scikeras.wrappers import KerasClassifier

In [None]:
diamonds = sns.load_dataset("diamonds")

In [None]:
X = diamonds.drop("cut", axis=1)
y = diamonds["cut"]

In [None]:
y = pd.get_dummies(y).values

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
categorical_features = X.select_dtypes(include=["object"]).columns.tolist()
numerical_features = X.select_dtypes(include=["float64", "int64"]).columns.tolist()

In [None]:
preprocessor = ColumnTransformer(
    transformers=[
        ("cat", OneHotEncoder(), categorical_features),
        ("num", StandardScaler(), numerical_features),
    ]
)

In [None]:
def create_nn_model(input_shape):
    model = Sequential()
    model.add(Input(shape=(input_shape,)))
    model.add(Dense(32, activation='relu'))
    model.add(Dense(16, activation='relu'))
    model.add(Dense(y_train.shape[1], activation='softmax'))
    model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
    return model

In [None]:
X_train_transformed = preprocessor.fit_transform(X_train)
input_shape = X_train_transformed.shape[1]

In [None]:
nn_classifier = KerasClassifier(
    model=create_nn_model,
    epochs=50,
    batch_size=32,
    verbose=0,
    model__input_shape=input_shape
)

In [None]:
pipeline = Pipeline(
    [
        ("preprocessor", preprocessor),
        ("classifier", nn_classifier),
    ]
)

In [None]:
pipeline.fit(X_train, y_train)

In [None]:
y_pred = pipeline.predict(X_test)

In [None]:
y_test_labels = y_test.argmax(axis=1)
y_pred_labels = y_pred.argmax(axis=1)

In [None]:
report = classification_report(y_test_labels, y_pred_labels)
print("\nClassification Report:")
print(report)


Classification Report:
              precision    recall  f1-score   support

           0       0.82      0.92      0.87      4292
           1       0.86      0.75      0.80      2775
           2       0.63      0.60      0.61      2382
           3       0.70      0.69      0.69      1004
           4       0.93      0.81      0.86       335

    accuracy                           0.78     10788
   macro avg       0.79      0.75      0.77     10788
weighted avg       0.78      0.78      0.78     10788



# ***Text Classification***

> Text classification is the process of assigning predefined labels or categories to text data based on its content. It involves training a model to recognize patterns in text and classify it accordingly. This technique is commonly used in applications like spam detection, sentiment analysis, and topic categorization.











## Performing text classification using ml
We use Naive bayes model to perform text classification on the movie review dataset available in the nlt library

Here are the steps involved in using `TextBlob` for sentiment analysis on the movie reviews dataset, outlined in 10 points:

1. **Install Libraries**:
   - Ensure you have `TextBlob` installed using `!pip install textblob`.

2. **Download NLTK Data**:
   - Download the `movie_reviews` dataset from NLTK using `nltk.download('movie_reviews')`.

3. **Load Dataset**:
   - Load movie reviews data from the NLTK `movie_reviews` corpus. This corpus contains text data and associated labels (categories).

4. **Prepare Data**:
   - Combine words from each review into a single string and label each review with its corresponding sentiment (`pos` or `neg`).

5. **Create DataFrame**:
   - Organize the text and labels into a Pandas DataFrame for easier manipulation and splitting.

6. **Split Data**:
   - Divide the dataset into training and test sets using `train_test_split` from `sklearn`. This helps in evaluating the model's performance on unseen data.

7. **Define Classifier**:
   - Create a classification function using `TextBlob`. This function analyzes sentiment polarity and assigns a label based on whether the polarity is positive or negative.

8. **Predict**:
   - Use the defined `TextBlob` classifier to predict sentiments on the test set.

9. **Evaluate**:
   - Calculate the accuracy of the predictions and generate a classification report using `accuracy_score` and `classification_report` from `sklearn`.

10. **Review Results**:
    - Print the accuracy and classification report to evaluate the performance of the sentiment classification.

This process leverages `TextBlob`'s sentiment analysis for a straightforward binary classification task, focusing on positive and negative sentiments in movie reviews.

In [None]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [None]:
import nltk
nltk.download('punkt')


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [None]:
!pip install textblob




In [None]:
from textblob import TextBlob
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, accuracy_score

In [None]:
import nltk
from nltk.corpus import movie_reviews
from textblob import TextBlob
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, accuracy_score
import pandas as pd

nltk.download('movie_reviews')

docs = [(list(movie_reviews.words(fileid)), category)
        for category in movie_reviews.categories()
        for fileid in movie_reviews.fileids(category)]

texts = [' '.join(doc) for doc, _ in docs]
labels = [label for _, label in docs]

df = pd.DataFrame({'text': texts, 'label': labels})

X_train, X_test, y_train, y_test = train_test_split(df['text'], df['label'], test_size=0.2, random_state=42)

def textblob_classify(text):
    blob = TextBlob(text)

    return 'pos' if blob.sentiment.polarity > 0 else 'neg'

y_pred = [textblob_classify(text) for text in X_test]

accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)
print("\nClassification Report:")
print(classification_report(y_test, y_pred))


[nltk_data] Downloading package movie_reviews to /root/nltk_data...
[nltk_data]   Unzipping corpora/movie_reviews.zip.


Accuracy: 0.61

Classification Report:
              precision    recall  f1-score   support

         neg       0.91      0.24      0.38       199
         pos       0.56      0.98      0.72       201

    accuracy                           0.61       400
   macro avg       0.74      0.61      0.55       400
weighted avg       0.73      0.61      0.55       400



## Text classifation using dl

To perform text classification using the transformers library by Hugging Face, you can use pre-trained models such as BERT, DistilBERT, or other transformer-based models. Here's how you can set up and run text classification on the movie_reviews dataset using transformers.

**Steps:**

Install Required Libraries: **bold text**

Ensure you have transformers and datasets installed.
Use !pip install transformers datasets if not already installed.

**Load and Prepare the Dataset:**

Use nltk to load the movie_reviews dataset.
Convert it into a format suitable for transformers.

**Tokenize Text:**

Use a tokenizer from a pre-trained model to convert text data into tokens.

**Create DataLoaders:**

Create training and validation datasets using the tokenized text.

**Define Model:**

Load a pre-trained transformer model suitable for text classification.

**Train and Evaluate:**

Fine-tune the model on your dataset and evaluate its performance.

In [None]:
!pip install scikit-learn


Collecting scikit-learn
  Using cached scikit_learn-1.5.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Using cached scikit_learn-1.5.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (13.4 MB)
Installing collected packages: scikit-learn
Successfully installed scikit-learn-1.5.1


In [None]:
!pip install transformers datasets

Collecting datasets
  Downloading datasets-2.21.0-py3-none-any.whl.metadata (21 kB)
Collecting pyarrow>=15.0.0 (from datasets)
  Downloading pyarrow-17.0.0-cp310-cp310-manylinux_2_28_x86_64.whl.metadata (3.3 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Downloading datasets-2.21.0-py3-none-any.whl (527 kB)
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m527.3/527.3 kB[0m [31m32.0 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚

Explanation:
Install Libraries: Ensure transformers and datasets are installed.

Download Dataset: Use NLTK to download and load the movie_reviews dataset.

Prepare Data: Convert text and labels into a format suitable for the datasets library.

Tokenize: Use a pre-trained BERT tokenizer to tokenize the text data.

Create DataLoader: Tokenize the dataset and prepare it for training.

Define Model: Load a pre-trained BERT model for sequence classification.

Train Model: Fine-tune the BERT model on your dataset using the Trainer API.

Evaluate Model: Evaluate the model‚Äôs performance on the validation set.

Predict: Use the model to predict labels for new texts.

Output Predictions: Print the predictions for example texts.

This setup leverages the power of transformer models for text classification and is suitable for various NLP tasks.

In [None]:
!pip install transformers datasets scikit-learn



In [None]:
import pandas as pd
from datasets import load_dataset
from transformers import BertTokenizer, BertForSequenceClassification, Trainer, TrainingArguments
from sklearn.model_selection import train_test_split
import torch

# Load the dataset
dataset = load_dataset("imdb")

# Split the dataset into training and test sets
train_texts, test_texts, train_labels, test_labels = train_test_split(
    dataset['train']['text'],
    dataset['train']['label'],
    test_size=0.2,
    random_state=42
)

# Create pandas DataFrames
train_df = pd.DataFrame({'text': train_texts, 'label': train_labels})
test_df = pd.DataFrame({'text': test_texts, 'label': test_labels})

# Convert to Hugging Face dataset
train_dataset = Dataset.from_pandas(train_df)
test_dataset = Dataset.from_pandas(test_df)


Downloading readme:   0%|          | 0.00/7.81k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/21.0M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/20.5M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/42.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

In [None]:
# Load the tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

def tokenize_function(examples):
    return tokenizer(examples['text'], padding="max_length", truncation=True, return_tensors="pt")

# Tokenize the datasets
train_dataset = train_dataset.map(tokenize_function, batched=True)
test_dataset = test_dataset.map(tokenize_function, batched=True)


Map:   0%|          | 0/20000 [00:00<?, ? examples/s]

Map:   0%|          | 0/5000 [00:00<?, ? examples/s]

In [None]:
model = BertForSequenceClassification.from_pretrained('bert-base-uncased')


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
training_args = TrainingArguments(
    output_dir='./results',
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=3,
    weight_decay=0.01,
)




In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
)

In [None]:
trainer.train()

In [None]:
eval_results = trainer.evaluate()


In [None]:
print(f"Evaluation Results: {eval_results}")

In [None]:
def predict(texts):
    inputs = tokenizer(texts, padding=True, truncation=True, return_tensors="pt")
    with torch.no_grad():
        outputs = model(**inputs)
    logits = outputs.logits
    predictions = torch.argmax(logits, dim=-1).numpy()
    return predictions


In [None]:
test_texts = ['I love this movie', 'I hate this movie']
predictions = predict(test_texts)
print(predictions)