# Saving Trained Model to a File

This notebook demonstrates how to train a machine learning model and save it to a file for later use.

## How to Solve a Problem Using Machine Learning?

1. **Define the problem**: What do you want to predict or classify?
2. **Collect and explore data**: Gather relevant data and understand its structure.
3. **Preprocess data**: Clean, transform, and prepare data for modeling.
4. **Select a model**: Choose an appropriate machine learning algorithm.
5. **Train the model**: Fit the model to your training data.
6. **Evaluate the model**: Test the model's performance on unseen data.
7. **Save the model**: Store the trained model for future use.
8. **Deploy and use the model**: Load the model and make predictions on new data.

## About the Dataset

**AG News Classification Dataset**  
- News articles classified into four categories: World, Sports, Business, and Science/Technology.
- Each sample contains a class index (1-4), a title, and a description.
- The dataset is widely used for text classification tasks.

In [4]:
# Coding: Train a simple ML model on AG News data

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

# Load the dataset
df = pd.read_csv('AG News Classification Dataset/train.csv')

# Combine title and description for text features
df['text'] = df['Title'] + ' ' + df['Description']
X = df['text']
y = df['Class Index']

# Split into train and validation sets
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

# Vectorize text data
vectorizer = TfidfVectorizer(max_features=5000)
X_train_vec = vectorizer.fit_transform(X_train)
X_val_vec = vectorizer.transform(X_val)

# Train a logistic regression model
model = LogisticRegression(max_iter=1000)
model.fit(X_train_vec, y_train)

# Evaluate the model
y_pred = model.predict(X_val_vec)
print(classification_report(y_val, y_pred))

              precision    recall  f1-score   support

           1       0.92      0.90      0.91      5956
           2       0.95      0.98      0.96      6058
           3       0.87      0.87      0.87      5911
           4       0.89      0.88      0.89      6075

    accuracy                           0.91     24000
   macro avg       0.91      0.91      0.91     24000
weighted avg       0.91      0.91      0.91     24000



## Save Model Using `pickle` Module

In [5]:
import pickle

# Save the model and vectorizer using pickle
with open('news_model_pickle.pkl', 'wb') as f:
    pickle.dump((model, vectorizer), f)

# To load the model later:
# with open('news_model_pickle.pkl', 'rb') as f:
#     loaded_model, loaded_vectorizer = pickle.load(f)

## Save Model Using `joblib` Module

In [6]:
import joblib

# Save the model and vectorizer using joblib
joblib.dump((model, vectorizer), 'news_model_joblib.pkl')

# To load the model later:
# loaded_model, loaded_vectorizer = joblib.load('news_model_joblib.pkl')

['news_model_joblib.pkl']

## Difference Between `pickle` and `joblib`

- **pickle**: General-purpose serialization module. Works for most Python objects, but can be slower and less efficient for large numpy arrays or scikit-learn models.
- **joblib**: Optimized for objects containing large numpy arrays (like scikit-learn models). Faster and more efficient for saving/loading machine learning models.

**Recommendation:**  
Use `joblib` for scikit-learn models and large data; use `pickle` for general Python objects.

## Test the Model Using Saved Files

Below, we demonstrate how to load the saved model and vectorizer using both `pickle` and `joblib`, and evaluate them on the test dataset.

In [7]:
# Test the model loaded from pickle file
import pandas as pd
import pickle
from sklearn.metrics import classification_report

# Load test dataset
test_df = pd.read_csv('AG News Classification Dataset/test.csv')
test_df['text'] = test_df['Title'] + ' ' + test_df['Description']
X_test = test_df['text']
y_test = test_df['Class Index']

# Load model and vectorizer from pickle
with open('news_model_pickle.pkl', 'rb') as f:
    loaded_model, loaded_vectorizer = pickle.load(f)

# Transform test data and predict
y_pred = loaded_model.predict(loaded_vectorizer.transform(X_test))
print('Results using model loaded from pickle:')
print(classification_report(y_test, y_pred))

Results using model loaded from pickle:
              precision    recall  f1-score   support

           1       0.92      0.90      0.91      1900
           2       0.95      0.97      0.96      1900
           3       0.87      0.87      0.87      1900
           4       0.88      0.88      0.88      1900

    accuracy                           0.90      7600
   macro avg       0.90      0.90      0.90      7600
weighted avg       0.90      0.90      0.90      7600



In [8]:
# Test the model loaded from joblib file
import joblib

# Load model and vectorizer from joblib
loaded_model_j, loaded_vectorizer_j = joblib.load('news_model_joblib.pkl')

y_pred_j = loaded_model_j.predict(loaded_vectorizer_j.transform(X_test))
print('Results using model loaded from joblib:')
print(classification_report(y_test, y_pred_j))

Results using model loaded from joblib:
              precision    recall  f1-score   support

           1       0.92      0.90      0.91      1900
           2       0.95      0.97      0.96      1900
           3       0.87      0.87      0.87      1900
           4       0.88      0.88      0.88      1900

    accuracy                           0.90      7600
   macro avg       0.90      0.90      0.90      7600
weighted avg       0.90      0.90      0.90      7600

