# Model Prediction Improvement

# 1. Introduction

There are several methods that can help improve the prediction performance of models. Here are some commonly used techniques:
   
1. **Data Augmentation**: This refers to techniques that increase the amount of data by adding slightly modified copies of already existing data. For example, in image processing, these techniques could include rotation, scaling, flipping, etc. In text data, it can include methods like back translation or synonym replacement.


2. **Data Cleaning**: This involves taking care of missing values (by either filling them in based on existing data, or removing the data points entirely), and handling outliers (which might distort the training of the model).


3. **Feature engineering**: This is the process of creating new features from existing data that can help improve model performance. This can involve transformations of existing features, creating interaction features, or any other kind of data manipulation that creates new, useful input for the model.


4. **Model Selection**: This involves choosing the right machine learning algorithm for your specific problem. This could be a linear regression model, a decision tree, a neural network, etc. The choice depends on the nature of your data and the problem you're trying to solve.


5. **Hyperparameter tuning**: Hyperparameters are parameters that are not learned from the data but are set before the training process. Examples are learning rate, number of layers in a neural network, number of clusters in a K-means clustering, etc. Tuning these can often significantly improve performance. Techniques for hyperparameter tuning include grid search, random search, and more advanced methods like Bayesian optimization.


6. **Cross-validation**: This is a resampling procedure used to evaluate the performance of a model on a limited data sample. The dataset is partitioned into 'k' equally sized folds, and the model is trained on 'k-1' folds, and the remaining fold is used for testing. This process is repeated 'k' times so that we obtain a model performance score for each fold. It helps in assessing how the results of a statistical analysis will generalize to an independent data set.


7. **Regularization**: This is a technique used to prevent overfitting, which is when a model performs well on the training data but poorly on unseen data. Regularization works by adding a penalty term to the loss function that increases as the complexity of the model increases.


8. **Ensemble your model**: This refers to combining different models to improve overall performance. Techniques include Bagging (e.g., Random Forest), Boosting (e.g., Gradient Boosting, XGBoost), and Stacking.


Since we have already covered data cleaning, feature engineering in the previous sections, our attention in this section will shift to other topics, including data augmentation, model selection, ensemble model, regularization, cross-validation and hyperparameter tuning.

# 2. Dataset Exploration

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_digits
from sklearn.model_selection import GridSearchCV, train_test_split, cross_val_score
from sklearn.ensemble import RandomForestClassifier, BaggingClassifier, VotingClassifier
from sklearn.metrics import accuracy_score
import warnings

# Ignore all warnings
warnings.filterwarnings("ignore")

In [None]:
from sklearn.datasets import load_digits

# Load the digits dataset
digits = load_digits()

# Create a dataframe
# "digits.data" contains the features and "digits.target" contains the target
df = pd.DataFrame(data= np.c_[digits['data'], digits['target']],
                  columns= digits['feature_names'] + ['target'])

# Separate the features (X) and the target (y)
X = df[digits['feature_names']]
y = df['target']

# Display the dataframe
df.head()

In [None]:
df.info()

As we can see, The dataset does **NOT** contain any NaN values.

In [None]:
df.describe()

# 3. Data Augmentation

The `augment_data` function is defined to perform data augmentation. It takes the original images and labels as input and generates augmented versions of each image. The augmentation includes adding the original image, its horizontal flip, and a 90-degree rotation. The augmented images and labels are stored in `augmented_images` and `augmented_labels`, respectively.

In [None]:
# Load the digit dataset
digits = load_digits()
images = digits.images
labels = digits.target

# Data augmentation (optional)
def augment_data(images, labels):
    augmented_images = []
    augmented_labels = []
    for image, label in zip(images, labels):
        augmented_images.append(image)
        augmented_labels.append(label)

        augmented_images.append(np.fliplr(image))
        augmented_labels.append(label)

        augmented_images.append(np.rot90(image, k=1))
        augmented_labels.append(label)

    return np.array(augmented_images), np.array(augmented_labels)

augmented_images, augmented_labels = augment_data(images, labels)


This code combines the original images and their augmented versions into a single dataset, resulting in `all_images` and `all_labels`.

In [None]:
# Combine original and augmented data
all_images = np.concatenate([images, augmented_images])
all_labels = np.concatenate([labels, augmented_labels])

The `plot_images` function is defined to visualize the original images and their augmented counterparts. It uses Matplotlib to create a grid of images, with the number of rows and columns specified by `rows` and `cols`. The function displays `num_samples` samples of original and augmented images side by side for better understanding.

In [None]:
# Visualize the original images and their augmented counterparts
def plot_images(images, labels, rows, cols):
    fig, axes = plt.subplots(rows, cols, figsize=(10, 10))
    for i, ax in enumerate(axes.flat):
        ax.imshow(images[i], cmap='gray')
        ax.set_title(f"Label: {labels[i]}")
        ax.axis('off')
    plt.show()

num_samples = 10# Number of samples to visualize for each category
original_images_sample = images[:num_samples]
augmented_images_sample = augmented_images[:num_samples]

plot_images(original_images_sample, labels[:num_samples], 1, num_samples)
plot_images(augmented_images_sample, labels[:num_samples], 1, num_samples)

# 4. Data Pre-processing

The digits dataset from sklearn is a clean dataset, meaning it `doesn't have missing values`, it `doesn't contain categorical features` that need to be encoded, and it `doesn't have obvious outliers`. Therefore, some pre-processing steps like handling missing values, encoding categorical variables, or outlier treatment are not applicable in this case.

# 5. Feature Engineering

The digits dataset is a set of 8x8 pixel images, and each pixel in the image is a feature. There are a total of 64 features for each image. These features are already in a form that's suitable for machine learning models, so it's typically not necessary to do additional feature engineering.

# 6. Model Selection

In this code, we load the digits dataset, split it into training and testing sets, and then iterate over four different models: K-Nearest Neighbors, Support Vector Machine, Random Forest, and Multi-Layer Perceptron. For each model, we train it on the training data, make predictions on the test data, calculate the accuracy, and print a classification report.

The classification report provides more detailed metrics such as precision, recall, and F1-score for each class. It is useful for understanding how well the model performs for individual digits.

After evaluating the models, the code also visualizes some example predictions for the first few test samples to give you an idea of how well the models are performing on specific digits.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import accuracy_score, classification_report


# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target, test_size=0.2, random_state=42)

# Define a list of model names and corresponding classifier objects
models = [
    ('K-Nearest Neighbors', KNeighborsClassifier(n_neighbors=3)),
    ('Support Vector Machine', SVC()),
    ('Random Forest', RandomForestClassifier(random_state=42)),
    ('Multi-Layer Perceptron', MLPClassifier(random_state=42))
]

# Create an empty DataFrame to store the results
results_df = pd.DataFrame(columns=['Model', 'Accuracy'])

# Loop over the models
for model_name, model in models:
    # Train the model on the training data
    model.fit(X_train, y_train)
    
    # Make predictions on the test data
    y_pred = model.predict(X_test)
    
    # Calculate accuracy
    accuracy = accuracy_score(y_test, y_pred)
    
    # Add the results to the DataFrame
    results_df = results_df.append({'Model': model_name, 'Accuracy': accuracy}, ignore_index=True)

The `sort_values` function is used to sort the DataFrame in descending order based on the 'Accuracy' column. The `reset_index(drop=True)` function is used to reset the index of the DataFrame after sorting, so the index starts from 0 without any gaps.

Now, the `results_df` DataFrame will be ranked based on the accuracy of each model, with the highest accuracy at the top.

In [None]:
results_df = results_df.sort_values(by='Accuracy', ascending=False).reset_index(drop=True)
results_df.head()

# 7. Hyperparameter Tuning using GridSearchCV with Visualization

## 7.1 Split the data into training and test sets

In [None]:
X_train, X_test, y_train, y_test = train_test_split(all_images, all_labels, test_size=0.2, random_state=42)

## 7.2 Hyperparameter Tuning using GridSearchCV with Visualization

In [None]:
param_grid = {
    'C': [0.1, 1, 10, 100],
    'gamma': [1, 0.1, 0.01, 0.001]
}

svm_model = SVC()
grid_search = GridSearchCV(svm_model, param_grid, cv=5, return_train_score=True, n_jobs=-1)
grid_search.fit(X_train.reshape(len(X_train), -1), y_train)

## 7.3 Visualize Hyperparameter Tuning Results

In [None]:
scores = grid_search.cv_results_['mean_test_score'].reshape(len(param_grid['C']), len(param_grid['gamma']))
sns.heatmap(scores, annot=True, fmt='.3f', xticklabels=param_grid['gamma'], yticklabels=param_grid['C'])
plt.xlabel('Gamma')
plt.ylabel('C')
plt.title('Hyperparameter Tuning Results')
plt.show()

## 7.4 Best Model from Hyperparameter Tuning

In [None]:
# Best Model from Hyperparameter Tuning
best_model = grid_search.best_estimator_

# 8. Cross-Validation

In [None]:
cross_val_scores = cross_val_score(best_model, X_train.reshape(len(X_train), -1), y_train, cv=5)

# 9. Ensemble Techniques (Bagging and Voting)

In [None]:
bagging_model = BaggingClassifier(base_estimator=best_model, n_estimators=10, random_state=42)
voting_model = VotingClassifier([('svm', best_model), ('bagging', bagging_model)])

# Train the Bagging and Voting models on the entire training data
bagging_model.fit(X_train.reshape(len(X_train), -1), y_train)
voting_model.fit(X_train.reshape(len(X_train), -1), y_train)


# 10. Evaluate the Bagging and Voting models on the test set

In [None]:
y_pred_bagging = bagging_model.predict(X_test.reshape(len(X_test), -1))
y_pred_voting = voting_model.predict(X_test.reshape(len(X_test), -1))

test_accuracy_bagging = accuracy_score(y_test, y_pred_bagging)
test_accuracy_voting = accuracy_score(y_test, y_pred_voting)

# 11. Print results

In [None]:
# Print results
print("Cross-Validation Scores:", cross_val_scores)
print("Best Model Parameters:", grid_search.best_params_)
print("Test Accuracy (Best Model):", best_model.score(X_test.reshape(len(X_test), -1), y_test))
print("Test Accuracy (Bagging Model):", test_accuracy_bagging)
print("Test Accuracy (Voting Model):", test_accuracy_voting)