# Implement Baselines

[Source](https://machinelearningmastery.com/implement-baseline-machine-learning-algorithms-scratch-python/)

[Medium article](https://medium.com/@preethi_prakash/understanding-baseline-models-in-machine-learning-3ed94f03d645)

Random classifier : Randomly assigning class labels based on the class distribution in the data.

A baseline model, like a dummy classifier, is useful for detecting imbalanced classes by providing a comparison point. It allows us to assess the performance of more advanced models in the context of imbalanced data.

Imbalanced classes often lead to the majority class dominating predictions, resulting in high accuracy but poor identification of the minority class. A baseline model helps establish the expected performance level using a random or simplistic approach.

In [1]:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.dummy import DummyClassifier
from sklearn.metrics import accuracy_score, classification_report

In [2]:
# Load the breast cancer dataset
data = load_breast_cancer()
X = data.data
y = data.target

In [3]:
# Split the dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [4]:
# Create a baseline random classifier
dummy_clf = DummyClassifier(strategy='stratified', random_state=42)

In [5]:
# Fit the baseline classifier on the training data
dummy_clf.fit(X_train, y_train)

In [6]:
# Make predictions on the test data
y_pred = dummy_clf.predict(X_test)

In [7]:
# Calculate accuracy and other metrics
accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred)

In [8]:
# Print the results
print("Baseline Classifier Accuracy:", accuracy)
print("Classification Report:")
print(report)

Baseline Classifier Accuracy: 0.5701754385964912
Classification Report:
              precision    recall  f1-score   support

           0       0.42      0.37      0.40        43
           1       0.64      0.69      0.67        71

    accuracy                           0.57       114
   macro avg       0.53      0.53      0.53       114
weighted avg       0.56      0.57      0.56       114



In [9]:
# Create a baseline random classifier
dummy_clf_most_frequent = DummyClassifier(strategy='most_frequent', random_state=42)

# Fit the baseline classifier on the training data
dummy_clf_most_frequent.fit(X_train, y_train)

# Make predictions on the test data
y_pred_most_frequent = dummy_clf_most_frequent.predict(X_test)

# Calculate accuracy and other metrics
accuracy_most_frequent = accuracy_score(y_test, y_pred_most_frequent)
# We dont have to calculate F1 score in this case because only the mejority of labels in y_testappear in y_pred.
#hence there is no F1 score to calculate for this label

In [10]:
accuracy_most_frequent

0.6228070175438597

In [15]:
# Create a baseline random classifier
dummy_clf_uniform = DummyClassifier(strategy='uniform', random_state=42)

# Fit the baseline classifier on the training data
dummy_clf_uniform.fit(X_train, y_train)

# Make predictions on the test data
y_pred_uniform = dummy_clf_uniform.predict(X_test)

# Calculate accuracy and other metrics
accuracy_uniform = accuracy_score(y_test, y_pred_uniform)

print(accuracy_uniform)

0.5964912280701754
