__1. This digit recognition project typically involves training a machine learning models-<br>1. Dummy Classifier <br>2. K-Nearest Neighbours <br>3. Decision Tree Classifier <br>4. Naive Bayes Gaussian Classifier <br>5. Gradient Booster classifier <br>6. Support Vector Classification <br>7. Multi-layer Perceptron Classifier  to recognize handwritten digits.<br> The goal is to create an algorithm that can accurately identify and classify handwritten digits (0-9) based on images of these digits.__

__2. This code segment imports various Python libraries and modules essential for a machine learning project. <br>Here's a breakdown:__

__pandas, numpy: Libraries for data manipulation and numerical computations.<br>
matplotlib.pyplot, matplotlib.image: Libraries for plotting and handling images.<br>
LabelBinarizer, StandardScaler: Tools for data preprocessing from scikit-learn.<br>
DummyClassifier, GradientBoostingClassifier, SVC, KNeighborsClassifier, DecisionTreeClassifier, GaussianNB, MLPClassifier: Different machine learning algorithms/models available in scikit-learn.<br>
accuracy_score, confusion_matrix, classification_report: Functions for evaluating model performance.<br>
train_test_split, cross_validate, cross_val_score, GridSearchCV: Tools for model evaluation, validation, and hyperparameter tuning.<br>
fetch_openml: Function for fetching datasets from OpenML.<br>
warnings: Module used for handling warnings; in this case, it's suppressing warnings using warnings.filterwarnings('ignore').__

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier
import matplotlib.image as mpimg
from sklearn.preprocessing import LabelBinarizer, StandardScaler
from sklearn.dummy import DummyClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split, cross_validate, cross_val_score, GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.datasets import fetch_openml
from sklearn.neural_network import MLPClassifier
import warnings
warnings.filterwarnings('ignore')


__3. This code segment imports the Pandas library and uses the read_csv() function to read data from train.csv and test.csv files, storing their contents into Pandas DataFrames named train_pd and test_pd respectively. These DataFrames can be used to work with the data from the CSV files.__

In [2]:
train_pd = pd.read_csv("C:/Users/GOPI/Downloads/train.csv") # Reading the train.csv file into train_pd DataFrame
test_pd = pd.read_csv("C:/Users/GOPI/Downloads/test.csv") # Reading the test.csv file into test_pd DataFrame


__4. This code segment involves preparing the data for training a machine learning model. <br>It does the following:<br> y_train: Extracts the labels for the training data from the "label" column in the train_pd DataFrame. <br>These labels will be the target/output values for training the model.<br> X_train: Extracts the features for the training data by dropping the "label" column from train_pd. <br>The remaining columns will be used as input features for training the model.<br> X_test: Extracts the features for the test data directly from the test_pd DataFrame and stores them in a NumPy array. The shape of X_test is printed to verify its dimensions.__

In [3]:
y_train = train_pd["label"]
X_train = train_pd.drop(labels = ["label"],axis = 1) 
X_test = test_pd.values
print(X_test.shape)


(28000, 784)


__5. This code segment performs data normalization on the training and test datasets. It divides all pixel values in the image data by 255.0 to scale them between 0 and 1. <br>Normalizing pixel values in image datasets is a common practice in machine learning tasks involving images. <br>It helps in improving the convergence speed of neural networks and other machine learning algorithms and aids in achieving better performance during training.__

In [4]:
X_train = X_train / 255.0
X_test = X_test / 255.0

__6. This code snippet imports various classifier classes from scikit-learn and initializes different machine learning classifiers with specific parameters:__

__DummyClassifier: A classifier that makes predictions using simple rules (in this case, the "most frequent" strategy).<br>
DecisionTreeClassifier: A classifier based on decision tree algorithms.<br>
KNeighborsClassifier: A classifier implementing the k-nearest neighbors algorithm.<br>
GaussianNB: A classifier based on Gaussian Naive Bayes algorithm.<br>
SVC: A classifier based on Support Vector Machines.<br>
RandomForestClassifier: A classifier based on the Random Forest ensemble method.<br>
MLPClassifier: A classifier implementing a multi-layer perceptron neural network.<br>__

In [5]:
dummy_clf = DummyClassifier(strategy="most_frequent")
dec_clf = DecisionTreeClassifier()
knn_clf = KNeighborsClassifier(n_neighbors=3)
nb_classifier = GaussianNB()
svc_clf = SVC(kernel='rbf', gamma='scale')
rf_clf = RandomForestClassifier(n_estimators=100, random_state=42)
mlp_clf = MLPClassifier(hidden_layer_sizes=(50,), max_iter=100, alpha=0.0001,
                               solver='adam', verbose=10, random_state=42, learning_rate_init=0.001)

__7. This code snippet demonstrates the training phase of various machine learning classifiers using the training data (X_train and y_train). Each classifier is trained using the fit() method with the training features (X_train) and their corresponding labels (y_train).__


In [6]:
knn_clf.fit(X_train, y_train)  #Linear Regression Model
dec_clf.fit(X_train, y_train)  #DecisionTree Model
dummy_clf.fit(X_train, y_train)
nb_classifier.fit(X_train, y_train)
svc_clf.fit(X_train, y_train)
print('a')
rf_clf.fit(X_train, y_train)
mlp_clf.fit(X_train, y_train)

a
Iteration 1, loss = 0.59564936
Iteration 2, loss = 0.28121472
Iteration 3, loss = 0.23254874
Iteration 4, loss = 0.20010007
Iteration 5, loss = 0.17552981
Iteration 6, loss = 0.15737165
Iteration 7, loss = 0.14062114
Iteration 8, loss = 0.12786752
Iteration 9, loss = 0.11696858
Iteration 10, loss = 0.10575857
Iteration 11, loss = 0.09662070
Iteration 12, loss = 0.08978305
Iteration 13, loss = 0.08312472
Iteration 14, loss = 0.07682152
Iteration 15, loss = 0.07175184
Iteration 16, loss = 0.06628327
Iteration 17, loss = 0.06209376
Iteration 18, loss = 0.05786509
Iteration 19, loss = 0.05397109
Iteration 20, loss = 0.05130834
Iteration 21, loss = 0.04723843
Iteration 22, loss = 0.04449115
Iteration 23, loss = 0.04163443
Iteration 24, loss = 0.03922938
Iteration 25, loss = 0.03614468
Iteration 26, loss = 0.03467583
Iteration 27, loss = 0.03257960
Iteration 28, loss = 0.03007737
Iteration 29, loss = 0.02888881
Iteration 30, loss = 0.02687594
Iteration 31, loss = 0.02518782
Iteration 32, l

MLPClassifier(hidden_layer_sizes=(50,), max_iter=100, random_state=42,
              verbose=10)

__8. This code snippet utilizes the train_test_split() function from scikit-learn to split the original training data (X_train and y_train) into two subsets:<br> a training set (X_train1 and y_train1) and a validation set (X_test1 and y_test1).__

__X_train and y_train: Original training features and labels.<br>
test_size=0.2: Specifies that 20% of the data will be allocated for the validation set (X_test1 and y_test1), while 80% will be used for the training set (X_train1 and y_train1).<br>
random_state=42: Sets a specific random seed for reproducibility, ensuring that the data split remains consistent across multiple runs.__


In [7]:
X_train1, X_test1, y_train1, y_test1 = train_test_split(X_train, y_train, test_size=0.2, random_state=42)


__9. This code segment employs the trained classifiers (dec_clf, knn_clf, dummy_clf, nb_classifier, svc_clf, rf_clf, mlp_clf) to make predictions on the test dataset (X_test). Each classifier's predict() method is used to generate predictions based on the trained models.__

__dec_pred: Predictions made by the Decision Tree classifier (dec_clf) on the test dataset.
<br>
knn_pred: Predictions made by the K-Nearest Neighbors classifier (knn_clf).
<br>
dummy_pred: Predictions made by the Dummy Classifier (dummy_clf).<br>
nb_pred: Predictions made by the Gaussian Naive Bayes classifier (nb_classifier).<br>
svc_pred: Predictions made by the Support Vector Machine classifier (svc_clf).<br>
rf_pred: Predictions made by the Random Forest classifier (rf_clf).<br>
mlp_pred: Predictions made by the Multi-Layer Perceptron Neural Network classifier (mlp_clf).<br>__

In [8]:
dec_pred = dec_clf.predict(X_test)
knn_pred = knn_clf.predict(X_test)
dummy_pred = dummy_clf.predict(X_test)
nb_pred = nb_classifier.predict(X_test)
svc_pred = svc_clf.predict(X_test)
print('a')
rf_pred = rf_clf.predict(X_test)
mlp_pred = mlp_clf.predict(X_test)

a


__10. This code segment contains a dictionary all_preds that holds different trained classifiers as values with their respective names as keys.<br> The loop iterates through each classifier, makes predictions on the validation set (X_test1 and y_test1), and evaluates their performance by calculating accuracy and generating classification reports using classification_report() from scikit-learn.__

__accuracy_score() calculates the accuracy of the predictions compared to the true labels (y_test1).<br>
classification_report() generates a detailed classification report containing precision, recall, F1-score, and support for each class in the dataset.<br>
The loop iterates through each classifier, printing the accuracy and classification report for each model, allowing an assessment of their performance on the validation set (X_test1 and y_test1).__

In [9]:
all_preds = {'Dummy Classifier':dummy_clf,'decisiontree Classifier':dec_clf,
        'knn Classifier':knn_clf, "guassian classifier":nb_classifier,"Random Forest classifier":rf_clf,
        "MLP Classifier":mlp_clf,"SVM Classifier":svc_clf}
for i,j in all_preds.items():
    y_pred = j.predict(X_test1)
    accuracy = accuracy_score(y_test1, y_pred)
    print(f"Accuracy of {i}: {accuracy * 100:.2f}%")
    classification_rep = classification_report(y_test1, y_pred)
    print(f"Classification report of {i}:\n{classification_rep}")
    

Accuracy of Dummy Classifier: 10.82%
Classification report of Dummy Classifier:
              precision    recall  f1-score   support

           0       0.00      0.00      0.00       816
           1       0.11      1.00      0.20       909
           2       0.00      0.00      0.00       846
           3       0.00      0.00      0.00       937
           4       0.00      0.00      0.00       839
           5       0.00      0.00      0.00       702
           6       0.00      0.00      0.00       785
           7       0.00      0.00      0.00       893
           8       0.00      0.00      0.00       835
           9       0.00      0.00      0.00       838

    accuracy                           0.11      8400
   macro avg       0.01      0.10      0.02      8400
weighted avg       0.01      0.11      0.02      8400

Accuracy of decisiontree Classifier: 100.00%
Classification report of decisiontree Classifier:
              precision    recall  f1-score   support

           

__11. This code segment takes the predictions made by various classifiers (dummy_pred, dec_pred, knn_pred, nb_pred, svc_pred, rf_pred, mlp_pred) and performs rounding of the predicted values to integers using the round() method. After rounding, the predictions are converted to integers using astype(int), resulting in integer values for the predictions made by each classifier.__

In [10]:
dummy_preds = dummy_pred.round().astype(int)
dec_preds = dec_pred.round().astype(int)
knn_preds = knn_pred.round().astype(int)
nb_preds = nb_pred.round().astype(int)
svm_preds = svc_pred.round().astype(int)
rf_preds = rf_pred.round().astype(int)
mlp_preds = mlp_pred.round().astype(int)


__12. This code segment creates a dictionary preds containing different sets of predictions made by various classifiers. The loop iterates through each key-value pair in preds, where each key represents the name of the classifier and its corresponding predicted values as values.__

__Inside the loop:__

__It creates a DataFrame output_df containing image IDs and their predicted labels.<br>
The image IDs are generated using range(1, len(X_test) + 1).<br>
The DataFrame is then saved to a CSV file named after the respective classifier ({key}.csv) using to_csv() method from Pandas, excluding the index column.__

In [11]:
preds = {'dummy prediction':dummy_preds,'decisiontree predictions':dec_preds,
        'knn predictions':knn_preds, "guassian predictions":nb_preds,"Random Forest Predictions":rf_preds,
         "MLP Predections": mlp_preds,"SVC predictions":svm_preds}
for key, values in preds.items():
    print(key)
    image_ids = range(1, len(X_test) + 1)
    output_df = pd.DataFrame({
        'ImageID': image_ids,
        'Label': values
    })
    output_df.to_csv(f'{key}.csv', index=False)

dummy prediction
decisiontree predictions
knn predictions
guassian predictions
Random Forest Predictions
MLP Predections
SVC predictions


__Dummy Classifier__

![WhatsApp%20Image%202023-12-09%20at%2009.21.29_11b39ec5.jpg](attachment:WhatsApp%20Image%202023-12-09%20at%2009.21.29_11b39ec5.jpg)

__KNN Classifier__

![WhatsApp%20Image%202023-12-09%20at%2009.17.23_aec95fbb.jpg](attachment:WhatsApp%20Image%202023-12-09%20at%2009.17.23_aec95fbb.jpg)

__Decision Tree Classifier__

![WhatsApp%20Image%202023-12-09%20at%2009.18.11_557becb6.jpg](attachment:WhatsApp%20Image%202023-12-09%20at%2009.18.11_557becb6.jpg)

__Gaussian Navie Bayes Classifier__

![WhatsApp%20Image%202023-12-09%20at%2009.25.38_5e7bcfd9.jpg](attachment:WhatsApp%20Image%202023-12-09%20at%2009.25.38_5e7bcfd9.jpg)

__Multi-Layer Perceptron Classifier__

![WhatsApp%20Image%202023-12-09%20at%2009.20.42_bd9b8bab.jpg](attachment:WhatsApp%20Image%202023-12-09%20at%2009.20.42_bd9b8bab.jpg)

__Random Forest Classifier__

![WhatsApp%20Image%202023-12-09%20at%2009.54.28_1885b662.jpg](attachment:WhatsApp%20Image%202023-12-09%20at%2009.54.28_1885b662.jpg)

__The high accuracy achieved across all classifiers suggests that our models performed well without requiring hyperparameter optimization. This outcome implies that the initial configuration of the models, including their default settings and parameters, led to satisfactory performance in accurately predicting the target labels. As a result, there was no need for fine-tuning or adjusting the hyperparameters of these models to achieve better accuracy. However, while high accuracy is promising, it's essential to consider potential overfitting or generalization issues that may arise when models perform exceedingly well on the given dataset but may struggle with new, unseen data. Therefore, despite high accuracies, it could still be beneficial to explore hyperparameter tuning techniques in order to ensure optimal model performance and robustness across different datasets or real-world scenarios.__