<a href="https://colab.research.google.com/github/Sam-krish2411/DATA-SCIENCE-ASSIGNMENT/blob/main/NEURAL_NETWORKS.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##**TASK-1:DATA EXPLORATION AND PREPROCESSING**

In [1]:
# Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC

In [2]:
# Load the dataset
df = pd.read_csv("Alphabets_data.csv")
print(df.head())


  letter  xbox  ybox  width  height  onpix  xbar  ybar  x2bar  y2bar  xybar  \
0      T     2     8      3       5      1     8    13      0      6      6   
1      I     5    12      3       7      2    10     5      5      4     13   
2      D     4    11      6       8      6    10     6      2      6     10   
3      N     7    11      6       6      3     5     9      4      6      4   
4      G     2     1      3       1      1     8     6      6      6      6   

   x2ybar  xy2bar  xedge  xedgey  yedge  yedgex  
0      10       8      0       8      0       8  
1       3       9      2       8      4      10  
2       3       7      3       7      3       9  
3       4      10      6      10      2       8  
4       5       9      1       7      5      10  


In [3]:
# Check for null values
print(df.isnull().sum())

letter    0
xbox      0
ybox      0
width     0
height    0
onpix     0
xbar      0
ybar      0
x2bar     0
y2bar     0
xybar     0
x2ybar    0
xy2bar    0
xedge     0
xedgey    0
yedge     0
yedgex    0
dtype: int64


In [4]:
# Encode target variable (letter)
le = LabelEncoder()
df['letter'] = le.fit_transform(df['letter'])


In [5]:
print("\nDataset Info:")
print(df.info())


Dataset Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20000 entries, 0 to 19999
Data columns (total 17 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   letter  20000 non-null  int64
 1   xbox    20000 non-null  int64
 2   ybox    20000 non-null  int64
 3   width   20000 non-null  int64
 4   height  20000 non-null  int64
 5   onpix   20000 non-null  int64
 6   xbar    20000 non-null  int64
 7   ybar    20000 non-null  int64
 8   x2bar   20000 non-null  int64
 9   y2bar   20000 non-null  int64
 10  xybar   20000 non-null  int64
 11  x2ybar  20000 non-null  int64
 12  xy2bar  20000 non-null  int64
 13  xedge   20000 non-null  int64
 14  xedgey  20000 non-null  int64
 15  yedge   20000 non-null  int64
 16  yedgex  20000 non-null  int64
dtypes: int64(17)
memory usage: 2.6 MB
None


In [6]:
print("\nSummary Statistics:")
print(df.describe())


Summary Statistics:
             letter          xbox          ybox         width       height  \
count  20000.000000  20000.000000  20000.000000  20000.000000  20000.00000   
mean      12.516750      4.023550      7.035500      5.121850      5.37245   
std        7.502175      1.913212      3.304555      2.014573      2.26139   
min        0.000000      0.000000      0.000000      0.000000      0.00000   
25%        6.000000      3.000000      5.000000      4.000000      4.00000   
50%       13.000000      4.000000      7.000000      5.000000      6.00000   
75%       19.000000      5.000000      9.000000      6.000000      7.00000   
max       25.000000     15.000000     15.000000     15.000000     15.00000   

              onpix          xbar          ybar         x2bar         y2bar  \
count  20000.000000  20000.000000  20000.000000  20000.000000  20000.000000   
mean       3.505850      6.897600      7.500450      4.628600      5.178650   
std        2.190458      2.026035      

In [7]:
print("\nShape of dataset (samples, features):", df.shape)



Shape of dataset (samples, features): (20000, 17)


In [8]:
# Number of unique classes
print("\nNumber of unique classes (letters):", df['letter'].nunique())
print("Class labels:", df['letter'].unique())



Number of unique classes (letters): 26
Class labels: [19  8  3 13  6 18  1  0  9 12 23 14 17  5  2  7 22 11 15  4 21 24 16 20
 10 25]


In [9]:
#Seperating the variables

X = df.drop("letter", axis=1).values   # Features
y = df["letter"].values                # Target

In [10]:
#Encode target

le = LabelEncoder()
y_encoded = le.fit_transform(y)

In [11]:
#Normalize features

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)


In [12]:
# Final summary

print("\nData preprocessing completed!")
print("Feature matrix shape:", X_scaled.shape)
print("Target vector shape:", y_encoded.shape)


Data preprocessing completed!
Feature matrix shape: (20000, 16)
Target vector shape: (20000,)


##**TASK-2:MODEL IMPLEMENTATION**

In [13]:
!pip install scikeras




In [14]:
import pandas as pd
import numpy as np
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Input
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.metrics import accuracy_score, classification_report
from scikeras.wrappers import KerasClassifier

In [15]:
# Train-test split

X_train, X_test, y_train, y_test = train_test_split(
    X_scaled, y, test_size=0.2, random_state=42, stratify=y
)


In [16]:
# Define model function

def create_model(hidden_units1=64, hidden_units2=32, learning_rate=0.001):
    model = Sequential()
    model.add(Dense(hidden_units1, input_dim=X_train.shape[1], activation="relu"))
    model.add(Dense(hidden_units2, activation="relu"))
    model.add(Dense(len(np.unique(y)), activation="softmax"))

    model.compile(
        optimizer=tf.keras.optimizers.Adam(learning_rate=learning_rate),
        loss="sparse_categorical_crossentropy",
        metrics=["accuracy"]
    )
    return model


In [17]:
 # Wrap model with scikeras

clf = KerasClassifier(
    model=create_model,
    hidden_units1=64,
    hidden_units2=32,
    learning_rate=0.001,
    epochs=30,
    batch_size=16,
    verbose=0
)

In [18]:
import tensorflow as tf

In [19]:
pip install --upgrade scikit-learn scikeras



In [20]:
#Train model

clf.fit(X_train, y_train)

  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


0,1,2
,model,<function cre...x7ebec247d8a0>
,build_fn,
,warm_start,False
,random_state,
,optimizer,'rmsprop'
,loss,
,metrics,
,batch_size,16
,validation_batch_size,
,verbose,0


In [21]:

y_pred = clf.predict(X_test)

print("\nAccuracy on test set:", accuracy_score(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))


Accuracy on test set: 0.93625

Classification Report:
               precision    recall  f1-score   support

           0       0.96      0.98      0.97       158
           1       0.90      0.90      0.90       153
           2       0.96      0.90      0.93       147
           3       0.90      0.96      0.93       161
           4       0.88      0.94      0.91       154
           5       0.89      0.91      0.90       155
           6       0.88      0.95      0.91       155
           7       0.90      0.89      0.90       147
           8       0.93      0.91      0.92       151
           9       0.92      0.95      0.93       149
          10       0.89      0.94      0.91       148
          11       0.97      0.95      0.96       152
          12       1.00      0.92      0.96       158
          13       0.94      0.95      0.95       157
          14       0.95      0.93      0.94       150
          15       0.97      0.95      0.96       161
          16       0.97  

##**TASK-4:HYPERPARAMETER TUNING**

In [1]:
#from sklearn.model_selection import RandomizedSearchCV

In [None]:
"""
from sklearn.model_selection import GridSearchCV

# Define hyperparameter grid
param_grid = {
    "model__hidden_layers": [1, 2],
    "model__neurons": [32, 64, 128],
    "model__activation": ['relu', 'tanh'],
    "model__learning_rate": [0.001, 0.01],
    "epochs": [20, 30],
    "batch_size": [32, 64]
}
"""

In [None]:
'''
def create_model(hidden_layers=1, neurons=64, activation='relu', learning_rate=0.001):
    from tensorflow.keras.models import Sequential
    from tensorflow.keras.layers import Dense
    from tensorflow.keras.optimizers import Adam

    model = Sequential()
    model.add(Dense(neurons, input_dim=X_train.shape[1], activation=activation))
    for _ in range(hidden_layers - 1):
        model.add(Dense(neurons, activation=activation))
    model.add(Dense(len(set(y_train)), activation='softmax'))

    optimizer = Adam(learning_rate=learning_rate)
    model.compile(optimizer=optimizer, loss='sparse_categorical_crossentropy', metrics=['accuracy'])
    return model
'''

In [None]:
'''
from scikeras.wrappers import KerasClassifier

clf = KerasClassifier(model=create_model, verbose=0)
'''

In [None]:
'''
# Grid search
import warnings
warnings.filterwarnings("ignore")

grid = GridSearchCV(estimator=clf, param_grid=param_grid, cv=3, scoring='accuracy')
grid_result = grid.fit(X_train, y_train)
'''


In [22]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report
import numpy as np

In [24]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from sklearn.model_selection import train_test_split
import numpy as np

# Assuming X, y already preprocessed from Task 1
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Build a simple ANN model
model = Sequential()
model.add(Dense(64, input_dim=X_train.shape[1], activation='relu'))
model.add(Dense(32, activation='relu'))
model.add(Dense(len(np.unique(y)), activation='softmax'))

# Compile
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

# Train
model.fit(X_train, y_train, epochs=20, batch_size=32, verbose=1, validation_data=(X_test, y_test))

Epoch 1/20


  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


[1m500/500[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 2ms/step - accuracy: 0.2373 - loss: 2.9773 - val_accuracy: 0.6593 - val_loss: 1.2743
Epoch 2/20
[1m500/500[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 1ms/step - accuracy: 0.6712 - loss: 1.1854 - val_accuracy: 0.7415 - val_loss: 0.9597
Epoch 3/20
[1m500/500[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 1ms/step - accuracy: 0.7378 - loss: 0.9517 - val_accuracy: 0.7505 - val_loss: 0.8704
Epoch 4/20
[1m500/500[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 1ms/step - accuracy: 0.7575 - loss: 0.8557 - val_accuracy: 0.7665 - val_loss: 0.8432
Epoch 5/20
[1m500/500[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 1ms/step - accuracy: 0.7805 - loss: 0.7718 - val_accuracy: 0.7840 - val_loss: 0.7408
Epoch 6/20
[1m500/500[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 1ms/step - accuracy: 0.7857 - loss: 0.7439 - val_accuracy: 0.7782 - val_loss: 0.7251
Epoch 7/20
[1m500/500[0m [32m━━━━━━━

<keras.src.callbacks.history.History at 0x7ebf2f939400>

In [25]:
# --- Evaluate Default Model (from Task 2) ---
# Predictions (convert probabilities to class labels)
y_pred_prob = model.predict(X_test)
y_pred = np.argmax(y_pred_prob, axis=1)

[1m125/125[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 727us/step


In [26]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report

# Predictions
y_pred_prob = model.predict(X_test)
y_pred = np.argmax(y_pred_prob, axis=1)

[1m125/125[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 809us/step


In [28]:
# Metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, average='weighted')
recall = recall_score(y_test, y_pred, average='weighted')
f1 = f1_score(y_test, y_pred, average='weighted')

print( "Default Model Performance:")
print(f"Accuracy  : {accuracy:.4f}")
print(f"Precision : {precision:.4f}")
print(f"Recall    : {recall:.4f}")
print(f"F1-score  : {f1:.4f}")
print("\nClassification Report:\n", classification_report(y_test, y_pred))

Default Model Performance:
Accuracy  : 0.8772
Precision : 0.8833
Recall    : 0.8772
F1-score  : 0.8782

Classification Report:
               precision    recall  f1-score   support

           0       0.92      0.93      0.92       158
           1       0.72      0.88      0.79       153
           2       0.98      0.80      0.88       147
           3       0.82      0.91      0.86       161
           4       0.79      0.85      0.82       154
           5       0.81      0.79      0.80       155
           6       0.72      0.86      0.78       155
           7       0.86      0.76      0.80       147
           8       0.94      0.83      0.88       151
           9       0.92      0.88      0.90       149
          10       0.91      0.85      0.88       148
          11       0.95      0.89      0.92       152
          12       0.97      0.94      0.95       158
          13       0.93      0.89      0.91       157
          14       0.90      0.87      0.89       150
       

##  Interpretation of Neural Network Classification Results

###  Overall Model Performance

| Metric        | Value   | Interpretation |
|---------------|---------|----------------|
| Accuracy      | 0.8772  | The model correctly classified ~88% of the alphabet samples. |
| Precision     | 0.8833  | When the model predicts a class, it's correct ~88% of the time. |
| Recall        | 0.8772  | The model identifies ~88% of actual class instances. |
| F1-score      | 0.8782  | Balanced performance across precision and recall. |

This indicates a well-generalized model with no major overfitting or underfitting symptoms.

---

###  Class-Level Insights

####  Strong Classes
- **Class 12, 22, 25**  
  - Precision and recall > 0.95  
  - High confidence and consistency in predictions  
  - Likely well-separated feature patterns

#### Weak Classes
- **Class 1, 6, 18**  
  - Precision ~0.72–0.75  
  - Lower F1-scores (~0.78–0.81)  
  - Possible confusion with visually or structurally similar alphabets  
  - May benefit from deeper layers or feature engineering

---

###  Diagnostic Observations

- **Macro vs Weighted Averages**  
  - Both are ~0.88, suggesting balanced performance across classes despite some variability.

- **Support Distribution**  
  - Fairly uniform across classes (~150–160 samples each), so performance gaps are likely due to feature complexity, not data imbalance.

---

###  Model Behavior Summary

- The ANN captures nonlinear patterns effectively across most classes.
- It struggles slightly with classes that may have overlapping features or subtle distinctions.
- The softmax output layer is well-calibrated, but some decision boundaries may be too tight for ambiguous classes.

---

###  Practical Recommendations

| Area | Action |
|------|--------|
| **Model Depth** | Try 2–3 hidden layers to capture complex patterns. |
| **Neuron Count** | Increase to 128–256 for richer representations. |
| **Activation Function** | Compare `'relu'` vs `'tanh'` for smoother gradients. |
| **Learning Rate** | Tune between `0.001` and `0.01` for convergence control. |
| **Early Stopping** | Prevent overfitting and stabilize training. |
| **Class-Specific Analysis** | Use confusion matrix to pinpoint misclassifications. |