# Glass Classification using Random Forest Classifier


### Objectives:
1. Explore the Glass dataset.
2. Train and evaluate a Random Forest model to classify glass types.
3. Apply Bagging and Boosting techniques.
4. Handle potential data imbalance issues.
    

In [None]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report
from sklearn.preprocessing import StandardScaler

# Load dataset
data = pd.read_csv('/mnt/data/glass.csv')

# Display first few rows of the dataset
data.head()
    

### Data Overview and Basic Statistics

In [None]:

# Display basic info and first few rows of the dataset
data.info()
data.describe()

# Check for missing values
data.isnull().sum()
    


### Exploratory Data Analysis (EDA)
We will visualize the distributions of numeric features, check for outliers, and analyze correlations between features.
    

In [None]:

# 1. Plot histograms for numerical columns
data.hist(bins=15, figsize=(12, 10))
plt.suptitle('Distribution of Numeric Features')
plt.show()

# 2. Boxplot for 'RI', 'Na', 'Mg', 'AI', and other features (outlier detection)
plt.figure(figsize=(12, 6))
sns.boxplot(x=data['RI'])
plt.title('Boxplot of Refractive Index (RI)')
plt.show()

plt.figure(figsize=(12, 6))
sns.boxplot(x=data['Na'])
plt.title('Boxplot of Sodium (Na)')
plt.show()

# 3. Correlation matrix
plt.figure(figsize=(10, 8))
correlation_matrix = data.corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')
plt.title('Correlation Matrix')
plt.show()
    


### Data Preprocessing and Feature Engineering:
1. Handle any missing values in the dataset.
2. Apply feature scaling using standardization.
    

In [None]:

# Feature Scaling (Standardization)
scaler = StandardScaler()
scaled_features = scaler.fit_transform(data.drop('Type', axis=1))

# Create a new DataFrame for the scaled features
scaled_data = pd.DataFrame(scaled_features, columns=data.columns[:-1])
scaled_data['Type'] = data['Type']
scaled_data.head()
    


### Random Forest Classifier:
We will split the data into training and testing sets, train a Random Forest classifier, and evaluate its performance.
    

In [None]:

# Splitting data into features (X) and target (y)
X = scaled_data.drop('Type', axis=1)
y = scaled_data['Type']

# Splitting into training and testing sets (80-20 split)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train a Random Forest Classifier
rf_clf = RandomForestClassifier(random_state=42)
rf_clf.fit(X_train, y_train)

# Predictions and evaluation
y_pred = rf_clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, average='macro')
recall = recall_score(y_test, y_pred, average='macro')
f1 = f1_score(y_test, y_pred, average='macro')

# Output the performance metrics
print(f"Accuracy: {accuracy}")
print(f"Precision: {precision}")
print(f"Recall: {recall}")
print(f"F1-Score: {f1}")
print(classification_report(y_test, y_pred))
    


### Bagging and Boosting Methods:
We will apply Bagging and Boosting techniques to further improve model performance and compare the results with the Random Forest classifier.
    

In [None]:

from sklearn.ensemble import BaggingClassifier, AdaBoostClassifier

# Bagging Classifier
bagging_clf = BaggingClassifier(base_estimator=rf_clf, n_estimators=50, random_state=42)
bagging_clf.fit(X_train, y_train)
y_bag_pred = bagging_clf.predict(X_test)
bagging_accuracy = accuracy_score(y_test, y_bag_pred)
print(f"Bagging Accuracy: {bagging_accuracy}")

# Boosting Classifier (AdaBoost)
boosting_clf = AdaBoostClassifier(n_estimators=50, random_state=42)
boosting_clf.fit(X_train, y_train)
y_boost_pred = boosting_clf.predict(X_test)
boosting_accuracy = accuracy_score(y_test, y_boost_pred)
print(f"Boosting Accuracy: {boosting_accuracy}")
    


### Handling Imbalanced Data:
In case the dataset is imbalanced (i.e., the target classes are not equally represented), methods like oversampling, undersampling, or using class weights in the model can be applied to balance the data.
    


## Interview Questions:
1. **What is the difference between Bagging and Boosting?**
   - **Bagging**: Combines multiple models (usually of the same type) to reduce variance. Each model is trained on a random subset of the data, and their outputs are averaged.
   - **Boosting**: Combines multiple models sequentially, where each subsequent model focuses on the errors of the previous models. It reduces both variance and bias.

2. **How to handle imbalanced data?**
   - **Oversampling/Undersampling**: Increasing or reducing the number of samples in the minority or majority class.
   - **Class Weights**: Assigning higher weights to the minority class to penalize misclassifications more heavily.
   - **Synthetic Data Generation (SMOTE)**: Creating synthetic samples for the minority class.
    