##### Context
This is a Glass Identification Data Set from UCI. It contains 10 attributes including id. The response is glass type(discrete 7 values)

##### Content
**Attribute Information:**

- **Id number**: 1 to 214 (removed from CSV file)
- **RI**: refractive index
- **Na**: Sodium (unit measurement: weight percent in corresponding oxide, as are attributes 4-10)
- **Mg**: Magnesium
- **Al**: Aluminum
- **Si**: Silicon
- **K**: Potassium
- **Ca**: Calcium
- **Ba**: Barium
- **Fe**: Iron

**Target information :**
Type of glass
- `1` building_windows_float_processed
- `2` building_windows_non_float_processed
- `3` vehicle_windows_float_processed
- `4` vehicle_windows_non_float_processed (none in this database)
- `5` containers
- `6` tableware
- `7` headlamps

source : [Kaggle](https://www.kaggle.com/datasets/uciml/glass)

In [2]:
import pandas as pd 
df=pd.read_csv("data.csv")
nRow, nCol = df.shape
print(f'There are {nRow} rows and {nCol} columns')
df.head()

There are 214 rows and 10 columns


Unnamed: 0,RI,Na,Mg,Al,Si,K,Ca,Ba,Fe,Type
0,1.52101,13.64,4.49,1.1,71.78,0.06,8.75,0.0,0.0,1
1,1.51761,13.89,3.6,1.36,72.73,0.48,7.83,0.0,0.0,1
2,1.51618,13.53,3.55,1.54,72.99,0.39,7.78,0.0,0.0,1
3,1.51766,13.21,3.69,1.29,72.61,0.57,8.22,0.0,0.0,1
4,1.51742,13.27,3.62,1.24,73.08,0.55,8.07,0.0,0.0,1


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 214 entries, 0 to 213
Data columns (total 10 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   RI      214 non-null    float64
 1   Na      214 non-null    float64
 2   Mg      214 non-null    float64
 3   Al      214 non-null    float64
 4   Si      214 non-null    float64
 5   K       214 non-null    float64
 6   Ca      214 non-null    float64
 7   Ba      214 non-null    float64
 8   Fe      214 non-null    float64
 9   Type    214 non-null    int64  
dtypes: float64(9), int64(1)
memory usage: 16.8 KB


In [3]:
# Calculate the total number of missing values in each column
missing_values = df.isnull().sum()

# Calculate the total number of cells (non-missing values) in each column
total_cells = df.shape[0]  # Total number of rows

# Calculate the percentage of missing data for each column
percentage_missing = (missing_values / total_cells) * 100

# Create a DataFrame to store the results
missing_data_info = pd.DataFrame({
    'Column Name': missing_values.index,
    'Missing Values': missing_values,
    'Percentage Missing': percentage_missing
})

# Sort the DataFrame by the percentage of missing data in descending order
missing_data_info = missing_data_info.sort_values(by='Percentage Missing', ascending=False)

# Format the 'Percentage Missing' column to display 1 digit after the decimal point
missing_data_info['Percentage Missing'] = missing_data_info['Percentage Missing'].round(1)

# Display the result
print(missing_data_info)


     Column Name  Missing Values  Percentage Missing
RI            RI               0                 0.0
Na            Na               0                 0.0
Mg            Mg               0                 0.0
Al            Al               0                 0.0
Si            Si               0                 0.0
K              K               0                 0.0
Ca            Ca               0                 0.0
Ba            Ba               0                 0.0
Fe            Fe               0                 0.0
Type        Type               0                 0.0


In [6]:
# Import necessary libraries
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline

# Assuming your data is in a DataFrame called 'data'
# Assuming 'data' is your DataFrame
X = df.drop('Type', axis=1)  # Features
y = df['Type']  # Target variable

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define preprocessing steps
numeric_features = ['RI', 'Na', 'Mg', 'Al', 'Si', 'K', 'Ca', 'Ba', 'Fe']

numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())
])

preprocessor = numeric_transformer

# Define the model
model = Pipeline(steps=[('preprocessor', preprocessor),
                        ('classifier', RandomForestClassifier(random_state=42))])

# Train the model
model.fit(X_train, y_train)

# Predictions on the test set
y_pred = model.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")


Accuracy: 0.84


### Test all possible methods
Logistic Regression:

Simple and interpretable; suitable for binary and multiclass classification.
Decision Trees:

Simple to understand, can capture non-linear relationships, and are interpretable.
Random Forest:

An ensemble of decision trees that provides higher accuracy and handles overfitting.
Support Vector Machines (SVM):

Effective in high-dimensional spaces; good for binary and multiclass classification.
K-Nearest Neighbors (KNN):

Simple and effective; classifies data points based on the majority class among their k-nearest neighbors.
Naive Bayes:

Probabilistic algorithm based on Bayes' theorem; works well with small datasets.
Gradient Boosting Algorithms (e.g., XGBoost, LightGBM):

Builds an ensemble of weak learners sequentially, achieving high accuracy.
Neural Networks:

Deep learning models with the ability to capture complex patterns in the data.
Linear Discriminant Analysis (LDA):

Finds linear combinations of features that best separate different classes.
Quadratic Discriminant Analysis (QDA):

Similar to LDA, but does not assume equal covariance matrices for different classes.
AdaBoost:

Boosting algorithm that combines weak learners to form a strong learner.
Gaussian Process Classification:

Based on Gaussian processes; useful for small to medium-sized datasets.
Ridge Classifier:

Linear classifier that uses ridge regularization.
Perceptron:

Simple neural network architecture for binary classification.
LASSO (Least Absolute Shrinkage and Selection Operator) Regression:

Linear regression method that includes L1 regularization; can be adapted for classification.
Elastic Net:

Combination of L1 and L2 regularization; can be used for both regression and classification.
Multinomial Logistic Regression:

Extension of logistic regression for multiclass classification.
CART (Classification and Regression Trees):

Decision tree algorithm for classification tasks.
Gini Index and Entropy-based Decision Trees:

Decision tree algorithms that use Gini impurity or entropy to split nodes.
CatBoost:

Gradient boosting library that supports categorical features.

In [7]:
pip install catboost
pip install lightgbm

Note: you may need to restart the kernel to use updated packages.


ERROR: Invalid requirement: 'catboost,'


In [8]:
# Import necessary libraries
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression, RidgeClassifier, Perceptron
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, AdaBoostClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis, QuadraticDiscriminantAnalysis
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.linear_model import LogisticRegressionCV
from sklearn.linear_model import Lasso
from sklearn.linear_model import ElasticNet
from sklearn.tree import DecisionTreeClassifier
from catboost import CatBoostClassifier
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier

# Assuming 'df' is your DataFrame
X = df.drop('Type', axis=1)  # Features
y = df['Type']  # Target variable

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define preprocessing steps (standardization and imputation)
preprocessor = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())
])

# Define models
models = [
    ('Logistic Regression', LogisticRegression(random_state=42)),
    ('Decision Tree', DecisionTreeClassifier(random_state=42)),
    ('Random Forest', RandomForestClassifier(random_state=42)),
    ('Support Vector Machine', SVC(random_state=42)),
    ('K-Nearest Neighbors', KNeighborsClassifier()),
    ('Naive Bayes', GaussianNB()),
    ('Gradient Boosting', GradientBoostingClassifier(random_state=42)),
    ('AdaBoost', AdaBoostClassifier(random_state=42)),
    ('Neural Network (Perceptron)', Perceptron(random_state=42)),
    ('Linear Discriminant Analysis', LinearDiscriminantAnalysis()),
    ('Quadratic Discriminant Analysis', QuadraticDiscriminantAnalysis()),
    ('Gaussian Process', GaussianProcessClassifier(random_state=42)),
    ('Logistic Regression CV', LogisticRegressionCV(cv=5, random_state=42)),
    ('Lasso', Lasso(random_state=42)),
    ('Elastic Net', ElasticNet(random_state=42)),
    ('Ridge Classifier', RidgeClassifier(random_state=42)),
    ('CatBoost', CatBoostClassifier(random_state=42, verbose=0)),
    ('XGBoost', XGBClassifier(random_state=42, verbosity=0)),
    ('LightGBM', LGBMClassifier(random_state=42))
]

# Train and evaluate each model
for name, model in models:
    # Create a pipeline with preprocessing and the current model
    model_pipeline = Pipeline(steps=[
        ('preprocessor', preprocessor),
        ('classifier', model)
    ])
    
    # Train the model
    model_pipeline.fit(X_train, y_train)

    # Predictions on the test set
    y_pred = model_pipeline.predict(X_test)

    # Calculate accuracy
    accuracy = accuracy_score(y_test, y_pred)
    print(f"{name} - Accuracy: {accuracy:.2f}")


Logistic Regression - Accuracy: 0.72
Decision Tree - Accuracy: 0.72
Random Forest - Accuracy: 0.84
Support Vector Machine - Accuracy: 0.72
K-Nearest Neighbors - Accuracy: 0.70
Naive Bayes - Accuracy: 0.56
Gradient Boosting - Accuracy: 0.86
AdaBoost - Accuracy: 0.49
Neural Network (Perceptron) - Accuracy: 0.56
Linear Discriminant Analysis - Accuracy: 0.70
Quadratic Discriminant Analysis - Accuracy: 0.53




Gaussian Process - Accuracy: 0.79


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

Logistic Regression CV - Accuracy: 0.70


ValueError: Classification metrics can't handle a mix of multiclass and continuous targets