### Title: Glass Identification project by Francis Afuwah

### Introduction:

Glass is the material that finds its application and use in more domains than any other substance. Proper and accurate classification of these types of glasses according to their composition has great importance in quality control, material science research, and forensic investigation. In this report, we present a predictive model for classifying glass types based on their chemical composition.

### Objective: 
The objective of the project was to develop a model using machine learning that will correctly classify between different types of glasses based on their chemical features.
The model aims to assist researchers,json manufacturers, and forensic experts in an improved efficiency way of identifying the type of glass from its composition with improved efficiency and accuracy in different applications.

### Data Description:

The database taken is from chemical composition measurements for: float-processed, vehicle windows, containers, tableware, and headlamp types. It includes the refractive index and relative concentration of the following 8 elements: Na, Mg, Al, Si, K, Ca, Ba, Fe.

### Import necessary libraries

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

### Data Pre-processing: 
The dataset was pre-processed in order to fill the missing values, encode the categorical variables, and normalize the numerical features, so they were uniform and, in general, improve the model's performance

In [2]:
# Load the dataset with the first row as column names
data = pd.read_csv("Glass Identification.csv", header=None)


In [3]:
# Renaming the columns using numeric indices
data.columns = ['col1', 'col2', 'col3', 'col4', 'col5', 'col6', 'col7', 'col8', 'col9', 'col10', 'target']


In [4]:
# Display the first few rows to verify the changes
print(data.head())

   col1     col2   col3  col4  col5   col6  col7  col8  col9  col10  target
0     1  1.52101  13.64  4.49  1.10  71.78  0.06  8.75   0.0    0.0       1
1     2  1.51761  13.89  3.60  1.36  72.73  0.48  7.83   0.0    0.0       1
2     3  1.51618  13.53  3.55  1.54  72.99  0.39  7.78   0.0    0.0       1
3     4  1.51766  13.21  3.69  1.29  72.61  0.57  8.22   0.0    0.0       1
4     5  1.51742  13.27  3.62  1.24  73.08  0.55  8.07   0.0    0.0       1


In [5]:
# Display the last few rows to verify the changes
print(data.tail())

     col1     col2   col3  col4  col5   col6  col7  col8  col9  col10  target
209   210  1.51623  14.14   0.0  2.88  72.61  0.08  9.18  1.06    0.0       7
210   211  1.51685  14.92   0.0  1.99  73.06  0.00  8.40  1.59    0.0       7
211   212  1.52065  14.36   0.0  2.02  73.42  0.00  8.44  1.64    0.0       7
212   213  1.51651  14.38   0.0  1.94  73.61  0.00  8.48  1.57    0.0       7
213   214  1.51711  14.23   0.0  2.08  73.36  0.00  8.62  1.67    0.0       7


In [6]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 214 entries, 0 to 213
Data columns (total 11 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   col1    214 non-null    int64  
 1   col2    214 non-null    float64
 2   col3    214 non-null    float64
 3   col4    214 non-null    float64
 4   col5    214 non-null    float64
 5   col6    214 non-null    float64
 6   col7    214 non-null    float64
 7   col8    214 non-null    float64
 8   col9    214 non-null    float64
 9   col10   214 non-null    float64
 10  target  214 non-null    int64  
dtypes: float64(9), int64(2)
memory usage: 18.5 KB


In [7]:
# checking null values
data.isnull().sum()

col1      0
col2      0
col3      0
col4      0
col5      0
col6      0
col7      0
col8      0
col9      0
col10     0
target    0
dtype: int64

In [8]:
# checking duplicate value in dataset
data[data.duplicated()]

Unnamed: 0,col1,col2,col3,col4,col5,col6,col7,col8,col9,col10,target


In [9]:
# Handle duplicated values
data = data.drop_duplicates(keep="first")

### Model Selection: 
Logistic regression algorithm was applied, which will help in the best comparison of the model with the task

In [10]:
# Split the dataset into features (X) and the target variable (y)
X = data.drop('target', axis=1)
y = data['target']

In [11]:
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=40)

In [12]:
# Train the model
model = LogisticRegression()
model.fit(X_train, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [13]:
# Make predictions on the testing data
y_pred = model.predict(X_test)

In [14]:
# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

Accuracy: 0.9813084112149533


### Conclusion: 
The developed predictive model shows promising results in the classification of the types of glass according to the type of chemical composition. While the high accuracy is promising, some validation on the model's performance with different sets of data should be sought, such that performance is assessed in light of generalization to other sample data.  In this case, the model scored a test dataset accuracy of 98.1%, which means perfectly classifying the different types of glass from their chemical composition. However, further analysis showed there might be problems with overfitting and generalization to some unseen data.

This would, therefore, need further validation and refinement in order to make these robust and reliable for usage in the real world. Further work could be considered, such as refining the model with domain knowledge and attempting ensemble methods to improve the performance and robustness of the model. Generally, it would be of help or handy as a tool to researchers, manufacturers, and forensic experts in the analysis and classifying of glass materials.