INTRODUCTION

This dataset contains **214 glass samples** with **10 features** representing their chemical composition. Key features include elements like **Na**, **Mg**, **Al**, and **Si**, along with the **refractive index**. The target variable, **Type**, identifies the category of glass, such as windows or containers. This dataset is often used in material classification and forensic analysis.

IMPORT LIBARIES AND LOAD DATASETS

In [40]:
#import libaries
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import warnings
warnings.filterwarnings('ignore')
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn import metrics


#import dataset
df=pd.read_csv(r'C:\Data Science\data_set\glass.csv')
print(df)




          RI     Na    Mg    Al     Si     K    Ca    Ba   Fe  Type
0    1.52101  13.64  4.49  1.10  71.78  0.06  8.75  0.00  0.0     1
1    1.51761  13.89  3.60  1.36  72.73  0.48  7.83  0.00  0.0     1
2    1.51618  13.53  3.55  1.54  72.99  0.39  7.78  0.00  0.0     1
3    1.51766  13.21  3.69  1.29  72.61  0.57  8.22  0.00  0.0     1
4    1.51742  13.27  3.62  1.24  73.08  0.55  8.07  0.00  0.0     1
..       ...    ...   ...   ...    ...   ...   ...   ...  ...   ...
209  1.51623  14.14  0.00  2.88  72.61  0.08  9.18  1.06  0.0     7
210  1.51685  14.92  0.00  1.99  73.06  0.00  8.40  1.59  0.0     7
211  1.52065  14.36  0.00  2.02  73.42  0.00  8.44  1.64  0.0     7
212  1.51651  14.38  0.00  1.94  73.61  0.00  8.48  1.57  0.0     7
213  1.51711  14.23  0.00  2.08  73.36  0.00  8.62  1.67  0.0     7

[214 rows x 10 columns]


DATA CLEANING 

In [41]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 214 entries, 0 to 213
Data columns (total 10 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   RI      214 non-null    float64
 1   Na      214 non-null    float64
 2   Mg      214 non-null    float64
 3   Al      214 non-null    float64
 4   Si      214 non-null    float64
 5   K       214 non-null    float64
 6   Ca      214 non-null    float64
 7   Ba      214 non-null    float64
 8   Fe      214 non-null    float64
 9   Type    214 non-null    int64  
dtypes: float64(9), int64(1)
memory usage: 16.8 KB


In [42]:
df.isna().sum()

RI      0
Na      0
Mg      0
Al      0
Si      0
K       0
Ca      0
Ba      0
Fe      0
Type    0
dtype: int64

In [43]:
df.fillna(0,inplace=True)

In [44]:
df.duplicated().sum()

1

In [45]:
df.drop_duplicates(inplace=True)

EXTRACTING INDEPENDENT AND DEPENDENT VARIABLES

In [46]:
x=df[['RI','Na','Mg','Al','Si','K','Ca','Ba','Fe']].values
x=pd.DataFrame(x)
y=df["Type"].values
x=pd.DataFrame(y)

SPLITING DATA INTO TRAIN AND TEST 

In [47]:
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2,random_state=62)

FEATURE SCALING 

In [48]:
st=StandardScaler()
x_test=st.fit_transform(x_test)
x_train=st.fit_transform(x_train)

MODEL BUILDING AND EVALUTE PREDICTIONS 

MODEL OF SVM

In [49]:
model=SVC(kernel='linear',random_state=0)
model.fit(x_train,y_train)

In [50]:
y_predt=model.predict(x_test)
print(y_predt)
print(y_test)

[6 2 6 1 1 2 1 2 5 6 3 1 5 1 1 1 5 1 2 2 6 2 6 2 2 5 1 1 6 5 1 1 6 2 1 1 1
 2 2 6 5 2 1]
[7 2 7 1 1 2 1 2 5 7 3 1 5 1 1 1 6 1 2 2 7 2 7 2 2 6 1 1 7 5 1 1 7 2 1 1 1
 2 2 7 6 2 1]


In [51]:
print(metrics.mean_squared_error(y_predt,y_test))
print(metrics.accuracy_score(y_predt,y_test)*100)

0.2558139534883721
74.4186046511628


MODEL OF LOGISTIC REGRESSION 

In [52]:
from sklearn.linear_model import LogisticRegression
model4=LogisticRegression(multi_class='multinomial',random_state=42)
model4.fit(x_train,y_train)

In [53]:
y_predt=model4.predict(x_test)
print(y_test)
print(y_predt)

[7 2 7 1 1 2 1 2 5 7 3 1 5 1 1 1 6 1 2 2 7 2 7 2 2 6 1 1 7 5 1 1 7 2 1 1 1
 2 2 7 6 2 1]
[7 2 7 1 1 2 1 2 3 7 2 1 3 1 1 1 5 1 2 2 7 2 7 2 2 5 1 1 7 3 1 1 7 2 1 1 1
 2 2 7 5 2 1]


In [54]:
print("mse value of logistic regression:",metrics.mean_squared_error(y_predt,y_test))
print("accuracy in logistic regression:",metrics.accuracy_score(y_predt,y_test)*100)

mse value of logistic regression: 0.37209302325581395
accuracy in logistic regression: 83.72093023255815


MODEL OF KNN

In [55]:
from sklearn.neighbors import KNeighborsClassifier
classifier=KNeighborsClassifier(n_neighbors=5,metric='minkowski',p=2)
classifier.fit(x_train,y_train)

In [56]:
y_predt=classifier.predict(x_test)
print(y_test)
print(y_predt)

[7 2 7 1 1 2 1 2 5 7 3 1 5 1 1 1 6 1 2 2 7 2 7 2 2 6 1 1 7 5 1 1 7 2 1 1 1
 2 2 7 6 2 1]
[6 2 6 1 1 2 1 2 5 6 3 1 5 1 1 1 5 1 2 2 6 2 6 2 2 5 1 1 6 5 1 1 6 2 1 1 1
 2 2 6 5 2 1]


In [57]:
print("mse value of knn regression:",metrics.mean_squared_error(y_predt,y_test))
print("accuracy in knn regression:",metrics.accuracy_score(y_predt,y_test)*100)

mse value of knn regression: 0.2558139534883721
accuracy in knn regression: 74.4186046511628


MODEL OF RANDOM FOREST

In [58]:
from sklearn.ensemble import RandomForestClassifier
model2=RandomForestClassifier(n_estimators=10,criterion='entropy',random_state=0)
model2.fit(x_train,y_train)


In [59]:
y_predt=model2.predict(x_test)
print(y_test)
print(y_predt)

[7 2 7 1 1 2 1 2 5 7 3 1 5 1 1 1 6 1 2 2 7 2 7 2 2 6 1 1 7 5 1 1 7 2 1 1 1
 2 2 7 6 2 1]
[6 2 6 1 1 2 1 2 5 6 3 1 5 1 1 1 5 1 2 2 6 2 6 2 2 5 1 1 6 5 1 1 6 2 1 1 1
 2 2 6 5 2 1]


In [60]:
print("mse value of random forest:",metrics.mean_squared_error(y_predt,y_test))
print("accuracy in random forest:",metrics.accuracy_score(y_predt,y_test)*100)

mse value of random forest: 0.2558139534883721
accuracy in random forest: 74.4186046511628


MODEL OF DECISION FOREST

In [61]:
from sklearn.tree import DecisionTreeClassifier
model3=DecisionTreeClassifier(criterion='entropy',random_state=80)
model3.fit(x_train,y_train)




In [62]:
y_predt=model3.predict(x_test)
print(y_test)
print(y_predt)

[7 2 7 1 1 2 1 2 5 7 3 1 5 1 1 1 6 1 2 2 7 2 7 2 2 6 1 1 7 5 1 1 7 2 1 1 1
 2 2 7 6 2 1]
[6 2 6 1 1 2 1 2 5 6 3 1 5 1 1 1 5 1 2 2 6 2 6 2 2 5 1 1 6 5 1 1 6 2 1 1 1
 2 2 6 5 2 1]


In [63]:
print("mse value of decision tree:",metrics.mean_squared_error(y_predt,y_test))
print("accuracy in decision tree:",metrics.accuracy_score(y_predt,y_test)*100)

mse value of decision tree: 0.2558139534883721
accuracy in decision tree: 74.4186046511628


SUMMARY

This project aimed to predict identifies the category of glass, such as windows or containers by using  their chemical composition..After processing ,we tested  multiple models ,including logistic regression,Decision tree,Random forest,SVM,KNN..The Logistic regression model achieved the highest accuracy of 83.72%,making it the best performing model