## ML iris practice

## General Workflow of Machine Learning Practices

1. **Load Dataset**: This is the initial step where you import your dataset from a file, database, or an external source.

2. **Numeric Encoding of Categories**: If your dataset contains categorical data, these categories need to be converted into a numeric format. This is usually done through one-hot encoding, ordinal encoding, or similar techniques.

3. **Data Splitting**: In this step, the dataset is divided into two or more sets, usually a training set and a test set. Sometimes, a validation set is also created. This helps to evaluate the performance of the model.

4. **Standardization of Input Data (Optional)**: If the features in your dataset are not on the same scale, it might be necessary to standardize them. This could mean centering the distribution around zero (mean removal), scaling variance to unit variance, or normalizing between a specific range.

5. **Model Estimation or Instance-Based Learning**: Here, you select the machine learning algorithm you want to use and fit it to the data. In the case of instance-based learning, the model memorizes the instances and makes predictions using a similarity measure.

6. **Result Analysis**: Finally, you evaluate the performance of your model. This might involve calculating error rates, accuracy, precision, recall, or any other performance metrics relevant to your task. Also, it could involve visualization of results.


## Load dataset

In [36]:
import seaborn as sns
iris = sns.load_dataset('iris')

print(iris.head())
print(iris.shape)

   sepal_length  sepal_width  petal_length  petal_width species
0           5.1          3.5           1.4          0.2  setosa
1           4.9          3.0           1.4          0.2  setosa
2           4.7          3.2           1.3          0.2  setosa
3           4.6          3.1           1.5          0.2  setosa
4           5.0          3.6           1.4          0.2  setosa
(150, 5)


In [37]:
print(iris.shape)

x = iris.drop('species',axis=1)
y = iris['species']
print(x.shape)

(150, 5)
(150, 4)


Numeric Encoding of Categories  
since the species is a str, we have to change it to numbers  
AKA label encoding

In [38]:
from sklearn.preprocessing import LabelEncoder
import numpy as np
classle = LabelEncoder()
y = classle.fit_transform(iris['species'].values)## changing it to numbers
print('species labels:',np.unique(y))

species labels: [0 1 2]


In [39]:
yo = classle.inverse_transform(y)
print('speces:',np.unique(yo))

speces: ['setosa' 'versicolor' 'virginica']


## Data splitting

In [40]:
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.3,random_state=1,stratify=y)##spliting the train,test dataset // the defult is 0.25 starify is blicking biased data

print(x_train.shape)
print(x_test.shape)
print(y_train.shape)
print(y_test.shape)

(105, 4)
(45, 4)
(105,)
(45,)


## Standardization


In [48]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
sc.fit(x_train)
x_train_std = sc.transform(x_train)
x_test_std= sc.transform(x_test)


#check standared datas
print(x_train.head)
x_train_std[1:5,]

<bound method NDFrame.head of      sepal_length  sepal_width  petal_length  petal_width
33            5.5          4.2           1.4          0.2
20            5.4          3.4           1.7          0.2
115           6.4          3.2           5.3          2.3
124           6.7          3.3           5.7          2.1
35            5.0          3.2           1.2          0.2
..            ...          ...           ...          ...
41            4.5          2.3           1.3          0.3
92            5.8          2.6           4.0          1.2
26            5.0          3.4           1.6          0.4
3             4.6          3.1           1.5          0.2
42            4.4          3.2           1.3          0.2

[105 rows x 4 columns]>


array([[-0.55053619,  0.76918392, -1.16537974, -1.30728421],
       [ 0.65376173,  0.30368356,  0.84243039,  1.44587881],
       [ 1.0150511 ,  0.53643374,  1.0655204 ,  1.18367281],
       [-1.03225536,  0.30368356, -1.44424226, -1.30728421]])

## KNN

In [41]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=5, p=2)#p=2 indicates the use of the Euclidean distance.  #ex) (a-b)^2 
knn.fit(x_train,y_train) #fitting model

In [43]:
y_train_pred = knn.predict(x_train)
#y_test_pred = knn.predict(x_test_std) #temp
y_test_pred = knn.predict(x_test)

print('Misclassified training samples : %d'%((y_train!=y_train_pred).sum()))
print('Misclassified test samples : %d' %(y_test != y_test_pred).sum())

Misclassified training samples : 2
Misclassified test samples : 1


In [44]:
from sklearn.metrics import accuracy_score
print(accuracy_score(y_test,y_test_pred))

0.9777777777777777


In [46]:
from sklearn.metrics import confusion_matrix
conf = confusion_matrix(y_true=y_test , y_pred= y_test_pred)
print(conf)
## pred 0 pred 1 pred 2
##answer0 0 1 2
##answer1 0 1 2
##answer2 0 1 2

[[15  0  0]
 [ 0 15  0]
 [ 0  1 14]]
