- Iris data set has three classes which are: Setosa, Versicolour, and Virginica. 
- Those classes have four attributes: sepal width, sepal length, petal width, and petal length. 
- We will build two classification models using K-Nearest Neighbours (KNN) and Random Forests.
- KNN and Random Forest are supervised machine learning techniques in which the model learns from the training data to predict the iris class of new unseen data (test data).


- First, we load the dataset, then check for missing values, and then we use the pandas' info() function to get a summary of the iris dataset.
- The summary shows that the iris has five columns and 150 entries.
- Since the classification algorithms do not work with categorical data, we convert the class variable, which is variety, to numeric values in which 0 = Setosa, 1 = Versicolor, and 2 = Virginica.
- After that, we identify the X attributes and the y class, then we split the data into train and test with a random_state = 1 to generate a fixed set of data for each iteration.


- We first build the KNN classifier with n_neigbors = 3, which is the nearest point that decides the class of a new data. It is called the hyperparameter, and its value can be tuned if needed to improve the classifier prediction accuracy.
- After training the data, we predict the iris class using our test data.


- We build the Random Forest classifier with n_estimators = 200, which is the number of trees that predict the iris class, and we then can decide based on the number of majorities.


- The output shows that there is no significant difference between KNN and RF model performance. With KNN, the accuracy score of prediction is higher when using a small K number. On the other hand, the RF accuracy score is higher when using a smaller random_state value.
- The KNN classifier correctly predicts all values with a K = 3 and random_state = 1.
- The RF classifier predict currectly 29 of 30 values with n_estimator = 300 and random_state = 1.
- When we change the value of random_state to 3, the RF classifier correctly predicts all values.


- Some advantages of using KNN over RF that it is simple to apply, and it doesn't need training, so it is faster.
- One disadvantage of KNN that it is not great with large dimensions data since its time and memory is consumed in the testing process. It stores all datasets to calculates the distance between data points to find the nearest ones. Besides, its computation cost is high since it occurs at runtime.


- Random forest is a solid and accurate classifier, but it might be time-consuming in the training process since it combines many decision trees. Also, it gets slower as the forest gets large.


In [12]:

# Import the required dependencies

import pandas as pd
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix
from sklearn import metrics

def main():
    
    # load the dataset
    df_iris = pd.read_csv("https://raw.githubusercontent.com/Amal211/DS_level2/main/iris.csv")
    
    # Check for null values
    print("Missing values:\n{}\n" .format(df_iris.isnull().sum()))
    
    # get information about iris dataset
    print("\n\nInformation about iris dataset:\n{}") 
    print(df_iris.info())
    
    # Encode class variable (variety) to numeric values using LabelEncoder function
    encode = preprocessing.LabelEncoder()
    df_iris["variety"] = encode.fit_transform(df_iris.variety)
    
    print("\n\niris dataset:\n{}" .format(df_iris.head()))
    
    print(df_iris["variety"].value_counts())

    # Identify attributes X
    X = df_iris.iloc[:, :-1].values

    # Identify the class y
    y = df_iris.iloc[:, 4].values

    # Split iris to train and test sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 1)
    
    
    # get the train sets shape
    shape = X_train.shape, y_train.shape
    print("\n\nThe shape of X and y training sets:\n{}" .format(shape))
    
    # Build the knn classifier
    knn_classifier = KNeighborsClassifier(n_neighbors = 3)
    
    # fit the knn classifier
    knn_classifier.fit(X_train, y_train)
    
    # predict the class 
    predict_y = knn_classifier.predict(X_test)
    
    print("\nThe actual data:\n{}" .format(y_test))
    
    print("\nThe predicted data:\n{}" .format(predict_y))
    
    print("\nKNN Classifier Accuracy Score:\n{}" .format(f'{knn_classifier.score(X_test, y_test):.2%}'))
    
    print("\nThe Classification Confusion Matrix:\n{}" .format(confusion_matrix(y_test, predict_y)))
    
  
if __name__ == "__main__":
  main()




Missing values:
sepal.length    0
sepal.width     0
petal.length    0
petal.width     0
variety         0
dtype: int64



Information about iris dataset:
{}
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   sepal.length  150 non-null    float64
 1   sepal.width   150 non-null    float64
 2   petal.length  150 non-null    float64
 3   petal.width   150 non-null    float64
 4   variety       150 non-null    object 
dtypes: float64(4), object(1)
memory usage: 6.0+ KB
None


iris dataset:
   sepal.length  sepal.width  petal.length  petal.width  variety
0           5.1          3.5           1.4          0.2        0
1           4.9          3.0           1.4          0.2        0
2           4.7          3.2           1.3          0.2        0
3           4.6          3.1           1.5          0.2        0
4           5.0          3.6           1

In [31]:
# Import the required dependencies

import pandas as pd
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix
from sklearn import metrics


def main():
    
    # load the dataset
    df_iris = pd.read_csv("https://raw.githubusercontent.com/Amal211/DS_level2/main/iris.csv")
    
    # Check for null values
    print("Missing values:\n{}\n" .format(df_iris.isnull().sum()))
    
    # get information about iris dataset
    print("\n\nInformation about iris dataset:\n") 
    print(df_iris.info())
    
    # Encode class variable (variety) to numeric values using LabelEncoder function
    encode = preprocessing.LabelEncoder()
    df_iris["variety"] = encode.fit_transform(df_iris.variety)
    
    print("\n\niris dataset:\n{}" .format(df_iris.head()))
    
    print(df_iris["variety"].value_counts())

    # Identify attributes X
    X = df_iris.iloc[:, :-1].values

    # Identify the class y
    y = df_iris.iloc[:, 4].values

    # Split iris to train and test sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 1)
    

    # get the train sets shape
    shape = X_train.shape, y_train.shape
    print("\n\nThe shape of X and y training sets:\n{}" .format(shape))

    # Build the Random Forest classifier
    RF = RandomForestClassifier(n_estimators = 300)

    # fit the Random Forest classifier
    RF.fit(X_train, y_train)

    # predict the class 
    predict_y2 = RF.predict(X_test)
    
    print("\nThe actual data:\n{}" .format(y_test))
    
    print("\nThe predicted result:\n{}" .format(predict_y2))
    
    # calculate the accuracy score of the KNN classifier
    
    print("\nRandom forest Classifier accuracy Score:\n{}" .format(metrics.accuracy_score(y_test, predict_y2)))

    print("\nRandom forest Classification Confusion Matrix:\n{}" .format(confusion_matrix(y_test, predict_y2)))
       
  
if __name__ == "__main__":
  main()
  



Missing values:
sepal.length    0
sepal.width     0
petal.length    0
petal.width     0
variety         0
dtype: int64



Information about iris dataset:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   sepal.length  150 non-null    float64
 1   sepal.width   150 non-null    float64
 2   petal.length  150 non-null    float64
 3   petal.width   150 non-null    float64
 4   variety       150 non-null    object 
dtypes: float64(4), object(1)
memory usage: 6.0+ KB
None


iris dataset:
   sepal.length  sepal.width  petal.length  petal.width  variety
0           5.1          3.5           1.4          0.2        0
1           4.9          3.0           1.4          0.2        0
2           4.7          3.2           1.3          0.2        0
3           4.6          3.1           1.5          0.2        0
4           5.0          3.6           1.4