Converting Dataset and transform NaN and empty cells

In [1]:
import pandas as pd
iris_data=pd.read_csv("Iris_data_sample.csv")

In [2]:
# Replacing non-numeric values with NaN and converting columns to numeric
for column in ['SepalLengthCm', 'PetalLengthCm']:
    iris_data[column] = pd.to_numeric(iris_data[column], errors='coerce')

# Now, replacing NaN values with the mean of the respective column
iris_data.fillna(iris_data.mean(), inplace=True)

# Check the first few rows and data types again after cleaning
cleaned_data_head = iris_data.head()
cleaned_data_types_summary = iris_data.dtypes

  iris_data.fillna(iris_data.mean(), inplace=True)


In [3]:
from sklearn.impute import SimpleImputer

# Impute the missing values in the 'Species' column
imputer = SimpleImputer(strategy='most_frequent')
iris_data['Species'] = imputer.fit_transform(iris_data[['Species']])

# Verifyung values
missing_values_after_imputation = iris_data.isnull().sum()
missing_values_after_imputation

Unnamed: 0       0
SepalLengthCm    0
SepalWidthCm     0
PetalLengthCm    0
PetalWidthCm     0
Species          0
dtype: int64

**kNN-Classifier**

In [4]:
# Split the dataset into training and testing sets
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, classification_report


X = iris_data.iloc[:, 1:-1]  # features (excluding the first column 'Unnamed: 0')
y = iris_data.iloc[:, -1]   # target

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train and evaluate kNN classifier for each k value
k_values = [1, 3, 5, 7]
results = {}

for k in k_values:
    knn = KNeighborsClassifier(n_neighbors=k)
    knn.fit(X_train, y_train)
    y_pred = knn.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    print("For k = ", k,", Accuracy = ", accuracy, sep="")
    report = classification_report(y_test, y_pred)
    results[k] = {'accuracy': accuracy, 'report': report}

For k = 1, Accuracy = 0.9555555555555556
For k = 3, Accuracy = 0.9777777777777777
For k = 5, Accuracy = 0.9777777777777777
For k = 7, Accuracy = 0.9777777777777777


*OBSERVATIONS*

A slight increase in accuracy as k increased.
Best Performance: Achieved with `k=3,5`

Insight: Higher values of k
k provided more stable and generalized predictions, reducing the effect of noise on classification.

**kNN-regressor**

In [5]:
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error, r2_score

# For regression, we need to choose a numeric target. Let's use 'SepalLengthCm' as the target.
# The rest of the features will be used as predictors.
X_reg = iris_data.drop(['SepalLengthCm', 'Unnamed: 0', 'Species'], axis=1)
y_reg = iris_data['SepalLengthCm']

# Split the dataset into training and testing sets for regression
X_train_reg, X_test_reg, y_train_reg, y_test_reg = train_test_split(X_reg, y_reg, test_size=0.3, random_state=42)

# Train and evaluate KNN regressor for each k value
k_values = [1, 3, 5, 7]
regression_results = {}

for k in k_values:
    knn_reg = KNeighborsRegressor(n_neighbors=k)
    knn_reg.fit(X_train_reg, y_train_reg)
    y_pred_reg = knn_reg.predict(X_test_reg)
    mse = mean_squared_error(y_test_reg, y_pred_reg)
    r2 = r2_score(y_test_reg, y_pred_reg)
    regression_results[k] = {'MSE': mse, 'R2 Score': r2}

regression_results

{1: {'MSE': 0.23983084846027955, 'R2 Score': 0.6396665171894449},
 3: {'MSE': 0.1381738960707476, 'R2 Score': 0.7924008461616977},
 5: {'MSE': 0.11759446191112517, 'R2 Score': 0.8233203847974265},
 7: {'MSE': 0.11652735423553982, 'R2 Score': 0.8249236590540376}}

As with the kNN classifier, the performance of the kNN regressor improved with increasing values of k.
Best Performance: Achieved with `k = 7`

_For k=7_, showing the lowest Mean Squared Error (MSE) and the highest R² Score, indicating a good fit to the data.
Insight: k values leading to better performance, likely due to a more balanced consideration of the neighboring points.

**Naive-Byes**

In [6]:
from sklearn.naive_bayes import GaussianNB

# Naive Bayes classifier
nb_classifier = GaussianNB()

# Using the same training and testing sets as for kNN classifier
# Here, we are considering 'Species' as the target variable again.
nb_classifier.fit(X_train, y_train)
y_pred_nb = nb_classifier.predict(X_test)

# Evaluate the Naive Bayes classifier
accuracy_nb = accuracy_score(y_test, y_pred_nb)
report_nb = classification_report(y_test, y_pred_nb)

print("Accuracy =", accuracy_nb)

Accuracy = 0.9777777777777777
