<a href="https://colab.research.google.com/github/Awino614/DATA-WRANGLING/blob/main/CLASSIFICATION_MODELS.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All"
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

**1.IMPORT LIBRARIES AND LOAD DATASETS**
Purpose: Load the required Python libraries and the Wine dataset from scikit-learn.

 Load essential libraries for data manipulation, visualization, and machine learning.
The Wine dataset contains 13 features and a target class with 3 wine types.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix

# Models
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC

# Load Wine dataset
wine = load_wine()
X = pd.DataFrame(wine.data, columns=wine.feature_names)
y = pd.Series(wine.target, name='target')

print(X.head())
print(y.value_counts())


**2.EXPLORATORY DATA ANALYSIS(EDA)**
Purpose: Understand the structure of the data, look for patterns, distributions, and relationships between variables.

 Perform basic EDA to check feature ranges, class distribution, and correlations.
 Use plots to visually explore the dataset and identify any skew or outlier

In [None]:
# Basic statistics
print(X.describe())

# Class distribution
sns.countplot(x=y)
plt.title("Class Distribution")
plt.show()

# Correlation heatmap
plt.figure(figsize=(12, 8))
sns.heatmap(X.corr(), cmap='coolwarm', annot=False)
plt.title("Feature Correlation Heatmap")
plt.show()


**3.DATA PREPARATION**
Purpose: Prepare the data for model training by scaling features and splitting into training and test sets.

Apply standardization to normalize feature scales for better model performance.
Split the dataset into training and testing sets (70/30) to evaluate generalization.

In [None]:
# Feature scaling
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.3, random_state=42)


**4. Helper: Plot Confusion Matrix**
Purpose: Create a reusable function to visualize confusion matrices for all models.
 Define a function to plot confusion matrices using seaborn heatmaps.
 Helps visually assess model performance (true vs. predicted labels).

In [None]:
def plot_conf_matrix(y_true, y_pred, title):
    cm = confusion_matrix(y_true, y_pred)
    plt.figure(figsize=(5, 4))
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
                xticklabels=wine.target_names, yticklabels=wine.target_names)
    plt.xlabel('Predicted')
    plt.ylabel('Actual')
    plt.title(f'Confusion Matrix: {title}')
    plt.show()


 **5. Model Training & Evaluation Template**
 Purpose: Train six classification models and evaluate them using accuracy and detailed classification reports.

 Train each model using the training set and make predictions on the test set.
 Evaluate with accuracy, precision, recall, F1-score, and plot confusion matrices


In [None]:
# Initialize results table
results = pd.DataFrame(columns=['Model', 'Accuracy'])

# Define models
models = {
    "Logistic Regression": LogisticRegression(max_iter=1000),
    "Decision Tree": DecisionTreeClassifier(),
    "Random Forest": RandomForestClassifier(),
    "K-Nearest Neighbors": KNeighborsClassifier(),
    "Naive Bayes": GaussianNB(),
    "Support Vector Machine": SVC()
}

# Train, predict, evaluate
for name, model in models.items():
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)

    # Accuracy
    acc = accuracy_score(y_test, y_pred)
    results.loc[len(results)] = [name, acc]

    # Print classification report
    print(f"\n🔍 {name}")
    print(classification_report(y_test, y_pred))

    # Plot confusion matrix
    plot_conf_matrix(y_test, y_pred, name)


6. **Compare Model Performance**
   Purpose: Create a comparison table and bar plot to easily see which model performed best.

 Store accuracy scores for all models in a DataFrame and sort them.
Visualize comparison to determine the most effective classifier for the Wine dataset.

In [None]:
# Sort and display results
results_sorted = results.sort_values(by='Accuracy', ascending=False)
print("\n✅ Model Comparison:")
print(results_sorted)

# Plot accuracy comparison
plt.figure(figsize=(10, 6))
sns.barplot(data=results_sorted, x='Accuracy', y='Model', palette='viridis')
plt.title("Model Accuracy Comparison")
plt.xlabel("Accuracy")
plt.ylabel("Model")
plt.xlim(0.8, 1.0)
plt.grid(True)
plt.show()


**7. Final Notes**
✅ All models are trained on the same data

📊 You’re evaluating based on Accuracy, Classification Report (precision, recall, F1), and Confusion Matrix


