# Lab #3

Data exploration is a crucial step in the AI/ML pipeline. Before we dive into complex algorithms, we need to understand the basics of our dataset. This process involves:

- Summarizing the main characteristics of the dataset
- Detecting any anomalies or outliers that may affect model performance
- Unveiling the structure and patterns in the data by visualizing 

We will use these Python librares: pandas for data exploration, matplotlib and seaborn for visualization

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC, LinearSVC

## Data Loading

- Here, we load the Iris dataset using Pandas. We use the 'head', 'info', and 'describe' methods to get an initial understanding of the dataset's structure, data types, and some basic statistical details.

In [None]:
# Load the dataset
df = pd.read_csv('data/iris_dataset.csv')

# Initial exploration
df.info()

In [None]:
df

In [None]:
df.describe()

## Data Cleaning and Preprocessing

- Missing values can significantly affect the performance of ML models. We handle them by imputing using the mean of each column. This is a simple strategy, but there are more sophisticated methods available. 

In [None]:
# Handling missing values - Simple Imputation
df.fillna(df.mean(), inplace=True)

# Check if there are any missing values left
print(df.isna().sum())

- Normalization is crucial when features have different scales, as it standardizes the range of independent variables.

In [None]:
# Normalization (if necessary)
scaler = StandardScaler()
df['target'] = np.ceil(df['target']).astype(int)
df_scaled = pd.DataFrame(scaler.fit_transform(df.drop('target', axis=1)), columns=df.columns[:-1])
df_scaled['target'] = df['target']

# df = df_scaled # uncomment if you want to use the scaled values
df_scaled

In [None]:
df.info()

## Descriptive Statistical Analysis

- Descriptive statistics provide important insights into the data. Mean and median offer information about the central tendency, while standard deviation tells us about the spread of the data. These metrics are foundational for understanding any dataset.

In [None]:
# Basic statistics
print("Mean:\n", df.mean())
print("Median:\n", df.median())
print("Standard Deviation:\n", df.std())

## Data Visualization

- Visualization is key to understanding the distribution and relationship between variables. Histograms show the distribution of each feature, box plots highlight outliers, and scatter plots (using pairplot) reveal the relationships between pairs of features, colored by the target variable.

In [None]:
# Histograms for feature distributions
df.hist(figsize=(10, 8))
plt.show()

In [None]:
# Box plots to identify outliers
df.boxplot(figsize=(10, 8))
plt.show()

In [None]:
# Scatter plot (advanced)
sns.pairplot(df, hue='target', corner=True)
plt.show()

## Example SVC (Support Vector Classifier)

- Let's create a classifier that can predict the target class using only two attributes.
- We should choose two attributes that can be easily separated based on their class. (Check out the pairplot we made in the last step)

- There are various types of [SVC](https://scikit-learn.org/stable/modules/svm.html) and different parameters that you can use according to your needs.
- Let's see an example with four different models that use 'petal width (cm)' and 'petal length (cm)' as attributes.

In [None]:
# Classification using Support Vector Classifier (SVC)
X = df[['petal width (cm)', 'petal length (cm)']].values
y = df['target'].values

# Create a meshgrid to plot decision boundary
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.02), np.arange(y_min, y_max, 0.02))


C = 1.0  # SVM regularization parameter
models = (
    SVC(kernel="linear", C=C),
    SVC(kernel="rbf", gamma=0.7, C=C),
    SVC(kernel="poly", degree=3, gamma="auto", C=C),
    LinearSVC(C=C, max_iter=10000, dual=True),
)
models = (clf.fit(X, y) for clf in models)

# title for the plots
titles = (
    "SVC with linear kernel",
    "SVC with RBF kernel",
    "SVC with polynomial (degree 3) kernel",
    "LinearSVC (linear kernel)",
)

fig, sub = plt.subplots(2, 2, figsize=(12,10))

for clf, title, ax in zip(models, titles, sub.flatten()):
    # Plot decision boundary
    Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)

    ax.contourf(xx, yy, Z, alpha=0.2, cmap='coolwarm')

    # Scatter plot with color-coded target variable
    ax.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.coolwarm, s=20, edgecolors="k")
    ax.set_title(title)
    ax.set_xlabel('Petal Width (cm)')
    ax.set_ylabel('Petal Length (cm)')

plt.show()
