# Lab #3

Data exploration is a crucial step in the AI/ML pipeline. Before we dive into complex algorithms, we need to understand the basics of our dataset. This process involves:

- Summarizing the main characteristics of the dataset
- Detecting any anomalies or outliers that may affect model performance
- Unveiling the structure and patterns in the data by visualizing 

We will use these Python librares: pandas for data exploration, matplotlib and seaborn for visualization

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler

## Data Loading

- Here, we load the Iris dataset using Pandas. We use the 'head', 'info', and 'describe' methods to get an initial understanding of the dataset's structure, data types, and some basic statistical details.

In [None]:
# Load the dataset
df = pd.read_csv('data/iris_dataset.csv')

# Initial exploration
print(df.info())
print(df.describe())
df

## Data Cleaning and Preprocessing

- Missing values can significantly affect the performance of ML models. We handle them by imputing using the mean of each column. This is a simple strategy, but there are more sophisticated methods available. Normalization is crucial when features have different scales, as it standardizes the range of independent variables.

In [None]:
# Handling missing values - Simple Imputation
df.fillna(df.mean(), inplace=True)

# Check if there are any missing values left
print(df.isna().sum())

# Normalization (if necessary)
scaler = StandardScaler()
df_scaled = pd.DataFrame(scaler.fit_transform(df.drop('target', axis=1)), columns=df.columns[:-1])
df_scaled['target'] = df['target']
df_scaled

## Descriptive Statistical Analysis

- Descriptive statistics provide important insights into the data. Mean and median offer information about the central tendency, while standard deviation tells us about the spread of the data. These metrics are foundational for understanding any dataset.

In [None]:
# Basic statistics
print("Mean:\n", df.mean())
print("Median:\n", df.median())
print("Standard Deviation:\n", df.std())

## Data Visualization

- Visualization is key to understanding the distribution and relationship between variables. Histograms show the distribution of each feature, box plots highlight outliers, and scatter plots (using pairplot) reveal the relationships between pairs of features, colored by the target variable.

In [None]:
# Histograms for feature distributions
df.hist(figsize=(10, 8))
plt.show()

# Box plots to identify outliers
df.boxplot(figsize=(10, 8))
plt.show()

# Scatter plot (advanced)
sns.pairplot(df, hue='target', kind="kde")
plt.show()
