
# 📌 Task 1: Exploring and Visualizing the Iris Dataset

## 📝 Introduction and Problem Statement
The Iris dataset is one of the most well-known datasets in machine learning and statistics.
It contains 150 observations of iris flowers from three different species (setosa, versicolor, virginica).
Each observation includes four features: sepal length, sepal width, petal length, and petal width.
The objective of this task is to explore the dataset, visualize relationships, clean if needed,
and build a classification model to predict the species of the flower.


In [None]:

# 📚 Import Required Libraries
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# 📥 Load the Iris Dataset
df = sns.load_dataset("iris")

# 📄 Display Basic Dataset Information
print("✅ Shape of the dataset:", df.shape)
print("✅ Column names:", df.columns.tolist())
print("\n🔹 First 5 rows:\n", df.head())

# 🔍 Check for Missing Values
print("\n🔍 Missing values in each column:\n", df.isnull().sum())

# 📊 Summary Statistics of the Dataset
print("\n📊 Statistical Summary:\n", df.describe())

# 📈 Exploratory Data Analysis (EDA)

# 🔹 Pairplot to visualize relationships between features
sns.pairplot(df, hue='species')
plt.suptitle('🔍 Pairplot: Feature Relationships in Iris Dataset', y=1.02)
plt.show()

# 🔹 Histograms to examine feature distributions
df.hist(figsize=(10, 8), bins=15, edgecolor='black')
plt.suptitle('📊 Histograms of Iris Features')
plt.tight_layout()
plt.show()

# 🔹 Box plots to detect outliers and compare spread
plt.figure(figsize=(12, 8))
for i, column in enumerate(df.columns[:-1], 1):  # Exclude 'species' column
    plt.subplot(2, 2, i)
    sns.boxplot(x='species', y=column, data=df)
    plt.title(f'📦 Box Plot of {column} by Species')
plt.tight_layout()
plt.show()

# 🤖 Model Training and Evaluation

# 🔹 Define Features and Target
X = df.drop('species', axis=1)
y = df['species']

# 🔹 Split Dataset into Training and Testing Sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 🔹 Initialize and Train Logistic Regression Model
model = LogisticRegression(max_iter=200)
model.fit(X_train, y_train)

# 🔹 Make Predictions on Test Set
y_pred = model.predict(X_test)

# 📏 Evaluate Model Performance
print("\n✅ Model Accuracy:", accuracy_score(y_test, y_pred))
print("\n🧾 Classification Report:\n", classification_report(y_test, y_pred))
print("\n📉 Confusion Matrix:\n", confusion_matrix(y_test, y_pred))



# 📌 Conclusion
The Iris dataset shows clear separation between the three species based on petal measurements.
The Logistic Regression model achieved high accuracy, indicating good performance.
Petal length and width are especially strong features for classification.
Further improvements can be made using more complex models like SVM or Random Forest.
