# EDA + Feature Engineering on Iris Dataset (Pandas + Matplotlib)
## Part A – Data Exploration
- Load the Dataset
    - Load iris.csv into a Pandas DataFrame.
    - Show shape, columns, dtypes, and first 5 rows.

- Summary Statistics
    - Use .describe() for numeric columns.
    - Find which species has the highest mean petal length.

- Data Quality
    - Check for missing values (.isna().sum()).
    - Check for duplicates and remove them if any.

- Univariate Analysis
    - Plot histograms of each feature (SepalLength, SepalWidth, PetalLength, PetalWidth).
    - Plot boxplots (Matplotlib’s plt.boxplot) for each feature grouped by species.
    - Identify which feature has the largest spread (variance).

- Bivariate Analysis
    - Create scatter plots:
        - SepalLength vs PetalLength (color by species).
        - SepalWidth vs PetalWidth.
    - Use different markers (o, s, ^) for species.
    - Which two features best separate the species visually?

- Correlation Analysis
    - Compute correlation matrix using .corr().
    - Plot as a heatmap-style table using Matplotlib (imshow + colorbar).
    - Which two features are most strongly correlated?

## Part B – Feature Engineering (Knowledge Engineering)
- Feature Transformation
    - Normalize features using (x - min) / (max - min).
    - Standardize features using (x - mean) / std.
    - Compare histograms of raw vs scaled values.

- New Features
    - Create:
        - SepalArea = SepalLength * SepalWidth
        - PetalArea = PetalLength * PetalWidth
    - Compare average values of new features across species with bar plots.

- Feature Binning
    - Bin Sepal Length into 3 categories: Short, Medium, Long.
    - Show species count within each bin using a bar chart.

- Encoding
    - Encode Species into numbers (0,1,2) using Pandas .factorize() or .astype('category').cat.codes.
    - Print mapping.

## Part C – Mini Assignment Questions
- Which feature shows the highest variance across the dataset?
- Which feature is the best predictor of species separation?
- Between SepalArea and PetalArea, which one separates species better?
- In your scatter plots, which species tends to overlap the most with others?
- Suggest two more new features you could engineer from existing columns.

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd

## Part A - Data Exploration

### 1. Load the dataset

In [None]:
iris_df=sns.load_dataset('iris')

In [None]:
# First five head
iris_df.head()

In [None]:
#Shape
print(f"Shape of dataframe is {iris_df.shape}")

In [None]:
#Columns
print("Columns of Iris dataset are \n", iris_df.columns)    


In [None]:
#dtype
print("Data of dataframe is \n")
iris_df.info()

### 2. Summary Statistics

In [None]:
print("describe numerical columns")
iris_df.describe()

In [None]:
print("species with highest mean petal length")
iris_df.groupby('species')['petal_length'].mean().sort_values(ascending=False).head(1)

### 3. Data Quality

In [None]:
# Check for missing values (.isna().sum())
print("Check for missing values")
iris_df.isna().sum()

In [None]:
# Check for duplicates and remove them if any
print("check for duplicates")
iris_df.drop_duplicates(inplace=True)
iris_df

### 4. Univariate Analysis

In [None]:
# Plot histograms of each feature (SepalLength, SepalWidth, PetalLength, PetalWidth).
iris_df['sepal_length'].plot(kind='hist',bins=10,edgecolor='black',label='Sepal Length',color='pink')
plt.legend()
plt.title('Sepal Length Distribution')
plt.xlabel('Sepal length (cm)')
plt.ylabel('Frequency')
plt.show()

In [None]:
iris_df['sepal_width'].plot(kind='hist',bins=10,edgecolor='black',label='Sepal Width',color='yellow')
plt.legend()
plt.title('Sepal Width Distribution')
plt.xlabel('Sepal width (cm)')
plt.ylabel('Frequency')
plt.show()

In [None]:
iris_df['petal_length'].plot(kind='hist',bins=20,edgecolor='black',label='Petal Length',color='pink')
plt.legend()
plt.title('Petal Length Distribution')
plt.xlabel('Petal Length (cm)')
plt.ylabel('Frequency')
plt.show()

In [None]:
iris_df['petal_width'].plot(kind='hist',bins=20,edgecolor='black',label='Petal Width',color='yellow')
plt.legend()
plt.title('Petal Width Distribution')
plt.xlabel('Petal Width (cm)')
plt.ylabel('Frequency')
plt.show()

In [None]:
plt.boxplot(
    [iris_df['sepal_length'],iris_df['sepal_width'],iris_df['petal_length'],iris_df['petal_width']],
    tick_labels=['sepal len','sepal wid','petal len','petal wid'],patch_artist=True,vert=True)
plt.title('Iris dataset - Boxplots')
plt.ylabel('cm')

In [None]:
varience=[iris_df['sepal_width'].var(),iris_df['sepal_length'].var(),iris_df['petal_width'].var(),iris_df['petal_length'].var()]
print(varience)

### Petal_Length has the largest spread (variance).

### 5. Bivariate Analysis

In [None]:
# SepalLength vs PetalLength (color by species).
colors={'setosa':'red','virginica':'blue','versicolor':'green'}
markers={'setosa':'o','virginica':'s','versicolor':'^'}
species=iris_df['species'].unique()
for s in species:
    subset=iris_df[iris_df['species']==s]
    plt.scatter(subset['sepal_width'],subset['petal_width'],c=colors[s],label=s,marker=markers[s],alpha=0.7)
plt.grid(color='gray',linestyle="--")
plt.title("SepalWidth vs PetalWidth")
plt.xlabel("Sepal Width")
plt.ylabel(" Petal Width")
plt.legend()
plt.show()

In [None]:
# SepalWidth vs PetalWidth
colors={'setosa':'red','virginica':'blue','versicolor':'green'}
markers={'setosa':'o','virginica':'s','versicolor':'^'}
species=iris_df['species'].unique()
for s in species:
    subset=iris_df[iris_df['species']==s]
    plt.scatter(subset['sepal_length'],subset['petal_length'],c=colors[s],label=s,marker=markers[s],alpha=0.7)
plt.grid(color='gray',linestyle="--")
plt.title("SepalWidth vs PetalWidth")
plt.xlabel("Sepal Width")
plt.ylabel(" Petal Width")
plt.legend()
plt.show()

### The Petal dimensions (Petal Length & Petal Width) provide the best visual separation of the three species

### 6. Correlation Matrix

In [None]:
corr_matrix=iris_df.corr(numeric_only=True)*100
print(corr_matrix)

### Petal Length and Petal Width are most strongly correlated = 96.277229

## Part B - Feature Engineering (Knowledge Engineering)

### 1. Feature Transformation

In [None]:
print("Normalization on Sepal Length = ( x - min ) / ( max - min )")
normalized=(iris_df['sepal_length']-iris_df['sepal_length'].min())/(iris_df['sepal_length'].max()-iris_df['petal_length'].min())
print(round(normalized.sample(10),2))

In [None]:
print("Standardize features using (x - mean) / std")
std=iris_df['sepal_length'].std()
mean=iris_df['sepal_length'].mean()
print(f"Std = {std} Mean = {mean}")
standard_val=round((iris_df['sepal_length']-mean)/std,2)
print(standard_val.sample(10))

In [None]:
fig,ax=plt.subplots(1,3,figsize=(12,5))
fig.supxlabel("Range")
fig.supylabel("Petal Length (cm)")

ax[0].hist(iris_df['sepal_length'],bins=8,edgecolor="black",color='purple')
ax[0].set_title('Sepal Length Raw data')

ax[1].hist(normalized,bins=8,edgecolor="black",color='red')
ax[1].set_title('Sepal Length Normalized data')


ax[2].hist(standard_val,bins=8,edgecolor="black",color='orange')
ax[2].set_title('Sepal Length Standard data')

plt.suptitle("histograms of raw vs scaled values")
plt.tight_layout()
plt.show()

### 2. New feature

In [None]:
# SepalArea = SepalLength * SepalWidth
iris_df['sepalArea']=iris_df['sepal_length']*iris_df['sepal_width']
iris_df.head()

In [None]:
# PetalArea = PetalLength * PetalWidth
iris_df['petalArea']=iris_df['petal_length']*iris_df['petal_width']
iris_df.head()

In [None]:
fig,ax=plt.subplots(1,2,figsize=(10,5))
fig.supxlabel("species")
fig.suptitle("Average values of new features across species")
colors={'setosa':'#cc4bf2','versicolor':'#8e1cb0','virginica':"#5E0776"}
species=iris_df['species'].unique()
for s in species:
    subset=iris_df[iris_df['species']==s]
    ax[0].bar(s,subset['sepalArea'].mean(),color=colors[s])
    ax[0].set_ylabel("Average SepalArea (cm^2)")

    ax[1].bar(s,subset['petalArea'].mean(),color=colors[s])
    ax[1].set_ylabel("Average petalArea (cm^2)")
plt.tight_layout()

### 3. Feature Binning

In [None]:
# Bin Sepal Length into 3 categories: Short, Medium, Long.
iris_df['sepal_length_bin']=pd.cut(iris_df['sepal_length'],bins=3,labels=['short','medium','large'])
iris_df.sample(10)

In [None]:
print("Count species in each bin")
bin_counts=iris_df.groupby(['species','sepal_length_bin'],observed=True).size().unstack(fill_value=0)
print(bin_counts)

In [None]:
bin_counts.plot(kind="bar",figsize=(8,5),edgecolor='black')
plt.title("Species Count by Sepal Length Bin")
plt.xlabel("Sepal Length Category")
plt.ylabel("Count")
plt.legend(title="Species")
plt.tight_layout()

### 4. Encoding

In [None]:
print("Adding species Code")
iris_df['species_code'],uniques=pd.factorize(iris_df['species'])
iris_df.sample(10)
for code,label in enumerate(uniques):
    print(code, "->", label)

### Part 3: Mini Assignment answers
Ans1 = Petal Length

Ans2 = Petal Length / Petal Width combo

Ans3 = Petal Area

Ans4 = Versicolor and Virginica


In [None]:
# Solution 1
round(iris_df.var(numeric_only=True),2)

In [None]:
# Solution 2
iris_df.corr(numeric_only=True)['species_code']

In [None]:
# Solution 3
plt.scatter(iris_df['sepalArea'], iris_df['petalArea'], c=iris_df['species_code'])
plt.title("Sepal Area vs Petal Area")
plt.xlabel("Sepal Area")
plt.ylabel("Petal Area")

In [None]:
# Save updated data
# iris_df.to_excel("Iris.xlsx")

### Completed basic EDA on the Iris dataset, covering distributions, correlations, feature transformations, and species separability 