**Univariate Analysis (Single Feature Analysis)**<br>

Objective: Understand the distribution and characteristics of a single feature using histograms boxplots, and violin plots.<br>

Dataset: Use the Iris dataset or any dataset with multiple numerical features.


Title: Histogram<br>

Task 1: Plot a histogram of the petal length feature.

In [None]:
# Write your code from here
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris

# Load the Iris dataset
iris = load_iris(as_frame=True)
iris_df = iris.frame

# Choose the feature
feature = 'petal length (cm)'
data = iris_df[feature]

# Plot the histogram
plt.figure(figsize=(8, 6))
sns.histplot(data, kde=True)
plt.title(f'Distribution of {feature}')
plt.xlabel(feature)
plt.ylabel('Frequency')
plt.show()

Title: Boxplots<br>

Task 2: Plot a boxplot of the petal length feature.

In [None]:
# Write your code from here
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris

# Load the Iris dataset
iris = load_iris(as_frame=True)
iris_df = iris.frame

# Choose the feature
feature = 'petal length (cm)'
data = iris_df[feature]

# Plot the boxplot
plt.figure(figsize=(8, 6))
sns.boxplot(x=data)
plt.title(f'Boxplot of {feature}')
plt.xlabel(feature)
plt.show()

Title: Violin Plots<br>

Task 3: Plot a violin plot of the petal length feature.

In [None]:
# Write your code from here
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris

# Load the Iris dataset
iris = load_iris(as_frame=True)
iris_df = iris.frame

# Choose the feature
feature = 'petal length (cm)'
data = iris_df[feature]

# Plot the violin plot
plt.figure(figsize=(8, 6))
sns.violinplot(x=data)
plt.title(f'Violin Plot of {feature}')
plt.xlabel(feature)
plt.show()

**Bivariate Analysis (Relationships Between Features)**<br>

Objective: Explore relationships between two features using scatter plots and correlation heatmaps.

Title: Scatter Plots<br>

Task 1: Create a scatter plot between sepal length and sepal width.<br>
Task 2: Scatter plot between petal length and petal width.<br>
Task 3: Scatter plot between sepal length and petal length.<br>

In [None]:
# Write your code from here
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris

# Load the Iris dataset
iris = load_iris(as_frame=True)
iris_df = iris.frame

# Task 1: Scatter plot between sepal length and sepal width
plt.figure(figsize=(8, 6))
sns.scatterplot(x='sepal length (cm)', y='sepal width (cm)', data=iris_df)
plt.title('Sepal Length vs. Sepal Width')
plt.xlabel('Sepal Length (cm)')
plt.ylabel('Sepal Width (cm)')
plt.show()

# Task 2: Scatter plot between petal length and petal width
plt.figure(figsize=(8, 6))
sns.scatterplot(x='petal length (cm)', y='petal width (cm)', data=iris_df)
plt.title('Petal Length vs. Petal Width')
plt.xlabel('Petal Length (cm)')
plt.ylabel('Petal Width (cm)')
plt.show()

# Task 3: Scatter plot between sepal length and petal length
plt.figure(figsize=(8, 6))
sns.scatterplot(x='sepal length (cm)', y='petal length (cm)', data=iris_df)
plt.title('Sepal Length vs. Petal Length')
plt.xlabel('Sepal Length (cm)')
plt.ylabel('Petal Length (cm)')
plt.show()

Title: Correlation Heatmaps<br>

Task 1: Generate a correlation heatmap of the dataset.<br>
Task 2: Highlight correlation between sepal length and petal length.<br>
Task 3: Highlight correlation between petal width and petal length.<br>

In [None]:
# Write your code from here
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
import pandas as pd

# Load the Iris dataset
iris = load_iris(as_frame=True)
iris_df = iris.frame

# Calculate the correlation matrix
correlation_matrix = iris_df.corr(numeric_only=True)

# Task 1: Generate the correlation heatmap
plt.figure(figsize=(8, 6))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f")
plt.title('Correlation Heatmap of Iris Features')
plt.show()

# Task 2 & 3: Highlight specific correlations (can be done by looking at the heatmap)
# The heatmap visually shows the correlation between all pairs.
# We can print the specific correlation values as well if needed.
sepal_petal_corr = correlation_matrix.loc['sepal length (cm)', 'petal length (cm)']
petal_width_length_corr = correlation_matrix.loc['petal width (cm)', 'petal length (cm)']

print(f"Correlation between Sepal Length and Petal Length: {sepal_petal_corr:.2f}")
print(f"Correlation between Petal Width and Petal Length: {petal_width_length_corr:.2f}")

**Multivariate Analysis (Higher-Dimensional Data Relationships)**<br>
Objective: Analyze relationships in higher-dimensional data using pair plots and PCA.

Title: Pair Plots<br>

Task 1: Create a pair plot for the Iris dataset.<br>
Task 2: Focus on a subset of features (e.g., only petal dimensions).<br>
Task 3: Exclude one class to observe differences.

In [None]:
# Write your code from here
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
import pandas as pd

# Load the Iris dataset
iris = load_iris(as_frame=True)
iris_df = iris.frame

# Task 1: Create a pair plot for the entire Iris dataset
plt.figure(figsize=(10, 10))
sns.pairplot(iris_df, hue='target', diag_kind='kde')
plt.suptitle('Pair Plot of All Iris Features', y=1.02)
plt.show()

# Task 2: Focus on a subset of features (petal length and petal width)
petal_df = iris_df[['petal length (cm)', 'petal width (cm)', 'target']]
plt.figure(figsize=(6, 6))
sns.pairplot(petal_df, hue='target', diag_kind='kde')
plt.suptitle('Pair Plot of Petal Dimensions', y=1.02)
plt.show()

# Task 3: Exclude one class (e.g., setosa) to observe differences
not_setosa_df = iris_df[iris_df['target'] != 0]
plt.figure(figsize=(10, 10))
sns.pairplot(not_setosa_df, hue='target', diag_kind='kde')
plt.suptitle('Pair Plot Excluding Setosa', y=1.02)
plt.show()

Title: Principal Component Analysis (PCA)<br>

Task 1: Perform PCA and plot the first two principal components.<br>
Task 2: Visualize explained variance by each component.<br>
Task 3: Retain more components and visualize in 3D.

In [None]:
# Write your code from here
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
import pandas as pd
from mpl_toolkits.mplot3d import Axes3D  # Import for 3D plotting

# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target
feature_names = iris.feature_names
target_names = iris.target_names

# Standardize the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Task 1: Perform PCA and plot the first two principal components
pca_2d = PCA(n_components=2)
X_pca_2d = pca_2d.fit_transform(X_scaled)

plt.figure(figsize=(8, 6))
for i, target_name in enumerate(target_names):
    plt.scatter(X_pca_2d[y == i, 0], X_pca_2d[y == i, 1], label=target_name)
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.title('First Two Principal Components of Iris Dataset')
plt.legend()
plt.grid(True)
plt.show()

# Task 2: Visualize explained variance by each component
pca_full = PCA()
pca_full.fit(X_scaled)

explained_variance_ratio = pca_full.explained_variance_ratio_

plt.figure(figsize=(8, 6))
plt.bar(range(1, len(explained_variance_ratio) + 1), explained_variance_ratio)
plt.xlabel('Principal Component')
plt.ylabel('Explained Variance Ratio')
plt.title('Explained Variance by Each Principal Component')
plt.xticks(range(1, len(explained_variance_ratio) + 1))
plt.grid(axis='y')
plt.show()

cumulative_variance_ratio = np.cumsum(explained_variance_ratio)
plt.figure(figsize=(8, 6))
plt.plot(range(1, len(cumulative_variance_ratio) + 1), cumulative_variance_ratio, marker='o')
plt.xlabel('Number of Principal Components')
plt.ylabel('Cumulative Explained Variance Ratio')
plt.title('Cumulative Explained Variance')
plt.xticks(range(1, len(cumulative_variance_ratio) + 1))
plt.yticks(np.arange(0, 1.1, 0.1))
plt.grid(True)
plt.show()

# Task 3: Retain more components and visualize in 3D
pca_3d = PCA(n_components=3)
X_pca_3d = pca_3d.fit_transform(X_scaled)

fig = plt.figure(figsize=(10, 8))
ax = fig.add_subplot(111, projection='3d')

for i, target_name in enumerate(target_names):
    ax.scatter(X_pca_3d[y == i, 0], X_pca_3d[y == i, 1], X_pca_3d[y == i, 2], label=target_name)

ax.set_xlabel('Principal Component 1')
ax.set_ylabel('Principal Component 2')
ax.set_zlabel('Principal Component 3')
ax.set_title('First Three Principal Components of Iris Dataset')
ax.legend()
plt.show()

**Statistical Analysis in EDA**<br>

Objective: Calculate basic statistical metrics and explore the relationship between features using correlation and covariance.<br>

Title: Descriptive Statistics<br>

Task 1: Calculate mean, median, and standard deviation of petal length.<br>
Task 2: Calculate skewness and kurtosis of sepal width.<br>
Task 3: Calculate mean, median, and standard deviation of sepal length.

In [None]:
# Write your code from here
import pandas as pd
from sklearn.datasets import load_iris

# Load the Iris dataset
iris = load_iris(as_frame=True)
iris_df = iris.frame

# Task 1: Calculate mean, median, and standard deviation of petal length
petal_length = iris_df['petal length (cm)']
mean_petal_length = petal_length.mean()
median_petal_length = petal_length.median()
std_petal_length = petal_length.std()

print(f"Task 1 - Petal Length:")
print(f"  Mean: {mean_petal_length:.2f}")
print(f"  Median: {median_petal_length:.2f}")
print(f"  Standard Deviation: {std_petal_length:.2f}\n")

# Task 2: Calculate skewness and kurtosis of sepal width
sepal_width = iris_df['sepal width (cm)']
skewness_sepal_width = sepal_width.skew()
kurtosis_sepal_width = sepal_width.kurt()

print(f"Task 2 - Sepal Width:")
print(f"  Skewness: {skewness_sepal_width:.2f}")
print(f"  Kurtosis: {kurtosis_sepal_width:.2f}\n")

# Task 3: Calculate mean, median, and standard deviation of sepal length
sepal_length = iris_df['sepal length (cm)']
mean_sepal_length = sepal_length.mean()
median_sepal_length = sepal_length.median()
std_sepal_length = sepal_length.std()

print(f"Task 3 - Sepal Length:")
print(f"  Mean: {mean_sepal_length:.2f}")
print(f"  Median: {median_sepal_length:.2f}")
print(f"  Standard Deviation: {std_sepal_length:.2f}")

Title: Correlation & Covariance<br>

Task 1: Compute correlation between sepal length and petal length.<br>
Task 2: Compute covariance between petal width and sepal width.<br>
Task 3: Determine the most correlated pair of features.<br>

In [None]:
# Write your code from here
import pandas as pd
import numpy as np
from sklearn.datasets import load_iris

# Load the Iris dataset
iris = load_iris(as_frame=True)
iris_df = iris.frame

# Task 1: Compute correlation between sepal length and petal length
correlation_sepal_petal_length = iris_df['sepal length (cm)'].corr(iris_df['petal length (cm)'])

print(f"Task 1 - Correlation between Sepal Length and Petal Length: {correlation_sepal_petal_length:.2f}\n")

# Task 2: Compute covariance between petal width and sepal width
covariance_petal_width_sepal_width = iris_df['petal width (cm)'].cov(iris_df['sepal width (cm)'])

print(f"Task 2 - Covariance between Petal Width and Sepal Width: {covariance_petal_width_sepal_width:.2f}\n")

# Task 3: Determine the most correlated pair of features
correlation_matrix = iris_df.corr(numeric_only=True)

# Exclude the diagonal by taking the upper triangle
upper_triangle = correlation_matrix.where(np.triu(np.ones(correlation_matrix.shape), k=1).astype(bool))

# Find the pair with the highest absolute correlation
most_correlated_pair = upper_triangle.abs().stack().idxmax()
highest_correlation = upper_triangle.loc[most_correlated_pair]

print("Task 3 - Most Correlated Pair of Features:")
print(f"  Pair: {most_correlated_pair[0]} and {most_correlated_pair[1]}")
print(f"  Correlation: {highest_correlation:.2f}")
