## Mid course Evalution
### After you finish this, make your notebook public and sent the link to Raz in Slack

**Iris Dataset Explanation:
The Iris dataset is a famous dataset in the field of machine learning and data analysis. It contains measurements of four attributes (sepal length, sepal width, petal length, and petal width) for three different species of Iris flowers: Setosa, Versicolor, and Virginica. The dataset consists of 150 samples, with 50 samples for each species.**


**Data import**

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Load the Iris dataset
iris = sns.load_dataset('iris')

# 1. Data Preparation (25 points)
# a. Examine the first and last 5 rows of the dataset.

In [None]:
first_five = iris.head(5)
last_five = iris.tail(5)

# b. Check the data types of each column.


In [None]:
dtypes = iris.dtypes


# c. Rename the columns to more descriptive names.


In [None]:
# c. Rename the columns to more descriptive names.
iris = iris.rename(columns={
    'sepal_length': 'Sepal_Length_cm',
    'sepal_width': 'Sepal_Width_cm',
    'petal_length': 'Petal_Length_cm',
    'petal_width': 'Petal_Width_cm',
    'species': 'Species'
})


# d. Create a new column "sepal_area" which is the product of sepal length and sepal width.

In [None]:
iris['Sepal_Area_cm2'] = iris['Sepal_Length_cm'] * iris['Sepal_Width_cm']


# e. Create a new column "petal_area" which is the product of petal length and petal width.

In [None]:
iris['Petal_Area_cm2'] = iris['Petal_Length_cm'] * iris['Petal_Width_cm']


# 2. Data Manipulation (25 points)
# a. Sort the dataset by sepal length in descending order.

In [None]:
iris_sorted = iris.sort_values(by='Sepal_Length_cm', ascending=False)


# b. Filter the dataset to include only the rows where sepal width is greater than 3.0.

In [None]:
iris_filtered = iris[iris['Sepal_Width_cm'] > 3.0]


# c. Group the dataset by species and calculate the mean values for each numeric column.

In [None]:
iris_grouped_mean = iris.groupby('Species').mean(numeric_only=True)


# d. Pivot the dataset to create a new DataFrame with species as columns and sepal length and sepal width as rows.

In [None]:
iris_pivot = iris.pivot_table(index=['Sepal_Length_cm', 'Sepal_Width_cm'], columns='Species', values='Petal_Length_cm', aggfunc='mean')


# e. Melt the pivoted DataFrame to create a long format DataFrame.

In [None]:
iris_melted = iris_pivot.reset_index().melt(id_vars=['Sepal_Length_cm', 'Sepal_Width_cm'], var_name='Species', value_name='Mean_Petal_Length')


# 3. Data Visualization (25 points)
# a. Create a histogram of sepal length.

In [None]:
plt.figure(figsize=(6,4))
plt.hist(iris['Sepal_Length_cm'], bins=10, color='skyblue', edgecolor='black')
plt.title('Histogram of Sepal Length')
plt.xlabel('Sepal Length (cm)')
plt.ylabel('Frequency')
plt.tight_layout()
plt.savefig("histogram_sepal_length.png")
plt.close()

# b. Create a scatterplot of sepal length vs. sepal width, colored by species.

In [None]:
plt.figure(figsize=(6,4))
sns.scatterplot(data=iris, x='Sepal_Length_cm', y='Sepal_Width_cm', hue='Species', palette='viridis')
plt.title('Sepal Length vs Sepal Width by Species')
plt.xlabel('Sepal Length (cm)')
plt.ylabel('Sepal Width (cm)')
plt.tight_layout()
plt.savefig("scatter_sepal_length_width_by_species.png")
plt.close()

# c. Create a boxplot of petal length by species.

In [None]:
plt.figure(figsize=(6,4))
sns.boxplot(data=iris, x='Species', y='Petal_Length_cm', palette='pastel')
plt.title('Petal Length Distribution by Species')
plt.xlabel('Species')
plt.ylabel('Petal Length (cm)')
plt.tight_layout()
plt.savefig("boxplot_petal_length_by_species.png")
plt.close()

# d. Create a pairplot of the dataset, colored by species.

In [None]:
pairplot_fig = sns.pairplot(iris, hue='Species', diag_kind='hist', palette='Set2')
pairplot_fig.savefig("pairplot_iris.png")
plt.close()

# e. Create a heatmap of the correlation matrix.

In [None]:
corr = iris[['Sepal_Length_cm','Sepal_Width_cm','Petal_Length_cm','Petal_Width_cm','Sepal_Area_cm2','Petal_Area_cm2']].corr()
plt.figure(figsize=(6,4))
sns.heatmap(corr, annot=True, cmap='coolwarm')
plt.title('Correlation Matrix of Iris Measurements')
plt.tight_layout()
plt.savefig("heatmap_correlation_matrix.png")
plt.close()

# 4. Data Aggregation (25 points)
# a. Calculate the total sepal area and petal area for each species.

In [None]:
species_area_sum = iris.groupby('Species')[['Sepal_Area_cm2','Petal_Area_cm2']].sum()


# b. Find the maximum and minimum values for each numeric column by species.

In [None]:
species_max = iris.groupby('Species').max(numeric_only=True)
species_min = iris.groupby('Species').min(numeric_only=True)


# c. Calculate the mean and median values for each numeric column by species.

In [None]:
species_mean = iris.groupby('Species').mean(numeric_only=True)
species_median = iris.groupby('Species').median(numeric_only=True)


# d. Count the number of observations for each species.

In [None]:
species_count = iris['Species'].value_counts()


# e. Calculate the 25th, 50th, and 75th percentiles for each numeric column by species.


In [None]:
percentiles = iris.groupby('Species').quantile([0.25,0.50,0.75], numeric_only=True)
