# Let's group IRIS data! 
In this notebook we will do again some analysis on IRIS but using grouping functions with Pandas

## Exercise 1 - Load the dataset
Download the dataset and read it directly with pandas into a dataframe with the pd.read_csv function
"https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"


In [4]:
import pandas as pd

# Download the dataset and read it directly with pandas into a dataframe
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"
column_names = ["sepal_length", "sepal_width", "petal_length", "petal_width", "label"]

df = pd.read_csv(url, header=None) # no header in the file
df.columns = column_names

labels = ['Iris-setosa', 'Iris-versicolor', 'Iris-virginica']
df['label'] = df['label'].map(lambda x: labels.index(x)) # it is not mandatory but it is useful for later

print("Dataframe: ")
print(df.head(10)) # to print only the first n=5 rows


Dataframe: 
   sepal_length  sepal_width  petal_length  petal_width  label
0           5.1          3.5           1.4          0.2      0
1           4.9          3.0           1.4          0.2      0
2           4.7          3.2           1.3          0.2      0
3           4.6          3.1           1.5          0.2      0
4           5.0          3.6           1.4          0.2      0
5           5.4          3.9           1.7          0.4      0
6           4.6          3.4           1.4          0.3      0
7           5.0          3.4           1.5          0.2      0
8           4.4          2.9           1.4          0.2      0
9           4.9          3.1           1.5          0.1      0


In [3]:
df

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,label
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,2
146,6.3,2.5,5.0,1.9,2
147,6.5,3.0,5.2,2.0,2
148,6.2,3.4,5.4,2.3,2


## Exercise 2 - compute the mean and std 
Compute the mean and std of each feature for each class by grouping the classes

In [6]:
# Compute the mean of each feature by grouping for each class
means = df.groupby('label').mean()
std = df.groupby('label').std()

print("Means: ")
print(means)

print("Standard deviations: ")
print(std)

means_and_std = df.groupby("label").agg(["mean", "std"])
print(means_and_std)


Means: 
       sepal_length  sepal_width  petal_length  petal_width
label                                                      
0             5.006        3.418         1.464        0.244
1             5.936        2.770         4.260        1.326
2             6.588        2.974         5.552        2.026
Standard deviations: 
       sepal_length  sepal_width  petal_length  petal_width
label                                                      
0          0.352490     0.381024      0.173511     0.107210
1          0.516171     0.313798      0.469911     0.197753
2          0.635880     0.322497      0.551895     0.274650
      sepal_length           sepal_width           petal_length            \
              mean       std        mean       std         mean       std   
label                                                                       
0            5.006  0.352490       3.418  0.381024        1.464  0.173511   
1            5.936  0.516171       2.770  0.313798        4.26

##  $(\star)$ Exercise 3 - predict the class of a sample by computing the distance from the mean of each class
- Create a function to compute the  distance of a sample from the mean of each class 
- Then create a new column in the dataset called "predicted_label" and assign the predicted class to each sample
- Then compute the accuracy of the prediction

**Note:** you must not iterate over the dataframe


In [4]:
import numpy as np

# Compute the distance of each sample from the mean of the three classes
# and assign the class with the minimum distance
def predict_class(samples, means):
    distances = np.zeros((len(df), 3))
    for i, label in enumerate(np.unique(df['label'])):
        mean = means.loc[label] # select the mean of the corresponding label
        distances[:, i] = np.linalg.norm(df_features - mean, axis=1) # for all samples compute the distance from the mean
    return np.argmin(distances, axis=1) # for all samples return the index of the class having the lowest distance 


# Predict the class of each sample
df_features = df[['sepal_length', 'sepal_width', 'petal_length', 'petal_width']] # select the features only
predicted_labels = predict_class(df_features, means)
df['predicted_label'] = predicted_labels

print("Predicted labels: ")
print(df['predicted_label'])


# Compute the accuracy of the prediction
accuracy = np.sum(df['label'] == df['predicted_label']) / len(df)
print("Accuracy: ", accuracy)


Predicted labels: 
0      0
1      0
2      0
3      0
4      0
      ..
145    2
146    2
147    2
148    2
149    2
Name: predicted_label, Length: 150, dtype: int64
Accuracy:  0.9266666666666666
