# Self Practice 4 - Data Annotation
___

## Material

- Data Understanding (From Practice [Self Practice 3](self_practice-3.ipynb))
- Data Cleansing (From  Practice [Self Practice 3](self_practice-3.ipynb))
- Data Annotation
___

## Import Library

In [1]:
import pandas as pd
from sklearn.datasets import load_iris

## Load Dataset

In [2]:
# Load the Iris dataset
iris = load_iris()
df = pd.DataFrame(data=iris.data, columns=iris.feature_names)
df['target'] = iris.target

## Data Understanding

In [3]:
# Identify the number of records
print("Number of records:", df.shape[0])

Number of records: 150


In [4]:
# Identify data types
print("\nData types:")
print(df.dtypes)


Data types:
sepal length (cm)    float64
sepal width (cm)     float64
petal length (cm)    float64
petal width (cm)     float64
target                 int32
dtype: object


In [5]:
# Identify missing values
print("\nNumber of missing values:")
print(df.isnull().sum())


Number of missing values:
sepal length (cm)    0
sepal width (cm)     0
petal length (cm)    0
petal width (cm)     0
target               0
dtype: int64


In [6]:
# Identify outliers using IQR method
for column in df.columns[:-1]: # Exclude 'target'
    Q1 = df[column].quantile(0.25)
    Q3 = df[column].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    outliers = df[(df[column] < lower_bound) | (df[column] > upper_bound)]
    print(f"\nOutliers in column {column}:")
    print(outliers)


Outliers in column sepal length (cm):
Empty DataFrame
Columns: [sepal length (cm), sepal width (cm), petal length (cm), petal width (cm), target]
Index: []

Outliers in column sepal width (cm):
    sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)  \
15                5.7               4.4                1.5               0.4   
32                5.2               4.1                1.5               0.1   
33                5.5               4.2                1.4               0.2   
60                5.0               2.0                3.5               1.0   

    target  
15       0  
32       0  
33       0  
60       1  

Outliers in column petal length (cm):
Empty DataFrame
Columns: [sepal length (cm), sepal width (cm), petal length (cm), petal width (cm), target]
Index: []

Outliers in column petal width (cm):
Empty DataFrame
Columns: [sepal length (cm), sepal width (cm), petal length (cm), petal width (cm), target]
Index: []


In [7]:
# Description of initial statistics
print("\nDescription of statistics before cleaning:")
print(df.describe())


Description of statistics before cleaning:
       sepal length (cm)  sepal width (cm)  petal length (cm)  \
count         150.000000        150.000000         150.000000   
mean            5.843333          3.057333           3.758000   
std             0.828066          0.435866           1.765298   
min             4.300000          2.000000           1.000000   
25%             5.100000          2.800000           1.600000   
50%             5.800000          3.000000           4.350000   
75%             6.400000          3.300000           5.100000   
max             7.900000          4.400000           6.900000   

       petal width (cm)      target  
count        150.000000  150.000000  
mean           1.199333    1.000000  
std            0.762238    0.819232  
min            0.100000    0.000000  
25%            0.300000    0.000000  
50%            1.300000    1.000000  
75%            1.800000    2.000000  
max            2.500000    2.000000  


## Data Cleansing

In [8]:
## Data cleaning steps
# Fill in missing values (if any)
# In the iris dataset, there are no missing values, so this part can be skipped.
# However, if there are, you can use:
df['sepal length (cm)'].fillna(df['sepal length (cm)'].mean(), inplace=True) 
df['sepal width (cm)']. fillna(df['sepal width (cm)'].median(), inplace=True)
df['petal length (cm)'].fillna(df['petal length (cm)'].mode()[0], inplace=True)

# Deleting rows with incorrect data (if any)
# In the iris dataset, there is no obvious incorrect data.
# However, if there is, you can use:
df.drop(df[df['sepal length (cm)'] < 0].index, inplace=True)

# Correcting outlier values
for column in df.columns[:-1]:
    Q1 = df[column].quantile(0.25)
    Q3 = df[column].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    df[column] = df[column].clip(lower_bound, upper_bound)

# After data cleaning
print("\nStatistical description after cleaning:")
print(df.describe())


Statistical description after cleaning:
       sepal length (cm)  sepal width (cm)  petal length (cm)  \
count         150.000000         150.00000         150.000000   
mean            5.843333           3.05400           3.758000   
std             0.828066           0.42539           1.765298   
min             4.300000           2.05000           1.000000   
25%             5.100000           2.80000           1.600000   
50%             5.800000           3.00000           4.350000   
75%             6.400000           3.30000           5.100000   
max             7.900000           4.05000           6.900000   

       petal width (cm)      target  
count        150.000000  150.000000  
mean           1.199333    1.000000  
std            0.762238    0.819232  
min            0.100000    0.000000  
25%            0.300000    0.000000  
50%            1.300000    1.000000  
75%            1.800000    2.000000  
max            2.500000    2.000000  


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['sepal length (cm)'].fillna(df['sepal length (cm)'].mean(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['sepal width (cm)']. fillna(df['sepal width (cm)'].median(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work b

## Data Annotation

In [9]:
## Identify the target data and perform labelling
# In the iris dataset, the “target” column already represents the class label (0, 1, 2).
# However, if more specific labelling is required based on the SOP,
# we can do label mapping according to the needs.

## For example, if the SOP specifies that:
# - target 0: Setosa
# - target 1: Versicolor
# - target 2: Virginica
# We can create a label mapping like this:

label_mapping = {
    0: 'Setosa',
    1: 'Versicolor',
    2: 'Virginica'
}

# Create a new column 'target_label' containing the text label
df['target_label'] = df['target'].map(label_mapping)

# Show data with new label
print("\nData with label based on SOP:")
print(df.head())


Data with label based on SOP:
   sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)  \
0                5.1               3.5                1.4               0.2   
1                4.9               3.0                1.4               0.2   
2                4.7               3.2                1.3               0.2   
3                4.6               3.1                1.5               0.2   
4                5.0               3.6                1.4               0.2   

   target target_label  
0       0       Setosa  
1       0       Setosa  
2       0       Setosa  
3       0       Setosa  
4       0       Setosa  
