**DAAPP Architecture for the Iris Dataset**
In this case, we developed a data acquisition and preprocessing pipeline using the Iris data set. The pipeline involves four main stages: This models are Data Acquisition, Feature Extraction, Data Transformation and Data Loading.

1. **Data Acquisition:**
The Iris dataset was uploaded using the sklearn.datasets library. It contains information about three species of flowers: All three species of iris: setosa, versicolor, and virginica. This dataset is the features that have sepal length, sepal width, petal length, and petal width measurements. These were stored in a Pandas DataFrame for future analysis.

2. **Feature Extraction:**
From the dataset, we selected three columns: The data set comprises of four measurements: sepal length, sepal width and the label indicating the species. This step helps to commodify the dataset for analysis by only selecting the necessary features together with the target.

3. **Data Transformation:**
To make the preprocessing of the numerical features more intuitive I standardized the sepal length and width features to the range of [0, 1] using Min-Max scaling. This is important in order to make sure that all features are scaled appropriately and in order that are important for subsequent analysis and machine learning algorithms are properly balanced.

4. **Data Loading:**
The transformed data were dumped into SQLite database iris_data.db.

**Testing:**
To confirm the pipeline, unit tests were used for selective columns for the feature extraction function. Furthermore, to ensure that pipeline functionality is correct in terms of loading its data and storing it in the database, an integration test was performed.

**Summary of Results:**
The pipeline successfully processes the Iris dataset by:

**Transferring the data into the Dataframe.**
When all the features needed for the performance of further actions are extracted and transformed.
Ingesting them into SQLite database after applying cleaning and normalization operations on these data.
This process generated a normally structured iris_data.db database which can be used for further analysis or modeling.

In [2]:
!pip install pandas sqlalchemy kaggle



**Acquiring Data**

In [13]:
from sklearn.datasets import load_iris
import pandas as pd

def load_iris_data():
    iris = load_iris(as_frame=True)
    data = iris.data
    data['species'] = iris.target
    data['species'] = data['species'].map(dict(enumerate(iris.target_names)))
    return data

# sample use
iris_data = load_iris_data()
print(iris_data.head())


   sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)  \
0                5.1               3.5                1.4               0.2   
1                4.9               3.0                1.4               0.2   
2                4.7               3.2                1.3               0.2   
3                4.6               3.1                1.5               0.2   
4                5.0               3.6                1.4               0.2   

  species  
0  setosa  
1  setosa  
2  setosa  
3  setosa  
4  setosa  


**Feature Selection**

In [14]:
def extract_features(df):
    selected_columns = ['sepal length (cm)', 'sepal width (cm)', 'species']
    return df[selected_columns]

# Sample use
extracted_data = extract_features(iris_data)
print(extracted_data.head())

   sepal length (cm)  sepal width (cm) species
0                5.1               3.5  setosa
1                4.9               3.0  setosa
2                4.7               3.2  setosa
3                4.6               3.1  setosa
4                5.0               3.6  setosa


**Data Transformation**

In [15]:
from sklearn.preprocessing import MinMaxScaler

def transform_data(df):
    features = df.select_dtypes(include=['float64'])
    scaler = MinMaxScaler()
    scaled_features = scaler.fit_transform(features)
    scaled_df = pd.DataFrame(scaled_features, columns=features.columns)
    scaled_df['species'] = df['species']
    return scaled_df

# sample use
transformed_data = transform_data(extracted_data)
print(transformed_data.head())

   sepal length (cm)  sepal width (cm) species
0           0.222222          0.625000  setosa
1           0.166667          0.416667  setosa
2           0.111111          0.500000  setosa
3           0.083333          0.458333  setosa
4           0.194444          0.666667  setosa


**Loading Data to Database**

In [16]:
from sqlalchemy import create_engine

def load_to_database(df, db_name="iris_data.db", table_name="iris"):
    engine = create_engine(f"sqlite:///{db_name}")
    df.to_sql(table_name, engine, if_exists="replace", index=False)
    print(f"Data loaded into table '{table_name}' in database '{db_name}'.")

# Sample use
load_to_database(transformed_data)


Data loaded into table 'iris' in database 'iris_data.db'.


**Unit Test for Feature Extraction**

In [17]:
def test_extract_features():
    sample_data = pd.DataFrame({
        'sepal length (cm)': [5.1, 4.9],
        'sepal width (cm)': [3.5, 3.0],
        'petal length (cm)': [1.4, 1.4],
        'petal width (cm)': [0.2, 0.2],
        'species': ['setosa', 'setosa']
    })
    output = extract_features(sample_data)
    assert list(output.columns) == ['sepal length (cm)', 'sepal width (cm)', 'species']
    print("Feature extraction test passed!")

test_extract_features()


Feature extraction test passed!


**Integration Test for the Pipeline**

In [18]:
def test_pipeline():
    data = load_iris_data()
    features = extract_features(data)
    transformed_data = transform_data(features)
    load_to_database(transformed_data)
    print("Pipeline test passed!")

test_pipeline()


Data loaded into table 'iris' in database 'iris_data.db'.
Pipeline test passed!
