# <h3 align="center">__Module 3 Activity__</h3>
# <h3 align="center">__Assigned at the start of Module 3__</h3>
# <h3 align="center">__Due at the end of Module 3__</h3><br>



# Weekly Discussion Forum Participation

Each week, you are required to participate in the module’s discussion forum. The discussion forum consists of the week's Module Activity, which is released at the beginning of the module. You must complete/attempt the activity before you can post about the activity and anything that relates to the topic. 

## Grading of the Discussion

### 1. Initial Post:
Create your thread by **Day 5 (Saturday night at midnight, PST).**

### 2. Responses:
Respond to at least two other posts by **Day 7 (Monday night at midnight, PST).**

---

## Grading Criteria:

Your participation will be graded as follows:

### Full Credit (100 points):
- Submit your initial post by **Day 5.**
- Respond to at least two other posts by **Day 7.**

### Half Credit (50 points):
- If your initial post is late but you respond to two other posts.
- If your initial post is on time but you fail to respond to at least two other posts.

### No Credit (0 points):
- If both your initial post and responses are late.
- If you fail to submit an initial post and do not respond to any others.

---

## Additional Notes:

- **Late Initial Posts:** Late posts will automatically receive half credit if two responses are completed on time.
- **Substance Matters:** Responses must be thoughtful and constructive. Comments like “Great post!” or “I agree!” without further explanation will not earn credit.
- **Balance Participation:** Aim to engage with threads that have fewer or no responses to ensure a balanced discussion.

---

## Avoid:
- A number of posts within a very short time-frame, especially immediately prior to the posting deadline.
- Posts that complement another post, and then consist of a summary of that.


# Module Activity: Building a Preprocessing Pipeline

## Objective
Learn how to build a preprocessing pipeline in scikit-learn and apply it to the famous Iris dataset. Gain hands-on experience in handling missing values, scaling features, and understanding the importance of preprocessing pipelines.

---

## Sample Code for Pipeline Syntax
Here’s an example to help you understand how to create a pipeline. This pipeline imputes missing values using the mean:


In [22]:
import pandas as pd
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer

# Example dataset with missing values
data = pd.DataFrame({
    'Feature1': [1.0, np.nan, 3.0],
    'Feature2': [np.nan, 2.0, 3.0]
})

# Define a pipeline with an imputer
pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='mean'))
])

# Fit and transform the data
processed_data = pipeline.fit_transform(data)

print("mean", data.mean())
print("Original Data:")
print(data)
print("\nProcessed Data:")
print(processed_data)

mean Feature1    2.0
Feature2    2.5
dtype: float64
Original Data:
   Feature1  Feature2
0       1.0       NaN
1       NaN       2.0
2       3.0       3.0

Processed Data:
[[1.  2.5]
 [2.  2. ]
 [3.  3. ]]


# Activity Instructions

## Dataset Preparation
We will use the Iris dataset, randomly remove values to simulate missing data, and keep it in a Pandas DataFrame for you to preprocess.

---

## Your Task
Build a preprocessing pipeline that:
- Imputes missing values using the median.
- Scales features to a `[0, 1]` range using `MinMaxScaler`.
- Add at least one more preprocessing step.

### Reflection
At the end of the activity, answer the following questions:
1. What challenges did you face while handling missing data?
2. Why is it important to use a pipeline for preprocessing?
---

## Dataset Setup
Run the following code to import the Iris dataset and simulate missing data. You will use this dataset for the activity.




In [28]:
import pandas as pd
import numpy as np
from sklearn.datasets import load_iris
from sklearn.impute import SimpleImputer
import sklearn.preprocessing as ppr

# Load the Iris dataset
iris = load_iris()
data = pd.DataFrame(iris.data, columns=iris.feature_names)

# Randomly introduce missing values in random cells
np.random.seed(42)
total_cells = data.size
num_missing = int(0.1 * total_cells)  # 10% of total cells
missing_indices = [(row, col) for row in range(data.shape[0]) for col in range(data.shape[1])]
random_missing_indices = np.random.choice(len(missing_indices), size=num_missing, replace=False)

for index in random_missing_indices:
    row, col = missing_indices[index]
    data.iat[row, col] = np.nan

print("Dataset with Missing Values:")
print(data.head(10))

Dataset with Missing Values:
   sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)
0                5.1               3.5                NaN               0.2
1                4.9               3.0                1.4               0.2
2                4.7               3.2                NaN               0.2
3                4.6               3.1                1.5               0.2
4                5.0               3.6                1.4               0.2
5                5.4               3.9                1.7               0.4
6                NaN               3.4                1.4               0.3
7                5.0               NaN                NaN               0.2
8                4.4               2.9                1.4               0.2
9                4.9               3.1                1.5               0.1


## Next Steps

1. **Build your pipeline** to preprocess the dataset.
2. **Test your pipeline** by fitting it to the Iris dataset and transforming it.
3. **Review the processed data** and reflect on how the pipeline simplifies your workflow.


In [58]:
from sklearn.preprocessing import FunctionTransformer
# Creation of the impurter to handle the missing values
median_imputer = SimpleImputer(missing_values=np.nan, strategy='median')

## Run our data through the pipeline to populate the missing values
processed_data = median_imputer.fit_transform(data)
scaler = ppr.MinMaxScaler((0,1))
scaled_data = scaler.fit_transform(processed_data)

scaled_data_frame = pd.DataFrame(scaled_data, columns=iris.feature_names)
# print(scaled_data_frame.head(10))

"""
now that we have seen how the data should behave in a non-pipline environemnt
lets build a pipline to standarize the process and make it reusable
"""


"""
I asked claude to help me figure out what another good preprocessing step would be
and it suggested capping the outliers. So this function caps the outliers using the IQR method
whichs uses the 25th and 75th percentiles to calculate the interquartile range (IQR) and then caps the outliers at 1.5 * IQR below the 25th percentile and 1.5 * IQR above the 75th percentile.
"""
def cap_outliers_iqr(X):
    """Cap outliers using IQR method"""
    X_copy = X.copy()
    Q1 = np.percentile(X_copy, 25, axis=0)
    Q3 = np.percentile(X_copy, 75, axis=0)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    
    # Cap outliers
    X_copy = np.clip(X_copy, lower_bound, upper_bound)
    return X_copy

outlier_capper = FunctionTransformer(cap_outliers_iqr)

pipeline = Pipeline([
    ('imputer', median_imputer),
    ('outlier_capper', FunctionTransformer(cap_outliers_iqr)),
    ('scaler', ppr.MinMaxScaler((0,1)))
])

processed_data = pipeline.fit_transform(data)
processed_data_frame = pd.DataFrame(processed_data, columns=iris.feature_names)
print("Pipline processed data")
print(processed_data_frame.head(15))

Pipline processed data
    sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)
0            0.222222          0.743421           0.576271          0.041667
1            0.166667          0.480263           0.067797          0.041667
2            0.111111          0.585526           0.576271          0.041667
3            0.083333          0.532895           0.084746          0.041667
4            0.194444          0.796053           0.067797          0.041667
5            0.305556          0.953947           0.118644          0.125000
6            0.416667          0.690789           0.067797          0.083333
7            0.194444          0.480263           0.576271          0.041667
8            0.027778          0.427632           0.067797          0.041667
9            0.166667          0.532895           0.084746          0.000000
10           0.305556          0.848684           0.084746          0.041667
11           0.138889          0.690789           0.1

# Reflection on the benefits of pipelines

One of the main benefits of using a pipeline is that it encapsulates every preprocessing step into a single, reusable workflow, making sure that your data always receives the same sequence of transformations and avoiding subtle mistakes that could have from re-implenting the pipeline. At the start of my script I had manually chained operations—imputing missing values and scaling values—which was error-prone and hard to reproduce. By switching to scikit-learn’s `Pipeline` class, I can bundle all of those steps (e.g. `SimpleImputer`, `OneHotEncoder`, `SandardScaler`) into one object. This not only simplifies my code, since I call `pipeline.fit_transform()` once instead of repeating each function, but also keeps my preprocessing logic in one place, and guarantees that training, validation, and test sets are handled identically every time.