### Task 1: Introduction to Isolation Forest
**Description**: Install the necessary library and load a sample dataset.

**Steps**:
1. Install scikit-learn
2. Load a sample dataset using Python

In [1]:
import pandas as pd
import numpy as np
from sklearn.ensemble import IsolationForest
from sklearn.datasets import load_boston # Note: load_boston is deprecated, but used as per request.
import matplotlib.pyplot as plt
import warnings

# Suppress the deprecation warning for load_boston
warnings.filterwarnings("ignore", category=FutureWarning)

print("### Task 1: Introduction to Isolation Forest ###")
print("\n" + "="*50 + "\n")

# Step 1: Install scikit-learn (usually done via pip in your environment)
# You would typically run: pip install scikit-learn pandas matplotlib
print("Step 1: To install scikit-learn, run 'pip install scikit-learn' in your terminal.")
print("Ensure pandas and matplotlib are also installed: 'pip install pandas matplotlib'")

# Step 2: Load a sample dataset using Python
# Using the Boston dataset as it's a classic for examples.
# Note: The Boston housing dataset is deprecated due to ethical concerns.
# For new projects, consider using other datasets like California Housing or synthetic data.
boston = load_boston()
X = pd.DataFrame(boston.data, columns=boston.feature_names)
y = boston.target # Target variable (house prices) - not directly used for anomaly detection here, but good to know

print("\nStep 2: Boston Housing Dataset Loaded.")
print(f"Dataset shape: {X.shape}")
print("First 5 rows of the dataset:")
print(X.head())
print("\n" + "="*50 + "\n")


print("### Task 2: Building an Isolation Forest ###")
print("\n" + "="*50 + "\n")

# Step 1: Initialize Isolation Forest
# n_estimators: The number of base estimators (trees) in the ensemble.
# contamination: The proportion of outliers in the dataset. This is a crucial parameter.
#   It's the proportion of anomalies in the data set and is used when fitting the model
#   to define the threshold on the decision function.
# random_state: For reproducibility.
model = IsolationForest(n_estimators=100, contamination=0.01, random_state=42)
print(f"Isolation Forest model initialized with n_estimators=100 and contamination=0.01.")

# Step 2: Fit model
# The fit method trains the Isolation Forest on the data to learn the normal behavior.
# It doesn't require labels for anomaly detection (unsupervised).
model.fit(X)
print("Isolation Forest model fitted to the dataset.")
print("\n" + "="*50 + "\n")


print("### Task 3: Detecting Anomalies ###")
print("\n" + "="*50 + "\n")

# Step 1: Predict anomalies
# The predict method returns -1 for outliers and 1 for inliers.
# The decision_function returns the anomaly score for each sample.
# Lower scores indicate more anomalous samples.
predictions = model.predict(X)
anomaly_scores = model.decision_function(X)

# Add predictions and anomaly scores to the DataFrame for easier analysis
X['anomaly'] = predictions
X['anomaly_score'] = anomaly_scores

print("Anomaly predictions and scores generated.")

# Step 2: Display anomaly counts
anomaly_count = X[X['anomaly'] == -1].shape[0]
inlier_count = X[X['anomaly'] == 1].shape[0]

print(f"Total samples: {X.shape[0]}")
print(f"Number of anomalies detected (-1): {anomaly_count}")
print(f"Number of inliers detected (1): {inlier_count}")
print(f"Proportion of anomalies: {anomaly_count / X.shape[0]:.4f}")

print("\nFirst 10 rows with anomaly predictions and scores:")
print(X[['anomaly', 'anomaly_score']].head(10))
print("\n" + "="*50 + "\n")


print("### Task 4: Visualizing Anomalies ###")
print("\n" + "="*50 + "\n")

# Step 1: Plot a scatter plot
# For visualization, we'll pick two features. 'LSTAT' (percentage of lower status population)
# and 'RM' (average number of rooms per dwelling) are often good for showing patterns.
# We'll color-code points based on their anomaly status.

plt.figure(figsize=(10, 7))
# Plot inliers (anomaly == 1) in blue
plt.scatter(X[X['anomaly'] == 1]['LSTAT'], X[X['anomaly'] == 1]['RM'],
            c='blue', label='Inliers (Normal)', alpha=0.6, s=50)
# Plot anomalies (anomaly == -1) in red
plt.scatter(X[X['anomaly'] == -1]['LSTAT'], X[X['anomaly'] == -1]['RM'],
            c='red', label='Anomalies (Outliers)', alpha=0.8, s=100, marker='X')

plt.xlabel('LSTAT (% lower status of the population)')
plt.ylabel('RM (average number of rooms per dwelling)')
plt.title('Anomaly Detection using Isolation Forest (LSTAT vs RM)')
plt.legend()
plt.grid(True, linestyle='--', alpha=0.7)
plt.show()

print("Scatter plot generated, showing inliers (blue circles) and anomalies (red X marks).")
print("Anomalies are typically isolated points or regions in the feature space.")
print("\n" + "="*50 + "\n")


print("### Task 5: Interpret Contamination Parameter ###")
print("\n" + "="*50 + "\n")

print("The 'contamination' parameter in Isolation Forest is the proportion of outliers in the dataset.")
print("It directly influences the threshold used to classify samples as anomalies.")
print("A higher contamination value means the model expects more outliers and will classify more points as anomalies.")
print("A lower contamination value means the model expects fewer outliers and will classify fewer points as anomalies.")
print("Let's experiment with different contamination levels:")

contamination_levels = [0.005, 0.01, 0.05, 0.1] # Try different percentages of expected anomalies

for contam in contamination_levels:
    print(f"\n--- Experimenting with contamination = {contam} ---")
    model_exp = IsolationForest(n_estimators=100, contamination=contam, random_state=42)
    model_exp.fit(X.drop(columns=['anomaly', 'anomaly_score'], errors='ignore')) # Refit on original data
    predictions_exp = model_exp.predict(X.drop(columns=['anomaly', 'anomaly_score'], errors='ignore'))

    anomaly_count_exp = np.sum(predictions_exp == -1)
    print(f"  Number of anomalies detected: {anomaly_count_exp}")
    print(f"  Proportion of anomalies: {anomaly_count_exp / X.shape[0]:.4f}")

    # Visualize for a couple of contamination levels to see the difference
    if contam in [0.005, 0.05]:
        X_exp = X.copy()
        X_exp['anomaly'] = predictions_exp

        plt.figure(figsize=(10, 7))
        plt.scatter(X_exp[X_exp['anomaly'] == 1]['LSTAT'], X_exp[X_exp['anomaly'] == 1]['RM'],
                    c='blue', label='Inliers (Normal)', alpha=0.6, s=50)
        plt.scatter(X_exp[X_exp['anomaly'] == -1]['LSTAT'], X_exp[X_exp['anomaly'] == -1]['RM'],
                    c='red', label=f'Anomalies (Contamination={contam})', alpha=0.8, s=100, marker='X')
        plt.xlabel('LSTAT (% lower status of the population)')
        plt.ylabel('RM (average number of rooms per dwelling)')
        plt.title(f'Anomaly Detection with Contamination = {contam}')
        plt.legend()
        plt.grid(True, linestyle='--', alpha=0.7)
        plt.show()
        print(f"  Plot generated for contamination = {contam}")

print("\nObservation:")
print("As the 'contamination' parameter increases, the number of samples classified as anomalies also increases.")
print("This parameter acts as a threshold, effectively controlling how many of the 'most anomalous' points are flagged as outliers.")
print("Choosing the right 'contamination' value often requires domain knowledge or iterative experimentation and validation.")
# write your code from here

ImportError: 
`load_boston` has been removed from scikit-learn since version 1.2.

The Boston housing prices dataset has an ethical problem: as
investigated in [1], the authors of this dataset engineered a
non-invertible variable "B" assuming that racial self-segregation had a
positive impact on house prices [2]. Furthermore the goal of the
research that led to the creation of this dataset was to study the
impact of air quality but it did not give adequate demonstration of the
validity of this assumption.

The scikit-learn maintainers therefore strongly discourage the use of
this dataset unless the purpose of the code is to study and educate
about ethical issues in data science and machine learning.

In this special case, you can fetch the dataset from the original
source::

    import pandas as pd
    import numpy as np

    data_url = "http://lib.stat.cmu.edu/datasets/boston"
    raw_df = pd.read_csv(data_url, sep="\s+", skiprows=22, header=None)
    data = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])
    target = raw_df.values[1::2, 2]

Alternative datasets include the California housing dataset and the
Ames housing dataset. You can load the datasets as follows::

    from sklearn.datasets import fetch_california_housing
    housing = fetch_california_housing()

for the California housing dataset and::

    from sklearn.datasets import fetch_openml
    housing = fetch_openml(name="house_prices", as_frame=True)

for the Ames housing dataset.

[1] M Carlisle.
"Racist data destruction?"
<https://medium.com/@docintangible/racist-data-destruction-113e3eff54a8>

[2] Harrison Jr, David, and Daniel L. Rubinfeld.
"Hedonic housing prices and the demand for clean air."
Journal of environmental economics and management 5.1 (1978): 81-102.
<https://www.researchgate.net/publication/4974606_Hedonic_housing_prices_and_the_demand_for_clean_air>


### Task 2: Building an Isolation Forest
**Description**: Initialize an Isolation Forest model and fit it to the Boston dataset.

**Steps**:
1. Initialize Isolation Forest
2. Fit model

In [None]:
# write your code from here

### Task 3: Detecting Anomalies
**Description**: Use the fitted Isolation Forest model to predict anomalies.

**Steps**:
1. Predict anomalies
2. Display anomaly counts

In [None]:
# write your code from here

### Task 4: Visualizing Anomalies
**Description**: Visualize the results to see which samples are considered anomalies.

**Steps**:
1. Plot a scatter plot

In [None]:
# write your code from here

### Task 5: Interpret Contamination Parameter
**Description**: Experiment with different contamination levels.

In [None]:
# write your code from here