# Assignment 3

You only need to write one line of code for each question. When answering questions that ask you to identify or interpret something, the length of your response doesn’t matter. For example, if the answer is just ‘yes,’ ‘no,’ or a number, you can just give that answer without adding anything else.

We will go through comparable code and concepts in the live learning session. If you run into trouble, start by using the help `help()` function in Python, to get information about the datasets and function in question. The internet is also a great resource when coding (though note that **no outside searches are required by the assignment!**). If you do incorporate code from the internet, please cite the source within your code (providing a URL is sufficient).

Please bring questions that you cannot work out on your own to office hours, work periods or share with your peers on Slack. We will work with you through the issue.

### Clustering and Resampling

Let's set up our workspace and use the **Iris dataset** from `scikit-learn`. This dataset is a classic dataset in machine learning and statistics, widely used for clustering tasks. It consists of many samples of iris flowers. Here are the key features and characteristics of the dataset:

##### Features:
1. **Sepal Length**: The length of the sepal in centimeters.
2. **Sepal Width**: The width of the sepal in centimeters.
3. **Petal Length**: The length of the petal in centimeters.
4. **Petal Width**: The width of the petal in centimeters.

In [1]:
# Import standard libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans


#### **Question 1:** 
#### Data inspection

#### Load the Iris dataset:

Use scikit-learn to load the Iris dataset and convert it into a Pandas DataFrame.
Display the first few rows of the dataset. How many observations (rows) and features (columns) does the dataset contain?

In [None]:
# Import standard libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans


from sklearn.datasets import load_iris
# Load the Iris dataset
iris_data = load_iris()

# Convert to DataFrame
iris_df = pd.DataFrame(iris_data.data, columns=iris_data.feature_names)

# Bind the disease progression (diabetes target) to the DataFrame
iris_df['species'] = iris_data.target


# Display the DataFrame
print(iris_df)

#Your code here ... 

print(iris_df.head())
print(iris_df.shape)

#### **Question 2:** 
#### Data-visualization

Let's create plots to visualize the relationships between the features (sepal length, sepal width, petal length, petal width).


In [None]:
def plot_feature_pairs(data, feature_names, color_labels=None, title_prefix=''):
    """
    Helper function to create scatter plots for all possible pairs of features.
    
    Parameters:
    - data: DataFrame containing the features to be plotted.
    - feature_names: List of feature names to be used in plotting.
    - color_labels: Optional. Cluster or class labels to color the scatter plots.
    - title_prefix: Optional. Prefix for plot titles to distinguish between different sets of plots.
    """
    # Create a figure for the scatter plots
    plt.figure(figsize=(12, 10))
    
    # Counter for subplot index
    plot_number = 1
    
    # Loop through each pair of features
    for i in range(len(feature_names)):
        for j in range(i + 1, len(feature_names)):
            plt.subplot(len(feature_names)-1, len(feature_names)-1, plot_number)
            
            # Scatter plot colored by labels if provided
            if color_labels is not None:
                plt.scatter(data[feature_names[i]], data[feature_names[j]], 
                            c=color_labels, cmap='viridis', alpha=0.7)
            else:
                plt.scatter(data[feature_names[i]], data[feature_names[j]], alpha=0.7)
            
            plt.xlabel(feature_names[i])
            plt.ylabel(feature_names[j])
            plt.title(f'{title_prefix}{feature_names[i]} vs {feature_names[j]}')
            
            # Increment the plot number
            plot_number += 1

    # Adjust layout to prevent overlap
    plt.tight_layout()

    # Show the plot
    plt.show()

# Get feature names
feature_names = iris_df.columns

# Use the helper function to plot scatter plots without coloring by cluster labels
plot_feature_pairs(iris_df, feature_names, title_prefix='Original Data: ')

**Question:**
- Do you notice any patterns or relationships between the different features? How might these patterns help in distinguishing between different species?

> Your answer...

The scatter plots reveal noticeable patterns between certain features. For instance, petal length and petal width show a clear positive relationship, where data points form distinct clusters. This suggests these features could be useful in differentiating between groups. In contrast, sepal length and sepal width display weaker relationships, meaning they might not be as effective in distinguishing categories.

These patterns indicate that some features, such as petal measurements are more valuable than others when it comes to identifying groups or clusters. Understanding these relationships can help in feature selection and improve the effectiveness of clustering or classification algorithms.

#### **Question 3:** 
#### Data cleaning

In [None]:
# Initialize the StandardScaler
scaler = StandardScaler()

# Scale all the features in the dataset
scaled_features = scaler.fit_transform(iris_df)

# Create a new DataFrame with scaled features
scaled_iris_df = pd.DataFrame(scaled_features, columns=iris_data.feature_names)

# Display the first few rows of the scaled DataFrame
print(scaled_iris_df.head())

Why is it important to standardize the features of a dataset before applying clustering algorithms like K-Means? Discuss the implications of using unstandardized data in your analysis.

> Your answer here ... 

Since clustering algorithms rely on distance calculations, features with larger scales can dominate the process if they are not standardized. This can result in biased clustering, where certain features disproportionately influence the outcome. Standardizing ensures that all features are on a similar scale, typically with a mean of 0 and a standard deviation of 1, so they contribute equally to the analysis. Without standardization, the algorithm might give undue weight to features with larger numerical ranges, skewing the clustering results.

#### **Question 4:** 
#### K-means clustering 

Apply the K-Means clustering algorithm to the Iris dataset. Choose the value 3 for the number of clusters (`k=3`) and fit the model. Assign cluster labels to the original data and add them as a new column in the DataFrame.

In [None]:
# Import standard libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.datasets import load_iris



# Load the Iris dataset
iris_data = load_iris()

# Convert to DataFrame
iris_df = pd.DataFrame(iris_data.data, columns=iris_data.feature_names)

# Initialize the StandardScaler
scaler = StandardScaler()

# Scale the features (excluding the species column)
scaled_features = scaler.fit_transform(iris_df)

# Create a new DataFrame with scaled features
scaled_iris_df = pd.DataFrame(scaled_features, columns=iris_data.feature_names)




# Your code here ...


# Apply K-Means with k=3
kmeans = KMeans(n_clusters=3, random_state=42)
iris_df['Cluster'] = kmeans.fit_predict(iris_df)

# Create a copy for clustered data
clustered_iris_data = iris_df.copy()







# Use the helper function to plot scatter plots, colored by cluster labels
plot_feature_pairs(clustered_iris_data, iris_data.feature_names, color_labels=clustered_iris_data['Cluster'], title_prefix='Clustered Data: ')



We chose `k=3` for the number of clusters arbitrarily. However, in a real-world scenario, it is important to determine the optimal number of clusters using appropriate methods.

**Question**: What is one method commonly used to determine the optimal number of clusters in K-means clustering, and why is this method helpful?

> Your answer here...

The results of the K-Means clustering on the Iris dataset can be evaluated by comparing the assigned clusters to the true species labels. Typically, Cluster 0 contains all 50 samples of Iris setosa, indicating a perfect match. Clusters 1 and 2 often overlap between Iris versicolor and Iris virginica, revealing some ambiguity. For instance, Cluster 1 might include 30 Iris versicolor and 18 Iris virginica, yielding a purity of 62.5%, while Cluster 2 could consist of 40 Iris virginica and 12 Iris versicolor, resulting in a purity of 76.9%.

Visualization through scatter plots usually shows a clear separation for Iris setosa, but Iris versicolor and Iris virginica may overlap, indicating that the feature distributions do not fully distinguish between these two species. Overall, while K-Means clustering effectively identifies Iris setosa, it faces challenges in differentiating Iris versicolor from Iris virginica. Metrics like purity provide quantitative insights into clustering performance, highlighting the need for improved feature selection or the exploration of alternative clustering algorithms.



#### **Question 5:** 
#### Bootstrapping 

Implement bootstrapping on the mean of Petal Width. Generate 10000 bootstrap samples, calculate the mean for each sample, and compute a 90% confidence interval.

In [None]:
# Import necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris

# Load the Iris dataset
iris_data = load_iris()
iris_df = pd.DataFrame(data=iris_data.data, columns=iris_data.feature_names)

# Set the random seed for reproducibility
np.random.seed(123)

# Define the number of bootstrap samples
n_bootstrap_samples = 10000

# Create an array to store the means of each bootstrap sample
bootstrap_means = np.empty(n_bootstrap_samples)

# Perform bootstrapping on the mean of Petal Width
for i in range(n_bootstrap_samples):
    # Generate a bootstrap sample by randomly sampling with replacement
    bootstrap_sample = iris_df['petal width (cm)'].sample(frac=1, replace=True)  # Use the correct column name
    
    # Calculate and store the mean of the bootstrap sample
    bootstrap_means[i] = bootstrap_sample.mean()

# Calculate the overall mean of petal width
mean_petal_width = np.mean(bootstrap_means)

# Calculate the 90% confidence interval (5th and 95th percentiles)
lower_bound = np.percentile(bootstrap_means, 5)
upper_bound = np.percentile(bootstrap_means, 95)


# Display the result
print(f"Mean of Petal Width: {mean_petal_width}")
print(f"90% Confidence Interval of Mean Petal Width: ({lower_bound}, {upper_bound})")

**Question:**
- Why do we use bootstrapping in this context? What does it help us understand about the mean?

>  Your answer...

Bootstrapping is used to estimate the sampling distribution of the mean by resampling the data with replacement. Since we don't know the true population distribution, bootstrapping helps approximate it by generating many new samples from the existing data. This gives a distribution of sample means, allowing us to assess the variability of the mean. The 90% confidence interval helps us understand the range in which the true mean is likely to fall, providing a better sense of the reliability of the estimate.

**Question:**
- What is the purpose of calculating the confidence interval from the bootstrap samples? How does it help us interpret the variability of the estimate?

> Your answer...

Calculating the confidence interval from bootstrap samples helps us understand the variability and reliability of our estimate—in this case, the mean of Petal Width. By resampling the data multiple times, we simulate different possible outcomes, allowing us to see how much the mean might fluctuate. The confidence interval provides a range where we expect the true population mean to fall with a certain level of confidence. It helps quantify the uncertainty around the estimate based on the data we have.

**Question:**

- Reflect on the variability observed in the bootstrapped means and discuss whether the mean of the Petal Width appears to be a stable and reliable estimate based on the confidence interval and the spread of the bootstrapped means.

> Your answer here...

The variability in the bootstrapped means shows how much the estimate of Petal Width fluctuates across different samples. If the spread of the bootstrapped means is narrow and the confidence interval is small, the mean appears to be stable and reliable. A wider spread or larger confidence interval would suggest more uncertainty and less reliability in the estimate. In this case, the confidence interval helps confirm whether the mean of Petal Width is a consistent measure across various resampled datasets.

# Criteria


| **Criteria**                                           | **Complete**                                      | **Incomplete**                                    |
|--------------------------------------------------------|---------------------------------------------------|--------------------------------------------------|
| **Data Inspection**                                    | Data is thoroughly inspected for the number of variables, observations, and data types, and relevant insights are noted. | Data inspection is missing or lacks detail.         |
| **Data Visualization**                                 | Visualizations (e.g., scatter plots) are well-constructed and correctly interpreted to explore relationships between features and species. | Visualizations are poorly constructed or not correctly interpreted. |
| **Clustering Implementation**                           | K-Means clustering is correctly implemented, and cluster labels are appropriately assigned to the dataset.            | K-Means clustering is missing or incorrectly implemented. |
| **Bootstrapping Process**                              | Bootstrapping is correctly performed, and results are used to assess variable mean stability. | Bootstrapping is missing or incorrectly performed. |

## Submission Information

🚨 **Please review our [Assignment Submission Guide](https://github.com/UofT-DSI/onboarding/blob/main/onboarding_documents/submissions.md)** 🚨 for detailed instructions on how to format, branch, and submit your work. Following these guidelines is crucial for your submissions to be evaluated correctly.

### Note:

If you like, you may collaborate with others in the cohort. If you choose to do so, please indicate with whom you have worked with in your pull request by tagging their GitHub username. Separate submissions are required.

### Submission Parameters:
* Submission Due Date: `HH:MM AM/PM - DD/MM/YYYY`
* The branch name for your repo should be: `assignment-3`
* What to submit for this assignment:
    * This Jupyter Notebook (assignment_1.ipynb) should be populated and should be the only change in your pull request.
* What the pull request link should look like for this assignment: `https://github.com/<your_github_username>/applying_statistical_concepts/pull/<pr_id>`
    * Open a private window in your browser. Copy and paste the link to your pull request into the address bar. Make sure you can see your pull request properly. This helps the technical facilitator and learning support staff review your submission easily.

Checklist:
- [ ] Created a branch with the correct naming convention.
- [ ] Ensured that the repository is public.
- [ ] Reviewed the PR description guidelines and adhered to them.
- [ ] Verify that the link is accessible in a private browser window.

If you encounter any difficulties or have questions, please don't hesitate to reach out to our team via our Slack at `#cohort-4-help`. Our Technical Facilitators and Learning Support staff are here to help you navigate any challenges.
