# Session 4: SciPy & Introduction to Scikit-learn

**Objective:** To explore the SciPy library for scientific computing and introduce Scikit-learn for foundational machine learning tasks like clustering.

## Part 1: Concepts (90 mins)

### 1. What is SciPy?
SciPy is a library that builds on NumPy and provides a large number of higher-level algorithms for scientific and technical computing. While NumPy's focus is on the `ndarray` object, SciPy provides modules for optimization, linear algebra, integration, signal processing, and, importantly for us, statistics.

### 3. `scipy.stats`: Statistical Functions
This submodule contains a wide range of statistical functions and probability distributions. A common use case is performing hypothesis tests, such as a T-test, which is used to determine if there is a significant difference between the means of two groups.

In [None]:
from scipy import stats
import numpy as np

# Sample data for two groups (e.g., test scores)
group_a_scores = np.random.normal(loc=85, scale=5, size=30)
group_b_scores = np.random.normal(loc=88, scale=5, size=30)

# Perform an independent T-test
t_statistic, p_value = stats.ttest_ind(group_a_scores, group_b_scores)

print(f"T-statistic: {t_statistic:.4f}")
print(f"P-value: {p_value:.4f}")

if p_value < 0.05: # A common significance level
    print("The difference between the groups is statistically significant.")
else:
    print("The difference between the groups is not statistically significant.")

### 4. `scipy.optimize`: Curve Fitting
The `optimize` module provides functions for finding the minimum or root of a function. A very common application is curve fitting, where we find the parameters of a function that best fit a set of data points.

In [None]:
from scipy.optimize import curve_fit
import matplotlib.pyplot as plt

# Define the function we want to fit (a quadratic function)
def func(x, a, b, c):
    return a * x**2 + b * x + c

# Generate some noisy data
x_data = np.linspace(0, 4, 50)
y_data = func(x_data, 2.5, 1.3, 0.5) + np.random.normal(0, 2.0, size=len(x_data))

# Fit the curve
params, covariance = curve_fit(func, x_data, y_data)
print(f"Fitted parameters (a, b, c): {params}")

# Plot the results
plt.scatter(x_data, y_data, label='Data')
plt.plot(x_data, func(x_data, *params), color='red', label='Fitted function')
plt.legend()
plt.show()

### 5. Introduction to Scikit-learn (SKLearn)
Scikit-learn is the most popular library for machine learning in Python. It provides simple and efficient tools for data mining and data analysis. It's built on NumPy, SciPy, and Matplotlib.

**Key features of SKLearn:**
- **Consistent API:** Models are Python classes with `fit()`, `predict()`, and `transform()` methods.
- **Wide Range of Algorithms:** It covers classification, regression, clustering, dimensionality reduction, and more.
- **Excellent Documentation:** Its documentation is considered a gold standard.

### 6. Clustering with `sklearn.cluster`
While SciPy has clustering algorithms, SKLearn's implementations are more feature-rich and follow the consistent API. K-Means is a popular algorithm that groups data by trying to separate samples into *k* groups of equal variance.

In [None]:
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
import seaborn as sns

# Create some synthetic data with two distinct groups
group1 = np.random.randn(50, 2) + np.array([5, 5])
group2 = np.random.randn(50, 2) + np.array([0, 0])
data = np.vstack([group1, group2])

# Step 1: Scale the data (SKLearn's version of 'whitening')
scaler = StandardScaler()
scaled_data = scaler.fit_transform(data)

# Step 2: Create and fit the K-Means model
kmeans = KMeans(n_clusters=2, random_state=42, n_init=10) # n_init suppresses a future warning
kmeans.fit(scaled_data)

# Step 3: Get the cluster assignments (labels)
cluster_ids = kmeans.labels_

# Visualize the results
sns.scatterplot(x=data[:, 0], y=data[:, 1], hue=cluster_ids, palette='viridis')
plt.title('K-Means Clustering Result with SKLearn')
plt.show()

### 7. Dataset Management with Scikit-learn
Often you need to practice with well-known datasets. The `scikit-learn` library provides easy access to many of them. These datasets are loaded as a special `Bunch` object, which contains the data, target, and descriptions all in one place.

In [None]:
from sklearn.datasets import load_iris
import pandas as pd

# Load the dataset
iris = load_iris()

# Explore the Bunch object
print(iris.keys())
print(iris.DESCR[:500] + "...") # Print the first 500 characters of the description

# The best practice is to convert it to a pandas DataFrame for analysis
iris_df = pd.DataFrame(data=iris.data, columns=iris.feature_names)
iris_df['target'] = iris.target

print("\nIris dataset converted to a DataFrame:")
print(iris_df.head())

## Part 2: Exercises & Debugging (90 mins)

### Lab 4.1: T-Test on Employee Salaries
* **Task:** Load the company dataset. Perform a T-test to see if there is a significant difference in salary between the 'Engineering' and 'Sales' departments.

In [None]:
import pandas as pd
from scipy import stats

# Load data
df_emp = pd.read_csv('employees.csv')
df_dept = pd.read_csv('departments.csv')
company_df = pd.merge(df_emp, df_dept, on='department_id')

# Isolate the salaries for the two departments
eng_salaries = company_df[company_df['department_name'] == 'Engineering']['salary']
sales_salaries = company_df[company_df['department_name'] == 'Sales']['salary']

# Perform T-test
t_stat, p_val = stats.ttest_ind(eng_salaries, sales_salaries)

print(f"T-test results for Engineering vs. Sales salaries:")
print(f"P-value: {p_val:.4f}")
if p_val < 0.05:
    print("Conclusion: There is a significant difference in salaries.")
else:
    print("Conclusion: There is no significant difference in salaries.")

### Lab 4.2: Clustering Employees with SKLearn
* **Task:** Use `sklearn.cluster.KMeans` to segment employees into groups based on their `salary` and `tenure_years`. Try to find 3 distinct clusters and visualize the result. Remember to scale the data first!

In [None]:
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
import seaborn as sns

# Prepare the data: select the two features
employee_features = company_df[['salary', 'tenure_years']].values

# Scale the data
scaler = StandardScaler()
scaled_features = scaler.fit_transform(employee_features)

# Create and fit the K-Means model for 3 clusters
kmeans_model = KMeans(n_clusters=3, random_state=42, n_init=10)
cluster_ids = kmeans_model.fit_predict(scaled_features)

# Add the cluster ID back to the DataFrame for plotting
company_df['cluster'] = cluster_ids

# Visualize the clusters
plt.figure(figsize=(10, 7))
sns.scatterplot(data=company_df, x='tenure_years', y='salary', hue='cluster', palette='deep')
plt.title('Employee Segments by Salary and Tenure (k=3)')
plt.xlabel('Tenure (Years)')
plt.ylabel('Salary')
plt.show()

### Lab 4.3: Load and Explore the Wine Dataset
* **Task:** Use `sklearn.datasets.load_wine()` to load the wine dataset. Convert it into a pandas DataFrame and print the first 5 rows (`.head()`) and the dimensions of the DataFrame (`.shape`).

In [None]:
from sklearn.datasets import load_wine
import pandas as pd

# Load the dataset
wine = load_wine()

# Convert to DataFrame
wine_df = pd.DataFrame(data=wine.data, columns=wine.feature_names)
wine_df['target'] = wine.target

# Print head and shape
print("Wine Dataset Head:")
print(wine_df.head())
print("\nWine Dataset Shape:", wine_df.shape)

### Debugging Focus: Choosing 'k' in K-Means
- **The Problem:** The biggest challenge in K-Means is choosing the right number of clusters, `k`. A bad `k` can lead to meaningless clusters. In our lab, we chose `k=3` arbitrarily, but how could we do better?
- **The "Elbow Method":** A common heuristic is to run K-Means for a range of `k` values (e.g., 1 to 10) and plot the *inertia* for each `k`. Inertia is the sum of squared distances of samples to their closest cluster center. SKLearn's KMeans model stores this value in the `inertia_` attribute after fitting. The plot often looks like an arm, and the "elbow" (the point of inflection where the rate of decrease sharply shifts) is a good candidate for `k`.