# Introduction to Machine Learning with Python


## 1. Introduction to Machine Learning
Machine learning is about building systems that can learn from data. Common types of
learning include:
- **Supervised learning**: learning from labeled examples (regression and classification).
- **Unsupervised learning**: finding patterns in unlabeled data.
- **Reinforcement learning**: learning by interacting with an environment.

In this notebook we'll focus on supervised learning (both regression and classification)
and briefly look at unsupervised learning.

## 2. Setting Up the Environment
Below we import the libraries we'll use. If you don't have them installed, you can
install them using `pip`. NumPy provides efficient array operations, pandas offers
data manipulation tools, matplotlib is for plotting, and scikit-learn has many ML
algorithms ready to use.

In [None]:
# Install packages if needed (uncomment the following lines)
# !pip install numpy pandas matplotlib scikit-learn

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import datasets, model_selection, metrics, linear_model, neighbors, tree, cluster

%matplotlib inline

A quick word on Jupyter Notebooks: each cell can contain code or Markdown. You can
execute a code cell with `Shift+Enter`. It's good practice to keep related code in
small cells and to restart the kernel occasionally to make sure your notebook runs
from a clean state.

## 3. Gentle Introduction to NumPy
NumPy is a foundational package for numerical computing in Python. The core object is the **array**, which behaves similarly to a list but stores data in a contiguous block of memory. This allows operations on entire arrays to be performed very quickly in compiled code. We'll start by creating some arrays and exploring indexing, basic arithmetic, and the concept of *broadcasting* which lets NumPy apply operations to arrays of different shapes.

In [None]:
# Creating arrays
a = np.array([1, 2, 3])
b = np.array([[1, 2, 3], [4, 5, 6]])
print("a:", a)
print("b:
", b)

# Indexing and slicing
print("a[0] =", a[0])
print("b[0, 1] =", b[0, 1])
print("b[:, 1] =", b[:, 1])

# Vectorized operations
print("a * 2 =", a * 2)
print("a + 3 =", a + 3)
print("a + a =", a + a)

# Broadcasting example
c = np.array([1, 2, 3])
d = np.array([[10], [20], [30]])
print("c + d =
", c + d)

# Dot product
print("Dot product a·c =", np.dot(a, c))

The output above illustrates several important ideas:
- `a` is a one-dimensional array while `b` is two-dimensional. The shape of an array is accessed via the `shape` attribute.
- Indexing works with square brackets, e.g. `b[0, 1]` fetches the element in the first row and second column.
- Operations like `a * 2` or `a + 3` automatically apply element-wise to every entry.
- When arrays have compatible shapes, NumPy *broadcasts* them so the arithmetic still works—notice how `c + d` adds a 1-D array to a column vector.
- `np.dot` computes the dot product without an explicit Python loop.

Take a moment to execute the cell and observe how each print statement matches these explanations.

### Hands-on Challenge
Implement the dot product of two vectors using a for loop. Fill in the code below.

In [None]:
def manual_dot(x, y):
    """Compute dot product of two 1-D arrays x and y using a loop."""
    # TODO: replace with your implementation
    result = 0.0
    for i in range(len(x)):
        pass  # replace this line
    return result

manual_dot(np.array([1, 2, 3]), np.array([4, 5, 6]))

## 4. Linear Regression (Supervised Learning: Regression)
Linear regression attempts to fit a straight line through the data.
For an input vector $x$ we predict a scalar $y$ using $\hat{y} = w^Tx + b$.
The parameters $w$ and $b$ are chosen to minimize the mean squared error (MSE) between the predictions and the true targets.
We will generate a simple synthetic dataset and then optimize the parameters with **gradient descent**—an iterative algorithm that nudges the parameters in the direction that most reduces the error.

In [None]:
np.random.seed(0)
X = 2 * np.random.rand(100, 1)
y = 4 + 3 * X[:,0] + np.random.randn(100)
X_b = np.c_[np.ones((X.shape[0], 1)), X]
eta = 0.1
n_iterations = 1000
theta = np.random.randn(2)
mse_history = []
for iteration in range(n_iterations):
    gradients = 2/len(X_b) * X_b.T.dot(X_b.dot(theta) - y)
    theta -= eta * gradients
    mse_history.append(np.mean((X_b.dot(theta) - y)**2))
print("Parameters:", theta)


In the loop above:
1. We compute the predictions `X_b.dot(theta)` using the current parameters.
2. The gradient of the MSE tells us how to change `theta` to reduce the error.
3. We subtract a small fraction (`eta`) of this gradient from `theta` each iteration.

Over many iterations the parameters converge and `mse_history` tracks the error at each step.

In [None]:
plt.figure(figsize=(6,4))
plt.plot(mse_history)
plt.xlabel('Iteration')
plt.ylabel('MSE')
plt.title('Gradient Descent Convergence')
plt.show()

In [None]:
plt.figure(figsize=(6,4))
plt.scatter(X, y, label="Data")
plt.plot(X, X_b.dot(theta), color="red", label="Model")
plt.xlabel("x")
plt.ylabel("y")
plt.legend();

In [None]:
lin_reg = linear_model.LinearRegression()
lin_reg.fit(X, y)
print("scikit-learn parameters:", [lin_reg.intercept_, lin_reg.coef_[0]])

### Hands-on Challenge
Modify the learning rate `eta` above and re-run the gradient descent loop. Observe
how it affects convergence.

## 5. Logistic Regression (Supervised Learning: Classification)
While linear regression predicts a continuous value, logistic regression predicts the probability that an example belongs to class 1.
It uses the **sigmoid** function to squeeze the output of a linear model between 0 and 1:
$$\hat{y} = \sigma(w^Tx + b) = \frac{1}{1 + e^{-(w^Tx + b)}}.$$
We'll create a small synthetic dataset with two features and then train a logistic regression model with gradient descent, similar to how we optimized the linear regression model.

In [None]:
from sklearn.datasets import make_classification
X, y = make_classification(n_features=2, n_redundant=0, n_informative=2, n_clusters_per_class=1, random_state=0)
plt.figure(figsize=(6,4))
plt.scatter(X[:,0], X[:,1], c=y, cmap='bwr', edgecolor='k')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('Synthetic classification dataset')
plt.show()
X_b = np.c_[np.ones((X.shape[0], 1)), X]
def sigmoid(z):
    return 1 / (1 + np.exp(-z))
eta = 0.1
n_iterations = 1000
theta = np.random.randn(X_b.shape[1])
for _ in range(n_iterations):
    scores = X_b.dot(theta)
    predictions = sigmoid(scores)
    gradient = X_b.T.dot(predictions - y) / len(y)
    theta -= eta * gradient
print('Parameters:', theta)


Here we repeatedly:
1. Compute the linear scores `X_b.dot(theta)` for each sample.
2. Apply the sigmoid function to obtain probabilities.
3. Update `theta` using the gradient of the cross-entropy loss.

After training we can classify new points by checking if the predicted probability is at least 0.5.

In [None]:
probs = sigmoid(X_b.dot(theta))
y_pred = (probs >= 0.5).astype(int)
print("Accuracy:", (y_pred == y).mean())

In [None]:
x0_min, x0_max = X[:,0].min() - .5, X[:,0].max() + .5
x1_min, x1_max = X[:,1].min() - .5, X[:,1].max() + .5
xx0, xx1 = np.meshgrid(np.linspace(x0_min, x0_max, 200), np.linspace(x1_min, x1_max, 200))
X_grid = np.c_[np.ones((xx0.size, 1)), xx0.ravel(), xx1.ravel()]
probs = sigmoid(X_grid.dot(theta)).reshape(xx0.shape)
plt.figure(figsize=(6,4))
plt.contourf(xx0, xx1, probs, levels=[0,0.5,1], alpha=0.3)
plt.scatter(X[:,0], X[:,1], c=y, edgecolor="k")
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.title("Logistic Regression Decision Boundary");

In [None]:
log_reg = linear_model.LogisticRegression()
log_reg.fit(X, y)
print("scikit-learn accuracy:", log_reg.score(X, y))

### Digits Dataset Example
The `sklearn` digits dataset contains 8×8 images of handwritten digits. Each image is flattened into a 64-element vector.
We'll train a logistic regression classifier on this dataset and visualize a few of the digits.

In [None]:
from sklearn.datasets import load_digits
digits = load_digits()
X_digits = digits.data
y_digits = digits.target
fig, axes = plt.subplots(1, 10, figsize=(10,2))
for i, ax in enumerate(axes):
    ax.imshow(digits.images[i], cmap='gray')
    ax.set_axis_off()
    ax.set_title(str(digits.target[i]))
plt.suptitle('Sample digits')
plt.show()

In [None]:
log_reg_digits = linear_model.LogisticRegression(max_iter=1000)
X_train_d, X_test_d, y_train_d, y_test_d = model_selection.train_test_split(X_digits, y_digits, test_size=0.2, random_state=0)
log_reg_digits.fit(X_train_d, y_train_d)
y_pred_d = log_reg_digits.predict(X_test_d)
print('Test accuracy:', metrics.accuracy_score(y_test_d, y_pred_d))
metrics.ConfusionMatrixDisplay.from_predictions(y_test_d, y_pred_d, cmap='Blues')
plt.show()

### Hands-on Challenge
Implement a function `predict` that returns class labels (0 or 1) using the learned
`theta` and test it on the data.

In [None]:
def predict(X_new):
    """Return 0 or 1 predictions using global theta."""
    # TODO: implement prediction using sigmoid
    pass

predict(X[:5])

## 6. k-Nearest Neighbors (Supervised Learning)
The k-NN algorithm stores the training data and classifies a new point by looking at
the $k$ closest training examples. We'll implement a simple version using Euclidean
distance.

In [None]:
from collections import Counter
class KNNClassifier:
    def __init__(self, k=3):
        self.k = k
    def fit(self, X, y):
        self.X_train = X
        self.y_train = y
    def _distance(self, a, b):
        return np.linalg.norm(a - b)
    def predict(self, X):
        preds = []
        for x in X:
            distances = [self._distance(x, x_train) for x_train in self.X_train]
            k_idx = np.argsort(distances)[:self.k]
            k_votes = self.y_train[k_idx]
            preds.append(Counter(k_votes).most_common(1)[0][0])
        return np.array(preds)

iris = datasets.load_iris()
X, y = iris.data, iris.target
knn = KNNClassifier(k=5)
knn.fit(X, y)
y_pred = knn.predict(X)
print("Accuracy:", (y_pred == y).mean())

In [None]:
sk_knn = neighbors.KNeighborsClassifier(n_neighbors=5)
sk_knn.fit(X, y)
print("scikit-learn accuracy:", sk_knn.score(X, y))

### Hands-on Challenge
Modify the `_distance` method to use Manhattan distance instead of Euclidean.

## 7. Decision Trees (Supervised Learning)
Decision trees split the data recursively based on feature values. We'll implement a
simple decision stump (a tree with one split) using Gini impurity.

In [None]:
def gini_impurity(y):
    m = len(y)
    return 1 - sum((np.sum(y==c)/m)**2 for c in np.unique(y))

def best_split(X, y):
    m, n = X.shape
    best_feature, best_thresh, best_gini = None, None, 1
    for feature in range(n):
        thresholds = np.unique(X[:, feature])
        for t in thresholds:
            left = y[X[:, feature] <= t]
            right = y[X[:, feature] > t]
            gini = (len(left)*gini_impurity(left) + len(right)*gini_impurity(right)) / m
            if gini < best_gini:
                best_gini = gini
                best_feature = feature
                best_thresh = t
    return best_feature, best_thresh

feature, thresh = best_split(X, y)
print("Best feature:", feature, "Threshold:", thresh)

In [None]:
plt.figure(figsize=(6,4))
for target in np.unique(y):
    subset = X[y==target]
    plt.scatter(subset[:, feature], subset[:, 2], label=iris.target_names[target])
plt.axvline(thresh, color="red", linestyle="--", label="split")
plt.xlabel(iris.feature_names[feature])
plt.ylabel(iris.feature_names[2])
plt.legend();

In [None]:
sk_tree = tree.DecisionTreeClassifier(max_depth=1)
sk_tree.fit(X, y)
print("scikit-learn depth:", sk_tree.get_depth())

### Hands-on Challenge
Using the best feature and threshold found above, write code to classify new points
with this decision stump.

In [None]:
def stump_predict(X_new):
    """Return class 0/1/2 using the stump parameters."""
    # TODO: implement using global `feature` and `thresh`
    pass

stump_predict(X[:5])

## 8. Brief Introduction to Unsupervised Learning
Unsupervised learning deals with data without explicit labels. Two common tasks are
clustering (grouping similar points) and dimensionality reduction. We'll look at the
k-Means clustering algorithm.

In [None]:
np.random.seed(42)
X_blobs, y_blobs = datasets.make_blobs(n_samples=200, centers=3, random_state=42)
plt.scatter(X_blobs[:,0], X_blobs[:,1], c=y_blobs, cmap="viridis");

In [None]:
def kmeans(X, k, n_iters=100):
    rng = np.random.default_rng(seed=0)
    centroids = X[rng.choice(len(X), k, replace=False)]
    for _ in range(n_iters):
        distances = np.linalg.norm(X[:, None] - centroids[None, :], axis=2)
        labels = np.argmin(distances, axis=1)
        new_centroids = np.array([X[labels==i].mean(axis=0) for i in range(k)])
        if np.allclose(centroids, new_centroids):
            break
        centroids = new_centroids
    return centroids, labels

centroids, labels = kmeans(X_blobs, 3)

In [None]:
plt.scatter(X_blobs[:,0], X_blobs[:,1], c=labels, cmap="viridis", alpha=0.6)
plt.scatter(centroids[:,0], centroids[:,1], c="red", marker="x", s=100)
plt.title("k-Means Clustering");

In [None]:
sk_kmeans = cluster.KMeans(n_clusters=3, n_init=10)
sk_kmeans.fit(X_blobs)
print("scikit-learn inertia:", sk_kmeans.inertia_)

### Hands-on Challenge
Add a stopping criterion to `kmeans` based on the change in centroids between
iterations.

## 9. Model Evaluation and Cross-Validation
When building models, we typically split data into training and test sets (and
sometimes a validation set). scikit-learn provides `train_test_split` and utilities
for cross-validation.

In [None]:
from sklearn.model_selection import train_test_split, cross_val_score
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.2, random_state=42)
clf = neighbors.KNeighborsClassifier()
clf.fit(X_train, y_train)
print("Test accuracy:", clf.score(X_test, y_test))
scores = cross_val_score(clf, iris.data, iris.target, cv=5)
print("Cross-validation scores:", scores)
print("Mean CV score:", scores.mean())

## 10. Wrap-Up and Next Steps
In this notebook you implemented linear and logistic regression, k-NN, a decision
stump, and k-Means from scratch, and you compared them with scikit-learn's
implementations. We also looked at model evaluation techniques.

To go further, consider exploring:
- The perceptron algorithm and how it leads to neural networks.
- Regularization techniques.
- More complex datasets and competitions such as Kaggle.

Machine learning is a vast field—keep experimenting and practicing!