# Basic Machine Learning Tutorials

This notebook walks through simple scikit-learn examples step by step.

Run the cell below if you need to install the required libraries. In Google Colab they come pre-installed.


In [None]:
!pip install scikit-learn matplotlib

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_regression, load_iris, load_digits
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report, ConfusionMatrixDisplay
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA


## 0. Linear regression on synthetic data

Linear regression assumes a linear relationship between the feature $x$ and target $y$: 

$$y = wx + b + \epsilon.$$ 

We will generate a noisy dataset and recover the parameters using least squares.
The parameters are estimated by minimizing the mean squared error:
$$\min_{w,b}\sum_i (y_i - w x_i - b)^2.$$

In [None]:
X, y, coef = make_regression(n_samples=100, n_features=1, noise=10.0, coef=True, random_state=42)
print('X shape:', X.shape)
print('y shape:', y.shape)


In [None]:
plt.scatter(X, y, color='blue', label='Data')
plt.xlabel('Feature')
plt.ylabel('Target')
plt.legend()
plt.tight_layout()
plt.show()

In [None]:
model = LinearRegression()
model.fit(X, y)
print('True coefficient:', coef)
print('Learned coefficient:', model.coef_[0])
print('Intercept:', model.intercept_)


In [None]:
x_grid = np.linspace(X.min(), X.max(), 100).reshape(-1,1)
y_pred = model.predict(x_grid)
print('Prediction shape:', y_pred.shape)
plt.scatter(X, y, color='blue', label='Data')
plt.plot(x_grid, y_pred, color='red', label='Fit')
plt.xlabel('Feature')
plt.ylabel('Target')
plt.legend()
plt.tight_layout()
plt.show()


## 1. Logistic regression on Iris

Logistic regression models class probabilities using the sigmoid function: $$P(y=1|x)=\sigma(w^Tx+b).$$ We will train a classifier on the Iris dataset and check its accuracy.
Training minimizes cross-entropy between predicted probabilities and true labels.

In [None]:
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)
print('Train shape:', X_train.shape, y_train.shape)
print('Test shape:', X_test.shape, y_test.shape)


In [None]:
clf = LogisticRegression(max_iter=200)
clf.fit(X_train, y_train)
preds = clf.predict(X_test)
print('Accuracy:', accuracy_score(y_test, preds))


In [None]:
ConfusionMatrixDisplay.from_predictions(y_test, preds)
plt.title('Logistic Regression Confusion Matrix')
plt.tight_layout()
plt.show()


## 2. k-NN classification on digits

k-NN classifies a point based on the majority label among its $k$ nearest neighbors: $$\hat y=\mathrm{mode}(\{y_i : x_i \in N_k(x)\}).$$
This method requires no training and relies purely on distances to existing labeled points.

In [None]:
X, y = load_digits(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)
print('Train shape:', X_train.shape)
print('Test shape:', X_test.shape)


In [None]:
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train, y_train)
preds = knn.predict(X_test)
print('Test accuracy:', accuracy_score(y_test, preds))


In [None]:
ConfusionMatrixDisplay.from_predictions(y_test, preds)
plt.title('k-NN Confusion Matrix')
plt.tight_layout()
plt.show()


## 3. Decision tree classifier

Decision trees recursively partition the feature space to minimize an impurity measure such as the Gini index.
Each split chooses the feature and threshold that maximizes impurity reduction.

In [None]:
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)
print('Train shape:', X_train.shape)


In [None]:
tree = DecisionTreeClassifier(max_depth=3, random_state=42)
tree.fit(X_train, y_train)
preds = tree.predict(X_test)
print(classification_report(y_test, preds))


In [None]:
ConfusionMatrixDisplay.from_predictions(y_test, preds)
plt.title('Decision Tree Confusion Matrix')
plt.tight_layout()
plt.show()


## 4. k-means clustering

k-means alternates between assigning points to the nearest center and recomputing centers to minimize $$J=\sum_i \|x_i-\mu_{c_i}\|^2.$$
Iterations continue until assignments stabilize or a preset step limit is reached.

In [None]:
X, y = load_iris(return_X_y=True)
kmeans = KMeans(n_clusters=3, random_state=42, n_init=10)
clusters = kmeans.fit_predict(X)
print('Cluster counts:', np.bincount(clusters))
print('Cluster centers:', kmeans.cluster_centers_)


In [None]:
plt.scatter(X[:,0], X[:,1], c=clusters, cmap='viridis', s=30)
plt.scatter(kmeans.cluster_centers_[:,0], kmeans.cluster_centers_[:,1], c='red', marker='x', s=100, linewidths=2, label='Centers')
plt.title('k-means Clustering')
plt.legend()
plt.tight_layout()
plt.show()


## 5. Principal component analysis

PCA projects data onto directions of maximal variance via eigen decomposition of the covariance matrix.


In [None]:
X, y = load_digits(return_X_y=True)
pca = PCA(n_components=2)
reduced = pca.fit_transform(X)
print('Explained variance ratio:', pca.explained_variance_ratio_)
print('Transformed shape:', reduced.shape)


In [None]:
plt.scatter(reduced[:,0], reduced[:,1], c=y, cmap='tab10', s=15)
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.title('PCA of Digits')
plt.tight_layout()
plt.show()


This concludes the brief tour of basic machine learning examples using scikit-learn. Feel free to modify the code cells and explore further!
Try experimenting with other datasets or algorithms for practice.