# Machine Learning Basics in Python

This notebook introduces the foundational concepts of machine learning (ML) using Python. You'll learn about supervised and unsupervised learning, model training, evaluation, and practical applications using scikit-learn.

## Topics Covered:
1. What is Machine Learning?
2. Types of Machine Learning
3. The Machine Learning Workflow
4. Data Preparation
5. Supervised Learning: Classification
6. Supervised Learning: Regression
7. Unsupervised Learning: Clustering
8. Model Evaluation and Metrics
9. Real-Life Use Cases

## 1. What is Machine Learning?

Machine learning is a field of artificial intelligence that enables computers to learn from data and make predictions or decisions without being explicitly programmed for each task.

**Real-life use case:** Email spam filters use machine learning to classify emails as spam or not spam based on patterns in the data.

## 2. Types of Machine Learning

- **Supervised Learning:** The model learns from labeled data (e.g., classification, regression).
- **Unsupervised Learning:** The model finds patterns in unlabeled data (e.g., clustering, dimensionality reduction).
- **Reinforcement Learning:** The model learns by interacting with an environment and receiving feedback (rewards or penalties).

**Real-life use case:** Customer segmentation (unsupervised) and credit scoring (supervised).

## 3. The Machine Learning Workflow

1. Define the problem
2. Collect and prepare data
3. Choose a model
4. Train the model
5. Evaluate the model
6. Tune and improve
7. Deploy and monitor

**Real-life use case:** Predicting house prices using historical sales data.

In [None]:
# Import required libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris, load_boston, make_blobs
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, mean_squared_error, classification_report, confusion_matrix
from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.cluster import KMeans

## 4. Data Preparation

Data cleaning and preprocessing are crucial for building effective ML models. This includes handling missing values, encoding categorical variables, and feature scaling.

**Real-life use case:** Preparing customer data for churn prediction by filling missing values and normalizing features.

In [None]:
# Example: Load and prepare the Iris dataset
iris = load_iris(as_frame=True)
df = iris.frame
print(df.head())

# Check for missing values
print('Missing values:', df.isnull().sum().sum())

# Feature scaling
scaler = StandardScaler()
X_scaled = scaler.fit_transform(df[iris.feature_names])
print('Scaled features shape:', X_scaled.shape)

## 5. Supervised Learning: Classification

Classification is about predicting a categorical label. Example: Predicting if an email is spam or not.

**Real-life use case:** Diagnosing diseases (e.g., predicting if a tumor is malignant or benign).

In [None]:
# Split data into train and test sets
X = X_scaled
y = df['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train a logistic regression classifier
clf = LogisticRegression(max_iter=200)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

# Evaluate the classifier
print('Accuracy:', accuracy_score(y_test, y_pred))
print('Classification Report:
', classification_report(y_test, y_pred))

# Confusion matrix
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.title('Confusion Matrix')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()

## 6. Supervised Learning: Regression

Regression predicts a continuous value. Example: Predicting house prices.

**Real-life use case:** Forecasting sales revenue for the next quarter.

In [None]:
# Example: Linear regression on Boston housing data (deprecated, so we simulate data)
from sklearn.datasets import make_regression
X_reg, y_reg = make_regression(n_samples=200, n_features=3, noise=10, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X_reg, y_reg, test_size=0.2, random_state=42)

reg = LinearRegression()
reg.fit(X_train, y_train)
y_pred = reg.predict(X_test)

# Evaluate regression
mse = mean_squared_error(y_test, y_pred)
print('Mean Squared Error:', mse)
plt.scatter(y_test, y_pred, alpha=0.7)
plt.xlabel('Actual')
plt.ylabel('Predicted')
plt.title('Regression: Actual vs Predicted')
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--')
plt.show()

## 7. Unsupervised Learning: Clustering

Clustering groups similar data points together without labels. Example: Customer segmentation.

**Real-life use case:** Grouping customers by purchasing behavior for targeted marketing.

In [None]:
# Example: KMeans clustering
X_blobs, y_blobs = make_blobs(n_samples=200, centers=3, n_features=2, random_state=42)
kmeans = KMeans(n_clusters=3, random_state=42)
clusters = kmeans.fit_predict(X_blobs)

plt.scatter(X_blobs[:, 0], X_blobs[:, 1], c=clusters, cmap='viridis', alpha=0.7)
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], c='red', marker='x', s=100)
plt.title('KMeans Clustering Example')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.show()

## 8. Model Evaluation and Metrics

Evaluating model performance is crucial. Common metrics include accuracy, precision, recall, F1-score (for classification), and mean squared error (for regression).

**Real-life use case:** Comparing different models to select the best one for predicting loan defaults.

In [None]:
# Example: Compare Decision Tree and Logistic Regression on Iris
dt = DecisionTreeClassifier(random_state=42)
dt.fit(X_train, y_train)
y_pred_dt = dt.predict(X_test)
print('Decision Tree Accuracy:', accuracy_score(y_test, y_pred_dt))
print('Logistic Regression Accuracy:', accuracy_score(y_test, y_pred))

## 9. Real-Life Use Cases

- **Healthcare:** Predicting disease risk from patient data
- **Finance:** Detecting fraudulent transactions
- **Retail:** Recommending products to customers
- **Transportation:** Predicting traffic congestion
- **Agriculture:** Forecasting crop yields using weather and soil data

## Practice Exercises

1. Try a different classifier (e.g., DecisionTree, SVC) on the Iris dataset and compare results.
2. Use a real dataset (e.g., from UCI or Kaggle) and build a regression model.
3. Perform clustering on a dataset of your choice and visualize the clusters.
4. Experiment with feature scaling and observe its effect on model performance.
5. Explore model evaluation metrics beyond accuracy (e.g., ROC-AUC, confusion matrix).