# What is Machine Learning?
Machine Learning (ML) is a field of Artificial Intelligence (AI) that enables computers to learn from data and make decisions or predictions without being explicitly programmed.

So, instead of writing a program to solve a problem, we let the computer learn patterns from data and then use those patterns to make predictions.

# Real-Life Applications of Machine Learning
| Domain           | Application                  | Example                                   |
| ---------------- | ---------------------------- | ----------------------------------------- |
| 📧 Email         | Spam detection               | Gmail filtering spam mails                |
| 🎥 Entertainment | Recommendation systems       | Netflix, YouTube suggesting movies        |
| 💰 Finance       | Fraud detection              | Banks identifying fraudulent transactions |
| 🚗 Automotive    | Self-driving cars            | Tesla using ML for object detection       |
| 🏥 Healthcare    | Disease prediction           | Predicting diabetes, cancer risk          |
| 🛍️ Retail       | Customer behavior prediction | Amazon predicting what you'll buy next    |


# Types of Machine Learning

Machine Learning is mainly divided into three types:

# Supervised Learning

The model is trained on labeled data (input + output known).

## Example:

Input: Hours studied

Output: Marks obtained

The algorithm learns the relationship between them.

## Applications:

Predicting house prices

Spam/Not Spam classification

Sentiment analysis

## Algorithms:

Linear Regression

Logistic Regression

Decision Trees

Random Forest

Support Vector Machines (SVM)

# Unsupervised Learning

The model is trained on unlabeled data (only input, no output).

Goal: Discover hidden patterns or groupings.

Examples:

Customer segmentation

Topic modeling in text

Market basket analysis

Algorithms:

K-Means Clustering

Hierarchical Clustering

PCA (Principal Component Analysis)

# Reinforcement Learning

The model learns by interacting with an environment and receiving rewards or penalties for actions.

## Example:

A robot learning to walk

AlphaGo (AI playing Go)

Self-driving cars adjusting their driving policy

# Machine Learning Problem Types

When we use Machine Learning, our goal is usually to predict something.

Depending on what type of data or output we want to predict,
ML problems are generally divided into two main categories:

🔹 Regression → Predict numerical (continuous) values
🔹 Classification → Predict categorical (discrete) values

# What is Regression?

Regression is used when the output (target variable) is numerical or continuous — meaning it can take any real number value.

| Problem                  | Input Features (X)       | Output (Y)       | Type       |
| ------------------------ | ------------------------ | ---------------- | ---------- |
| Predicting house prices  | Area, Bedrooms, Location | Price (in $)     | Continuous |
| Predicting student marks | Hours studied            | Marks (0–100)    | Continuous |
| Predicting temperature   | Time of day, humidity    | Temperature (°C) | Continuous |
| Predicting sales revenue | Marketing budget, region | Sales amount     | Continuous |


# What is Classification?

Classification is used when the output variable is categorical (discrete) —
meaning the output belongs to specific groups or classes.

**Examples of Classification Problems**
| Problem                       | Input Features (X)         | Output (Y)                    | Type        |
| ----------------------------- | -------------------------- | ----------------------------- | ----------- |
| Email Spam Detection          | Email text                 | Spam / Not Spam               | Binary      |
| Tumor Diagnosis               | Tumor size, age, cell type | Malignant / Benign            | Binary      |
| Weather Prediction            | Temperature, humidity      | Sunny / Rainy / Cloudy        | Multi-class |
| Handwritten Digit Recognition | Image pixels               | 0–9 digits                    | Multi-class |
| Sentiment Analysis            | Movie review text          | Positive / Negative / Neutral | Multi-class |


# Regression vs Classification
| Feature                | Regression                              | Classification                        |
| ---------------------- | --------------------------------------- | ------------------------------------- |
| **Output type**        | Continuous / Numerical                  | Categorical / Discrete                |
| **Goal**               | Predict a number                        | Predict a class or label              |
| **Examples**           | Predict house price, temperature, sales | Predict spam/not spam, disease type   |
| **Algorithms**         | Linear Regression, SVR                  | Logistic Regression, Decision Tree    |
| **Evaluation Metrics** | MSE, RMSE, R² Score                     | Accuracy, Precision, Recall, F1-Score |
| **Output Example**     | 93.6 marks                              | “Pass” / “Fail”                       |


# Real-World Example Comparison
| Scenario                                          | Type           | Why?                                |
| ------------------------------------------------- | -------------- | ----------------------------------- |
| Predicting the **price** of a car                 | Regression     | Output is numeric (price value)     |
| Predicting whether a **customer will buy** or not | Classification | Output is “Yes” or “No”             |
| Predicting **exam marks**                         | Regression     | Continuous numeric value            |
| Predicting **disease presence**                   | Classification | Categories: “Positive” / “Negative” |
| Predicting **movie rating (1–5 stars)**           | Classification | Discrete classes                    |
| Predicting **electricity usage**                  | Regression     | Continuous numeric output           |


# Supervised Learning Example — Linear Regression

Now let’s focus on one key Machine Learning model: Linear Regression
This is the foundation of ML and is often the first algorithm taught.

# What is Linear Regression?

Linear Regression is a supervised learning algorithm that models the relationship between a dependent variable (Y) and one or more independent variables (X) using a straight line.


In [14]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# Dataset
data = {
    'Hours': [1,2,3,4,5,6,7,8,9,10],
    'Marks': [20,25,35,40,50,60,65,70,85,90]
}
df = pd.DataFrame(data)

# Features and target
X = df[['Hours']]
y = df['Marks']

# Split into training and testing data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train the model
model = LinearRegression()
model.fit(X_train, y_train)

# Model Parameters
print("Slope (Coefficient):", model.coef_[0])
print("Intercept:", model.intercept_)

# Predictions
y_pred = model.predict(X_test)

# Evaluate model performance
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
accuracy = r2 * 100  # Convert to percentage

print("Mean Squared Error:", round(mse, 2))
print("R-squared Score:", round(r2, 3))
print("Model Accuracy: {:.2f}%".format(accuracy))

# Live prediction
hours = [[9.25]]
predicted_marks = model.predict(hours)
print(f"📘 Predicted Marks for {hours[0][0]} hours of study: {predicted_marks[0]:.2f}")

# # Visualization
# plt.figure(figsize=(8,6))
# plt.scatter(X, y, color='blue', label='Actual Data')
# plt.plot(X, model.predict(X), color='red', linewidth=2, label='Best Fit Line')

# # Highlight the prediction point
# plt.scatter(hours, predicted_marks, color='green', s=100, marker='o', label='Predicted Point (9.25 hrs)')
# plt.xlabel('Hours Studied')
# plt.ylabel('Marks Obtained')
# plt.title('Linear Regression - Study Hours vs Marks')
# plt.legend()
# plt.grid(True)
# plt.show()


Slope (Coefficient): 7.672413793103448
Intercept: 11.551724137931032
Mean Squared Error: 11.46
R-squared Score: 0.987
Model Accuracy: 98.73%
📘 Predicted Marks for 9.25 hours of study: 82.52




# What is Train-Test Split?

When we build a machine learning model, we want it to learn patterns from data —
but also to perform well on unseen (new) data.

| Set              | Purpose                    | Description                                                      |
| ---------------- | -------------------------- | ---------------------------------------------------------------- |
| **Training Set** | Used to train the model    | Model learns relationships (patterns) between inputs and outputs |
| **Testing Set**  | Used to evaluate the model | Model is tested on unseen data to check performance              |

Example Analogy:

Imagine you’re a teacher preparing students for an exam:

Training data = Practice questions you give them in class.

Testing data = Final exam questions they’ve never seen before.

If students only memorize the practice questions, they may fail the real test.
So, we test with new questions to check true understanding  that’s exactly why we use test data.

## What Happens When We Change the Ratio?

Let’s understand what different splits mean:
| Split Ratio | Train (%) | Test (%) | Description                               |
| ----------- | --------- | -------- | ----------------------------------------- |
| **80/20**   | 80        | 20       | 🔹 Most common and balanced split         |
| **70/30**   | 70        | 30       | Good for smaller datasets                 |
| **60/40**   | 60        | 40       | Risk of underfitting (less data to train) |
| **90/10**   | 90        | 10       | Good for very large datasets              |
| **50/50**   | 50        | 50       | Not ideal — training data too small       |


## Effect of Different Ratios
| Split     | Pros                                      | Cons                                        |
| --------- | ----------------------------------------- | ------------------------------------------- |
| **80/20** | Good balance between training and testing | Works well for most datasets                |
| **70/30** | More test data for validation             | Slightly less training data                 |
| **90/10** | More data to learn patterns               | Smaller test set (less reliable evaluation) |
| **60/40** | More testing data to evaluate performance | Less data for training → weaker model       |


## Example Visualization (Conceptual)

Let’s say we have 10 data points:
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

80/20 split:
→ Train = [1,2,3,4,5,6,7,8]
→ Test = [9,10]

60/40 split:
→ Train = [1,2,3,4,5,6]
→ Test = [7,8,9,10]

## Why Do We Use 80/20 Most Commonly?

Because it provides a good balance:

The model has enough data to learn (80%)

And enough unseen data to test its generalization (20%)

For small datasets (like <1000 samples),people often use 70/30 to have more testing data.

For very large datasets (like millions of samples),
even 90/10 works fine since 10% of a million is still 100,000 test samples.

## How Train-Test Split Works Internally

It randomly shuffles the dataset

Then divides it according to the test_size

Assigns:

Training data → X_train, y_train

Testing data → X_test, y_test