### Introduction to machine learning

#### **What is machine learning?**

-  Machine learnig is a method of teaching computers to learn patterns from data instead of being explicitly programmed.

##### **Types of machine learning**
- **Supervised Learning**: Learning from labeled data (e.g., predicting house prices)

- **Unsupervised Learning**: Finding patterns in data (e.g., customer segmentation)

- **Reinforcement Learning**: Learning from feedback (e.g., game-playing AI)

###  Supervised Learning

Supervised learning is a type of machine learning where the model is trained on a **labeled dataset**—this means that each training example includes both the input data and the correct output (or target). The goal is for the model to **learn the mapping** from inputs to outputs so it can predict the output for new, unseen inputs.

#### Key Characteristics:
- **Labeled Data**: The dataset includes both features (inputs) and labels (outputs).
- **Goal**: Learn a function that maps inputs to correct outputs.
- **Feedback**: The model receives feedback on its predictions and adjusts itself accordingly.

#### Common Applications:
1. **Classification** – Predicting a category or class label.
   - Example: Email spam detection (spam or not spam).
   - Example: Diagnosing diseases (positive or negative).

2. **Regression** – Predicting a continuous value.
   - Example: Predicting house prices.
   - Example: Estimating student test scores.


In [6]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [9]:
# Set styling
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_palette("Set2")
plt.rcParams.update({'font.size': 12})

##### **House Price Prediction**

In [24]:
# Create a sample dataset for housing price prediction
np.random.seed(42)
n_samples = 100

# Features
square_footage = np.random.randint(800, 3000, n_samples)
num_bedrooms = np.random.randint(1, 5, n_samples)
age_of_house = np.random.randint(1, 50, n_samples)

# Generate target with some noise
prices = (
    square_footage * 100 + 
    num_bedrooms * 15000 - 
    age_of_house * 1000 + 
    np.random.normal(0, 20000, n_samples)
)

# Create DataFrame
house_df = pd.DataFrame({
    'Square_Footage': square_footage,
    'Num_Bedrooms': num_bedrooms,
    'Age_of_House': age_of_house,
    'Price': prices
})


house_df.head(10)


Unnamed: 0,Square_Footage,Num_Bedrooms,Age_of_House,Price
0,1660,2,28,167902.861107
1,2094,2,23,208733.533875
2,1930,4,37,206749.462796
3,1895,1,32,150250.052913
4,2438,3,33,273144.58998
5,2969,3,1,360068.276724
6,1266,4,19,175962.370439
7,2038,3,2,209069.439084
8,1130,1,44,129466.90316
9,2282,4,26,265139.493009


#### **Email classification**

In [23]:
np.random.seed(42)
n_samples = 100

# Features for regular emails
exclamation_count = np.random.poisson(0.5, n_samples//2)  # Fewer exclamations in regular emails
caps_percent = np.random.beta(2, 5, n_samples//2) * 100   # Lower caps % in regular emails  
spam_words = np.random.poisson(1, n_samples//2)           # Fewer spam words in regular emails

# Features for spam emails
spam_exclamation = np.random.poisson(3, n_samples//2)     # More exclamations in spam
spam_caps = np.random.beta(5, 2, n_samples//2) * 100      # Higher caps % in spam
spam_spam_words = np.random.poisson(5, n_samples//2)      # More spam words in spam

# Create two classes with different feature distributions
regular_emails = pd.DataFrame({
    'Exclamation_Count': exclamation_count,
    'Caps_Percentage': caps_percent,
    'Spam_Word_Count': spam_words,
    'Is_Spam': 0  # Not spam
})

spam_emails = pd.DataFrame({
    'Exclamation_Count': spam_exclamation,
    'Caps_Percentage': spam_caps,
    'Spam_Word_Count': spam_spam_words,
    'Is_Spam': 1  # Spam
})

# Combine into one dataset
email_df = pd.concat([regular_emails, spam_emails]).sample(frac=1, random_state=42).reset_index(drop=True)


email_df.head(10)


Unnamed: 0,Exclamation_Count,Caps_Percentage,Spam_Word_Count,Is_Spam
0,3,62.437815,8,1
1,2,62.843798,3,1
2,3,87.232135,8,1
3,0,28.184254,1,0
4,0,47.848639,3,0
5,0,16.770028,1,0
6,1,42.220867,0,0
7,6,63.219972,6,1
8,0,45.013096,2,0
9,0,32.367422,1,0


#### **Student's score prediction**

In [22]:
np.random.seed(42)
n_samples = 100

# Features
hours_studied = np.random.uniform(1, 10, n_samples)
attendance = np.random.uniform(50, 100, n_samples)
previous_score = np.random.uniform(50, 100, n_samples)

# Generate final scores with some noise
final_scores = (
    hours_studied * 5 +
    attendance * 0.2 +
    previous_score * 0.3 +
    np.random.normal(0, 5, n_samples)
)

# Ensure scores are in a reasonable range (0-100)
final_scores = np.clip(final_scores, 0, 100)

# Create letter grades for visualization
def score_to_letter(score):
    if score >= 90:
        return 'A'
    elif score >= 80:
        return 'B'
    elif score >= 70:
        return 'C'
    elif score >= 60:
        return 'D'
    else:
        return 'F'

# Create DataFrame
student_df = pd.DataFrame({
    'Hours_Studied': hours_studied,
    'Attendance_Percent': attendance,
    'Previous_Score': previous_score,
    'Final_Score': final_scores
})

# Add letter grades
student_df['Letter_Grade'] = student_df['Final_Score'].apply(score_to_letter)

student_df.head(10)

Unnamed: 0,Hours_Studied,Attendance_Percent,Previous_Score,Final_Score,Letter_Grade
0,4.370861,51.571459,82.101582,57.026931,F
1,9.556429,81.820521,54.206998,77.150346,C
2,7.587945,65.717799,58.081436,79.227438,C
3,6.387926,75.428535,94.927709,78.673247,C
4,2.404168,95.378324,80.321453,45.067227,F
5,2.403951,62.464611,50.459853,40.582903,F
6,1.522753,70.519146,55.073577,34.930733,F
7,8.795585,87.777557,83.175088,90.748131,A
8,6.410035,61.439908,50.253079,55.451477,F
9,7.372653,53.848995,58.040403,64.471504,D


### **Iris flower classification**

In [21]:
from sklearn.datasets import load_iris
iris = load_iris()
X, y = iris.data, iris.target
feature_names = iris.feature_names
target_names = iris.target_names

# Create DataFrame for easier handling
iris_df = pd.DataFrame(data=X, columns=feature_names)
iris_df['species'] = [target_names[i] for i in y]

iris_df.head(10)


Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa
5,5.4,3.9,1.7,0.4,setosa
6,4.6,3.4,1.4,0.3,setosa
7,5.0,3.4,1.5,0.2,setosa
8,4.4,2.9,1.4,0.2,setosa
9,4.9,3.1,1.5,0.1,setosa


#### **Non-numeric data classification**

In [28]:
np.random.seed(42)

def create_categorical_dataset(n_samples=100):
    # Define possible values for each feature
    colors = ['Red', 'Blue', 'Green', 'Yellow', 'Purple']
    sizes = ['Small', 'Medium', 'Large']
    shapes = ['Round', 'Square', 'Triangle', 'Oval']
    materials = ['Wood', 'Metal', 'Plastic', 'Glass', 'Fabric']
    origins = ['North', 'South', 'East', 'West', 'Central']
    categories = ['A', 'B', 'C', 'D']
    
    # Generate random data
    data = {
        'color': np.random.choice(colors, size=n_samples),
        'size': np.random.choice(sizes, size=n_samples),
        'shape': np.random.choice(shapes, size=n_samples),
        'material': np.random.choice(materials, size=n_samples),
        'origin': np.random.choice(origins, size=n_samples),
        'category': np.random.choice(categories, size=n_samples)
    }
    
    # Create DataFrame
    df = pd.DataFrame(data)
    
    return df

# Create the dataset
df = create_categorical_dataset(100)

df.head(10)

Unnamed: 0,color,size,shape,material,origin,category
0,Yellow,Large,Round,Plastic,South,D
1,Purple,Small,Square,Glass,East,C
2,Green,Medium,Square,Wood,North,A
3,Purple,Small,Oval,Glass,South,A
4,Purple,Large,Square,Plastic,Central,D
5,Blue,Large,Triangle,Wood,South,B
6,Green,Medium,Round,Glass,South,C
7,Green,Small,Oval,Glass,North,D
8,Green,Large,Round,Plastic,West,A
9,Purple,Medium,Round,Wood,South,A


###  Types of Supervised Learning Models

Supervised learning models can be broadly categorized into two main types based on the nature of the target variable: **Classification** and **Regression**.

#### 1. Classification

- **Goal**: Predict **discrete** labels or categories.
- **Output**: A class label (e.g., Yes/No, Spam/Not Spam, Species A/B/C).
- **Example**: Email spam detection, disease diagnosis, handwriting recognition.

**Common Classification Algorithms:**
- **Logistic Regression**
- **Decision Trees**
- **Random Forest**
- **Support Vector Machines (SVM)**
- **K-Nearest Neighbors (KNN)**
- **Naive Bayes**
- **Neural Networks**


#### 2. Regression

- **Goal**: Predict **continuous** numeric values.
- **Output**: A real number (e.g., price, temperature, age).
- **Example**: Predicting house prices, forecasting sales, estimating income.

**Common Regression Algorithms:**
- **Linear Regression**
- **Ridge/Lasso Regression**
- **Decision Tree Regressor**
- **Random Forest Regressor**
- **Support Vector Regressor (SVR)**
- **Neural Networks for regression**


####  Bonus Tip: Choosing Between Classification and Regression
- If your **target variable is categorical**, use **classification**.
- If your **target variable is numeric**, use **regression**.


