# Module 4.1: Introduction to Scikit-Learn

Welcome to the Machine Learning module! This is where we transition from analyzing the past to predicting the future. 🤖

**Scikit-Learn (`sklearn`)** is the gold standard for traditional machine learning in Python. It provides a massive library of well-documented algorithms and a simple, consistent API (Application Programming Interface) for using them.

**Goal of this Notebook:**
Before we dive into specific algorithms, we must understand the fundamental Scikit-Learn workflow. We will learn:

1.  The difference between **Features** (input) and **Target** (output).
2.  The crucial concept of the **Train-Test Split**.
3.  The standard Scikit-Learn API pattern: `fit()`, `predict()`, and `score()`.

## 1. Features and Target

In supervised machine learning, we work with two main types of data:

* **Features (X):** These are the input variables—the data we use to make a prediction. This is typically a 2D array or DataFrame.
* **Target (y):** This is the output variable—the value we are trying to predict. This is typically a 1D array or Series.

**Analogy:** If you want to predict house prices, the *features* (X) would be things like `[Area, Bedrooms, Age]`, and the *target* (y) would be the `Price`.

In [None]:
import numpy as np
import pandas as pd

# Let's create a simple, sample dataset of 'Years of Experience' vs. 'Salary'
data = {
    'YearsExperience': [1.1, 1.3, 1.5, 2.0, 2.2, 2.9, 3.0, 3.2, 3.2, 3.7],
    'Salary': [39343, 46205, 37731, 43525, 39891, 56642, 60150, 54445, 64445, 57189]
}
df = pd.DataFrame(data)

# Define our features (X) and target (y)
X = df[['YearsExperience']] # Features must be a 2D structure, hence the double brackets
y = df['Salary']

print("--- Features (X) ---")
print(X.head())
print("\n--- Target (y) ---")
print(y.head())

## 2. The Train-Test Split

This is the **most important concept** in this notebook. To evaluate how well our model performs, we need to test it on data it has **never seen before**.

We do this by splitting our dataset into two parts:
* **Training Set:** The majority of the data, used to train the model.
* **Testing Set:** A smaller portion of the data, held back to test the trained model's performance.

**Analogy:** You wouldn't use the exact same questions to study for an exam and to take the final exam. The training set is your study material; the testing set is the final exam.

In [None]:
from sklearn.model_selection import train_test_split

# Split the data: 80% for training, 20% for testing
# test_size=0.2 means 20% of the data will be for the test set
# random_state ensures we get the same split every time we run the code
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"Shape of X_train: {X_train.shape}")
print(f"Shape of X_test: {X_test.shape}")
print(f"Shape of y_train: {y_train.shape}")
print(f"Shape of y_test: {y_test.shape}")

## 3. The Scikit-Learn API Pattern

Scikit-Learn's beauty is its consistency. The workflow for using almost any model is the same three steps:

1.  **Instantiate:** Create an instance of the model object.
2.  **Fit:** Train the model on the training data. The model learns the relationship between `X_train` and `y_train`. This is done with the `.fit()` method.
3.  **Predict:** Use the trained model to make predictions on new, unseen data (`X_test`). This is done with the `.predict()` method.

In [None]:
# We will use a Linear Regression model as our first example
from sklearn.linear_model import LinearRegression

# Step 1: Instantiate the model
model = LinearRegression()

# Step 2: Fit the model on the training data
model.fit(X_train, y_train)

print("Model training is complete!")

In [None]:
# Step 3: Make predictions on the test data
predictions = model.predict(X_test)

# Let's see the predictions vs. the actual values
print("--- Predictions ---")
print(predictions)

print("\n--- Actual Values ---")
print(y_test.values)

## ✅ What's Next?

Congratulations! You've just trained your first machine learning model. You now understand the fundamental workflow that applies to nearly all supervised learning tasks in Scikit-Learn.

In the next notebook, we will dive deeper into the algorithm we just used—**Linear Regression**—to understand how it works and how to evaluate its performance.