# Preprocessing Data for Machine Learning

## 1. Importance of Preprocessing
- Real-world data is rarely ready for machine learning.
- scikit-learn requires input features to be numeric and without missing values.
- Preprocessing transforms raw data into a usable format.

## 2. Handling Categorical Features
- Categorical features (e.g., "genre", "color") must be converted to numeric form.
- This is typically done using **dummy variables** (also known as one-hot encoding).

## 3. Dummy Variables Explained
- For a categorical feature with `n` unique values, create `n` binary columns.
- Each column represents a category: 1 if the observation belongs to it, otherwise 0.
- To avoid redundancy (collinearity), drop one category (e.g., the last one).
  - This is known as the **dummy variable trap**.

## 4. Creating Dummy Variables in Python
- Use either:
  - `pandas.get_dummies()` (simpler and integrated into pandas)
  - `sklearn.preprocessing.OneHotEncoder()` (more customizable)
- Example with pandas:

  import pandas as pd

  df = pd.read_csv('music.csv')
  dummies = pd.get_dummies(df['genre'], drop_first=True)
  df = pd.concat([df, dummies], axis=1)
  df = df.drop('genre', axis=1)


* Alternatively, if only one categorical column exists:


  df = pd.get_dummies(df, drop_first=True)


## 5. Use Case: Music Dataset

* Dataset includes a `genre` column with 10 categories.
* Convert `genre` to 9 binary features using `get_dummies(drop_first=True)`.

## 6. Linear Regression with Dummies

* After encoding, modeling steps remain the same.
* Example: Linear regression to predict song popularity.

  * Split data into training/test sets.
  * Use `KFold` for cross-validation.
  * Use `cross_val_score()` with `scoring='neg_mean_squared_error'`.


  from sklearn.model_selection import cross_val_score, KFold
  from sklearn.linear_model import LinearRegression
  import numpy as np

  kf = KFold(n_splits=5, shuffle=True, random_state=1)
  model = LinearRegression()
  scores = cross_val_score(model, X, y, cv=kf, scoring='neg_mean_squared_error')
  rmse_scores = np.sqrt(-scores)




In [1]:
from sklearn.datasets import load_boston
from sklearn.linear_model import Ridge
from sklearn.model_selection import cross_val_score
import numpy as np

# Load dataset
boston = load_boston()
X, y = boston.data, boston.target

# Ridge Regression with cross-validation
ridge = Ridge(alpha=0.2)
scores = cross_val_score(ridge, X, y, cv=5, 
                        scoring='neg_mean_squared_error')
rmse = np.sqrt(-scores)
print(f"RMSE scores: {rmse}")
print(f"Mean RMSE: {np.mean(rmse):.2f}")

ImportError: 
`load_boston` has been removed from scikit-learn since version 1.2.

The Boston housing prices dataset has an ethical problem: as
investigated in [1], the authors of this dataset engineered a
non-invertible variable "B" assuming that racial self-segregation had a
positive impact on house prices [2]. Furthermore the goal of the
research that led to the creation of this dataset was to study the
impact of air quality but it did not give adequate demonstration of the
validity of this assumption.

The scikit-learn maintainers therefore strongly discourage the use of
this dataset unless the purpose of the code is to study and educate
about ethical issues in data science and machine learning.

In this special case, you can fetch the dataset from the original
source::

    import pandas as pd
    import numpy as np

    data_url = "http://lib.stat.cmu.edu/datasets/boston"
    raw_df = pd.read_csv(data_url, sep="\s+", skiprows=22, header=None)
    data = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])
    target = raw_df.values[1::2, 2]

Alternative datasets include the California housing dataset and the
Ames housing dataset. You can load the datasets as follows::

    from sklearn.datasets import fetch_california_housing
    housing = fetch_california_housing()

for the California housing dataset and::

    from sklearn.datasets import fetch_openml
    housing = fetch_openml(name="house_prices", as_frame=True)

for the Ames housing dataset.

[1] M Carlisle.
"Racist data destruction?"
<https://medium.com/@docintangible/racist-data-destruction-113e3eff54a8>

[2] Harrison Jr, David, and Daniel L. Rubinfeld.
"Hedonic housing prices and the demand for clean air."
Journal of environmental economics and management 5.1 (1978): 81-102.
<https://www.researchgate.net/publication/4974606_Hedonic_housing_prices_and_the_demand_for_clean_air>


In [None]:
import pandas as pd

music_dummies = pd.get_dummies(music_df, drop_first=True)

print("Shape of music_dummies: {}".format(music_dummies.shape))

In [None]:
# Create X and y
X = music_dummies.drop("popularity", axis=1)
y = music_dummies["popularity"]

# Instantiate a ridge model
ridge = Ridge(alpha=0.2)

# Perform cross-validation
scores = cross_val_score(ridge, X, y, cv=kf, scoring="neg_mean_squared_error")

# Calculate RMSE
rmse = np.sqrt(-scores)
print("Average RMSE: {}".format(np.mean(rmse)))
print("Standard Deviation of the target array: {}".format(np.std(y)))

# Handling Missing Data in Machine Learning

## 1. What is Missing Data?
- Missing data occurs when a value for a feature is absent in a row.
- Causes include:
  - No recorded observation
  - Corrupt data
- Essential to handle missing data before modeling.

## 2. Identifying Missing Data
- Use Pandas to inspect missing values:

  df.isna().sum().sort_values(ascending=False)


* In the music dataset, each feature had between 8 and 200 missing values.

## 3. Dropping Missing Data

* Common strategy: drop rows with missing values if they represent <5% of data.
* Use `dropna()` with `subset`:

  ```python
  cols_to_drop = ['col1', 'col2']  # where missing values < 5%
  df_cleaned = df.dropna(subset=cols_to_drop)
  ```

## 4. Imputing Missing Values

* Imputation: filling in missing values with estimated data.
* Numeric columns:

  * Mean (default)
  * Median
* Categorical columns:

  * Most frequent value
* **Important:** Always split train/test sets **before** imputing to prevent data leakage.

## 5. Imputation with scikit-learn

* Use `SimpleImputer` from `sklearn.impute`:

  ```python
  from sklearn.impute import SimpleImputer

  # Categorical
  imp_cat = SimpleImputer(strategy='most_frequent')
  X_cat_train = imp_cat.fit_transform(X_cat_train)
  X_cat_test = imp_cat.transform(X_cat_test)

  # Numeric (mean is default)
  imp_num = SimpleImputer()
  X_num_train = imp_num.fit_transform(X_num_train)
  X_num_test = imp_num.transform(X_num_test)
  ```

* Combine processed features using `np.append`:

  ```python
  X_train_final = np.append(X_num_train, X_cat_train, axis=1)
  X_test_final = np.append(X_num_test, X_cat_test, axis=1)
  ```

## 6. Imputers as Transformers

* `SimpleImputer` is a **transformer**: it fits and transforms data.
* Transformers can be used in pipelines.

## 7. Imputation Within a Pipeline

* Use `Pipeline` from `sklearn.pipeline` for clean workflows:

  from sklearn.pipeline import Pipeline
  from sklearn.linear_model import LogisticRegression

  pipeline = Pipeline([
      ('imputer', SimpleImputer(strategy='most_frequent')),
      ('model', LogisticRegression())
  ])


* Example workflow:

  1. Drop rows with <5% missing values.
  2. Convert `genre` to binary: 1 if "Rock", else 0.

     import numpy as np
     y = np.where(df['genre'] == 'Rock', 1, 0)

  3. Split data and fit the pipeline:

     pipeline.fit(X_train, y_train)
     accuracy = pipeline.score(X_test, y_test)


* Each step in a pipeline except the last must be a transformer.


In [None]:
# Print missing values for each column
print(music_df.isna().sum().sort_values())

# Remove values where less than 5% are missing
music_df = music_df.dropna(subset=["genre", "popularity", "loudness", "liveness", "tempo"])

# Convert genre to a binary feature
music_df["genre"] = np.where(____["____"] == "____", ____, ____)

print(music_df.isna().sum().sort_values())
print("Shape of the `music_df`: {}".format(music_df.shape))

In [None]:
# Print missing values for each column
print(music_df.isna().sum().sort_values())

# Remove values where less than 5% are missing
music_df = music_df.dropna(subset=["genre", "popularity", "loudness", "liveness", "tempo"])

# Convert genre to a binary feature
music_df["genre"] = np.where(music_df["genre"] == "Rock", 1, 0)

print(music_df.isna().sum().sort_values())
print("Shape of the `music_df`: {}".format(music_df.shape))

In [None]:
# Import modules
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsClassifier

# Instantiate an imputer
imputer = SimpleImputer()

# Instantiate a knn model
knn = KNeighborsClassifier(n_neighbors=3)

# Build steps for the pipeline
steps = [("imputer", imputer), 
         ("knn", knn)]

# Centering and Scaling Data for Machine Learning

## 1. Centering and Scaling
- Data imputation is an essential preprocessing step.
- Centering and scaling are also critical for preparing data for machine learning models.

## 2. Why Scale Our Data?
- Features in datasets can have very different ranges:
  - `duration_ms`: 0 to 1.62 million
  - `speechiness`: decimal values
  - `loudness`: negative values
- Models using distance (e.g., KNN) can be biased by features with larger scales.
- Scaling ensures all features contribute equally to the model.

## 3. Scaling Techniques
- **Standardization**:
  - Subtract the mean, divide by the standard deviation.
  - Resulting data has a mean of 0 and a variance of 1.
- **Normalization**:
  - Subtract the minimum, divide by the range.
  - Transforms data to range between 0 and 1.
- **Centering to [-1, 1]** is another option.
- In this video, **standardization** is demonstrated.

## 4. Scaling with scikit-learn
- Use `StandardScaler` from `sklearn.preprocessing`.
- Steps:
  1. Split data before scaling to prevent data leakage.
  2. Instantiate `StandardScaler`.
  3. Use `fit_transform()` on training features.
  4. Use `transform()` on test features.
  5. Verify transformation by checking means and std deviations.

## 5. Scaling in a Pipeline
- Scalers can be included in pipelines with models.
- Example:
  - Pipeline with `StandardScaler` + `KNeighborsClassifier(n_neighbors=6)`
  - Train-test split → Fit pipeline → Predict → Accuracy = 0.81

## 6. Comparing with Unscaled Data
- Model trained on unscaled data had accuracy = 0.53.
- Scaling improved accuracy by over 50%.

## 7. Cross-Validation with Scaling
- Build pipeline and define hyperparameter grid.
- Use dictionary with format: `step_name__param_name`.
  - Example: `'knn__n_neighbors': [3, 6, 9, 12]`
- Use `GridSearchCV` with pipeline and parameter grid.
- Fit on training data and predict on test set.

## 8. Checking Model Parameters
- `best_score_` from `GridSearchCV` shows slight improvement.
- `best_params_` shows optimal model uses 12 neighbors.


In [None]:
# Import StandardScaler
from sklearn.preprocessing import StandardScaler

# Import Pipeline and Lasso
from sklearn.pipeline import Pipeline
from sklearn.linear_model import Lasso

# Create pipeline steps
steps = [("scaler", StandardScaler()),
         ("lasso", Lasso(alpha=0.5))]

# Instantiate the pipeline
pipeline = Pipeline(steps)
pipeline.fit(X_train, y_train)

# Calculate and print R-squared
print(pipeline.score(X_test, y_test))

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
import numpy as np

# Build the steps
steps = [("scaler", StandardScaler()),
         ("logreg", LogisticRegression())]
pipeline = Pipeline(steps)

# Create the parameter space
parameters = {"logreg__C": np.linspace(0.001, 1.0, 20)}

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, 
                                                    random_state=21)

# Instantiate the grid search object
cv = GridSearchCV(pipeline, param_grid=parameters)

# Fit to the training data
cv.fit(X_train, y_train)

# Print best score and parameters
print(cv.best_score_, "\n", cv.best_params_)

# Evaluating Multiple Models in Supervised Learning

## 1. Introduction
- We've now covered the full supervised learning workflow.
- Next step: deciding which model to use.

## 2. Different Models for Different Problems
- Choice depends on:
  - **Dataset size**: Smaller datasets often favor simpler models.
  - **Feature count**: Fewer features = simpler models, faster training.
  - **Data-hungry models**: e.g., Artificial Neural Networks need large datasets.
  - **Interpretability**: Linear models (e.g., Linear Regression) are easier to explain.
  - **Flexibility/Accuracy**: More flexible models (e.g., KNN) can yield better predictions by making fewer assumptions.

## 3. Evaluation Metrics
- scikit-learn provides a consistent API across models.
- **Regression models**:
  - Root Mean Squared Error (RMSE)
  - R-squared (R²)
- **Classification models**:
  - Accuracy
  - Confusion Matrix & associated metrics
  - ROC AUC
- Evaluate several models using the same metric before hyperparameter tuning.

## 4. Scaling Considerations
- Some models are sensitive to feature scales:
  - KNN, Linear Regression, Logistic Regression
- Always scale data before model evaluation.

## 5. Model Evaluation Example
- Task: Binary classification of song genre.
- Models used:
  - K-Nearest Neighbors (KNN)
  - Logistic Regression
  - Decision Tree Classifier
- Steps:
  1. Import models
  2. Create feature and target arrays
  3. Split data into training and test sets
  4. Scale features using `fit_transform()` and `transform()`

## 6. Cross-Validation
- Create a dictionary of model instances.
- Initialize an empty list to store cross-validation results.
- Loop through each model:
  1. Instantiate `KFold`
  2. Perform cross-validation using `cross_val_score` (scoring defaults to accuracy)
  3. Append results to list
- Visualize results using a boxplot with model names as labels.

## 7. Visualizing Results
- Boxplot shows accuracy ranges from cross-validation.
- Median accuracy shown with an orange line.
- Logistic Regression has the best median score.

## 8. Evaluating on the Test Set
- Loop through model dictionary using `.items()`.
- Fit each model to the training set, evaluate on test set.
- Logistic Regression achieves the highest test accuracy.

## 9. Practice
- Time to choose and optimize models for your own supervised learning problems!


In [2]:
models = {"Linear Regression": LinearRegression(), "Ridge": Ridge(alpha=0.1), "Lasso": Lasso(alpha=0.1)}
results = []
# Loop through the models' values
for model in models.values():
  kf = KFold(n_splits=6, random_state=42, shuffle=True)
  
  # Perform cross-validation
  cv_scores = cross_val_score(model, X_train, y_train, cv=kf)
  
  # Append the results
  results.append(cv_scores)
# Create a box plot of the results
plt.boxplot(results, labels=models.keys())
plt.show()

NameError: name 'LinearRegression' is not defined

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import KFold, cross_val_score
import matplotlib.pyplot as plt

# Create models dictionary
models = {
    "Logistic Regression": LogisticRegression(),
    "KNN": KNeighborsClassifier(),
    "Decision Tree Classifier": DecisionTreeClassifier()
}
results = []

# Loop through the models' values
for model in models.values():
  
    # Instantiate a KFold object
    kf = KFold(n_splits=6, random_state=12, shuffle=True)
  
    # Perform cross-validation using scaled training features (X_train_scaled) and target (y_train)
    cv_results = cross_val_score(model, X_train_scaled, y_train, cv=kf)
  
    # Append results for plotting
    results.append(cv_results)

# Create box plot
plt.boxplot(results, labels=models.keys())
plt.title("Model Comparison with Cross-Validation")
plt.ylabel("Accuracy Score")
plt.xlabel("Models")
plt.show()

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
import numpy as np

# Create steps
steps = [("imp_mean", SimpleImputer()), 
         ("scaler", StandardScaler()), 
         ("logreg", LogisticRegression())]

# Set up pipeline
pipeline = Pipeline(steps)

# Define hyperparameters to search
params = {
    "logreg__solver": ["newton-cg", "saga", "lbfgs"],
    "logreg__C": np.linspace(0.001, 1.0, 10)
}

# Create the GridSearchCV object
tuning = GridSearchCV(pipeline, param_grid=params, cv=5)
tuning.fit(X_train, y_train)

# Predict on test set
y_pred = tuning.predict(X_test)

# Compute and print performance
from sklearn.metrics import accuracy_score
test_accuracy = accuracy_score(y_test, y_pred)

print("Tuned Logistic Regression Parameters: {}, Accuracy: {:.4f}".format(
    tuning.best_params_, test_accuracy))