# scikit-learn 

**A note on this document**
This document is known as a Jupyter notebook; it allows text and executable code to coexist in a very easy-to-read format. Blocks can contain text or executable code. For blocks containing code, press `Shift + Enter`, `Ctrl+Enter`, or click the arrow on the block to run the code. Earlier blocks of code need to be run for the later blocks of code to work.

scikit-learn, often abbreviated as sklearn, is an open-source machine learning library for Python. It is a widely used and highly popular library for various machine learning tasks, including classification, regression, clustering, dimensionality reduction, model selection, and more. Scikit-learn is built on top of other popular Python libraries such as NumPy and SciPy and provides a consistent and user-friendly API for various machine learning algorithms and tools.

Scikit-learn is a valuable tool for both beginners and experienced data scientists and machine learning practitioners, as it simplifies the implementation of various machine learning models and algorithms and encourages good practices in machine learning.

## Pandas

`Pandas` is a popular Python library for data manipulation and analysis. It provides easy-to-use data structures and functions for working with structured data, making it an essential tool for data scientists and analysts. Here's an introductory tutorial on how to get started with Pandas.



### Data Frames

`pandas.DataFrame` is the primary data structure in Pandas. You can create a DataFrame from various data sources like dictionaries, lists, CSV files, Excel files, and more.

In [None]:
import pandas as pd

data = {
    "Name": ["Alice", "Bob", "Charlie", "Mark", "Lisa", "Luke"],
    "Age": [25, 30, 35, 18, 45, 29],
}
df = pd.DataFrame(data)

We can quickly inspect the data in our DataFrame using various methods.


In [None]:
# df.info() provides a concise summary of the DataFrame.
df.info()

In [None]:
# df.shape returns the number of rows and columns.
df.shape

In [None]:
# df.head() displays the first 5 rows.
df.head()

In [None]:
# df.tail() shows the last 5 rows.
df.tail()

You can access specific data in your DataFrame using various indexing methods.

In [None]:
# Accessing a column: df['ColumnName'] or df.ColumnName
df.Name  # or df["Name"]

In [None]:
# Accessing rows by index: df.loc[row_index]
df.loc[0]

### Series

`pandas.Series` is a class and data structure that is part of the Pandas library in Python. It represents a one-dimensional labeled array that can hold data of various types, similar to a column in a spreadsheet or a single-dimensional array. 

In practice, a Pandas Series can be thought of as a combination of a **NumPy array** (providing efficient numerical operations) and a **Python dictionary** (providing label-based access). You can use the labels to access and manipulate the data, and you can perform various data analysis tasks like filtering, aggregation, and visualization with ease.

Here's a quick tutorial on how to work with Pandas Series:

In [None]:
import pandas as pd

# We can create a Pandas Series from various data sources, such as lists, NumPy arrays, dictionaries, or even scalar values.
data = [10, 20, 30, 40, 50]
label = ["A", "B", "C", "D", "E"]
series = pd.Series(data, index=label)

In [None]:
# We can access elements in a Series using indexing, just like with lists.
print(series.iloc[1])  # Access data by position

In [None]:
# We can access elements in a Series using custom indexing
print(series["B"])  # Access data by label

In [None]:
# We can perform mathematical operations on Series, and the operations are applied element-wise.
print(series * 2)  # Perform arithmetic operations

In [None]:
# We can filter and slice a Series based on conditions or positions.
print(series[series > 40])  # Filter elements greater than 40
print(series[1:4])  # Slice from the second to the fourth element

In [None]:
# Pandas Series support various operations like addition, subtraction, and more. You can also perform aggregation operations like sum, mean, and max.
print(series.sum())
print(series.mean())
print(series.max())

## Linear Regression with scikit-learn

Let's revisit the linear regression problem we previously explored this semester, but this time we will employ the `scikit-learn` libraries. We will examine the following linear equation for our regression task: 
$$y = 1 + 2x + \zeta$$
In this equation, we have a weight vector $\mathbf{w} = [1 \quad 2]^\top$, and $\zeta$ represents Gaussian noise.


In [None]:
# Import necessary libraries
import numpy as np
import matplotlib.pyplot as plt

# Generate some sample data
np.random.seed(0)
X = np.random.rand(100, 1)  # Independent variable
y = 2 * X + 1 + 0.3 * np.random.rand(100, 1)  # Dependent variable

The `train_test_split` function in scikit-learn is a utility for splitting a dataset into two or more parts, typically for the purpose of creating a **training set** and a **testing (or validation) set**. This is a crucial step in machine learning to avoid data leakage and evaluate the performance of a model on unseen data. 

In [None]:
from sklearn.model_selection import train_test_split

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

print(f"The sample data size is {len(X)}")
print(f"The training data size is {len(X_train)}")
print(f"The testing data size is {len(X_test)}")

# Plot the training and testing data
plt.figure(figsize=(5, 3))
plt.scatter(X_train, y_train, label="Training Data", color="blue")
plt.scatter(X_test, y_test, label="Test Data", color="green")
plt.xlabel("X")
plt.ylabel("y")
plt.legend()
plt.grid(True)
plt.show()

`sklearn.linear_model` is a module within scikit-learn, a popular machine learning library for Python. `The sklearn.linear_model` module provides classes and functions for linear modeling, including linear regression, logistic regression, and other linear-based models. 

The `fit` method in machine learning is used to train a machine learning model on a given dataset. It is one of the fundamental steps in the machine learning workflow where the model learns patterns and relationships in the data to make predictions or classifications. 

In [None]:
from sklearn.linear_model import LinearRegression

# Create a linear regression model
linear_reg = LinearRegression()

# Fit the model to the training data
linear_reg.fit(X_train, y_train)

# Print the model's coefficients (w1, w2, ...) and intercept (w0)
print("Coefficients:", linear_reg.coef_)
print("Intercept:", linear_reg.intercept_)

It is important to note that the `LinearRegression` model in scikit-learn does not have the same hyperparameters related to _iterations_ or _learning rates_ as iterative optimization-based models like gradient descent. Scikit-learn's `LinearRegression` uses a _closed-form solution_, so there's no concept of adjusting the number of iterations or learning rates.

In [None]:
# Plot the training data and the regression line
plt.figure(figsize=(5, 3))
plt.scatter(X_train, y_train, color="blue", label="Training Data")
plt.plot(X, linear_reg.predict(X), color="red", linewidth=1.5, label="Regression Line")
plt.xlabel("X")
plt.ylabel("y")
plt.grid(True)
plt.legend()
plt.show()

The `model.predict` method is used to make predictions for the **test data** or infer target values based on input features. It takes an array-like input of feature values and returns predictions for the corresponding target values.

In [None]:
# Make predictions on the test data
y_pred = linear_reg.predict(X_test)

# Plot the testing data and the regression line
plt.figure(figsize=(5, 3))
plt.scatter(X_test, y_test, label="Test Data", color="green")
plt.plot(X, linear_reg.predict(X), color="red", linewidth=1.5, label="Regression Line")
plt.xlabel("X")
plt.ylabel("y")
plt.grid(True)
plt.legend()
plt.show()

The R-squared value, also known as the coefficient of determination, is a statistical measure used to evaluate the goodness of fit of a regression model to a dataset. It quantifies the proportion of the variance in the dependent variable (target) that is explained by the independent variables (features) in the model. _**In other words, it assesses how well the regression model captures the variation in the data.**_

The R-squared value, denoted as $R^2$ is calculated using the following mathematical equation:

$$ R^2 = \frac{SSR}{SST} = \frac{SST - SSE}{SST}$$

where $SSR$ is the sum of squares for regression, $SSE$ is the sum of squared errors, and $SST$ is the total sum of squares. The terms $SSR$, $SSE$ and $SST$ are defined as follows:

$$ SSR = \sum_{i=1}^{N}(\hat{y}_i - \bar{y})^2 $$

$$ SSE = \sum_{i=1}^{N}(y_i - \hat{y}_i)^2 $$

$$ SST = \sum_{i=1}^{N}(y_i - \bar{y})^2 $$

The R-squared can be interpreted as the proportion of the total variation in the dependent variable that is explained by the regression model. _**It ranges from 0 (no explanatory power) to 1 (perfect explanation).**_ A higher R-squared value indicates a better fit of the model to the data, as it implies that a larger proportion of the variance is explained by the model.



In [None]:
# Calculate the model's R-squared score
r_squared = linear_reg.score(X_test, y_test)
print("R-squared:", r_squared)

Cross-validation is a critical technique for evaluating the performance of machine learning models. `Scikit-learn` provides a convenient way to perform cross-validation using its `cross_val_score` function. This function helps you compute scores for a specific metric (e.g., R-squared) on different subsets of your dataset. Here's an example of how to perform cross-validation using scikit-learn:

In [None]:
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold

# Define the cross-validation method. In this case, we use 5-fold cross-validation.
kf = KFold(n_splits=5, shuffle=True, random_state=25)

# Perform cross-validation and calculate the scores
scores = cross_val_score(linear_reg, X, y, cv=kf, scoring="r2")

# Print the cross-validation scores and their mean
print("Cross-validation R^2 scores:", scores)
print("Mean R^2:", np.mean(scores))

## Deliverable 1

Use the `scikit-learn` libraries for the `Falling Body` example.

In [None]:
import numpy as np
import pandas as pd

y0 = 60
v0 = 20
g = -9.8067

data = pd.read_csv("./data/falling_body.csv")
t = data["time"]
y = data["y"]

plt.figure(figsize=(5, 4))
plt.plot(t, y, "o")
plt.xlabel("time, sec")
plt.ylabel("height (y), m", rotation=90)
plt.grid(True)

In [None]:
X = np.column_stack([np.ones_like(t), t, t**2])
print(X.shape)
print(X)

`TODO`: Split the dataset into two parts; a training set and a testing (or validation) set.

In [None]:
from sklearn.model_selection import train_test_split
# Split the data into training and testing sets

# print the sample data, training data, and testing data sizes.

# Plot the training and testing data



`TODO`: Train the model on the training data and print the optimal coefficients.

In [None]:
from sklearn.linear_model import LinearRegression
# Fit the model to the training data and find the optimal coefficients.
# Create a linear regression model

# Fit the model to the training data

# Print the model's coefficients and intercept


`TODO` Plot the testing data and the regression line

In [None]:
# Plot the testing data and the regression line



`TODO`: Calculate the model's R-squared score

In [None]:
# Calculate the model's R-squared score



`TODO`: Use 5-fold cross-validation to find $R^2$ scores and their average.

In [None]:
# Use 5-fold cross-validation to find $R^2$ scores and their average.
# Define the cross-validation method. In this case, we use 5-fold cross-validation.
# Perform cross-validation and calculate the scores
# Print the cross-validation scores and their mean


## Logistic Regression with scikit-learn

Let's revisit the logistic regression binary classification problem where we want to predict whether a student will pass (1) or fail (0) an exam based on the number of hours they studied and the number of hours they slept.


In [None]:
import pandas as pd


def read_exam_data():
    # Specify the file name
    file_name = "./data/exam_data.csv"

    # Read the CSV file into a DataFrame
    return pd.read_csv(file_name)


exam_df = read_exam_data()
print(exam_df)

# Separate the data into Passed (Y=1) and Failed (Y=0)
passed = exam_df[exam_df["Passed Exam (Y)"] == 1]
failed = exam_df[exam_df["Passed Exam (Y)"] == 0]

# Create a scatter plot
plt.figure(figsize=(7, 4))
plt.scatter(
    passed["Hours Studied (X1)"],
    passed["Hours Slept (X2)"],
    label="Passed",
    color="r",
    marker="o",
)

plt.scatter(
    failed["Hours Studied (X1)"],
    failed["Hours Slept (X2)"],
    label="Failed",
    color="b",
    marker="o",
)
plt.xlabel("Hours Studied (X1)")
plt.ylabel("Hours Slept (X2)")
plt.title("Exam Data Scatter Plot")
plt.legend()
plt.grid(True)
plt.show()

As we did before, we will use the `train_test_split` function in scikit-learn for splitting a dataset into two or more parts.

In [None]:
# Import necessary libraries
import numpy as np
from sklearn.model_selection import train_test_split

x1 = exam_df["Hours Studied (X1)"]
x2 = exam_df["Hours Slept (X2)"]
y = exam_df["Passed Exam (Y)"]

X = np.column_stack([x1, x2])

print(X.shape)
print(y.shape)

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

We use the `fit` method for logistic regression to train a machine learning model on a given dataset. 

We can control the number of iterations and learning rate for logistic regression in scikit-learn. Scikit-learn's logistic regression implementation uses the 'liblinear' solver by default, which allows you to set the maximum number of iterations and the learning rate through the max_iter and C parameters, respectively.

We can set the maximum number of iterations for the solver like this.

```
model = LogisticRegression(solver='liblinear', max_iter=100, C=1.0)  # default values
```

We will use the default settings without specifying any function arguments.

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_curve, roc_auc_score, auc

# Create a logistic regression model
logistic_reg = LogisticRegression()

# Train the model on the training data
logistic_reg.fit(X_train, y_train)

# Print the model's coefficients and intercept
print("Coefficients:", logistic_reg.coef_)
print("Intercept:", logistic_reg.intercept_)

The `predict_proba` method is commonly used in machine learning, especially for binary and multiclass classification problems. The purpose of `predict_proba` is to estimate the probability of each class label for a given input, allowing you to make more informed decisions when dealing with classification tasks. The `predict_proba` method returns a probability distribution over the classes. In binary classification, it typically returns a 2D array with two columns, where each column represents the probability of the data point belonging to one of the two classes. The sum of these probabilities for each class will be equal to 1.0. In multiclass classification, the method returns the probabilities for each class.

In [None]:
# Get predicted probabilities for the positive class (class 1)
probs = logistic_reg.predict_proba(X_test)

for x, prob in zip(X_test, probs[:, 1]):
    print(f"study hours: {x[0]}, sleep hours: {x[1]}, probability to pass: {prob}")

In [None]:
import matplotlib.pyplot as plt

# Separate the data into Passed (Y=1) and Failed (Y=0)
threshold = 0.5
testdata_passed = X_test[probs[:, 1] >= threshold]
testdata_failed = X_test[probs[:, 1] < threshold]

traindata_passed = X_train[logistic_reg.predict_proba(X_train)[:, 1] >= threshold]
traindata_failed = X_train[logistic_reg.predict_proba(X_train)[:, 1] < threshold]

# Create a scatter plot
plt.figure(figsize=(7, 4))
plt.scatter(
    passed["Hours Studied (X1)"],
    passed["Hours Slept (X2)"],
    label="Passed",
    color="r",
    marker="o",
)

plt.scatter(
    failed["Hours Studied (X1)"],
    failed["Hours Slept (X2)"],
    label="Failed",
    color="b",
    marker="o",
)


plt.xlabel("Hours Studied (X1)")
plt.ylabel("Hours Slept (X2)")
plt.title("Exam Data")
plt.legend()
plt.grid(True)
plt.show()


plt.figure(figsize=(7, 4))

plt.scatter(
    traindata_passed[:, 0],
    traindata_passed[:, 1],
    label="Test Data Passed",
    color="r",
    marker="o",
)

plt.scatter(
    traindata_failed[:, 0],
    traindata_failed[:, 1],
    label="Test Data Failed",
    color="b",
    marker="o",
)

plt.scatter(
    testdata_passed[:, 0],
    testdata_passed[:, 1],
    label="Test Data Passed",
    color="r",
    marker="x",
)

plt.scatter(
    testdata_failed[:, 0],
    testdata_failed[:, 1],
    label="Test Data Failed",
    color="b",
    marker="x",
)


plt.xlabel("Hours Studied (X1)")
plt.ylabel("Hours Slept (X2)")
plt.title("Train & Test Result")
plt.legend()
plt.grid(True)
plt.show()

Let's use the trained model to make predictions for the testing data.

In [None]:
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

y_pred = logistic_reg.predict(X_test)

# confusion matrix
confusion = confusion_matrix(y_test, y_pred)

# classification report
classification_rep = classification_report(y_test, y_pred)

print("Confusion Matrix:\n", confusion)
print("Classification Report:\n", classification_rep)

Cross-validation is a resampling procedure used in machine learning and model evaluation to assess how well a predictive model will perform on an independent dataset. It's a critical step to ensure that a machine learning model's performance is not overly optimistic or pessimistic and to detect issues like overfitting or underfitting.

In [None]:
# Define the cross-validation method. In this case, we use 5-fold cross-validation.
kf = KFold(n_splits=3)

# Perform cross-validation and calculate the scores
scores = cross_val_score(logistic_reg, X, y, cv=kf, scoring="r2")

# Print the cross-validation scores and their mean
print("Cross-validation R^2 scores:", scores)
print("Mean R^2:", np.mean(scores))

The Receiver Operating Characteristic (ROC) curve is a graphical representation commonly used to evaluate the performance of binary classification models, such as logistic regression, support vector machines, or decision trees. The ROC curve is a tool for visualizing the trade-off between a model's true positive rate (sensitivity) and its false positive rate (1-specificity) across different classification thresholds.

In [None]:
# Compute ROC curve
fpr, tpr, thresholds = roc_curve(y_test, probs[:, 1])

# Calculate the AUC (Area Under the ROC Curve)
roc_auc = auc(fpr, tpr)

# Plot the ROC curve
plt.figure(figsize=(5, 3))
plt.plot(fpr, tpr, color="darkorange", lw=2, label=f"ROC curve (AUC = {roc_auc:.2f})")
plt.plot([0, 1], [0, 1], color="navy", lw=2, linestyle="--")
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("Receiver Operating Characteristic (ROC) Curve")
plt.legend(loc="lower right")
plt.show()

## Deliverable 2

The scikit-learn breast cancer dataset is a commonly used dataset in machine learning for binary classification tasks, particularly for practicing and demonstrating various machine learning algorithms and techniques. It is included as part of the scikit-learn library in Python, which is a popular machine learning library.

This dataset is often referred to as the "Breast Cancer Wisconsin (Diagnostic) dataset" or simply the "breast cancer dataset." It contains features computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. The task associated with this dataset is to predict whether a breast tumor is malignant (cancerous) or benign (non-cancerous) based on these features.

You can load this dataset in scikit-learn using the following code:

In [None]:
# Import necessary libraries
import pandas as pd
from sklearn import datasets

# Load the Pima Indians cancer dataset
cancer = datasets.load_breast_cancer()

# Extract features (X) and target (y) from the dataset
X = cancer.data
y = cancer.target

cancer_df = pd.DataFrame(data=cancer.data, columns=cancer.feature_names)

# Add the target variable as a column
cancer_df["target"] = cancer.target

cancer_df.head()

`TODO`: Split the dataset into two partsl a training set and a testing (or validation) set.

In [None]:
# Split the dataset into two partsl a training set and a testing (or validation) set.
# Import necessary libraries
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Split the data into training and testing sets




`TODO`: Fit the model to the training data

In [None]:
# Fit the model to the training data

# Create a logistic regression model

# Train the model on the training data

# It will stop with no success. Click on the link to learn more about preprocessing.



In scikit-learn, the `StandardScaler` is a preprocessing technique used to standardize the features of a dataset, meaning it scales the features to have a mean of 0 and a standard deviation of 1. This is often referred to as `z-score normalization` or `standardization`. The process of standardization helps ensure that all features have the same scale and can prevent certain features from dominating the model's learning process.

The `.fit()` method of the StandardScaler computes the mean and standard deviation for each feature in the dataset based on the provided data. The mean and standard deviation are used to standardize the data when the `.transform()` method is called.

`TODO`: Split the `scaled` dataset 

In [None]:
from sklearn import preprocessing

scaler = preprocessing.StandardScaler().fit(X)

X_scaled = scaler.transform(X)

# Split the scaled dataset
# Split the data into training and testing sets



`TODO`: Train the model on the training data and print the optimal coefficients.

In [None]:
# Fit the model to the training data and find the optimal coefficients.
# Create a logistic regression model

# Train the model on the training data

# Print the model's coefficients and intercept




`TODO`: Make predictions for the testing data and print the confustion matrix and classification report.

In [None]:
# Make predictions for the testing data and print the confustion matrix and classification report.
# Make predictions on the test data


# Calculate accuracy and display results



`TODO`: Use 5-fold cross-validation to find $R^2$ scores and their average.

In [None]:
# Use 5-fold cross-validation to find $R^2$ scores and their average.

# Perform cross-validation and calculate the scores


# Print the cross-validation scores and their mean




`TODO`: Plot the ROC curve

In [None]:
# Plot the ROC curve

"""Solutions"""

# Get predicted probabilities for the positive class (class 1)


# Compute ROC curve

# Calculate the AUC (Area Under the ROC Curve)


# Plot the ROC curve


