# Scikit-learn Basics and Usage

- Overview:
    - Scikit-learn is one of the most popular and powerful libraries for machine learning in Python. It provides a wide range of tools for data mining, data analysis, and machine learning tasks. Scikit-learn is designed to be simple and efficient, making it accessible for both beginners and experienced practitioners.

- Importance:
    - Understanding the basics of Scikit-learn is crucial for anyone interested in applying machine learning to real-world problems. The library covers a variety of tasks, including classification, regression, clustering, dimensionality reduction, and model evaluation.

## Installing Scikit-learn
To install Scikit-learn, use the following command:
```
pip install scikit-learn
```
You can also install it via conda if you're using Anaconda.

##### NOTE:

- When using Anaconda any time you want to install a package you will need to put the "!" infront of the pip
&nbsp;

    - This will work
      ```
          !pip install scikit-learn
                 or
          !pip3 install scikit-learn
      ```
      &nbsp;
    - This will not work
      ```
          pip install scikit-learn
                 or
          pip3 install scikit-learn
      ```

In [None]:
!pip install scikit-learn

## Core Concepts

- Overview:

    - Scikit-learn revolves around several core concepts that you need to understand to effectively use the library. These include data `preparation`, `estimators`, `transformers`, and `pipelines`. Each of these concepts plays a vital role in building and deploying machine learning models.

    - `Data Preparation:` Before feeding your data into a machine learning model, it typically needs to be preprocessed. This can involve splitting the data into training and testing sets, scaling the features, and handling missing values.

    - `Estimators:` Estimators in Scikit-learn are objects that can be trained on data. They include models like classifiers (for classification tasks) and regressors (for regression tasks). Once trained, an estimator can make predictions based on new data.

    - `Transformers:` Transformers are used to preprocess data before it is fed into an estimator. Common transformations include scaling, normalization, and feature extraction. Transformers are applied using the fit_transform method.
    
    - `Pipelines:` Pipelines allow you to chain together multiple processing steps, including transformers and estimators, into a single sequence. This ensures that all steps are executed in the correct order and makes your workflow more reproducible.

In [None]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.cluster import KMeans
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, mean_squared_error
from sklearn.model_selection import cross_val_score
from sklearn.datasets import load_iris

### Data Preparation

- Overview:

    - Data preparation is a critical step in any machine learning project. It involves splitting the data into training and testing sets, scaling the features, and sometimes performing additional preprocessing steps like encoding categorical variables or handling missing data.

    - Data Splitting: The data is usually split into training and testing sets to evaluate the model's performance on unseen data. The `train_test_split` function in Scikit-learn is commonly used for this purpose. A typical split might allocate 80% of the data to training and 20% to testing.

    - Scaling: Feature scaling is necessary to ensure that all features contribute equally to the model. Many machine learning algorithms perform better when the input features are on a similar scale. The `StandardScaler` is a common tool used to standardize features by removing the mean and scaling to unit variance.

- Examples:
    - In this example, we split a small dataset into training and testing sets and then scale the features using `StandardScaler`. This ensures that the model will train on data where all features are on a similar scale, leading to better performance.

In [None]:
# Sample data
X = np.array([[2, 3], [4, 6], [5, 7], [8, 8], [10, 12], [12, 14], [14, 16], [16, 18], [18, 20], [20, 22]])
y = np.array([1, 2, 1, 3, 2, 3, 1, 2, 3, 2])

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print(f"Training data: \n{X_train}")
print(f"\nTesting data: \n{X_test}")

# Scale the data
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
print(f"\nScaled training data: \n{X_train_scaled}")

### Estimators and Transformers

*Estimators are objects that can be trained on data (e.g., classifiers, regressors), while transformers are used to preprocess data.*

- Overview:
    - Estimators and transformers are fundamental components in Scikit-learn. Estimators are used to fit models to data, while transformers are used to preprocess data before feeding it into an estimator.

    - Estimators: These include classifiers and regressors, which are used for tasks like predicting a class label or a continuous value. The `fit()` method is used to train the estimator on the training data, and the `predict()` method is used to make predictions on new data.

    - Transformers: These are used to transform data in some way, such as scaling, normalizing, or encoding categorical variables. The `fit_transform()` method is commonly used, where `fit()` learns the parameters from the training data, and `transform()` applies the transformation.

- Examples:
    - We demonstrate how to use a LogisticRegression model as an estimator, which is trained on scaled data. The coefficients and intercept of the model are output, showing how the model has learned from the training data. The transformer (`StandardScaler`) is used to scale new data before making predictions.

In [None]:
# Logistic Regression
model = LogisticRegression()
model.fit(X_train_scaled, y_train)
print(f"Model coefficients: \n{model.coef_}")
print(f"\nModel intercept: \n{model.intercept_}")

new_data = np.array([[5, 6], [9, 11]])
scaled_new_data = scaler.transform(new_data)
print(f"\nScaled new data: \n{scaled_new_data}")

### Pipelines

*Pipelines allow you to chain multiple processing steps together for ease of use and reproducibility.*

- Overview:
    - Pipelines are a powerful feature in Scikit-learn that allow you to chain together multiple processing steps into a single object. This ensures that the entire sequence of steps is applied consistently to the data. Pipelines are particularly useful when you want to standardize your machine learning workflow and make it more reproducible.

    - Components of a Pipeline: A typical pipeline includes transformers (for data preprocessing) and an estimator (for modeling). The pipeline ensures that the transformers are applied to the training data during fitting and to the test data during prediction.

- Examples:
    - In this example, a pipeline is created to scale the data using StandardScaler and then apply a Support Vector Machine (SVM) for classification. The pipeline is trained on the training data, and predictions are made on the test data, showing how a pipeline streamlines the process.

In [None]:
# SVM with Pipeline
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('svm', SVC())
])

pipeline.fit(X_train, y_train)
predictions = pipeline.predict(X_test)
print(f"SVM Predictions: \n{predictions}")

## Machine Learning Models

Scikit-learn provides a wide range of machine learning models for both supervised and unsupervised learning tasks. Supervised learning involves predicting a target variable based on input features, while unsupervised learning involves finding patterns in data without predefined labels.

### Supervised Learning

- Overview:
    - Supervised learning tasks in Scikit-learn include classification and regression. Classification involves predicting a discrete label (e.g., spam or not spam), while regression involves predicting a continuous value (e.g., house price).

    - k-Nearest Neighbors (k-NN): The k-NN algorithm is a simple, yet effective, method for both classification and regression. It works by finding the k nearest data points to a new observation and making predictions based on the majority label (in classification) or the average value (in regression).

- Examples:
    - We demonstrate how to use the k-NN classifier and regressor on scaled data. The classifier predicts class labels based on the nearest neighbors, while the regressor predicts continuous values. This example shows the versatility of the k-NN algorithm in different types of machine learning tasks.

In [None]:
# k-Nearest Neighbors Classifier(k-NN)

# Train the k-NN model
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train_scaled, y_train)
# Predict
knn_predictions = knn.predict(X_test_scaled)
print(f"k-NN Predictions: \n{knn_predictions}")

# k-NN Regressor

# Train
knn_regressor = KNeighborsRegressor(n_neighbors=3)
knn_regressor.fit(X_train_scaled, y_train)
# Predict
knn_reg_predictions = knn_regressor.predict(X_test_scaled)
print(f"k-NN Regression Predictions: \n{np.round(knn_reg_predictions,2)}")

### Unsupervised Learning

- Overview:
    - Unsupervised learning includes tasks like clustering and dimensionality reduction, where the goal is to find structure in data without predefined labels. Scikit-learn provides various algorithms for unsupervised learning, including k-Means for clustering and PCA for dimensionality reduction.

    - k-Means Clustering: k-Means is a popular clustering algorithm that groups data into k clusters based on feature similarity. The algorithm iteratively assigns data points to clusters and adjusts the cluster centers until convergence.

- Examples:
    - We use k-Means to cluster the training data into two groups. The cluster centers are output, showing the average feature values for each cluster, and the labels indicate which cluster each data point belongs to.

In [None]:
kmeans = KMeans(n_clusters=2)
kmeans.fit(X_train_scaled)
print(f"Cluster centers: \n{kmeans.cluster_centers_}")
print(f"\nLabels for training data: \n{kmeans.labels_}")

## Model Evaluation

Evaluating the performance of a machine learning model is crucial to understanding how well it generalizes to new data. Scikit-learn provides various tools for model evaluation, including cross-validation and performance metrics.

### Cross-Validation

- Overview:
    - Cross-validation is a technique for assessing how well a model will generalize to unseen data. It involves splitting the data into multiple folds, training the model on some folds, and testing it on the remaining folds. This process is repeated, and the results are averaged to provide a robust estimate of the model’s performance.

- Examples:
    - We demonstrate how to perform cross-validation using the `cross_val_score` function. This example uses logistic regression, but the concept can be applied to any estimator. The cross-validation scores and their average provide insight into the model’s stability and generalizability.

In [None]:
# Perform cross-validation
cv_scores = cross_val_score(LogisticRegression(), X, y, cv=3)
print(f"Cross-validation scores: \n{cv_scores}")
print(f"Mean cross-validation score: \n{np.mean(cv_scores)}")

### Performance Metrics

- Overview:
    - Performance metrics are used to quantify the accuracy of a model's predictions. Common metrics in classification tasks include accuracy, precision, recall, and F1 score. In regression tasks, mean squared error (MSE) is a commonly used metric.

    - Accuracy: The proportion of correctly predicted instances out of the total instances. It’s a simple and widely used metric but can be misleading in imbalanced datasets.

    - Precision: The proportion of true positive predictions out of all positive predictions. It is especially important in scenarios where the cost of false positives is high.

    - Recall: The proportion of true positive predictions out of all actual positives. High recall is crucial when missing positive instances is costly.

    - F1 Score: The harmonic mean of precision and recall, providing a balance between the two metrics.

    - Mean Squared Error (MSE): A metric used in regression that measures the average squared difference between the observed and predicted values.

- Examples:
    - We evaluate a k-NN classifier and regressor using these metrics. The classification metrics provide a detailed understanding of the model's performance across different aspects, while MSE gives a measure of the regression model's prediction error.

In [None]:
# Evaluate the model
accuracy = accuracy_score(y_test, knn_predictions)
precision = precision_score(y_test, knn_predictions, average='macro', zero_division=1)
recall = recall_score(y_test, knn_predictions, average='macro', zero_division=1)
f1 = f1_score(y_test, knn_predictions, average='macro', zero_division=1)

print(f"Accuracy: \t{accuracy}")
print(f"Precision: \t{precision}")
print(f"Recall: \t{recall}")
print(f"F1 Score: \t{f1}")

# Evaluate Regressor
mse = mean_squared_error(y_test, knn_reg_predictions)
print(f"Mean Squared Error: {mse}")

## Practical Example: Classification with k-NN

- Overview:
    - In this practical example, we apply the k-Nearest Neighbors algorithm to classify the famous Iris dataset. The Iris dataset is a classic dataset in machine learning, consisting of three classes of iris flowers, each described by four features.

- Steps Involved:
    - Data Preparation: The dataset is split into training and testing sets, and the features are scaled to ensure that the k-NN algorithm performs optimally.

    - Model Training: The k-NN model is trained on the scaled training data.

    - Prediction: The model makes predictions on the test data, and these predictions are compared to the true labels to evaluate the model's accuracy.

- Conclusion:
    - This example demonstrates the effectiveness of the k-NN algorithm in a classification task. By following these steps, you can apply k-NN to other datasets and understand the workflow involved in building and evaluating a classification model.

In [None]:
# Load the Iris dataset
iris = load_iris()
X_iris, y_iris = iris.data, iris.target

# Split the data
X_train_iris, X_test_iris, y_train_iris, y_test_iris = train_test_split(X_iris, y_iris, test_size=0.2, random_state=42)

# Scale the data
scaler = StandardScaler()
X_train_iris_scaled = scaler.fit_transform(X_train_iris)
X_test_iris_scaled = scaler.transform(X_test_iris)

# Train the k-NN model
knn_iris = KNeighborsClassifier(n_neighbors=3)
knn_iris.fit(X_train_iris_scaled, y_train_iris)

# Predict
iris_predictions = knn_iris.predict(X_test_iris_scaled)

# Evaluate
iris_accuracy = accuracy_score(y_test_iris, iris_predictions)
print(f"Accuracy of k-NN on Iris dataset: {iris_accuracy}")

## Practical Example: Titanic Dataset

Now let's integrate the Titanic dataset for another practical example.

- The Titanic dataset is another popular dataset used in machine learning to predict whether a passenger survived the disaster based on features like age, class, fare, and the number of siblings or spouses aboard. This practical example walks you through the steps of preparing the data, training models, and evaluating their performance.

## Loading and Preparing the Titanic Dataset

- Overview:
    - The Titanic dataset is a classic dataset often used in machine learning to predict whether a passenger survived the sinking of the Titanic based on various features such as passenger class, age, fare, and the number of relatives aboard. This section covers how to load, prepare, and preprocess this dataset for machine learning tasks.

- Loading the Data:
    - The dataset is loaded directly from a URL using pandas' `read_csv()` function. This function reads a CSV file into a DataFrame, which is a table-like structure in pandas that is ideal for data manipulation and analysis.

- Selecting Relevant Features:
    - After loading the dataset, we select a subset of columns (features) that are relevant for predicting survival. In this case, we choose features like `pclass` (passenger class), `age`, `fare`, `sibsp` (siblings/spouses aboard), and `parch` (parents/children aboard). These features are stored in a DataFrame called `features`.
    - The target variable, which we want to predict, is the `survived` column, indicating whether the passenger survived (1) or not (0). This is stored in a separate variable called `target`.

- Handling Missing Values:
    - Missing values in the dataset can cause issues with model training, so they need to be handled appropriately. In the Titanic dataset, the `age` column has missing values. We handle these missing values by filling them with the mean age of the passengers using the fillna() function. This is a common strategy to deal with missing numerical data.

- Splitting the Data:
    - The data is split into training and testing sets using the `train_test_split()` function. The training set is used to train the model, and the testing set is used to evaluate its performance. A common split is 80% training and 20% testing, as used here.

- Scaling the Data:
    - Feature scaling is performed to ensure that all features contribute equally to the model. This is especially important for models that rely on distance calculations, like k-NN and SVM. We use the `StandardScaler` to standardize the features by removing the mean and scaling to unit variance.

- Conclusion:
    - This section prepares the Titanic dataset for model training, ensuring that the data is clean, relevant, and scaled. Proper data preparation is a crucial step in any machine learning workflow and can significantly impact the model’s performance.

In [None]:
# Load the Titanic dataset
url = "https://raw.githubusercontent.com/mwaskom/seaborn-data/master/titanic.csv"
titanic_data = pd.read_csv(url)

# Select relevant features and target
features = titanic_data[['pclass', 'age', 'fare', 'sibsp', 'parch']]
target = titanic_data['survived']

# Handle missing values
features['age'] = features['age'].fillna(features['age'].mean())

# Split the data into training and testing sets
X_train_titanic, X_test_titanic, y_train_titanic, y_test_titanic = train_test_split(features, target, test_size=0.2, random_state=42)

# Scale the data
scaler = StandardScaler()
X_train_titanic_scaled = scaler.fit_transform(X_train_titanic)
X_test_titanic_scaled = scaler.transform(X_test_titanic)

## Training a Model on the Titanic Dataset

- Overview:
    - Once the data is prepared and preprocessed, the next step is to train a machine learning model. In this section, we train a Logistic Regression model to predict whether a passenger survived the Titanic disaster based on the features selected.

- Logistic Regression:
    - Logistic Regression is a commonly used algorithm for binary classification tasks. It models the probability that a given input belongs to a particular class. In this case, the model predicts the probability of survival based on the input features.
    - The model is trained using the `fit()` method on the scaled training data. The maximum number of iterations is set to 1000 to ensure the model converges.

- Making Predictions:
    - After the model is trained, we use it to make predictions on the test data. The `predict()` method returns the predicted class labels (0 for not survived, 1 for survived) for the test data.

- Evaluating the Model:
    - The performance of the Logistic Regression model is evaluated using the accuracy_score() function, which calculates the proportion of correctly predicted instances. Accuracy is a straightforward and commonly used metric in binary classification problems.

- Conclusion:
    - This section demonstrates how to train a Logistic Regression model on the Titanic dataset and evaluate its performance. Logistic Regression is a powerful and interpretable model for binary classification, making it a good choice for this type of problem.

In [None]:
# Logistic Regression
titanic_model = LogisticRegression(max_iter=1000)
titanic_model.fit(X_train_titanic_scaled, y_train_titanic)

# Make predictions
titanic_predictions = titanic_model.predict(X_test_titanic_scaled)

# Evaluate the model
titanic_accuracy = accuracy_score(y_test_titanic, titanic_predictions)
print(f"Accuracy of Logistic Regression on Titanic dataset: {titanic_accuracy:.2f}")

## Further Evaluation with k-NN and Random Forest on Titanic Dataset

While Logistic Regression is a solid starting point, it’s important to explore and compare other models to find the one that performs best on the dataset. In this section, we evaluate two additional models: k-Nearest Neighbors (k-NN) and Random Forest, both of which are popular choices for classification tasks.


### k-Nearest Neighbors (k-NN):

- The k-NN algorithm is a simple, non-parametric method used for classification. It works by finding the k closest data points (neighbors) in the feature space and predicting the class based on the majority class among these neighbors. It’s particularly useful for datasets where the decision boundary between classes is complex.

- Training the Model:
    - We train the k-NN classifier on the scaled training data. The `n_neighbors` parameter is set to 3, meaning the algorithm will consider the three nearest neighbors for making predictions.

- Making Predictions:
    - After training, we use the `predict()` method to classify the test data. The predictions are compared against the actual labels to evaluate the model’s performance.

- Evaluating the Model:
    - The accuracy of the k-NN classifier is calculated using the `accuracy_score()` function. This allows us to compare the k-NN model’s performance with that of the Logistic Regression model.

- Conclusion:
    - This section illustrates how the k-NN algorithm can be applied to the Titanic dataset and how it compares to Logistic Regression. k-NN is often used as a benchmark for more complex models due to its simplicity and effectiveness.


### Random Forest:

- Random Forest is an ensemble learning method that combines multiple decision trees to improve accuracy and reduce overfitting. Each tree in the forest is trained on a random subset of the data, and the final prediction is made by averaging the predictions of all trees (in regression) or by majority vote (in classification).

- Training the Model:
    - The Random Forest classifier is trained on the scaled training data using 100 decision trees (`n_estimators=100`). The random state is set to 42 to ensure reproducibility of the results.

- Making Predictions:
    - The model is used to predict the survival of passengers in the test set. These predictions are then evaluated to determine the model’s accuracy.

- Evaluating the Model:
    - The accuracy of the Random Forest classifier is calculated using the `accuracy_score()` function. Random Forest is known for its robustness and ability to handle a large number of features, making it a strong candidate for this classification task.

- Conclusion:
    - This section shows how the Random Forest algorithm can be applied to the Titanic dataset. By comparing the performance of Logistic Regression, k-NN, and Random Forest, we can select the best model for predicting passenger survival on the Titanic.

In [None]:
# k-NN Classifier on Titanic dataset

knn_titanic = KNeighborsClassifier(n_neighbors=3)
knn_titanic.fit(X_train_titanic_scaled, y_train_titanic)
knn_titanic_predictions = knn_titanic.predict(X_test_titanic_scaled)
knn_titanic_accuracy = accuracy_score(y_test_titanic, knn_titanic_predictions)
print(f"Accuracy of k-NN on Titanic dataset: {knn_titanic_accuracy:.2f}")

# Random Forest Classifier on Titanic dataset

rf_titanic = RandomForestClassifier(n_estimators=100, random_state=42)
rf_titanic.fit(X_train_titanic_scaled, y_train_titanic)
rf_titanic_predictions = rf_titanic.predict(X_test_titanic_scaled)
rf_titanic_accuracy = accuracy_score(y_test_titanic, rf_titanic_predictions)
print(f"Accuracy of Random Forest on Titanic dataset: {rf_titanic_accuracy:.2f}")

## Conclusion
This notebook has covered the basics of Scikit-learn, including core concepts, machine learning models, and model evaluation. With these tools, you can start using Scikit-learn for your machine learning tasks.

## Extra Resources

Hyperlinks are attached to each of the extra resources

- [Scikit-learn Documentation](https://scikit-learn.org/stable/)
    - Official Site 
      
&nbsp;

- [Scikit-learn Tutorial - DataCamp](https://www.datacamp.com/tutorial/machine-learning-python)
    - Complete beginner tutorial for Scikit-learn
          
&nbsp; 
     
- [SimpliLearn](https://www.simplilearn.com/tutorials/python-tutorial/scikit-learn)
    - An Introduction to Scikit-Learn: Machine Learning in Python