# Get Familiar with Scikit-learn

## Introduction to Scikit-learn


- Scikit-learn is a Python library for machine learning built on NumPy, SciPy, and Matplotlib.
  It leverages the power of these foundational libraries to deliver a comprehensive suite of tools for machine learning and statistical modeling.
- Wide range of algorithms supported.
  Scikit-learn supports a vast array of algorithms, including linear and logistic regression, decision trees, support vector machines, naive Bayes, k-means, and more.
- Scikit-learn emphasizes ease of use and productivity.
  Its clean API and well-documented functionality make it accessible for both beginners and experienced practitioners. The library includes a variety of efficient algorithms that can handle large-scale data.
- Integration with other libraries.
  Scikit-learn seamlessly integrates with other Python libraries such as pandas for data manipulation, matplotlib for plotting, and seaborn for statistical visualization.
- Community and documentation.
  Scikit-learn has a large and active community, contributing to its continuous improvement and expansion. Comprehensive documentation, user guides, and tutorials are readily available, making it easier to learn and apply machine learning techniques.
- Open source and widely adopted.
  As an open-source project, Scikit-learn is free to use and distribute. It is widely adopted in both academic research and industry applications, demonstrating its reliability and effectiveness in various domains.
        

## Installation and Setup

In [1]:
!pip install scikit-learn



## Basic Usage

Let's look at a very simple example: "Fruits Classificator".

The input data X represents fruits with their weight in grams and color score on a scale from 1 to 10.
The target data y uses 0 for Apple and 1 for Orange.

In [2]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

### Prepare data
## Train data
# input: weight (grams) and color score (1-10 scale)
X_train = [
    [150, 7],  # Apple
    [170, 8],  # Apple
    [140, 6],  # Apple
    [300, 4],  # Orange
    [320, 5],  # Orange
    [310, 4]   # Orange
]

# target: 0 for Apple, 1 for Orange
y_train = [0, 0, 0, 1, 1, 1]

## Test data
X_test = [
    [205, 4],  # Should be classified as Apple
    [315, 5]   # Should be classified as Orange
]
y_test = [0, 1]  # Actual labels for the test data

### Train the model
## instantiate the estimator object
clf = KNeighborsClassifier(n_neighbors=1)

# fit the estimator on training data (X, y)
clf.fit(X_train, y_train)

### Test
# predict the classes
y_pred = clf.predict(X_test)


### Evaluate
print(f'Predicted outputs: {y_pred}')
print(f'Actual outputs: {y_test}')

# calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Model accuracy:", accuracy)

Predicted outputs: [0 1]
Actual outputs: [0, 1]
Model accuracy: 1.0


# Key Features

## Built-in Datasets


- Scikit-learn comes with several built-in datasets for practicing machine learning techniques. These datasets are small and are used primarily for educational purposes and quick testing.
- Popular Built-in Datasets:
  - Iris: A classic dataset for classification tasks, containing measurements of iris flowers from three different species.
  - Digits: A dataset for classification tasks, containing images of handwritten digits.
  - Boston Housing: A dataset used for regression tasks, containing information about housing prices in Boston suburbs.
  - Wine: A dataset for classification tasks, containing chemical analysis of wines grown in the same region in Italy.
  - Breast Cancer: A dataset for classification tasks, containing data on breast cancer tumor features.
        

### Loading Built-in Datasets:

In [3]:

from sklearn import datasets

# Load Iris dataset
iris = datasets.load_iris()
print(iris.data[:5])

# Load Digits dataset
digits = datasets.load_digits()
print(digits.data[:5])

[[5.1 3.5 1.4 0.2]
 [4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]
 [4.6 3.1 1.5 0.2]
 [5.  3.6 1.4 0.2]]
[[ 0.  0.  5. 13.  9.  1.  0.  0.  0.  0. 13. 15. 10. 15.  5.  0.  0.  3.
  15.  2.  0. 11.  8.  0.  0.  4. 12.  0.  0.  8.  8.  0.  0.  5.  8.  0.
   0.  9.  8.  0.  0.  4. 11.  0.  1. 12.  7.  0.  0.  2. 14.  5. 10. 12.
   0.  0.  0.  0.  6. 13. 10.  0.  0.  0.]
 [ 0.  0.  0. 12. 13.  5.  0.  0.  0.  0.  0. 11. 16.  9.  0.  0.  0.  0.
   3. 15. 16.  6.  0.  0.  0.  7. 15. 16. 16.  2.  0.  0.  0.  0.  1. 16.
  16.  3.  0.  0.  0.  0.  1. 16. 16.  6.  0.  0.  0.  0.  1. 16. 16.  6.
   0.  0.  0.  0.  0. 11. 16. 10.  0.  0.]
 [ 0.  0.  0.  4. 15. 12.  0.  0.  0.  0.  3. 16. 15. 14.  0.  0.  0.  0.
   8. 13.  8. 16.  0.  0.  0.  0.  1.  6. 15. 11.  0.  0.  0.  1.  8. 13.
  15.  1.  0.  0.  0.  9. 16. 16.  5.  0.  0.  0.  0.  3. 13. 16. 16. 11.
   5.  0.  0.  0.  0.  3. 11. 16.  9.  0.]
 [ 0.  0.  7. 15. 13.  1.  0.  0.  0.  8. 13.  6. 15.  4.  0.  0.  0.  2.
   1. 13. 13.  0.  0.  0.  0.  0.  

## Data Preprocessing

### Overview

Data preprocessing is a crucial step in the machine learning pipeline. It involves transforming raw data into a format suitable for modeling, which can improve the performance and accuracy of machine learning algorithms.

Some of the common preprocessing techniques include:
  - Scaling: Adjusting the range of features to ensure they contribute equally to the model.
  - Normalization: Transforming features to have a mean of 0 and a standard deviation of 1.
  - Encoding: Converting categorical variables into numerical format.
  - Imputation: Filling in missing values with appropriate substitutes.        

### Scaling and Normalization with Scikit-learn

In [4]:

from sklearn.preprocessing import StandardScaler

# Example of scaling features
scaler = StandardScaler()

X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)


### Encoding Categorical Variables with Scikit-learn

In [5]:

from sklearn.preprocessing import OneHotEncoder

# Example of one-hot encoding
encoder = OneHotEncoder()
# Assuming `X` is a dataset containing categorical variables
# X_encoded = encoder.fit_transform(X)


### Handling Missing Values

In [6]:

from sklearn.impute import SimpleImputer

# Example of imputing missing values
# Assuming `X` is a dataset containing missing values
# imputer = SimpleImputer(strategy='mean')
# X_imputed = imputer.fit_transform(X)


## Model Selection

### Overview of Model Selection


- Model selection is the process of choosing the most appropriate machine learning model and its hyperparameters for a given task.
- This step is critical for ensuring the best performance and generalization of the model.
        

### Train/Test Split

In [7]:

from sklearn.model_selection import train_test_split

# Assuming `X` and `y` are the features and target variable of the dataset
# X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


### Cross-Validation

In [8]:

from sklearn.model_selection import cross_val_score

# Assuming `model`, `X`, and `y` are predefined
# scores = cross_val_score(model, X, y, cv=5)
# print(scores)


### Grid Search

In [9]:

from sklearn.model_selection import GridSearchCV

param_grid = {'n_neighbors': [1, 3, 5, 7]}
grid_search = GridSearchCV(KNeighborsClassifier(), param_grid, cv=5)
grid_search.fit(X_train, y_train)
print(grid_search.best_params_)


ValueError: n_splits=5 cannot be greater than the number of members in each class.

### Random Search

In [None]:

from sklearn.model_selection import RandomizedSearchCV

param_dist = {'n_neighbors': [1, 3, 5, 7]}
random_search = RandomizedSearchCV(KNeighborsClassifier(), param_dist, n_iter=10, cv=5)
random_search.fit(X_train, y_train)
print(random_search.best_params_)


## Model Evaluation

In [None]:

from sklearn.metrics import classification_report

y_pred = knn.predict(X_test)
print(classification_report(y_test, y_pred))
