# scikit-learn 
- **scikit-learn**, often abbreviated as <mark>sklearn</mark>,is an `open-source Python library` that implements a range of 
 - machine learning
 - pre-processing
 - cross-validation
 - visualization algorithms<br>using a unified interface.


- Simple and efficient tools for <mark>data mining</mark> and <mark>data analysis</mark>. 


- It includes implementations of various **supervised and unsupervised learning algorithms** such as:
 - support vector machines (SVM)
 - random forests
 - k-nearest neighbors (KNN)
 - k-means clustering...


- It provides a wide range of tools for various <mark>**ML tasks**</mark> such as:
  - classification
  - Regression
  - Clustering
  - Dimensionality reduction
  - Model selection.


- Built on the top of NumPy, SciPy, and matplotlib.

## How to implement sklearn on a dataset

### Step1 : Import the relevant modules: 
Start by importing the necessary modules from scikit-learn as well as other Python libraries for data manipulation and visualization.

In [2]:
from sklearn.datasets import load_iris

- imports the **`load_iris function`** from the <mark>datasets **module**</mark> in **scikit-learn**. 
<br>
- This function is used to load the **Iris dataset**, which is a **classic dataset** in machine learning and statistics. It is often used for **learning and testing purposes**.
<br>
- The Iris dataset consists of 150 samples of iris flowers.

In [8]:
from sklearn.model_selection import train_test_split

- imports the **`train_test_split function`** from the <mark>model_selection **module**</mark> in **scikit-learn**. 
<br>
- This function is commonly used for <mark>**splitting datasets** into training and testing subsets.</mark>
<br>
- By default, it splits the data into **`75% training and 25% testing sets`**, but you can adjust this ratio with the **test_size parameter**.

In [9]:
from sklearn.preprocessing import StandardScaler

- imports the **`StandardScaler class`** from the <mark>preprocessing **module**</mark> in scikit-learn. 
<br>
- The StandardScaler class is used for <mark>**standardizing features**</mark> by removing the mean and scaling to unit variance.
<br>
- Standardization is a common preprocessing step in machine learning where the features are transformed in such a way that they have **`mean 0 and variance 1.`**
<br>
- This preprocessing step is important for algorithms that are sensitive to the scale of the features, such as
 - support vector machines (SVM)
 - k-nearest neighbors (KNN)
 - logistic regression.


In [10]:
from sklearn.linear_model import LogisticRegression

- imports the **`LogisticRegression class`** from the <mark>linear_model **module**</mark> in scikit-learn. 
<br>
- The LogisticRegression class is used for <mark>**logistic regression**</mark>, which is a statistical method used for binary classification tasks.

In [11]:
from sklearn.metrics import accuracy_score

- imports the **`accuracy_score function`** from the <mark>metrics **module**</mark> in scikit-learn. 
<br>
- The accuracy_score function is used to <mark>**evaluate the accuracy**</mark> of a classification model's predictions compared to the true labels.

### Step 2 : Prepare your data: 
Load your dataset and preprocess it as necessary. This might involve tasks such as 
- handling missing values
- encoding categorical variables
- splitting the data into training and testing sets.

In [12]:
# Load the Iris dataset
iris = load_iris()

X = iris.data # Features (sepal length, sepal width, petal length, petal width)
y = iris.target # Target (species labels)

In [13]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

- **X_train**: Features for training
- **X_test**: Features for testing
- **y_train**: Target labels for training
- **y_test**: Target labels for testing
<br>


- **test_size=0.2**: Specifies that 20% of the data will be reserved for testing, and the remaining 80% will be used for training.
- **random_state=42**: Sets the random seed to 42, ensuring reproducibility. This means that each time you run this code, the data will be split in the same way, which is useful for debugging and comparing different models.

### Step 3 : Choose a model: 
Select an appropriate machine learning algorithm for your task based on the type of problem you're trying to solve (classification, regression, clustering, etc.) and the characteristics of your data.



In [5]:
# Step 3: Choose a model
model = LogisticRegression()

- Now, you can use this model to fit your training data and make predictions

### Step 4 : Train the model:
Fit the chosen model to your training data. This involves using the fit() method provided by scikit-learn.

In [None]:
# Step 4: Train the model
model.fit(X_train, y_train)

### Step 5 : Evaluate the model: 
Assess the performance of your model using appropriate evaluation metrics. For supervised learning tasks, this often involves making predictions on a test set and comparing them to the true labels.

In [7]:
# Step 5: Evaluate the model
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

Accuracy: 1.0


- **y_pred = model.predict(X_test)**: This line uses the trained LogisticRegression model (model) to predict the labels for the test data (X_test). The predict method takes the features of the test data (X_test) as input and returns the predicted labels (y_pred). These predicted labels will be used to evaluate the performance of the model.

- **accuracy = accuracy_score(y_test, y_pred)**: This line calculates the accuracy of the predictions made by the model. The accuracy_score function from scikit-learn compares the true labels of the test data (y_test) with the predicted labels (y_pred) and computes the accuracy as the fraction of correctly classified samples. The result is stored in the variable accuracy.

### Step 6 : Tune hyperparameters:
Fine-tune the hyperparameters of your model to optimize its performance. This can be done using techniques like grid search or randomized search.

### Step 7 : Make predictions: 
Once you're satisfied with your model's performance, you can use it to make predictions on new, unseen data.

# Algorithms available in Sklearn

Linear Models:
- Linear Regression
- Logistic Regression
- Ridge Regression
- Lasso Regression
- ElasticNet

Support Vector Machines (SVM):

- Support Vector Classifier (SVC)
- Support Vector Regression (SVR)

Tree-based Methods:

- Decision Trees
- Random Forests
- Gradient Boosting Machines (GBM)
- AdaBoost

Nearest Neighbors:
- k-Nearest Neighbors (KNN)

Clustering:
- K-Means
- Agglomerative Hierarchical Clustering
- DBSCAN

Dimensionality Reduction:
- Principal Component Analysis (PCA)
- Singular Value Decomposition (SVD)
- t-distributed Stochastic Neighbor Embedding (t-SNE)

Naive Bayes:
- Gaussian Naive Bayes
- Multinomial Naive Bayes
- Bernoulli Naive Bayes

Ensemble Methods:
- Voting Classifier
- Voting Regressor
- Bagging
- Stacking

Neural Networks (via integration with libraries like TensorFlow or PyTorch):
- Multi-layer Perceptron (MLP)

Supporting algorithms:
- Gradient Descent Optimizers
- Loss functions
- Various metrics for model evaluation