# Introduction to Common Machine Learning Libraries

## Scikit-learn

Scikit-learn is widely used for traditional machine learning algorithms, including classification, regression, clustering, and dimensionality reduction. \
![Scikit Learn](imgs/sklearn.png) \
https://scikit-learn.org/stable/



### **Installation**

```bash
pip install scikit-learn


### **Classification with Scikit-learn**
Classification is a supervised learning approach where the goal is to predict the categorical class labels of new observations, based on past observations.

Example: Iris Dataset Classification
- The Iris dataset is a classic dataset for classification. It consists of 150 observations of iris flowers from three different species. Each observation has four features: sepal length, sepal width, petal length, and petal width.

In [2]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Load dataset
iris = load_iris()
X = iris.data
y = iris.target

# Split dataset into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3) # 70% training and 30% testing

# Create Decision Tree classifer object
clf = DecisionTreeClassifier()

# Train Decision Tree Classifer
clf = clf.fit(X_train,y_train)

#Predict the response for test dataset
y_pred = clf.predict(X_test)

# Model Accuracy, how often is the classifier correct?
print("Accuracy:", accuracy_score(y_test, y_pred))


Accuracy: 0.9555555555555556


### **Regression with Scikit-learn**
Regression analysis is a type of predictive modelling technique which investigates the relationship between a dependent (target) and independent variable(s).

Example: Diabetes Dataset Regression
-The diabetes dataset consists of 10 physiological variables (age, sex, weight, blood pressure) measure on 442 patients, and an indication of disease progression after one year.

In [4]:
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# Load diabetes dataset
diabetes = datasets.load_diabetes()
X = diabetes.data
y = diabetes.target

# Split the data into training/testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

# Create linear regression object
regr = LinearRegression()

# Train the model using the training sets
regr.fit(X_train, y_train)

# Make predictions using the testing set
diabetes_y_pred = regr.predict(X_test)

# The mean squared error
print('Mean squared error: %.2f'
      % mean_squared_error(y_test, diabetes_y_pred))


Mean squared error: 3266.08


### **Clustering with Scikit-learn**
Clustering is a type of unsupervised learning that groups together data points that are similar to each other.

Example: K-Means Clustering
- K-Means is a popular clustering algorithm that partitions n observations into k clusters in which each observation belongs to the cluster with the nearest mean.

In [5]:
from sklearn.cluster import KMeans
import numpy as np

# Generate artificial data
X = np.array([[1, 2], [1, 4], [1, 0],
              [10, 2], [10, 4], [10, 0]])

# Create KMeans object
kmeans = KMeans(n_clusters=2, random_state=0).fit(X)

# Predict the clusters
predicted_clusters = kmeans.predict([[0, 0], [12, 3]])

# Centers of the clusters
centers = kmeans.cluster_centers_

print("Predicted clusters:", predicted_clusters)
print("Cluster Centers:", centers)

Predicted clusters: [1 0]
Cluster Centers: [[10.  2.]
 [ 1.  2.]]


## PyTorch

PyTorch is an open-source machine learning library for Python developed by Facebook, widely used for applications such as computer vision and natural language processing. It is known for its flexibility, speed, and ease of use. PyTorch works with data in the form of tensors, which are similar to arrays and matrices. \
![Pytorch](imgs/pytorch.png) \
https://pytorch.org/


### **Installation**

```bash
pip install torch torchvision



### **Basic Tensor Operations**
Tensors are the fundamental unit of data in PyTorch and represent a multi-dimensional array.

Example: Creating Tensors and Basic Operations

In [6]:
import torch

# Create a tensor
x = torch.tensor([1, 2, 3])
y = torch.tensor([4, 5, 6])

# Basic operations
sum_xy = x + y  # Element-wise addition
mul_xy = x * y  # Element-wise multiplication

print("Sum of x and y:", sum_xy)
print("Element-wise multiplication of x and y:", mul_xy)


Sum of x and y: tensor([5, 7, 9])
Element-wise multiplication of x and y: tensor([ 4, 10, 18])


### **Simple Neural Network**
This example gives a glimpse into defining and manipulating tensors, computing gradients, and a basic framework for a neural network model in PyTorch

In [7]:
import torch
import torch.nn.functional as F

# Random data for inputs (features) and outputs (targets)
features = torch.randn((20, 3))  # 20 samples, 3 features
targets = torch.randn((20, 1))  # 20 samples, 1 target

# Weight and bias
weights = torch.randn((3, 1), requires_grad=True)  # 3 features to 1 output
bias = torch.randn(1, requires_grad=True)

# Simple linear model
def model(x):
    return x @ weights + bias  # @ represents matrix multiplication

# Loss function (Mean Square Error)
def mse(predictions, targets):
    differences = predictions - targets
    return torch.sum(differences * differences) / differences.numel()

# Forward pass: compute predicted y by passing x to the model
predictions = model(features)

# Compute and print loss
loss = mse(predictions, targets)
print("Loss:", loss.item())

# Backpropagation
loss.backward()
print("Gradient w.r.t weights:", weights.grad)
print("Gradient w.r.t bias:", bias.grad)


Loss: 3.811100721359253
Gradient w.r.t weights: tensor([[-0.0261],
        [ 1.8886],
        [ 1.9473]])
Gradient w.r.t bias: tensor([-1.8974])


## TensorFlow/Keras
TensorFlow is an extensive ecosystem for machine learning and deep learning, while Keras is a high-level API for building and training deep learning models.\
![Tensorflow](imgs/tensorflow.png) \
https://www.tensorflow.org/



### **Installation**

```bash
pip install tensorflow



In [6]:
import tensorflow as tf
from tensorflow.keras.layers import Dense
from tensorflow.keras.models import Sequential

# Load MNIST data
mnist = tf.keras.datasets.mnist
(X_train, y_train), (X_test, y_test) = mnist.load_data()
X_train, X_test = X_train / 255.0, X_test / 255.0  # Normalize data

# Build the Sequential model
model = Sequential([
    tf.keras.layers.Flatten(input_shape=(28, 28)),
    Dense(128, activation='relu'),
    tf.keras.layers.Dropout(0.2),
    Dense(10, activation='softmax')
])

model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

# Train the model
model.fit(X_train, y_train, epochs=5)

# Evaluate the model
model.evaluate(X_test, y_test)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


[0.07107871770858765, 0.9775999784469604]

## Hugging Face Transformers

Hugging Face Transformers is a library that provides thousands of pre-trained models to perform tasks on texts such as classification, information extraction, question answering, summarization, translation, and more. It is built on top of PyTorch, TensorFlow, and JAX, offering a high-level API for working with these models.\
![Hugging Face](imgs/hugging_face4.png) \
https://huggingface.co/



### **Installation**

```bash
pip install transformers

### **Text Classification**
Text classification involves assigning categories to text data. Here's how to perform sentiment analysis using a pre-trained model.

In [10]:
from transformers import pipeline

# Load the sentiment analysis pipeline
classifier = pipeline("sentiment-analysis",model="distilbert/distilbert-base-uncased-finetuned-sst-2-english")

# Perform sentiment analysis
result = classifier("I love using Transformers for NLP tasks!")

print(result)


[{'label': 'POSITIVE', 'score': 0.9982925057411194}]


### **Name Entity Recognition (NER)**
Named Entity Recognition is the task of classifying named entities found in a text into pre-defined categories such as the names of persons, organizations, locations.

In [15]:
from transformers import pipeline

# Load the named entity recognition pipeline
ner = pipeline("ner", grouped_entities=True)

text = "Hugging Face is a company based in New York City. Its technology is based on Transformers."

results = ner(text)

for entity in results:
    print("Entity:", entity['word'], "| Type:", entity['entity_group'], "| Score:", entity['score'])

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision f2482bf (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Entity: Hugging Face | Type: ORG | Score: 0.9593236
Entity: New York City | Type: LOC | Score: 0.9991713
Entity: Transformers | Type: ORG | Score: 0.99184954


### **Text Generation**
Text generation involves automatically generating text based on a given prompt.

In [18]:
from transformers import pipeline

# Load the text generation pipeline
text_generator = pipeline("text-generation")

# Generate text
result = text_generator("In the future, AI will", max_length=50)
print(result[0]['generated_text'])


No model was supplied, defaulted to openai-community/gpt2 and revision 6c0e608 (https://huggingface.co/openai-community/gpt2).
Using a pipeline without specifying a model name and revision in production is not recommended.
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In the future, AI will create machines that do things from top down to the point where they can do whatever they want. The goal is to create a better understanding of the individual human being, who they actually represent. We can do this by training


### **Question Answering**
Question Answering systems can read through texts and provide answers to questions based on the content.

In [22]:
from transformers import pipeline

# Load the question answering pipeline
question_answerer = pipeline("question-answering")

context = "Transformers have been adopted for a wide range of NLP tasks."

result = question_answerer(question="What can transformers be used for?", context=context)
print(result)


No model was supplied, defaulted to distilbert/distilbert-base-cased-distilled-squad and revision 626af31 (https://huggingface.co/distilbert/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.


{'score': 0.2838223874568939, 'start': 51, 'end': 60, 'answer': 'NLP tasks'}


### **Translation**
Translation models can translate text from one language to another.

In [23]:
from transformers import pipeline

# Load the translation pipeline
translator = pipeline("translation_en_to_fr")

# Translate text
result = translator("Hugging Face provides easy-to-use NLP models.", max_length=40)
print(result[0]['translation_text'])


No model was supplied, defaulted to google-t5/t5-base and revision 686f1db (https://huggingface.co/google-t5/t5-base).
Using a pipeline without specifying a model name and revision in production is not recommended.
For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on google-t5/t5-base automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.


Hugging Face propose des modèles NLP faciles à utiliser.
