# Introduction to Machine Learning Algorithms

In this hands-on session, we will explore different types of machine learning algorithms using scikit-learn and pandas.

## Table of Contents
1. Introduction to Machine Learning
2. Supervised Learning
  1. Linear Regression
  2. Logistic Regression
3. Unsupervised Learning
  1. K-means Clustering
4. Conclusion


# 1. Introduction to Machine Learning
Machine learning is a field of study that enables computers to learn from data and improve their performance over time without being explicitly programmed. There are various types of machine learning algorithms, broadly categorized into supervised and unsupervised learning.

# 2. Supervised Learning
In supervised learning, the algorithm learns from labeled data, where each example is a pair consisting of an input object (typically a vector) and a desired output value (also called the supervisory signal).

## 2.1 Linear Regression
Linear regression is a linear approach to modeling the relationship between a dependent variable and one or more independent variables. Let's create a simple linear regression model using scikit-learn:

In [None]:
# Import necessary libraries
import numpy as np
import pandas as pd
from sklearn.datasets import load_iris, load_digits
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.cluster import KMeans
from sklearn.metrics import accuracy_score, mean_squared_error, silhouette_score
from sklearn.decomposition import PCA
import plotly.express as px
from sklearn.svm import SVC
from sklearn.manifold import TSNE

In [None]:
# Generate synthetic data for demonstration
np.random.seed(1)  # For reproducibility

# Generate feature (independent variable)
X = np.random.rand(20, 1) * 10 # 20 records, 1 feature

# Generate target (dependent variable)
# We'll use a linear relationship: y = 2x + 1 + noise
y = 2 * X.squeeze() + 1 + np.random.normal(0, 1, 20)
# Create a pandas DataFrame to organize the data
data = pd.DataFrame({'Feature': X.squeeze(), 'Target': y})

# Create and fit the linear regression model
model = LinearRegression()
model.fit(X, y);

In [None]:
# Plot the data points and linear regression line using Plotly Express
fig = px.scatter(data, x='Feature', y='Target', title='Linear Regression Model')
fig.add_scatter(x=X.squeeze(), y=model.predict(X), mode='lines', line=dict(color='red', width=2), name='Linear Regression')
fig.show()

## 2.2 Logistic Regression
Logistic regression is a supervised learning algorithm used for binary classification tasks. Let's build a logistic regression model using scikit-learn:

In [None]:
# Load dataset
iris = load_iris()

df = pd.DataFrame(iris.data)
df["species_id"] = iris.target

# Define the mapping dictionary
mapping = {2: 'virginica', 1: 'versicolor', 0: 'setosa'}

# Create the new column by replacing values
df['species'] = df['species_id'].replace(mapping)

df.columns = ["sepal_length", "sepal_width", "petal_length", "petal_width", "species_id", "species"]
fig = px.scatter(df, x="sepal_width", y="sepal_length", color="species", size='petal_length', hover_data="petal_width", color_discrete_sequence=px.colors.qualitative.D3)
fig.show()

In [None]:
# Split the data into features and target
X_iris, y_iris = iris.data, iris.target
X_train_iris, X_test_iris, y_train_iris, y_test_iris = train_test_split(X_iris, y_iris, test_size=0.2, random_state=42)

# Train the logistic regression model
logreg = LogisticRegression(max_iter=1000)
logreg.fit(X_train_iris, y_train_iris)

# Evaluate the model
iris_predictions = logreg.predict(X_test_iris)
iris_accuracy = accuracy_score(y_test_iris, iris_predictions)
print(f"Iris Logistic Regression Accuracy: {iris_accuracy}")

Iris Logistic Regression Accuracy: 1.0


In [None]:
# New data to be added to original
new_data = [
    [5.3, 3.8, 2.3, 0.2],
    [3.3, 1.2, 1.3, 0.4],
    [7.4, 2.6, 5.0, 2.3],
    [5.4, 1.6, 2.0, 3],
    [8, 3.0, 7, 0.1],
    [7.0,3.2,5,1.2]
]
new_data = pd.DataFrame(new_data, columns=df.columns[:4])

# Add missing columns to have same dimensions than original for plotting purposes
new_data["species"] = "to predict"
new_data["species_id"] = 3

# Union of new data to existing one
augmented_data = df.copy()
augmented_data = pd.concat([augmented_data, new_data])

# Plot new points
fig = px.scatter(augmented_data, x="sepal_width", y="sepal_length", color="species", size='petal_length', hover_data="petal_width", color_discrete_sequence=px.colors.qualitative.D3)
fig.show()

In [None]:
# Take only necesary columns for training
new_data = new_data.iloc[:,:4]

# Predict category of new data
predictions = logreg.predict(new_data)
new_data["species_id"] = predictions

# Create the new column by replacing values, (model returns a number)
new_data['species'] = new_data['species_id'].replace(mapping)

# Plot new data with predicted category
fig = px.scatter(new_data, x="sepal_width", y="sepal_length", color="species", size='petal_length', hover_data="petal_width", color_discrete_sequence=px.colors.qualitative.D3)
fig.show()


X has feature names, but LogisticRegression was fitted without feature names



# 3. Unsupervised Learning
In unsupervised learning, the algorithm learns patterns from unlabeled data. It explores the data and can draw inferences from datasets.

## 3.1 K-means Clustering
K-means clustering is a popular unsupervised learning algorithm used for clustering tasks. Let's apply k-means clustering to a sample dataset:

In [None]:
# Load the digits dataset
digits = load_digits()
X_digits, y_digits = digits.data, digits.target

# Prepare the data for display
images = digits.images
targets = digits.target

# Select a few examples to display
num_examples = 10
selected_images = images[:num_examples]
selected_targets = targets[:num_examples]

# Create a subplot for each digit
fig = px.imshow(selected_images, facet_col=0, facet_col_wrap=5, binary_string=True, labels={'facet_col': 'Digit'})
fig.update_layout(title="Examples of Digits")
fig.show()

In [None]:
# Train the K-Means clustering model
kmeans = KMeans(n_clusters=10, random_state=42)
kmeans.fit(X_digits)

# Evaluate the model
digits_silhouette_score = silhouette_score(X_digits, kmeans.labels_, random_state=42)
print(f"Digits K-Means Silhouette Score: {digits_silhouette_score}")

# Visualize the clusters for Digits dataset using t-SNE
tsne = TSNE(n_components=2, random_state=42)
X_tsne = tsne.fit_transform(X_digits)

# Create a DataFrame for the data
data = {'t-SNE Component 1': X_tsne[:, 0], 't-SNE Component 2': X_tsne[:, 1], 'Cluster': kmeans.labels_}
df = pd.DataFrame(data)
df["Cluster"] = df["Cluster"].astype("str")
# Plot the clusters
fig = px.scatter(df, x='t-SNE Component 1', y='t-SNE Component 2', color='Cluster',
                 title='K-Means Clustering on Digits Dataset (t-SNE-reduced)',
                 color_continuous_scale='viridis',
                 labels={'t-SNE Component 1': 't-SNE Component 1', 't-SNE Component 2': 't-SNE Component 2'},
                 hover_data={'Cluster': True})
fig.update_coloraxes(colorbar_title='Cluster')
fig.show()






Digits K-Means Silhouette Score: 0.18244258012780126
