# Machine Learning

__Machine learning__ is a method of _data analysis_ that automates
_analytical model building_.
It is a branch of _artificial intelligence_ based on the idea that systems
can _learn from data_, _identify patterns_ and _make decisions_ with
minimal human intervention.

### Sci-kit Learn (SKlearn, Scipy, Numpy)

__Scikit-learn__ is a _Python package_ that provides a wide range of _machine learning algorithms_ and tools. 
It is built on top of _NumPy_, _SciPy_, and _Matplotlib_, and is designed to be simple and efficient for data analysis and modeling.

__Scikit-learn__ offers various modules for tasks such as _classification_, _regression_, _clustering_, _dimensionality reduction_, and _model selection_.
It also provides utilities for _preprocessing data_, _evaluating models_, and _handling datasets_.

With its extensive documentation and user-friendly interface, __Scikit-learn__ is widely used in the field of machine learning and data science.

In [1]:
!pip install scikit-learn numpy pandas
import numpy as np
import pandas as pd
import sklearn

Collecting scikit-learn
  Downloading scikit_learn-1.8.0-cp314-cp314-macosx_10_15_x86_64.whl.metadata (11 kB)
Collecting numpy
  Downloading numpy-2.4.2-cp314-cp314-macosx_10_15_x86_64.whl.metadata (6.6 kB)
Collecting pandas
  Downloading pandas-3.0.0-cp314-cp314-macosx_10_15_x86_64.whl.metadata (79 kB)
Collecting scipy>=1.10.0 (from scikit-learn)
  Downloading scipy-1.17.0-cp314-cp314-macosx_10_14_x86_64.whl.metadata (62 kB)
Collecting joblib>=1.3.0 (from scikit-learn)
  Downloading joblib-1.5.3-py3-none-any.whl.metadata (5.5 kB)
Collecting threadpoolctl>=3.2.0 (from scikit-learn)
  Downloading threadpoolctl-3.6.0-py3-none-any.whl.metadata (13 kB)
Downloading scikit_learn-1.8.0-cp314-cp314-macosx_10_15_x86_64.whl (8.5 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.5/8.5 MB[0m [31m2.5 MB/s[0m  [33m0:00:03[0m eta [36m0:00:01[0m
[?25hDownloading numpy-2.4.2-cp314-cp314-macosx_10_15_x86_64.whl (16.7 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m 

## Datasets

### Load Titanic Dataset - Kaggle Competence

URL: https://www.kaggle.com/competitions/titanic

This is a beginner-friendly dataset often used to practice machine learning techniques. It contains information about the passengers of the Titanic, such as their age, gender, class, and whether they survived.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

titanic_df = pd.read_csv("Titanic.csv")

# Remove features
columns = ["Survived", "Pclass", "Sex", "Age", "SibSp", "Parch"]
titanic_df = titanic_df[columns]

# Encode sex
titanic_df["Sex"] = titanic_df["Sex"].map({"male": 0, "female": 1})

# Display the first few rows of the dataset
print(titanic_df.head())

# Split datasets
X = titanic_df.drop("Survived", axis=1)
y = titanic_df["Survived"]

# Scale values
scaler_titanic = StandardScaler()
X = scaler_titanic.fit_transform(X)

print(X)

X_train_titanic, X_test_titanic, y_train_titanic, y_test_titanic = train_test_split(
    X, y, test_size=0.2, random_state=42
)

   Survived  Pclass  Sex   Age  SibSp  Parch
0         0       3    0  22.0      1      0
1         1       1    1  38.0      1      0
2         1       3    1  26.0      0      0
3         1       1    1  35.0      1      0
4         0       3    0  35.0      0      0
[[ 0.82737724 -0.73769513 -0.53037664  0.43279337 -0.47367361]
 [-1.56610693  1.35557354  0.57183099  0.43279337 -0.47367361]
 [ 0.82737724  1.35557354 -0.25482473 -0.4745452  -0.47367361]
 ...
 [ 0.82737724  1.35557354         nan  0.43279337  2.00893337]
 [-1.56610693 -0.73769513 -0.25482473 -0.4745452  -0.47367361]
 [ 0.82737724 -0.73769513  0.15850313 -0.4745452  -0.47367361]]


### Load Possum Regression - Kaggle Competence

URL: https://www.kaggle.com/datasets/abrambeyer/openintro-possum

This is a beginner-friendly dataset often used to practice regression techniques. It contains measurements of possums, such as their age, head length, skull width, and other physical attributes.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

possum_df = pd.read_csv("possum.csv")

# Remove features
columns = ["site", "Pop", "sex", "hdlngth", "skullw", "totlngth", "taill", "footlgth", "earconch", "eye", "chest", "belly", "age"]
possum_df = possum_df[columns]

# encode Pop
possum_df["Pop"] = possum_df["Pop"].astype("category").cat.codes

# encode sex
possum_df["sex"] = possum_df["sex"].map({"m": 0, "f": 1})

# remove NaN
possum_df = possum_df.dropna()

# display the first rows of the dataset
print(possum_df.head())

# Split datasets
X = possum_df.drop("age", axis=1)
y = possum_df["age"]

# Scale features
scaler_possum = StandardScaler()
X = scaler_possum.fit_transform(X)

X_train_possum, X_test_possum, y_train_possum, y_test_possum = train_test_split(
    X, y, test_size=0.2, random_state=42
)


## Supervised Machine Learning

### Linear Regression with Least Squares

__Linear regression__ is a type of _regression analysis_ used for predicting the value of a _continuous dependent variable_. It works by finding the _line that best fits the data_.

_Least squares_ is a method for finding the _best-fitting_ line by __minimizing__ the _sum of the squared differences_ between the predicted and actual values.

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# Creating the Linear Regression model
linear_regressor = LinearRegression()

# Fitting the model with the training data
linear_regressor.fit(X_train_possum, y_train_possum)

# Making predictions on the test set
y_pred = linear_regressor.predict(X_test_possum)

# evaluate model
mse = mean_squared_error(y_test_possum, y_pred)
rmse = np.sqrt(mse)

print(f"Mean Squared Error (MSE): {mse}")
print(f"Root Mean Squared Error (RMSE): {rmse}")

In [None]:
# test single input. Colums: site, Pop, sex, hdlngth, skullw, totlngth, taill, footlgth, earconch, eye, chest, belly
new_possum = pd.DataFrame({
    'site': [1],       # Site location
    'Pop': [0],        # Population 
    'sex': [0],        # Male (0) / Female (1)
    'hdlngth': [94],   # Head length
    'skullw': [60],    # Skull width
    'totlngth': [89],  # Total length
    'taill': [36],     # Tail length
    'footlgth': [74],  # Foot length
    'earconch': [54],  # Ear conch length
    'eye': [15],       # Eye measurement
    'chest': [29],     # Chest measurement
    'belly': [35]      # Belly measurement
})

# Scale the new data using the same scaler
new_data = scaler_possum.transform(new_possum)

# Predict age
predicted_age = linear_regressor.predict(new_data)
print(f"\nPredicted possum age: {predicted_age[0]} years")

### Random Forest

__Random forest__ is an _ensemble learning_ method that combines
_multiple decision trees_ to create a strong predictive model.

It works by building _multiple trees_ and averaging their predictions to
_reduce overfitting_.

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, f1_score

# Train the model
clf = RandomForestClassifier(random_state=42)
clf.fit(X_train_titanic, y_train_titanic)

# Make predictions
y_pred = clf.predict(X_test_titanic)

# Calculate metrics
accuracy = accuracy_score(y_test_titanic, y_pred)
f1 = f1_score(y_test_titanic, y_pred, average="weighted")

print(f"Accuracy: {accuracy}")
print(f"F1 Score: {f1}")

In [None]:
# test single value. Columns: Pclass  Sex   Age  SibSp  Parch
new_passenger = pd.DataFrame({
    'Pclass': [1],
    'Sex': [0], 
    'Age': [45],
    'SibSp': [1],
    'Parch': [0]
})

# Scale the input (no warning now)
new_data = scaler_titanic.transform(new_passenger)

output = clf.predict(new_data)
print(f"The predicted target for the new input is: {output[0]}")

### Neural Networks

__Neural networks__ are a type of _machine learning_ model inspired by
the _human brain_.

They consist of _layers of interconnected nodes_ that process input data
and produce output data.

In [None]:
#!pip install tensorflow
#!pip install keras

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Input

# Build the neural network (FIXED VERSION)
model = Sequential(
    [
        Input(shape=(X_train_titanic.shape[1],)),
        Dense(8, activation="relu"),
        Dense(1, activation="sigmoid"),
    ]
)

model.compile(optimizer="adam", loss="binary_crossentropy", metrics=["accuracy"])

# Train the model
history = model.fit(
    X_train_titanic,
    y_train_titanic,
    epochs=10,
    batch_size=32,
    validation_split=0.2,
    verbose=1,
)

# Evaluate the model
test_loss, test_accuracy = model.evaluate(X_test_titanic, y_test_titanic, verbose=0)
print(f"Neural Network Test Accuracy: {test_accuracy:.4f}")

In [None]:
# test single value. Columns: Pclass, sex, Age, SibSp, Parch
new_data = pd.DataFrame(
    {"Pclass": [1], "Sex": [0], "Age": [3], "SibSp": [1], "Parch": [1]}
)

output = model.predict(new_data)
predicted_class = (output > 0.5).astype(int)[0][0]
print(f"The predicted target for the new input is: {predicted_class}")