# Machine Learning

__Machine learning__ is a method of _data analysis_ that automates
_analytical model building_.
It is a branch of _artificial intelligence_ based on the idea that systems
can _learn from data_, _identify patterns_ and _make decisions_ with
minimal human intervention.

### Sci-kit Learn (SKlearn, Scipy, Numpy)

__Scikit-learn__ is a _Python package_ that provides a wide range of _machine learning algorithms_ and tools. 
It is built on top of _NumPy_, _SciPy_, and _Matplotlib_, and is designed to be simple and efficient for data analysis and modeling.

__Scikit-learn__ offers various modules for tasks such as _classification_, _regression_, _clustering_, _dimensionality reduction_, and _model selection_.
It also provides utilities for _preprocessing data_, _evaluating models_, and _handling datasets_.

With its extensive documentation and user-friendly interface, __Scikit-learn__ is widely used in the field of machine learning and data science.

In [5]:
#!pip install scikit-learn numpy pandas
import numpy as np
import pandas as pd
import sklearn

## Datasets

### Load Titanic Dataset - Kaggle Competence

URL: https://www.kaggle.com/competitions/titanic

This is a beginner-friendly dataset often used to practice machine learning techniques. It contains information about the passengers of the Titanic, such as their age, gender, class, and whether they survived.

In [11]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

titanic_df = pd.read_csv("Titanic.csv")
titanic_df = titanic_df.dropna()

# Remove features
columns = ["Survived", "Pclass", "Sex", "Age", "SibSp", "Parch"]
titanic_df = titanic_df[columns]

# Encode sex
titanic_df["Sex"] = titanic_df["Sex"].map({"male": 0, "female": 1})

# Display the first few rows of the dataset
print(titanic_df.head())

# Split datasets
X = titanic_df.drop("Survived", axis=1)
y = titanic_df["Survived"]

# Scale values
scaler_titanic = StandardScaler()
X = scaler_titanic.fit_transform(X)

print(X)

X_train_titanic, X_test_titanic, y_train_titanic, y_test_titanic = train_test_split(
    X, y, test_size=0.2, random_state=42
)

    Survived  Pclass  Sex   Age  SibSp  Parch
1          1       1    1  38.0      1      0
3          1       1    1  35.0      1      0
6          0       1    0  54.0      0      0
10         1       3    1   4.0      1      1
11         1       1    1  58.0      0      0
[[-0.37225618  1.03901177  0.14906507  0.83362754 -0.63172982]
 [-0.37225618  1.03901177 -0.0432295   0.83362754 -0.63172982]
 [-0.37225618 -0.96245301  1.17463611 -0.7230443  -0.63172982]
 [ 3.52047984  1.03901177 -2.03027338  0.83362754  0.69708118]
 [-0.37225618  1.03901177  1.43102886 -0.7230443  -0.63172982]
 [ 1.57411183 -0.96245301 -0.10732769 -0.7230443  -0.63172982]
 [-0.37225618 -0.96245301 -0.49191683 -0.7230443  -0.63172982]
 [-0.37225618 -0.96245301 -1.06880054  3.94697121  2.02589219]
 [-0.37225618  1.03901177  0.85414516  0.83362754 -0.63172982]
 [-0.37225618 -0.96245301  1.87971619 -0.7230443   0.69708118]
 [-0.37225618 -0.96245301  0.5977524   0.83362754 -0.63172982]
 [ 1.57411183  1.03901177 -0.42

### Load Possum Regression - Kaggle Competence

URL: https://www.kaggle.com/datasets/abrambeyer/openintro-possum

This is a beginner-friendly dataset often used to practice regression techniques. It contains measurements of possums, such as their age, head length, skull width, and other physical attributes.

In [19]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

possum_df = pd.read_csv("possum.csv")

# Remove features
columns = ["site", "Pop", "sex", "hdlngth", "skullw", "totlngth", "taill", "footlgth", "earconch", "eye", "chest", "belly", "age"]
possum_df = possum_df[columns]

# encode Pop
possum_df["Pop"] = possum_df["Pop"].astype("category").cat.codes

# encode sex
possum_df["sex"] = possum_df["sex"].map({"m": 0, "f": 1})

# remove NaN
possum_df = possum_df.dropna()
print(possum_df.describe())


# display the first rows of the dataset
print(possum_df.head())

# Split datasets
X = possum_df.drop("age", axis=1)
y = possum_df["age"]

# Scale features
scaler_possum = StandardScaler()
X = scaler_possum.fit_transform(X)

X_train_possum, X_test_possum, y_train_possum, y_test_possum = train_test_split(
    X, y, test_size=0.2, random_state=42
)


             site         Pop         sex     hdlngth      skullw    totlngth  \
count  101.000000  101.000000  101.000000  101.000000  101.000000  101.000000   
mean     3.673267    0.574257    0.415842   92.730693   56.960396   87.269307   
std      2.366892    0.496921    0.495325    3.518714    3.102679    4.196802   
min      1.000000    0.000000    0.000000   82.500000   50.000000   75.000000   
25%      1.000000    0.000000    0.000000   90.700000   55.000000   84.500000   
50%      4.000000    1.000000    0.000000   92.900000   56.400000   88.000000   
75%      6.000000    1.000000    1.000000   94.800000   58.100000   90.000000   
max      7.000000    1.000000    1.000000  103.100000   68.600000   96.500000   

            taill    footlgth    earconch         eye       chest       belly  \
count  101.000000  101.000000  101.000000  101.000000  101.000000  101.000000   
mean    37.049505   68.398020   48.133663   15.050495   27.064356   32.638614   
std      1.971681    4.4135

## Supervised Machine Learning

K-n


In [17]:
from  sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

knn_titanic = KNeighborsClassifier(n_neighbors=21)
knn_titanic.fit(X_train_titanic, y_train_titanic)

y_pred_titanic = knn_titanic.predict(X_test_titanic)
accuracy_titanic = accuracy_score(y_test_titanic, y_pred_titanic)
print(f"Titanic KNN Accuracy: {accuracy_titanic}")



Titanic KNN Accuracy: 0.8378378378378378


### Linear Regression with Least Squares

__Linear regression__ is a type of _regression analysis_ used for predicting the value of a _continuous dependent variable_. It works by finding the _line that best fits the data_.

_Least squares_ is a method for finding the _best-fitting_ line by __minimizing__ the _sum of the squared differences_ between the predicted and actual values.

In [18]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# Creating the Linear Regression model
linear_regressor = LinearRegression()

# Fitting the model with the training data
linear_regressor.fit(X_train_possum, y_train_possum)

# Making predictions on the test set
y_pred = linear_regressor.predict(X_test_possum)

# evaluate model
mse = mean_squared_error(y_test_possum, y_pred)
rmse = np.sqrt(mse)

print(f"Mean Squared Error (MSE): {mse}")
print(f"Root Mean Squared Error (RMSE): {rmse}")

Mean Squared Error (MSE): 4.713181569510544
Root Mean Squared Error (RMSE): 2.1709863126032243


In [20]:
# test single input. Colums: site, Pop, sex, hdlngth, skullw, totlngth, taill, footlgth, earconch, eye, chest, belly
new_possum = pd.DataFrame({
    'site': [1],       # Site location
    'Pop': [0],        # Population 
    'sex': [0],        # Male (0) / Female (1)
    'hdlngth': [94],   # Head length
    'skullw': [60],    # Skull width
    'totlngth': [89],  # Total length
    'taill': [36],     # Tail length
    'footlgth': [74],  # Foot length
    'earconch': [54],  # Ear conch length
    'eye': [15],       # Eye measurement
    'chest': [29],     # Chest measurement
    'belly': [35]      # Belly measurement
})

# Scale the new data using the same scaler
new_data = scaler_possum.transform(new_possum)

# Predict age
predicted_age = linear_regressor.predict(new_data)
print(f"\nPredicted possum age: {predicted_age[0]} years")


Predicted possum age: 4.025402384673814 years


### Random Forest

__Random forest__ is an _ensemble learning_ method that combines
_multiple decision trees_ to create a strong predictive model.

It works by building _multiple trees_ and averaging their predictions to
_reduce overfitting_.

In [22]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, f1_score

# Train the model
clf = RandomForestClassifier(random_state=42)
clf.fit(X_train_titanic, y_train_titanic)

# Make predictions
y_pred = clf.predict(X_test_titanic)

# Calculate metrics
accuracy = accuracy_score(y_test_titanic, y_pred)
f1 = f1_score(y_test_titanic, y_pred)

print(f"Accuracy: {accuracy}")
print(f"F1 Score: {f1}")

Accuracy: 0.7837837837837838
F1 Score: 0.8260869565217391


In [28]:
# test single value. Columns: Pclass  Sex   Age  SibSp  Parch
new_passenger = pd.DataFrame({
    'Pclass': [3],
    'Sex': [0], 
    'Age': [55],
    'SibSp': [1],
    'Parch': [0]
})

# Scale the input (no warning now)
new_data = scaler_titanic.transform(new_passenger)

output = clf.predict(new_data)
print(f"The predicted target for the new input is: {output[0]}")

The predicted target for the new input is: 0


### Neural Networks

__Neural networks__ are a type of _machine learning_ model inspired by
the _human brain_.

They consist of _layers of interconnected nodes_ that process input data
and produce output data.

In [None]:
#!pip install tensorflow
#!pip install keras

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Input

# Build the neural network (FIXED VERSION)
model = Sequential(
    [
        Input(shape=(X_train_titanic.shape[1],)),
        Dense(8, activation="relu"),
        Dense(1, activation="sigmoid"),
    ]
)

model.compile(optimizer="adam", loss="binary_crossentropy", metrics=["accuracy"])

# Train the model
history = model.fit(
    X_train_titanic,
    y_train_titanic,
    epochs=10,
    batch_size=32,
    validation_split=0.2,
    verbose=1,
)

# Evaluate the model
test_loss, test_accuracy = model.evaluate(X_test_titanic, y_test_titanic, verbose=0)
print(f"Neural Network Test Accuracy: {test_accuracy:.4f}")

In [None]:
# test single value. Columns: Pclass, sex, Age, SibSp, Parch
new_data = pd.DataFrame(
    {"Pclass": [1], "Sex": [0], "Age": [3], "SibSp": [1], "Parch": [1]}
)

output = model.predict(new_data)
predicted_class = (output > 0.5).astype(int)[0][0]
print(f"The predicted target for the new input is: {predicted_class}")