# Welcome to the DeepLearning NoteBooks series

Hi, I’m Elio — a CS student at AUB (Class of 2027), passionate about AI and Deep Learning.  
This notebook is the **starting point of my Deep Learning series**, where I refine and share what I’ve learned to:

- Build a stronger foundation in ML/DL  
- Recap & solidify concepts from the Deep Learning Specialization  
- Showcase practical skills through structured notebooks
- Provide a clear, step-by-step resource for others to **understand the mathematical concepts and theory** behind        already-implemented methods

This is not only a recap of what I have learned from DL specialization by DeepLearning.ai, it provides what is needed for the user to go beyond **libraries** and understand how things really work **under the hood**

# What to expect throughout this series of NoteBooks
I will be going over foundational DL concepts - explaining the mathematics of each concept, its implementation (most of them with only numpy), and how it works.

The series will be consisted of the Following NoteBooks:

00. Welcome-and-preprocessing 
01. Logistic-Regression

# 1 - Data Preprocessing
Data preprocessing is the core of **machine learing**. The main step to successfully train an accurate model is to have a good data to train the model on.

When it comes to Data, there are **three different types**:
1. Images
2. Texts
3. Numeric/Tabular

We are going to go through **the preprocessing of each of these three types of data**.

# 1. Images Preprocessing

To preprocess Images, We have to do the following:
- Find the dimensions and shapes we are dealing with **(m_train, m_test, num_px, num_px, 3)**
- Reshape the data where each example becomes shaped as a vector of size **(num_px*num_px*3)**
- Standardize the data

In [None]:
# Images Preprocessing

def data_dimensions(train_x, test_x):
    m_train = train_x.shape[0]
    m_test = test_x.shape[0]
    num_px = train_x[0].shape[0]
    num_channels = train_x[0].shape[2] # It is by 3 -> RGB

    return {'m_train': m_train, 'm_test': m_test, 'num_px': num_px, 'num_channels': num_channels}

def reshape_data(train_x, test_x):
    train_x_flatten = train_x.reshape(train_x.shape[0], -1).T   # Here -1 = product of the rest dimensions
    test_x_flatten = test_x.reshape(test_x.shape[0], -1).T

    return (train_x_flatten, test_x_flatten)

def standardize_data(train_x_flatten, test_x_flatten):
    train_x_std = train_x_flatten / 255.0
    test_x_std = test_x_flatten / 255.0

    return (train_x_std, test_x_std)

# 2. Text Preprocessing

To preprocess Text, we have to do the following steps:
- Tokenize Text **(sentence -> sequence of integers)**
- Pad the sequences to the same length

The intuition behing each of these 2 steps:
- Computers do not understand characters, they process numbers. That's why we tokenize the text sequences and make them sequences of integers. **It maps each unique word to a unique integer ID**
- Neural networks expect inputs to have the same shape in a batch, and sentences may differ in length that's why we tend to **pad the sentences to a certain length**

In [None]:
# Text Preprocessing

from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

def Tokenize(texts, fitted_tokenizer=None, oov_token = '<OOV>'):
    """
    Convert a list of sentences into integer sequences.
    
    Args:
        texts (list[str]): Sentences to tokenize.
        tokenizer (Tokenizer, optional): Existing fitted tokenizer to reuse.
        oov_token (str): Token for out-of-vocabulary words (used if tokenizer=None).
        
    Returns:
        tokenizer (Tokenizer): The fitted or reused tokenizer.
        sequences (list[list[int]]): List of integer sequences.
    """
    if fitted_tokenizer:
        tokenizer = fitted_tokenizer
    else:
        tokenizer = Tokenizer(num_words=None, oov_token=oov_token)

    tokenizer.fit_on_texts(texts)
    sequences = tokenizer.texts_to_sequences(texts)

    return {'tokenizer': tokenizer, 'sequences': sequences}

def Pad(sequences, max_len=None, padding='post'):
    """
    Pad sequences to become of the same length.
    short sequences -> padded to same length as the longer squences
    
    Args:
        sequences (list[int]): tokenized sentences.
        max_len (int, optional): a desired length for shorter sequences to be padded to and longer ones to be truncated to.
                                (if kept None the function will pad the shorter sequences to have same length as the longest one)
        oov_token (str): Token for out-of-vocabulary words (used if tokenizer=None).
        
    Returns:
       padded(list[int]): padded sequences (or truncated/padded if max_len != None)
    """

    padded = pad_sequences(sequences, maxlen=None, padding='post')

    return padded

# 3. Tabular/Numeric Preprocessing

To preprocess Numeric/Tabular data (df: dataframe), we have to do the following steps:
- Drop rows with **Nan** values
- Do one-hot encoding to categorical feauture columns **This is used for data that have one column for each label. (1 for corresponding label/s and 0 for the rest)**
(OR)
- Do label encoding **This is used for data that have only one column for all labels where each row has a class name under it. (Labels to integers -> each label = unique integer)**

In [None]:
# Tabular/Numeric Preprocessing

import pandas as pd
from sklearn.preprocessing import OneHotEncoder, LabelEncoder

def dropNan(df):
    """
    Drop from the dataframe rows that have any NaN values

    Args:
        df(pd.DataFrame): DataFrame (table)

    Returns:
        pd.DataFrame: DataFrame without NaN values
    """

    return df.dropna()

def one_hot_encoding(df, categorical_cols):
    """
    Apply one_hot encoding to categorical feature columns

    Args:
        df(pd.DataFrame): DataFrame without NaN values
        categorical_cols(list[str]): Names of categorical columns to encode (names of classes)

    Returns:
        pd.DataFrame: DataFrame with one-hot encoding applied to it
    """

    encoder = OneHotEncoder(sparse_output=False)
    encoded = encoder.fit_transform(df[categorical_cols])

    return encoded, encoder

def label_encoding(labels):
    """
    Encode string labels (train_y / test_y) into integers.
    
    Args:
        labels (pd.Series or list): Target labels.
    
    Returns:
        encoded (ndarray): Integer-encoded labels.
        encoder (LabelEncoder): Fitted label encoder (for reuse later).
    """
    
    encoder = LabelEncoder()
    encoded = encoder.fit_transform(labels)
    
    return encoded, encoder