
| **Model**              | **Strengths**                                                                 | **Weaknesses**                                                         | **When to Use**                                      | **Improvements**                                                                                                   |
|-------------------------|-------------------------------------------------------------------------------|------------------------------------------------------------------------|-----------------------------------------------------|--------------------------------------------------------------------------------------------------------------------|
| **Perceptron (1958)**   | Simple, efficient for linearly separable data.                               | Cannot solve non-linear problems (e.g., XOR).                          | Basic linearly separable problems.                 | Multi-Layer Perceptron (MLP) introduced hidden layers for non-linear problems.                                    |
| **MLP (1986)**          | Solves non-linear problems using hidden layers and backpropagation.          | Overfitting, requires manual feature engineering.                      | Classification, regression with structured data.   | CNNs for handling image data.                                                                                    |
| **CNN (1998)**          | Excels in image processing, automatic spatial hierarchy detection.           | Struggles with sequential data, high computational needs.              | Image classification, object detection.            | Recurrent models (RNNs) for sequential data.                                                                     |
| **RNN (1986)**          | Processes sequential data, maintains memory of previous inputs.              | Vanishing/exploding gradients, struggles with long dependencies.       | Time-series forecasting, NLP tasks.                | LSTM and GRU to handle long-term dependencies.                                                                   |
| **LSTM (1997)**         | Handles long-term dependencies, reduces vanishing gradients.                 | Computationally expensive, requires careful tuning.                    | Machine translation, speech recognition.           | GRUs introduced simpler architectures with similar performance.                                                  |
| **GRU (2014)**          | Simplified version of LSTM, fewer parameters, faster training.               | May underperform for complex tasks compared to LSTM.                   | Computational efficiency in sequential tasks.       | Attention-based models like Transformers.                                                                        |
| **Transformers (2017)** | Handles long-range dependencies efficiently, parallel processing.            | High data and computational requirements.                              | NLP tasks, advanced image tasks (e.g., ViT).       | Specialized models like BERT, GPT.                                                                               |
| **BERT (2018)**         | Bidirectional context understanding, effective for multiple NLP tasks.       | Resource-intensive fine-tuning, unsuitable for real-time applications. | Text classification, summarization, question answering. | GPT specialized in text generation and few-shot learning.                                                        |
| **GPT (2018)**          | Excels in text generation, few-shot and zero-shot learning.                  | Requires extensive compute, prone to incorrect plausible content.      | Content creation, chatbots, story generation.       | Scaling models (GPT-2, GPT-3, GPT-4).                                                                            |
| **Diffusion Models (2022)** | High-quality image generation, stable and controllable outputs.          | Slow inference, high computational resources.                          | Image generation, artistic designs, medical imaging. | Optimization for faster inference.                                                                               |
| **ViT (2020)**          | Adapts Transformers for images, better on large-scale datasets.              | Requires more data than CNNs.                                          | Advanced image recognition tasks.                   | Hybrid CNN-ViT models, fine-tuning for smaller datasets.                                                         |


### **1. General Evolution Themes**
- **Neural Networks**:
  - Began as biologically inspired but quickly diverged to purely mathematical models.
  - Each new architecture focuses on solving a limitation of its predecessor (e.g., from linear separability to handling sequential data).

- **Model Scaling**:
  - Recent models like GPT-3 and GPT-4 focus on scaling parameters, datasets, and computational power.
  - While scaling improves performance, it also raises concerns about accessibility and environmental impact.

---

### **2. Emerging Trends**
- **Foundation Models**:
  - Models like GPT, BERT, and DALL-E are pre-trained on massive datasets and then fine-tuned for specific tasks.
  - They enable transfer learning, saving time and resources for downstream applications.

- **Unified Architectures**:
  - Research is moving toward models that handle both vision and language tasks (e.g., Flamingo, CLIP).
  - The goal is to unify tasks that were traditionally solved by separate architectures.

- **Efficient Models**:
  - Lightweight architectures (e.g., MobileNet, DistilBERT) are designed for deployment on devices with limited resources like mobile phones.

---

### **3. Key Drawbacks to Note**
- **Overfitting**:
  - Most deep models are prone to overfitting on small datasets.
  - Solutions include regularization techniques, data augmentation, and dropout layers.

- **Data Requirements**:
  - Deep learning models typically require large amounts of labeled data, which can be challenging to obtain in specialized fields like medical imaging.

- **Interpretability**:
  - Neural networks are often criticized as "black boxes."
  - Methods like SHAP and LIME help interpret models but are still an active area of research.

---

### **4. Decision-Making Tips**
- **Task-Specific Models**:
  - Use CNNs for image-related tasks but Transformers for text.
  - For sequence-to-sequence tasks, start with RNNs or LSTMs but consider Transformers for state-of-the-art results.

- **Data Availability**:
  - If you have limited data, pre-trained models (like BERT or GPT) are ideal since they leverage transfer learning.

- **Computational Resources**:
  - Consider lightweight models for mobile applications or edge devices.
  - Cloud-based GPUs or TPUs are essential for training large models like GPT.

---

### **5. Best Practices**
- **Hyperparameter Tuning**:
  - Always optimize learning rates, batch sizes, and architecture-specific parameters.
- **Evaluation Metrics**:
  - For text tasks, consider BLEU, ROUGE, or perplexity.
  - For image tasks, use accuracy, IoU, or F1-score.
- **Deployment**:
  - Use frameworks like TensorFlow Lite or ONNX for efficient deployment of trained models.

---

### **Interesting Applications Across Models**
- **CNNs in Medicine**: Early cancer detection using CNN-based models in histopathology images.
- **Transformers in Finance**: Predicting stock trends using BERT for text analysis of market news.
- **Diffusion Models in Art**: AI-generated art for creative industries using diffusion-based architectures.


### **1. CNNs in Medicine**
- **Example Use Case**: **Histopathology Image Analysis**
  - CNNs have been employed to detect cancerous cells in histopathology images, significantly reducing the workload for radiologists.
  - A model might classify an image into "cancerous" or "non-cancerous" based on patterns learned from previous scans.
- **Techniques**:
  - Transfer learning with pre-trained networks like ResNet or EfficientNet.
  - Data augmentation to handle small datasets common in medical applications.
- **Challenges**:
  - Class imbalance: Cancerous images might be rarer than normal images.
  - Interpretability: CNNs need to provide visual evidence (e.g., heatmaps using Grad-CAM).

---

### **2. Transformers in Finance**
- **Example Use Case**: **Sentiment Analysis for Stock Prediction**
  - BERT models analyze financial news headlines to determine market sentiment.
  - Positive sentiment might correlate with a stock price increase, while negative sentiment suggests a decrease.
- **Techniques**:
  - Fine-tuning BERT on financial datasets like news articles and earnings reports.
  - Combining sentiment predictions with time-series models (e.g., ARIMA or LSTMs) for better predictions.
- **Challenges**:
  - Ambiguity in language: Financial texts can be subtle and context-specific.
  - Noise in data: Not all news headlines impact stock prices.

---

### **3. Diffusion Models in Art**
- **Example Use Case**: **AI-Generated Artwork**
  - Diffusion models, like Stable Diffusion, iteratively add noise to images and learn to reverse the process.
  - Artists use these models to create abstract designs or enhance creativity.
- **Techniques**:
  - Conditional diffusion to guide the output (e.g., generating a painting in the style of Van Gogh).
  - Using text prompts for multimodal applications (e.g., "A futuristic city at sunset").
- **Challenges**:
  - Computational cost: Inference is slow and requires significant GPU power.
  - Ethical concerns: Plagiarism risks when replicating artistic styles.

---

### **4. RNNs and LSTMs in Healthcare**
- **Example Use Case**: **Electronic Health Record (EHR) Analysis**
  - Sequential models like LSTMs analyze patient histories to predict potential illnesses.
  - For instance, predicting a heart attack risk based on historical data like cholesterol levels and ECG readings.
- **Techniques**:
  - Embedding categorical data (e.g., disease codes) and feeding it into the LSTM.
  - Attention mechanisms for emphasizing critical parts of the sequence.
- **Challenges**:
  - Missing data: EHRs often have incomplete entries.
  - Scalability: Large hospitals generate terabytes of sequential data.

---

### **5. Lightweight Models**
- **Example Use Case**: **Mobile Applications**
  - MobileNets are used for real-time object detection in mobile apps like Snapchat or Instagram.
  - Applications include AR filters, live translations, and personal assistants.
- **Techniques**:
  - Quantization to reduce model size.
  - Knowledge distillation to transfer knowledge from a large model to a smaller one.
- **Challenges**:
  - Trade-off between accuracy and latency.
  - Hardware-specific optimizations might be necessary.

---

### **Common Practical Tips**
1. **Explainability Tools**:
   - Use tools like **SHAP** or **LIME** to explain black-box models.
   - For CNNs, visualize feature maps to understand what the model "sees."
2. **Regularization Techniques**:
   - Dropout, weight decay, and data augmentation can prevent overfitting.
3. **Hyperparameter Optimization**:
   - Use tools like Optuna or GridSearchCV for fine-tuning critical parameters.
4. **Cloud and Hardware**:
   - Leverage GPUs or TPUs for faster training of large models.
   - For on-device inference, optimize models using TensorFlow Lite or ONNX.

In [8]:
import tensorflow as tf
import tensorflow_hub as hub
import tensorflow_text as text
from tensorflow.python.training.tracking import data_structures

In [22]:
preprocessor = hub.KerasLayer("https://kaggle.com/models/tensorflow/bert/TensorFlow2/en-uncased-preprocess/3")
encoder = ("https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/4")

In [23]:
text_test = ['nice movie', 'i dont love']
text_preprocessor = preprocessor(text_test)
text_preprocessor.keys()

dict_keys(['input_mask', 'input_type_ids', 'input_word_ids'])

In [18]:
text_preprocessor['input_mask']
#ClS nice movie SEP

<tf.Tensor: shape=(2, 128), dtype=int32, numpy=
array([[1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])>

In [16]:
text_preprocessor['input_type_ids']

<tf.Tensor: shape=(2, 128), dtype=int32, numpy=
array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])>

In [17]:
text_preprocessor['input_word_ids']

<tf.Tensor: shape=(2, 128), dtype=int32, numpy=
array([[ 101, 3835, 3185,  102,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0],
       [ 101, 1045, 2123, 2102, 2293,  102,    0,    0,    0,    0,    0,
           0,

In [25]:
bert_model = hub.KerasLayer(encoder)
bert_result = bert_model(text_preprocessor)
bert_result.keys()

dict_keys(['default', 'encoder_outputs', 'pooled_output', 'sequence_output'])

In [26]:
bert_result['pooled_output']

<tf.Tensor: shape=(2, 768), dtype=float32, numpy=
array([[-0.82307976, -0.33021632,  0.05804335, ...,  0.15002407,
        -0.5304848 ,  0.84959537],
       [-0.9289333 , -0.51373696, -0.9104153 , ..., -0.68788904,
        -0.7519929 ,  0.9477711 ]], dtype=float32)>

In [27]:
bert_result['sequence_output']

<tf.Tensor: shape=(2, 128, 768), dtype=float32, numpy=
array([[[-0.04972774, -0.16134027, -0.0449734 , ..., -0.19434878,
          0.15166706,  0.07051323],
        [ 0.40138033, -0.71619904,  0.7960528 , ..., -0.34071213,
          0.01467809, -0.282048  ],
        [ 0.15849896, -0.13599616, -0.31225622, ..., -0.31121084,
         -0.10993739, -0.5535932 ],
        ...,
        [-0.07264662, -0.2278554 ,  0.50889313, ..., -0.02908365,
          0.10880288,  0.17614342],
        [-0.35652015, -0.61217856,  0.10107186, ...,  0.1591494 ,
          0.01639703, -0.17355028],
        [-0.0866785 , -0.28234932,  0.47228336, ...,  0.03270615,
          0.10365032,  0.07470024]],

       [[-0.37575817,  0.59689146, -0.4135615 , ..., -0.51438975,
          0.4503662 ,  0.61310536],
        [-0.22726893,  0.6257317 , -0.00740789, ..., -0.16128698,
          0.39208385,  0.7126612 ],
        [ 0.09385063,  0.762618  ,  0.32807058, ..., -0.4738144 ,
          0.47743925,  0.163544  ],
        ...,