### 1. Install Required Libraries
You can use the following command to install all necessary libraries:

```python
    ! pip install transformers keras keras-nlp spacy nltk pandas tensorflow
```

### 2. Explanation of Each Library
Here’s a breakdown of why each library is important for your training task:

- **`keras`**: 
  - **Purpose**: This library is widely used for building and training neural network models, particularly high-level model creation, data processing, and training loops.
  - **Relevance**: `Gemma2_2b` can leverage `keras` for neural network architecture and training processes.

- **`keras-nlp`**:
  - **Purpose**: It is an NLP-focused library that extends `keras` with tools for text processing, tokenization, sequence modeling, and transformers.
  - **Relevance**: `keras-nlp` provides specific tools for text processing, making it easier to preprocess Marathi data for language models.

- **`transformers`**: (`But I will use the keras to download the model and work on it`)
  - **Purpose**: From Hugging Face, this library provides pre-trained language models, tokenizers, and the infrastructure for fine-tuning models for specific NLP tasks.
  - **Relevance**: If `Gemma_2b` uses transformer-based architecture, you can utilize `transformers` for tokenization and transformer model layers.

- **`spacy`**:
  - **Purpose**: This NLP library is useful for advanced text preprocessing, including tokenization, POS tagging, and custom language model handling.
  - **Relevance**: `spacy` supports many languages and allows you to add Marathi-specific custom pipelines. You can use it for initial Marathi text tokenization and NLP preprocessing.

- **`nltk`**:
  - **Purpose**: The Natural Language Toolkit (NLTK) offers a wide range of text processing libraries and utilities, especially for tokenization, stemming, and lexical resources.
  - **Relevance**: NLTK can be used for basic text processing tasks on Marathi text, such as tokenization and cleaning, and is particularly useful if you need additional preprocessing.

- **`pandas`**:
  - **Purpose**: For data manipulation, cleaning, and structuring, `pandas` is crucial for handling datasets and dataframes.
  - **Relevance**: Use `pandas` to load, preprocess, and batch your Marathi data, making it easy to integrate into the training pipeline.

- **`tensorflow`**:
  - **Purpose**: As a deep learning framework, `tensorflow` provides core functionalities for model training, custom layers, and model optimization.
  - **Relevance**: Since `keras` is integrated with `tensorflow`, you’ll use it for low-level operations, custom training functions, and GPU/TPU acceleration, which is essential for efficient training of large models like `Gemma_2b`.

### Next Steps for Training `Gemma_2b` in Marathi
Once the libraries are installed:
1. **Preprocess Marathi Text**: Use `spacy` and `nltk` to tokenize and clean Marathi text data.
2. **Load and Structure Data**: Use `pandas` to organize and structure the data, loading it into appropriate formats for model training.
3. **Model Initialization**: Initialize the `Gemma_2b` model with `keras` and `tensorflow`.
4. **Training and Evaluation**: Use `keras` and `tensorflow` for training, `keras-nlp` for NLP utilities, and consider utilizing `transformers` if Gemma_2b relies on transformer-based layers or architectures.


In [1]:
! pip install keras keras-nlp tensorflow Pandas # spacy nltk transformers



In [2]:
!pip install -q -U keras keras-nlp

[0m

1. **Environment Setup**:
   - This code configures the environment to use TensorFlow as the backend for Keras and allows the XLA Python client to use the full GPU memory. This can help optimize the performance of Gemma model.

2. **Marathi Language Dataset**:
   - Prepare or obtain a Marathi language dataset for fine-tuning. You may look for datasets like CC100 Marathi, IndicCorp, or any custom dataset with Marathi text.

3. **Preprocessing**:
   - Tokenize Marathi text using a tokenizer compatible with the Gemma_2b architecture, which might involve custom preprocessing if Marathi is unsupported by default.
   - Remove unnecessary symbols, normalize text, and ensure your tokenizer can handle Marathi script.

4. **Model Training Configuration**:
   - **Hyperparameters**: Define hyperparameters based on your grid search strategy, such as learning rate, batch size, and number of epochs.
   - **Training Framework**: Use a framework that supports transformers and efficient GPU utilization for training (e.g., PyTorch Lightning, Hugging Face Trainer API).
   
5. **Evaluation and Validation**:
   - Create a Marathi language validation set to monitor performance metrics.
   - Regularly evaluate metrics such as loss, accuracy, and perplexity during training and tune hyperparameters as necessary.

6. **Saving and Deployment**:
   - Save the trained model checkpoints periodically and consider post-processing for deployment, especially if it will be used for inference in Marathi.

In [3]:
import os
os.environ["KERAS_BACKEND"] = "tensorflow"
os.environ["XLA_PYTHON_CLIENT_MEM_FRACTION"] = "1.00"

these lines of code are used to install and enable interactive widgets in Jupyter notebooks and configure PyTorch to manage GPU memory more efficiently.

In [4]:
! pip install --upgrade ipywidgets
! jupyter nbextension enable --py widgetsnbextension
! export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True

Enabling notebook extension jupyter-js-widgets/extension...
      - Validating: [32mOK[0m


### Explanation of Each Library:

1. **`keras`**: Essential for building and training deep learning models, including the Gemma_2b model.
2. **`pandas`**: Handles data loading and preprocessing tasks, enabling efficient dataset management.
3. **`keras_nlp`**: Adds NLP-specific components within Keras, such as tokenizers and embedding
4. **`tensorflow`**: To get the optimizer like adam, adamw like this
5. **`mixed_precision`**: To set the global policy to float16

In [5]:
# Gemma Model Training Script
# Purpose: Train the Gemma_2b model in Marathi Language
# Step 1: Import essential libraries for data processing, NLP, and model training
import keras
import pandas as pd
import keras_nlp 
import tensorflow as tf
from tensorflow.keras import mixed_precision
policy = mixed_precision.Policy('float16')
mixed_precision.set_global_policy(policy)

- **Loading the Model**: The from_preset method is used to load a pre-trained model. The preprocessor parameter is set to None, indicating that the model will not apply any preprocessing to the input data.

- **Model Summary**: The summary method provides a detailed overview of the model's architecture, which is useful for understanding the structure and complexity of the model.

In [6]:
gemma2_2b_en = keras_nlp.models.GemmaCausalLM.from_preset('gemma2_instruct_2b_en')
gemma2_2b_en.summary()

normalizer.cc(51) LOG(INFO) precompiled_charsmap is empty. use identity normalization.


## **Before the train of model**

First we check output with respect to this sentence `'tell me how photosynthesis work in marathi langauge'`

In [7]:
gemma2_2b_en.generate("tell me how photosynthesis work in marathi langauge")

I0000 00:00:1733683209.533128      23 service.cc:145] XLA service 0x5a5a08978770 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
I0000 00:00:1733683209.533193      23 service.cc:153]   StreamExecutor device (0): Tesla T4, Compute Capability 7.5
I0000 00:00:1733683209.533200      23 service.cc:153]   StreamExecutor device (1): Tesla T4, Compute Capability 7.5
I0000 00:00:1733683231.020445      23 device_compiler.h:188] Compiled cluster using XLA!  This line is logged at most once for the lifetime of the process.


'tell me how photosynthesis work in marathi langauge.\n\n**पणासाठी प्रकाशसंश्लेषण हा एक महत्त्वाचा प्रक्रिया आहे.**\n\n**प्रकाशसंश्लेषणाचा कार्यक्रम:**\n\n* **प्रकाश:** प्रकाशसंश्लेषणात प्रकाश किरणे वापरतात.\n* **पौधे:** पौधे प्रकाशसंश्लेषण करण्यासाठी वापरतात.\n* **पौधेच्या पाण्याची किंमत:** पौधेच्या पाण्याची किंमत प्रकाशसंश्लेषणात वापरते.\n* **कार्बन डाइऑक्साइड:** कार्बन डाइऑक्साइड पौधेच्या पाण्याच्या किंमतीतून प्राप्त होते.\n* **ऑक्सीजन:** प्रकाशसंश्लेषणात ऑक्सीजन निर्माण होते.\n\n**प्रकाशसंश्लेषणाचा कार्यक्रम:**\n\n1. **प्रकाश किरणे पौध्याच्या पातळीवर येतात.**\n2. **प्रकाश किरणे पौध्याच्या पातळीवर पौधेच्या पातळीवर पौधेच्या पातळीवर पौधेच्या पातळीवर पौधेच्या पातळीवर पौधेच्या पातळीवर पौधेच्या पातळीवर पौधेच्या पातळीवर पौधेच्या पातळीवर पौधेच्या पातळीवर पौधेच्या पातळीवर पौधेच्या पातळीवर पौधेच्या पातळीवर पौधेच्या पातळीवर पौधेच्या पातळीवर पौधेच्या पातळीवर पौधेच्या पातळीवर पौधेच्या पातळीवर पौधेच्या पातळीवर पौधेच्या पातळीवर पौधेच्या पातळीवर पौधेच्या पातळीवर पौधेच्या पातळीवर पौधेच्या पातळीवर प

## **Same input**
**We can see that in marathi Language it just repeat multipile time it make no sence where else in English we can see that it differ for marathi language ** Input as `'The Maharashtra government has announced a new policy.'`

In [8]:
gemma2_2b_en.generate("tell me how photosynthesis work")

"tell me how photosynthesis work\n\n## Photosynthesis: Turning Sunlight into Sugar\n\nPhotosynthesis is the incredible process by which plants, algae, and some bacteria convert light energy into chemical energy in the form of sugars. It's like a solar-powered factory, using sunlight, water, and carbon dioxide to create food for itself and release oxygen as a byproduct.\n\n**Here's a simplified breakdown of the process:**\n\n**1. Capturing Sunlight:**\n   - Plants have special structures called chloroplasts, which contain a green pigment called chlorophyll.\n   - Chlorophyll absorbs light energy, primarily in the red and blue wavelengths, and reflects green light (which is why plants appear green).\n\n**2. Light-Dependent Reactions:**\n   - The absorbed light energy is used to split water molecules into hydrogen ions (H+), electrons (e-), and oxygen gas (O2).\n   - The electrons are passed along an electron transport chain, releasing energy that is used to generate ATP (adenosine tripho

# Here, we can see that there is a significant difference between Marathi and English. Therefore, we trained the model in Marathi. We observed that it produces different results for the same sentence.

This code sets up the Gemma model with a sequence length of 512, enables LoRA with a rank of 4, and configures the optimizer using AdamW with a specified learning rate and weight decay. The model is then compiled with the optimizer, loss function, and metrics, and is ready for training with the provided input data.

In [9]:
gemma2_2b_en.sequence_length = 512
gemma2_2b_en.backbone.enable_lora(rank=4) 

gemma2_2b_en.summary()

optimizer = tf.keras.optimizers.AdamW(
    learning_rate=1e-5,
    weight_decay=0.01,
)

optimizer.exclude_from_weight_decay(var_names=["bias", "scale"])

gemma2_2b_en.compile(optimizer=optimizer,
                     loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True),
                     weighted_metrics=[keras.metrics.SparseCategoricalAccuracy()])

We preprocess the articles to remove HTML tags, website links, duplicate values, and other unnecessary elements. The cleaned data is then loaded into a pandas DataFrame and split into training and testing sets. This preprocessing helps reduce the loss function and improves the accuracy of the model.

In [10]:
MarathiData=pd.read_csv('/kaggle/input/marathi-new-article-dataset/Marathi_dataset.csv')

We split the data part into the train and test to train the model and make it more better then previos version

In [11]:
Train_data=MarathiData['sentence'][:50_000]

##### The purpose of this code is to monitor the training process and automatically stop training once the model reaches an accuracy of 95%. This can save time and computational resources by preventing unnecessary training beyond the desired accuracy threshold.

In [12]:
class MyCallBack(tf.keras.callbacks.Callback):
    def on_epoch_end(self, epoch, logs={}):
        if (logs.get('sparse_categorical_accuracy')>.97):
            print(f"\nReached 95% accuracy so cancelling training!")
            self.model.stop_training=True
callback=MyCallBack()

### Let's run fit the model in marathi text and much more understabel

In [13]:
gemma2_2b_en.fit(x=Train_data, epochs=1, batch_size=1, callbacks=[callback])

W0000 00:00:1733683350.986270      83 assert_op.cc:38] Ignoring Assert operator compile_loss/sparse_categorical_crossentropy/SparseSoftmaxCrossEntropyWithLogits/assert_equal_1/Assert/Assert




[1m50000/50000[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m36760s[0m 733ms/step - loss: 0.3374 - sparse_categorical_accuracy: 0.4473


<keras.src.callbacks.history.History at 0x7e7ea836efb0>

### Now we save the model get the better in marathi language 

In [14]:
gemma2_2b_en.save('gemma2_2b_mr.keras', include_optimizer=True)

### **After train** of the model output with repect to this input or different input of prompt

In [15]:
gemma2_2b_en.generate("tell me how photosynthesis work in marathi language")

'tell me how photosynthesis work in marathi language.\n\n**पणासाचे प्रकाशसंश्लेषण**\n\nपणासाचे प्रकाशसंश्लेषण हा एक महत्त्वाचा प्रक्रिया आहे ज्यात पणासाचे सूर्यप्रकाश वापरून त्यांच्यासाठी खूपच आवश्यक पोषक तत्वे तयार करतात. या प्रक्रियात, पणासाचे सूर्यप्रकाश, जल आणि कार्बन डाइऑक्साइड वापरून, त्यांच्यासाठी खूपच आवश्यक पोषक तत्वे तयार करतात. \n\n**प्रक्रिया:**\n\n1. **सूर्यप्रकाश:** पणासाच्या पाद्यांमध्ये सूर्यप्रकाश पडतो.\n2. **जल आणि कार्बन डाइऑक्साइड:** पणासाच्या पाद्यांमध्ये जल आणि कार्बन डाइऑक्साइड येतात.\n3. **प्रकाशसंश्लेषण:** सूर्यप्रकाश, जल आणि कार्बन डाइऑक्साइड वापरून पणासाच्या पाद्यांमध्ये पोषक तत्वे तयार करतात.\n\n**पोषक तत्वे:**\n\n* **ऑक्सीजन:** पणासाच्या पाद्यांमध्ये ऑक्सीजन तयार होतो.\n* **कार्बन डाइऑक्साइड:** पणासाच्या पाद्यांमध्ये कार्बन डाइऑक्साइड वापरून पोषक तत्वे तयार करतात.\n* **पोटेशियम:** पणासाच्या पाद्यांमध्ये पोटेशियम तयार होतो.\n* **न्यूट्रिएंट:** पणासाच्या पाद्यांमध्ये न्यूट्रिएंट तयार होतो.\n\n**उपयोग:**\n\n* पणासाच्या पाद्यांमध्ये पोषक तत्वे तयार करून त्यांच्