#  Natural Language Processing with Hugging Face Transformers

This notebook introduces **NLP basics** using the Hugging Face `transformers` library.  
We will cover:
1. Setting up the environment (GPU, Drive, installing libraries)  
2. Tokenization (breaking text into tokens for models)  
3. Using a pre-trained language model (GPT-2)  
4. Generating text with the model  

---


##  Step 1: Mount Google Drive

Google Colab provides temporary storage. To **save files permanently**, we connect Colab to our Google Drive.  
This way, any models, datasets, or outputs can be saved directly in Drive.


In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


##  Step 2: Check GPU Availability

Deep learning models like GPT-2 are **heavy**.  
- Running them on CPU is very slow.  
- GPUs speed up training and inference drastically.  

Here we use `nvidia-smi` to check GPU info (type, memory, usage).


In [None]:
!nvidia-smi

Mon Aug 18 11:08:36 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   48C    P8             10W /   70W |       0MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

##  Step 3: Install Hugging Face Transformers

[Hugging Face](https://huggingface.co/) provides pre-trained NLP models (BERT, GPT, RoBERTa, etc).  
We install the `transformers` library which allows us to:
- Load pre-trained models  
- Perform tokenization  
- Do text classification, translation, summarization, etc.


In [None]:
!pip install transformers



##  Step 4: Tokenization

**What is Tokenization?**  
- Process of converting raw text into tokens (numbers) that models understand.  
- Example: `"Hello, how are you?"` → `[15496, 11, 4919, 389, 345]`  

Here we use the GPT-2 tokenizer:
- `AutoTokenizer.from_pretrained("gpt2")` loads GPT-2 tokenizer.  
- `return_tensors="pt"` converts tokens into PyTorch tensors.


In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("gpt2")

text = "Hello,how are you"
tokens = tokenizer(text,return_tensors="pt")
print(tokens)

{'input_ids': tensor([[15496,    11,  4919,   389,   345]]), 'attention_mask': tensor([[1, 1, 1, 1, 1]])}


##  Step 5: Load Pre-trained GPT-2 Model

We now load the **GPT-2 model**:  
- `AutoModelForCausalLM` loads GPT-2 for **text generation**.  
- We encode an input prompt (e.g., `"Mughals"`) into tokens.  
- `model.generate()` produces text continuation.  
- Finally, we decode tokens back into human-readable text.


In [None]:
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("gpt2")

input_ids = tokenizer.encode("Mughals",return_tensors="pt")
output = model.generate(input_ids,max_length=50)
generated_text = tokenizer.decode(output[0],skip_special_tokens=True)
print(generated_text)


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Mughals, who were the first to establish a foothold in the Indian Ocean, were the first to establish a foothold in the Indian Ocean, and the first to establish a foothold in the Indian Ocean.

The first Indian Ocean expedition was launched


#  Summary

- **Google Drive + GPU** setup ensures smooth work on large models.  
- **Tokenization** converts raw text into tokens.  
- **GPT-2 model** can generate text given a prompt.  


