# NLP basics :

* NLP yani Natural Language Processing ek field hai jisme computer ko human language samajhna aur generate karna sikhaya jata hai. Ye text ko numbers me convert karke machine ko meaning samajhne layak banata hai.

* NLP me grammar, context, aur intent jaise factors ko model seekh kar text ka deep relation samajhta hai. Ye simple keyword matching se kai zyada smart hota hai.

* NLP ka goal language ko machine friendly format me break karna hai taaki model text classify, translate, summarize ya predict kar sake.

* NLP multiple layers ka process hota hai jaise tokenization, embedding, context modelling aur inference jisse final meaningful output milta hai.

### 1# Tokenization :

* Tokenization ek process hai jisme sentence ko chote manageable units me break kiya jata hai jaise words, subwords ya characters.

* Ye raw text ko balanced pieces me split karta hai taaki model sequence ko step by step samajh sake.

* Modern NLP me subword tokenization famous hai kyunki ye rare words ko bhi break karke handle kar leta hai.

* Tokenization model ke input ka base hota hai kyunki bina clean tokens model correct attention flow nahi seekh pata.

* Why in AI/ML :

    - Tokenization text ko numerical modelling ke layak banata hai jisse ML model sequence ko proper order me process kar pata hai.

    - Ye word variation, spelling change aur rare words ko handle karta hai jo training stability improve karta hai.

    - Bina tokenization model complete sentence ko ek single block ke tarah dekhega, jisse context aur meaning lose ho jayega.

* Types :

    - Word tokenization

    - Character tokenization

    - Subword tokenization (BPE, WordPiece, SentencePiece)

        - Subword sabse powerful hota hai kyunki new words ko bhi model seekh sakta hai.

In [3]:
from transformers import AutoTokenizer , AutoModel

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

text = "Machine learning helps AI understand language"

tokens = tokenizer.tokenize(text)  # text ko subword tokens me break karta hai
ids = tokenizer.encode(text)  # tokens ko numeric ids me convert karta hai

print(tokens)
print(ids)

['machine', 'learning', 'helps', 'ai', 'understand', 'language']
[101, 3698, 4083, 7126, 9932, 3305, 2653, 102]


### 2# Embeddings :

* Embeddings ek vector representation hoti hai jisme har word ko ek dense (close) numeric space me convert kiya jata hai.

* Ye numerical vectors words ke meaning ka relationship capture karte hain jaise king queen similarity.

* Embeddings high dimensional data ko compressed form me store karte hain jisse model context aur semantics samajh pata hai.

* Embeddings dynamic hoti hain yani model training ke sath improve hoti jati hain aur language understand karne me smarter ban jati hain.

* Why in AI/ML :

    - Embeddings machine ko deep meaning aur context samajhne ki ability deti hai jo AI models ka core hota hai.

    - Ye ML models ko long term patterns aur relation seekhne me help karti hain jaise synonyms, analogies.

    - Embeddings ke bina text vectors sparse aur inefficient hote jisse model learn nahi kar pata.

In [4]:
import torch

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-uncased")

text = "NLP embeddings capture meaning"
inputs = tokenizer(text , return_tensors="pt")

 # model se embedding vector nikalta hai

outputs = model(**inputs)   
cls_vector = outputs.last_hidden_state[:, 0, :]  # CLS token embedding


print(cls_vector)                       # vector jis me sentence ka meaning store hota hai

tensor([[-2.6162e-01, -1.5060e-01, -5.8002e-01, -1.9405e-01, -3.7837e-01,
         -2.6904e-01,  4.5140e-01,  3.2322e-01, -1.7788e-01, -2.7300e-01,
         -2.5873e-01,  1.2740e-02, -1.3905e-01,  1.1248e-01,  4.4079e-01,
          2.4274e-01, -1.8626e-01,  4.9890e-01,  2.0663e-01, -3.5341e-01,
         -1.0902e-02, -4.5293e-01, -6.6118e-01, -5.5842e-01,  2.0922e-01,
          6.0444e-02,  6.8188e-02, -4.9698e-01,  1.5289e-01,  1.6451e-01,
         -2.3054e-01,  1.5929e-01, -3.6369e-01, -4.2262e-02,  5.3762e-01,
         -1.3789e-01,  3.8621e-01, -1.3518e-01,  2.5008e-01,  6.3026e-02,
         -2.4635e-01, -1.7674e-01,  7.5943e-01, -3.9611e-02, -4.0518e-01,
         -1.8297e-01, -3.1206e+00, -9.6969e-02, -6.6714e-01, -5.3219e-01,
         -3.2492e-01,  3.1801e-03,  2.7319e-02,  8.6779e-01,  1.2414e-03,
          5.4698e-01, -4.4062e-02,  4.7390e-01,  2.4350e-01,  8.1779e-02,
          9.4857e-02,  7.8707e-02, -1.3740e-01,  1.4905e-01,  1.7796e-01,
          2.0958e-01, -1.4692e-01,  5.

### 3# Sentence Similarity Project :

* Sentence similarity project me do sentences ke beech meaning ka kitna close relation hai wo calculate kiya jata hai.

* Model embeddings generate karta hai aur un vectors ka distance compare karke similarity score deta hai.

* Ye project semantic understanding test karta hai yani model sirf words nahi poore sentence ka idea samajhta hai.

* Similarity score 0 to 1 ya cosine similarity ke form me hota hai jisse pata lagta hai ki text kitna similar hai.

* Why in AI/ML (4 points) : 

    - Sentence similarity applications me essential hota hai jaise duplicate question detection, search ranking, recommendation.

    - NLP systems ke quality measure karne ka best method hota hai ki sentences ka meaning kitna accurately match hota hai.

    - AI ko real life text relation samajhna padta hai aur similarity task uska practical test hota hai.

    - Ye project embeddings aur transformer models ka actual utility dikhata hai jo ML learning ko strong banata hai.

In [5]:
from sentence_transformers import SentenceTransformer , util

model = SentenceTransformer("all-MiniLM-L6-v2") # pretrained embedding model

s1 = "I love learning NLP"
s2 = "Studying natural language processing is enjoyable"

e1 = model.encode(s1)  # first sentence embedding
e2 = model.encode(s2)  # second sentence embedding

score = util.cos_sim(e1, e2)  # cosine similarity to compare meaning

print("similarity:", score.item())

similarity: 0.6640487909317017
