<a href="https://colab.research.google.com/github/DBishal13/gpt_chatwithPDF/blob/main/RomanNepaliAI.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# 🚀 Romanized Nepali Model - Google Colab Setup

# ✅ Install necessary libraries
!pip install transformers datasets torch

Collecting datasets
  Downloading datasets-3.5.0-py3-none-any.whl.metadata (19 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.12.0,>=2023.1.0 (from fsspec[http]<=2024.12.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.12.0-py3-none-any.whl.metadata (11 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_c

In [None]:
# ✅ Import Libraries
from transformers import BertTokenizer, BertForSequenceClassification, Trainer, TrainingArguments
from datasets import load_dataset, Dataset
import torch
import pandas as pd

# ✅ Load mBERT model and tokenizer
model_name = 'bert-base-multilingual-cased'  # Multilingual BERT

tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertForSequenceClassification.from_pretrained(model_name, num_labels=2)  # Example: binary classification

# ✅ Sample Romanized Nepali Dataset (You can replace this with your own)
data = {'text': ['ma ghar gaye', 'malai tha xaina', 'timlai k cha'],
        'label': [0, 1, 0]}  # 0 = neutral, 1 = negative (example labels)

df = pd.DataFrame(data)

def preprocess_function(examples):
    return tokenizer(examples['text'], truncation=True, padding='max_length', max_length=128)

# ✅ Convert to Hugging Face dataset format
dataset = Dataset.from_pandas(df)
dataset = dataset.map(preprocess_function, batched=True)

# ✅ Fine-tuning settings
training_args = TrainingArguments(
    output_dir='./results',
    evaluation_strategy='epoch',
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
    logging_dir='./logs',
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset
)

# ✅ Start training
trainer.train()

# ✅ Save the model
model.save_pretrained('./romanized_nepali_model')
tokenizer.save_pretrained('./romanized_nepali_model')

print("✅ Model training complete and saved!")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/996k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.96M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/625 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/714M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-multilingual-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Map:   0%|          | 0/3 [00:00<?, ? examples/s]



ValueError: You have set `args.eval_strategy` to IntervalStrategy.EPOCH but you didn't pass an `eval_dataset` to `Trainer`. Either set `args.eval_strategy` to `no` or pass an `eval_dataset`. 

In [None]:
# 🚀 Combined Social Media Scraper for Romanized Nepali Text

# ✅ Install necessary libraries
!pip install snscrape praw facebook-scraper pandas

import snscrape.modules.twitter as sntwitter
import praw
from facebook_scraper import get_posts
import pandas as pd

# ✅ Configuration
num_tweets = 500       # Number of tweets to scrape
num_reddit_posts = 100  # Number of Reddit posts to scrape
num_facebook_posts = 100  # Number of Facebook posts to scrape

# ✅ Twitter Scraper (No API keys required)
print("🚀 Scraping Twitter...")
twitter_query = "(Nepal OR Kathmandu) lang:ne until:2025-04-01 since:2024-01-01"
twitter_data = []

for i, tweet in enumerate(sntwitter.TwitterSearchScraper(twitter_query).get_items()):
    if i >= num_tweets:
        break
    twitter_data.append([tweet.date, tweet.content, tweet.user.username, 'Twitter'])

twitter_df = pd.DataFrame(twitter_data, columns=['Date', 'Text', 'User', 'Source'])

# ✅ Reddit Scraper (Optional API keys)
print("🚀 Scraping Reddit...")
try:
    reddit = praw.Reddit(
        client_id='YOUR_CLIENT_ID',    # Optional: Replace with your keys or leave blank
        client_secret='YOUR_CLIENT_SECRET',
        user_agent='YourAppName'
    )
    subreddits = ['Nepal', 'nepalibloggers']
    reddit_data = []

    for sub in subreddits:
        subreddit = reddit.subreddit(sub)
        for post in subreddit.new(limit=num_reddit_posts):
            reddit_data.append([post.created_utc, post.title, post.selftext, 'Reddit'])

    reddit_df = pd.DataFrame(reddit_data, columns=['Date', 'Text', 'User', 'Source'])

except Exception as e:
    print(f"❌ Reddit scraping failed: {e}")
    reddit_df = pd.DataFrame(columns=['Date', 'Text', 'User', 'Source'])

# ✅ Facebook Scraper (No login required)
print("🚀 Scraping Facebook...")
facebook_pages = ['nepal', 'nepalinetwork']
facebook_data = []

for page in facebook_pages:
    for post in get_posts(page, pages=num_facebook_posts):
        facebook_data.append([post['time'], post['text'], 'Facebook'])

facebook_df = pd.DataFrame(facebook_data, columns=['Date', 'Text', 'Source'])

# ✅ Combine all data
print("✅ Combining all data...")
combined_df = pd.concat([twitter_df, reddit_df, facebook_df], ignore_index=True)
combined_df.to_csv('romanized_nepali_dataset.csv', index=False)

print("✅ Scraping complete! Dataset saved as 'romanized_nepali_dataset.csv'")
