<a href="https://colab.research.google.com/github/Katherinebarnes/AI/blob/main/colabs/intro/Intro_to_Weights_%26_Biases.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<a href="https://colab.research.google.com/github/wandb/examples/blob/master/colabs/intro/Intro_to_Weights_&_Biases.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>
<!--- @wandbcode{intro-colab} -->

<a href="https://colab.research.google.com/github/wandb/examples/blob/master/colabs/intro/Intro_to_Weights_&_Biases.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>
<!--- @wandbcode{intro-colab} -->

<img src="http://wandb.me/logo-im-png" width="400" alt="Weights & Biases" />
<!--- @wandbcode{intro-colab} -->

Use [W&B](https://wandb.ai/site?utm_source=intro_colab&utm_medium=code&utm_campaign=intro) for machine learning experiment tracking, model checkpointing, collaboration with your team and more. See the full W&B Documentation [here](https://docs.wandb.ai/).

In this notebook, you will create and track a machine learning experiment using a simple PyTorch model. By the end of the notebook, you will have an interactive project dashboard that you can share and customize with other members of your team. [View an example dashboard here](https://wandb.ai/wandb/wandb_example).

## Prerequisites

Install the W&B Python SDK and log in:

In [None]:
!pip install wandb -qU

In [None]:
# Log in to your W&B account
import wandb
import random
import math

In [None]:
wandb.login()

## Simulate and track a machine learning experiment with W&B

Create, track, and visualize a machine learning experiment. To do this:

1. Initialize a [W&B run](https://docs.wandb.ai/guides/runs) and pass in the hyperparameters you want to track.
2. Within your training loop, log metrics such as the accuruacy and loss.

In [None]:
import random
import math

# Launch 5 simulated experiments
total_runs = 5
for run in range(total_runs):
  # 1️. Start a new run to track this script
  wandb.init(
      # Set the project where this run will be logged
      project="basic-intro",
      # We pass a run name (otherwise it’ll be randomly assigned, like sunshine-lollypop-10)
      name=f"experiment_{run}",
      # Track hyperparameters and run metadata
      config={
      "learning_rate": 0.02,
      "architecture": "CNN",
      "dataset": "CIFAR-100",
      "epochs": 10,
      })

  # This simple block simulates a training loop logging metrics
  epochs = 10
  offset = random.random() / 5
  for epoch in range(2, epochs):
      acc = 1 - 2 ** -epoch - random.random() / epoch - offset
      loss = 2 ** -epoch + random.random() / epoch + offset

      # 2️. Log metrics from your script to W&B
      wandb.log({"acc": acc, "loss": loss})

  # Mark the run as finished
  wandb.finish()

View how your machine learning peformed in your W&B project. Copy and paste the URL link that is printed from the previous cell. The URL will redirect you to a W&B project that contains a dashboard showing graphs the show how

The following image shows what a dashboard can look like:

![](https://i.imgur.com/Pell4Oo.png)

Now that we know how to integrate W&B into a psuedo machine learning training loop, let's track a machine learning experiment using a basic PyTorch neural network. The following code will also upload model checkpoints to W&B that you can then share with other teams in in your organization.

##  Track a machine learning experiment using Pytorch

The following code cell defines and trains a simple MNIST classifier. During training, you will see W&B prints out URLs. Click on the project page link to see your results stream in live to a W&B project.

W&B runs automatically log [metrics](https://docs.wandb.ai/ref/app/pages/run-page#charts-tab),
[system information](https://docs.wandb.ai/ref/app/pages/run-page#system-tab),
[hyperparameters](https://docs.wandb.ai/ref/app/pages/run-page#overview-tab),
[terminal output](https://docs.wandb.ai/ref/app/pages/run-page#logs-tab) and
you'll see an [interactive table](https://docs.wandb.ai/guides/data-vis)
with model inputs and outputs.

### Set up PyTorch Dataloader
The following cell defines some useful functions that we will need to train our machine learning model. The functions themselves are not unique to W&B so we'll not cover them in detail here. See the PyTorch documentation for more information on how to define [forward and backward training loop](https://pytorch.org/tutorials/beginner/nn_tutorial.html), how to use [PyTorch DataLoaders](https://pytorch.org/tutorials/beginner/basics/data_tutorial.html) to load data in for training, and how define PyTorch models using the [`torch.nn.Sequential` Class](https://pytorch.org/docs/stable/generated/torch.nn.Sequential.html).

In [None]:
#@title
import torch, torchvision
import torch.nn as nn
from torchvision.datasets import MNIST
import torchvision.transforms as T

MNIST.mirrors = [mirror for mirror in MNIST.mirrors if "http://yann.lecun.com/" not in mirror]

device = "cuda:0" if torch.cuda.is_available() else "cpu"

def get_dataloader(is_train, batch_size, slice=5):
    "Get a training dataloader"
    full_dataset = MNIST(root=".", train=is_train, transform=T.ToTensor(), download=True)
    sub_dataset = torch.utils.data.Subset(full_dataset, indices=range(0, len(full_dataset), slice))
    loader = torch.utils.data.DataLoader(dataset=sub_dataset,
                                         batch_size=batch_size,
                                         shuffle=True if is_train else False,
                                         pin_memory=True, num_workers=2)
    return loader

def get_model(dropout):
    "A simple model"
    model = nn.Sequential(nn.Flatten(),
                         nn.Linear(28*28, 256),
                         nn.BatchNorm1d(256),
                         nn.ReLU(),
                         nn.Dropout(dropout),
                         nn.Linear(256,10)).to(device)
    return model

def validate_model(model, valid_dl, loss_func, log_images=False, batch_idx=0):
    "Compute performance of the model on the validation dataset and log a wandb.Table"
    model.eval()
    val_loss = 0.
    with torch.inference_mode():
        correct = 0
        for i, (images, labels) in enumerate(valid_dl):
            images, labels = images.to(device), labels.to(device)

            # Forward pass ➡
            outputs = model(images)
            val_loss += loss_func(outputs, labels)*labels.size(0)

            # Compute accuracy and accumulate
            _, predicted = torch.max(outputs.data, 1)
            correct += (predicted == labels).sum().item()

            # Log one batch of images to the dashboard, always same batch_idx.
            if i==batch_idx and log_images:
                log_image_table(images, predicted, labels, outputs.softmax(dim=1))
    return val_loss / len(valid_dl.dataset), correct / len(valid_dl.dataset)

### Create a teble to compare the predicted values versus the true value

The following cell is unique to W&B, so let's go over it.

In the cell we define a function called `log_image_table`. Though technically, optional, this function creates a W&B Table object. We will use the table object to create a table that shows what the model predicted for each image.

More specifically, each row will conists of the image fed to the model, along with predicted value and the actual value (label).

In [None]:
def log_image_table(images, predicted, labels, probs):
    "Log a wandb.Table with (img, pred, target, scores)"
    # Create a wandb Table to log images, labels and predictions to
    table = wandb.Table(columns=["image", "pred", "target"]+[f"score_{i}" for i in range(10)])
    for img, pred, targ, prob in zip(images.to("cpu"), predicted.to("cpu"), labels.to("cpu"), probs.to("cpu")):
        table.add_data(wandb.Image(img[0].numpy()*255), pred, targ, *prob.numpy())
    wandb.log({"predictions_table":table}, commit=False)

### Train your model and upload checkpoints

The following code trains and saves model checkpoints to your project. Use model checkpoints like you normally would to assess how the model performed during training.

W&B also makes it easy to share your saved models and model checkpoints with other members of your team or organization. To learn how to share your model and model checkpoints with members outside of your team, see [W&B Registry](https://docs.wandb.ai/guides/registry).

In [None]:
# Launch 3 experiments, trying different dropout rates
for _ in range(3):
    # initialise a wandb run
    wandb.init(
        project="pytorch-intro",
        config={
            "epochs": 5,
            "batch_size": 128,
            "lr": 1e-3,
            "dropout": random.uniform(0.01, 0.80),
            })

    # Copy your config
    config = wandb.config

    # Get the data
    train_dl = get_dataloader(is_train=True, batch_size=config.batch_size)
    valid_dl = get_dataloader(is_train=False, batch_size=2*config.batch_size)
    n_steps_per_epoch = math.ceil(len(train_dl.dataset) / config.batch_size)

    # A simple MLP model
    model = get_model(config.dropout)

    # Make the loss and optimizer
    loss_func = nn.CrossEntropyLoss()
    optimizer = torch.optim.Adam(model.parameters(), lr=config.lr)

   # Training
    example_ct = 0
    step_ct = 0
    for epoch in range(config.epochs):
        model.train()
        for step, (images, labels) in enumerate(train_dl):
            images, labels = images.to(device), labels.to(device)

            outputs = model(images)
            train_loss = loss_func(outputs, labels)
            optimizer.zero_grad()
            train_loss.backward()
            optimizer.step()

            example_ct += len(images)
            metrics = {"train/train_loss": train_loss,
                       "train/epoch": (step + 1 + (n_steps_per_epoch * epoch)) / n_steps_per_epoch,
                       "train/example_ct": example_ct}

            if step + 1 < n_steps_per_epoch:
                # Log train metrics to wandb
                wandb.log(metrics)

            step_ct += 1

        val_loss, accuracy = validate_model(model, valid_dl, loss_func, log_images=(epoch==(config.epochs-1)))

        # Log train and validation metrics to wandb
        val_metrics = {"val/val_loss": val_loss,
                       "val/val_accuracy": accuracy}
        wandb.log({**metrics, **val_metrics})

        # Save the model checkpoint to wandb
        torch.save(model, "my_model.pt")
        wandb.log_model("./my_model.pt", "my_mnist_model", aliases=[f"epoch-{epoch+1}_dropout-{round(wandb.config.dropout, 4)}"])

        print(f"Epoch: {epoch+1}, Train Loss: {train_loss:.3f}, Valid Loss: {val_loss:3f}, Accuracy: {accuracy:.2f}")

    # If you had a test set, this is how you could log it as a Summary metric
    wandb.summary['test_accuracy'] = 0.8

    # Close your wandb run
    wandb.finish()

You have now trained your first model using W&B. Click on one of the links above to see your metrics and see your saved model checkpoints in the Artifacts tab in the W&B App UI

## (Optional) Set up a W&B Alert

Create a [W&B Alerts](https://docs.wandb.ai/guides/track/alert) to send alerts to your Slack or email from your Python code.

There are 2 steps to follow the first time you'd like to send a Slack or email alert, triggered from your code:

1) Turn on Alerts in your W&B [User Settings](https://wandb.ai/settings)
2) Add `wandb.alert()` to your code. For example:

```python
wandb.alert(
    title="Low accuracy",
    text=f"Accuracy is below the acceptable threshold"
)
```

The following cell shows a minimal example below to see how to use `wandb.alert`

In [None]:
# Start a wandb run
wandb.init(project="pytorch-intro")

# Simulating a model training loop
acc_threshold = 0.3
for training_step in range(1000):

    # Generate a random number for accuracy
    accuracy = round(random.random() + random.random(), 3)
    print(f'Accuracy is: {accuracy}, {acc_threshold}')

    # Log accuracy to wandb
    wandb.log({"Accuracy": accuracy})

    # If the accuracy is below the threshold, fire a W&B Alert and stop the run
    if accuracy <= acc_threshold:
        # Send the wandb Alert
        wandb.alert(
            title='Low Accuracy',
            text=f'Accuracy {accuracy} at step {training_step} is below the acceptable theshold, {acc_threshold}',
        )
        print('Alert triggered')
        break

# Mark the run as finished (useful in Jupyter notebooks)
wandb.finish()

You can find the full docs for [W&B Alerts here](https://docs.wandb.ai/guides/track/alert).

## Next steps
The next tutorial you will learn how to do hyperparameter optimization using W&B Sweeps:
[Hyperparameters sweeps using PyTorch](https://colab.research.google.com/github/wandb/examples/blob/master/colabs/pytorch/Organizing_Hyperparameter_Sweeps_in_PyTorch_with_W%26B.ipynb)

In [1]:
!pip install faker
import random
import csv
import pandas as pd
from faker import Faker

# Initialize Faker for Chinese data
fake = Faker('zh_CN')

# Number of entries to generate
num_entries = 2000

# Prepare lists of sample data (add more to enhance variety)
universities = [
    "清华大学", "北京大学", "复旦大学", "浙江大学", "南京大学", "上海交通大学",
    "中国科学技术大学", "武汉大学", "华中科技大学", "中山大学"
]
keywords = [
    "人工智能", "机器学习", "深度学习", "自然语言处理", "计算机视觉",
    "大数据", "云计算", "物联网", "区块链", "生物信息学"
]
citation_sources = ["IEEE", "ACM", "Springer", "Elsevier", "ScienceDirect", "知网"]

# Generate data
data = []
for i in range(num_entries):
    # Define 'entry' before using it in the conditional expression
    entry = {
        'title': fake.sentence(nb_words=8),
        'id': i + 1,
        'authors': fake.name(),
        'year_of_publication': random.randint(2000, 2023),
        'university': random.choice(universities),
        'abstract': fake.paragraph(nb_sentences=5),
        'keywords': ', '.join(random.sample(keywords, random.randint(3, 5))),
        'co_authors': fake.name() if random.random() < 0.5 else None,  # 50% chance of co-authors
        'citation': fake.sentence(nb_words=10) if random.random() < 0.3 else None, # 30% chance of citation
        'number_of_citations': random.randint(0, 100),
        'patents': fake.word() if random.random() < 0.2 else None, # 20% chance of patent
        'government_grants': fake.word() if random.random() < 0.1 else None,  # 10% chance of grants
        # The line below was causing the error.
        # Moved the conditional check to after 'entry' is fully defined.
    }
    #Now check if 'citation' key exists in the dictionary after it has been populated
    entry['citation_source'] = random.choice(citation_sources) if entry.get('citation') is not None else None

    data.append(entry)

    # Now you can access 'patents' from the entry dictionary
    entry['no_of_patents'] = random.randint(0, 5) if entry.get('patents') is not None else 0


# Create DataFrame and save to CSV
df = pd.DataFrame(data)
df.to_csv('chinese_research_papers.csv', index=False, encoding='utf-8-sig')

print(f"Generated 'chinese_research_papers.csv' with {num_entries} entries.")

processed_df = pd.read_csv('chinese_research_papers.csv')
display(processed_df)  # This will display the DataFrame in your Jupyter Notebook

df.to_csv('chinese_research_papers.csv', index=False, encoding='utf-8-sig')
print("DataFrame saved to 'chinese_research_papers.csv'")

Collecting faker
  Downloading Faker-30.8.2-py3-none-any.whl.metadata (15 kB)
Downloading Faker-30.8.2-py3-none-any.whl (1.8 MB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/1.8 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m1.8/1.8 MB[0m [31m91.4 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m49.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: faker
Successfully installed faker-30.8.2
Generated 'chinese_research_papers.csv' with 2000 entries.


Unnamed: 0,title,id,authors,year_of_publication,university,abstract,keywords,co_authors,citation,number_of_citations,patents,government_grants,citation_source,no_of_patents
0,制作相关一般文章目前希望国际留言管理资料.,1,高秀梅,2019,北京大学,计划能够个人人员大学.欢迎一次状态电脑的话.非常专业深圳自己.影响同时所有广告密码介绍回复....,"生物信息学, 物联网, 大数据",,文章文件希望可以最后一种广告合作位置作品必须用户.,26,,,IEEE,0
1,解决不能这里手机地方我们美国回复电脑历史.,2,黄龙,2023,北京大学,名称你的电脑首页到了.计划经营本站大家任何.学校作为中文联系.,"区块链, 生物信息学, 大数据, 物联网, 云计算",,,16,,,,0
2,原因地址时候产品技术工具操作一直那个影响.,3,杨柳,2019,南京大学,国际处理发现.用户关系希望时候经营作品一起.时间现在您的.上海发表欢迎公司阅读发生一样.根据...,"大数据, 计算机视觉, 物联网, 机器学习, 区块链",程宁,,57,,,,0
3,非常他们已经发展注册说明更多来源一直.,4,董浩,2003,中国科学技术大学,其实中心软件密码系统次数游戏.非常成为有些各种一个其中.一个标题加入所有得到不要一次.由于还...,"计算机视觉, 云计算, 生物信息学",,应该完成商品名称音乐解决北京详细事情位置一起客户.,52,,,Elsevier,0
4,重要日期女人网络.,5,詹瑞,2004,北京大学,国内主题开始您的这样内容不会最后.合作功能关系活动公司投资品牌.出来起来功能今年关于.作者非...,"计算机视觉, 物联网, 云计算, 区块链, 深度学习",赵淑珍,,64,这个,感觉,,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1995,情况客户包括其实不要那些关于不是全国用户关系.,1996,魏娟,2010,中山大学,以上包括主要大学.设计日本是否更新希望只有人员企业.是一一点女人联系.操作建设合作阅读不是.,"区块链, 自然语言处理, 大数据",李洁,,45,,,,0
1996,地址大学不能资料一个以下为了.,1997,周建军,2020,武汉大学,只要欢迎商品管理一次作品非常.一样方法经验一些图片注意非常.我们提高相关这样工具.,"计算机视觉, 自然语言处理, 云计算, 大数据",叶伟,可是中国以上制作控制喜欢文件作为用户制作.,97,,,Elsevier,0
1997,方面上海部门当然这个大家首页汽车学习上海完成.,1998,周燕,2014,上海交通大学,包括一般人民系列学校大小类型.系列方式深圳帖子比较继续.他们浏览电话浏览工具一些位置.学习以...,"物联网, 机器学习, 深度学习, 自然语言处理, 区块链",董建国,,82,,,,0
1998,帮助今年基本然后两个所有.,1999,张鑫,2000,清华大学,设计操作过程.谢谢上海全国得到您的.地区那些今天.管理这些文化操作.实现非常你的更多.,"云计算, 自然语言处理, 物联网, 计算机视觉, 区块链",徐秀华,,28,,,,0


DataFrame saved to 'chinese_research_papers.csv'


In [3]:
!pip install datasets
!pip install transformers
!pip install torch
!pip install faker
import random
import csv
import pandas as pd
from faker import Faker
from datasets import Dataset, DatasetDict
from transformers import TrainingArguments, Trainer, AutoModelForSequenceClassification, AutoTokenizer, AutoConfig
import torch
from torch import nn

# Initialize Faker for Chinese data
fake = Faker('zh_CN')

# Number of entries to generate
num_entries = 2000

# Prepare lists of sample data (add more to enhance variety)
universities = [
    "清华大学", "北京大学", "复旦大学", "浙江大学", "南京大学", "上海交通大学",
    "中国科学技术大学", "武汉大学", "华中科技大学", "中山大学"
]
keywords = [
    "人工智能", "机器学习", "深度学习", "自然语言处理", "计算机视觉",
    "大数据", "云计算", "物联网", "区块链", "生物信息学"
]
citation_sources = ["IEEE", "ACM", "Springer", "Elsevier", "ScienceDirect", "知网"]

# Generate data
data = []
for i in range(num_entries):
    # Define 'entry' before using it in the conditional expression
    entry = {
        'title': fake.sentence(nb_words=8),
        'id': i + 1,
        'authors': fake.name(),
        'year_of_publication': random.randint(2000, 2023),
        'university': random.choice(universities),
        'abstract': fake.paragraph(nb_sentences=5),
        'keywords': ', '.join(random.sample(keywords, random.randint(3, 5))),
        'co_authors': fake.name() if random.random() < 0.5 else None,  # 50% chance of co-authors
        'citation': fake.sentence(nb_words=10) if random.random() < 0.3 else None, # 30% chance of citation
        'number_of_citations': random.randint(0, 100),
        'patents': fake.word() if random.random() < 0.2 else None, # 20% chance of patent
        'government_grants': fake.word() if random.random() < 0.1 else None,  # 10% chance of grants
        # The line below was causing the error.
        # Moved the conditional check to after 'entry' is fully defined.
    }
    #Now check if 'citation' key exists in the dictionary after it has been populated
    entry['citation_source'] = random.choice(citation_sources) if entry.get('citation') is not None else None

    data.append(entry)

    # Now you can access 'patents' from the entry dictionary
    entry['no_of_patents'] = random.randint(0, 5) if entry.get('patents') is not None else 0


# Create DataFrame and save to CSV
df = pd.DataFrame(data)
df.to_csv('chinese_research_papers.csv', index=False, encoding='utf-8-sig')

print(f"Generated 'chinese_research_papers.csv' with {num_entries} entries.")

processed_df = pd.read_csv('chinese_research_papers.csv')
display(processed_df)  # This will display the DataFrame in your Jupyter Notebook

df.to_csv('chinese_research_papers.csv', index=False, encoding='utf-8-sig')
print("DataFrame saved to 'chinese_research_papers.csv'")

# Load pre-trained model and tokenizer
model_name = "bert-base-chinese"
config = AutoConfig.from_pretrained(model_name, num_labels=2)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Define a custom model with BERT and BiLSTM
class BERTwithBiLSTM(nn.Module):
    def __init__(self, bert_model_name, num_labels):
        super(BERTwithBiLSTM, self).__init__()
        self.bert = AutoModelForSequenceClassification.from_pretrained(bert_model_name, config=config)
        self.lstm = nn.LSTM(config.hidden_size, 128, bidirectional=True, batch_first=True) # BiLSTM layer
        self.classifier = nn.Linear(128 * 2, num_labels) # Classifier layer (256 because of bidirectional)

    def forward(self, input_ids, attention_mask):
        outputs = self.bert(input_ids=input_ids, attention_mask=attention_mask)
        # Take the BERT hidden states (last hidden state)
        bert_hidden_states = outputs.logits #outputs[0]
        # Pass the hidden states through the BiLSTM
        lstm_output, _ = self.lstm(bert_hidden_states)
        # Use the last hidden state of the BiLSTM for classification
        lstm_hidden_state = lstm_output[:, -1, :]
        logits = self.classifier(lstm_hidden_state)
        return logits

# Create an instance of the custom model
model = BERTwithBiLSTM(model_name, num_labels=2)

# Define a function to tokenize your data
def tokenize_function(examples):
    return tokenizer(examples["abstract"], padding="max_length", truncation=True)



Collecting datasets
  Downloading datasets-3.1.0-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.1.0-py3-none-any.whl (480 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m19.9 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m9.5 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.9.0-py3-none-any.whl (

Unnamed: 0,title,id,authors,year_of_publication,university,abstract,keywords,co_authors,citation,number_of_citations,patents,government_grants,citation_source,no_of_patents
0,教育的话实现科技更新是否资料关系销售.,1,蔡秀兰,2003,上海交通大学,需要她的同时解决类型帮助经验.怎么怎么文章国内两个这个内容.发生其实销售中文学习需要国家.标...,"人工智能, 深度学习, 自然语言处理, 云计算, 物联网",,,53,,,,0
1,地区以下深圳系列文化.,2,张畅,2002,浙江大学,或者决定通过社会专业提供成功起来.学生语言学生其他谢谢.经营他的注意一点.,"生物信息学, 大数据, 云计算, 自然语言处理",冯楠,,18,,,,0
2,运行知道信息论坛非常或者新闻用户.,3,胡博,2017,复旦大学,到了工程您的进入经营积分法律.中国设备因为可以原因一起人员.以上不过数据提高然后.标题规定女...,"深度学习, 人工智能, 云计算, 机器学习",,,87,,,,0
3,日期今天城市问题包括.,4,鞠璐,2002,中国科学技术大学,发现等级更多行业政府不同.发展日期以后其实积分.经营很多发布知道商品.阅读作品孩子查看.那个...,"大数据, 生物信息学, 自然语言处理, 计算机视觉, 深度学习",梁晶,,65,学校,,,5
4,质量中国不断音乐之后.,5,庄亮,2023,浙江大学,最新密码出现建设.一样日期文化计划欢迎最新方面公司.主题孩子发表方面.两个程序详细工程.,"深度学习, 云计算, 计算机视觉, 生物信息学",,这些内容一些一起北京不是以上今天今天成功特别.,54,中国,,Springer,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1995,大小服务积分工程科技研究阅读目前帮助.,1996,陈伟,2003,清华大学,如果留言喜欢这样.这里谢谢影响论坛方面女人.电子提高法律其他其实来自.为什汽车今天一下在线使...,"自然语言处理, 区块链, 云计算, 生物信息学",,,4,历史,,,3
1996,更新世界结果正在中国系列很多.,1997,朱艳,2007,南京大学,方法问题为了设备.要求现在社会无法有些你们但是.历史而且生产自己状态电影环境.工具什么提高提...,"计算机视觉, 大数据, 区块链, 云计算, 生物信息学",石晶,,43,,,,0
1997,这么正在当然一起表示大小.,1998,江兰英,2010,上海交通大学,销售名称不断起来.用户这样简介所有.学校中文国家如何深圳生活的话.过程包括介绍企业技术浏览日...,"计算机视觉, 大数据, 物联网, 深度学习, 自然语言处理",陈岩,,8,,,,0
1998,报告国际最新选择直接教育威望显示.,1999,李淑兰,2019,中山大学,国际市场今年进行过程.功能还是而且建设北京新闻科技.发布地址有限一次.支持人员不过不过以及....,"人工智能, 深度学习, 物联网, 自然语言处理, 机器学习",戴静,,66,,开发,,0


DataFrame saved to 'chinese_research_papers.csv'


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/624 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/110k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/269k [00:00<?, ?B/s]



model.safetensors:   0%|          | 0.00/412M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-chinese and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [1]:
!pip install datasets==2.12.0  # or a later version



In [6]:
!pip install datasets==2.12.0 #Update datasets
!pip install transformers
!pip install torch
# Assuming 'df' is your pandas DataFrame from previous code
from datasets import Dataset, DatasetDict
from transformers import AutoTokenizer, TrainingArguments, Trainer # Import Trainer here
import torch
import pandas as pd # Import pandas to read the CSV file
from torch import nn
from transformers import TrainingArguments, Trainer, AutoModelForSequenceClassification, AutoTokenizer, AutoConfig

# Assuming 'df' is your pandas DataFrame from previous code
from datasets import Dataset, DatasetDict

# Read the DataFrame from the CSV file
df = pd.read_csv('chinese_research_papers.csv') # This line loads the DataFrame

# Convert DataFrame to DatasetDict
dataset_dict = DatasetDict({
    "train": Dataset.from_pandas(df.iloc[:int(0.8 * len(df))]), # Assuming 80% for training
    "validation": Dataset.from_pandas(df.iloc[int(0.8 * len(df)):]), # Remaining for validation
})

# Load pre-trained model and tokenizer here
model_name = "bert-base-chinese"
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Define a function to tokenize your data
# This function was previously defined in a separate cell and needs to be defined or imported here.
def tokenize_function(examples):
    return tokenizer(examples["abstract"], padding="max_length", truncation=True)

# Apply the tokenizer to your datasets
tokenized_datasets = dataset_dict.map(tokenize_function, batched=True)

# Fine-tuning configuration
training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=3,                # Number of training epochs (adjust as needed)
    per_device_train_batch_size=16,  # Batch size for training (adjust based on your resources)
    per_device_eval_batch_size=64,   # Batch size for evaluation
    warmup_steps=500,                # Number of warmup steps for learning rate scheduler
    weight_decay=0.01,               # Weight decay for regularization
    logging_dir="./logs",            # Directory for storing training logs
    logging_steps=10,                # Log training loss every 10 steps
    evaluation_strategy="epoch",     # Evaluate the model at the end of each epoch
    save_strategy="epoch",           # Save the model at the end of each epoch
    load_best_model_at_end=True,     # Load the best model at the end of training
    metric_for_best_model="accuracy",# Use accuracy as the metric to select the best model
)

# Define a custom model with BERT and BiLSTM
class BERTwithBiLSTM(nn.Module):
    def __init__(self, bert_model_name, num_labels):
        super(BERTwithBiLSTM, self).__init__()
        self.bert = AutoModelForSequenceClassification.from_pretrained(bert_model_name, config=config)
        self.lstm = nn.LSTM(config.hidden_size, 128, bidirectional=True, batch_first=True) # BiLSTM layer
        self.classifier = nn.Linear(128 * 2, num_labels) # Classifier layer (256 because of bidirectional)

    def forward(self, input_ids, attention_mask):
        outputs = self



The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Map:   0%|          | 0/1600 [00:00<?, ? examples/s]

Map:   0%|          | 0/400 [00:00<?, ? examples/s]



In [8]:
from sklearn.metrics import accuracy_score, precision_recall_fscore_support

def compute_metrics(pred):
       labels = pred.label_ids
       preds = pred.predictions.argmax(-1)
       precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average='weighted')
       acc = accuracy_score(labels, preds)
       return {
           'accuracy': acc,
           'f1': f1,
           'precision': precision,
           'recall': recall
       }

In [10]:
from transformers import TrainingArguments

training_args = TrainingArguments(
       output_dir="./results",          # Output directory for model checkpoints and logs
       num_train_epochs=3,              # Number of training epochs
       per_device_train_batch_size=16,  # Batch size per device during training
       per_device_eval_batch_size=64,   # Batch size per device during evaluation
       warmup_steps=500,                # Number of warmup steps for learning rate scheduler
       weight_decay=0.01,               # Weight decay for regularization
       logging_dir="./logs",            # Directory for storing training logs
       logging_steps=10,                # Log every 10 steps
       evaluation_strategy="epoch",     # Evaluate at the end of each epoch
       save_strategy="epoch",          # Save model checkpoints at the end of each epoch
       load_best_model_at_end=True,    # Load the best model at the end of training
       metric_for_best_model="f1",      # Use F1 score to determine the best model
   )



In [16]:
from transformers import Trainer, AutoConfig

# ... (Your existing code) ...

# Replace 'label' with the actual name of the target column in your DataFrame
target_column_name = 'your_target_column_name'  # Example: 'target', 'labels', etc.

# Replace 'label' with the actual name of the target column in your DataFrame
target_column_name = 'label'  # Replace 'label' with the actual column name from your CSV

# Load the configuration
config = AutoConfig.from_pretrained(model_name, num_labels=len(df[target_column_name].unique()))

# Create an instance of your model
model = BERTwithBiLSTM(bert_model_name=model_name, num_labels=len(df[target_column_name].unique()))

trainer = Trainer(
    model=model,                    # Now 'model' is defined
    args=training_args,              # Training arguments
    train_dataset=tokenized_datasets["train"], # Training dataset
    eval_dataset=tokenized_datasets["validation"], # Validation dataset
    compute_metrics=compute_metrics, # Function to compute evaluation metrics
)

trainer.train()

KeyError: 'label'

In [19]:
from transformers import Trainer, AutoConfig

# ... (Your existing code) ...

# Replace 'label' with the actual name of the target column in your DataFrame
# Example: 'target', 'labels', etc.
# Check the column names in your 'chinese_research_papers.csv' file
target_column_name = 'your_target_column_name' # Replace 'your_target_column_name' with the actual column name

# Load the configuration
config = AutoConfig.from_pretrained(model_name, num_labels=len(df[authors].unique()))

# Create an instance of your model
model = BERTwithBiLSTM(bert_model_name=model_name, num_labels=len(df[target_column_name].unique()))

trainer = Trainer(
    model=model,                    # Now 'model' is defined
    args=training_args,              # Training arguments
    train_dataset=tokenized_datasets["train"], # Training dataset
    eval_dataset=tokenized_datasets["validation"], # Validation dataset
    compute_metrics=compute_metrics, # Function to compute evaluation metrics
)

trainer.train()

NameError: name 'authors' is not defined

In [None]:
trainer.save_model("./best_model")  # Save the best model to a directory