<a href="https://colab.research.google.com/github/Shrsht/Datathon2024/blob/main/Model_GPU_Training.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import pandas as pd
import torch
#import cupy
import spacy
import torchvision
import torchaudio
import transformers

  _torch_pytree._register_pytree_node(


# Step 1. Connect to GPU

We connect Google Collab's available T4 GPU architecture in order to load our CUDY package dependanies:

### 1.1 Download our Transformer Pipeline from Spacy:

We will be using the **en_core_web_trf** desgined by HuggingFace as our Transformer Architecture.

The documentation for **en_core_web_trf** can be found here: https://huggingface.co/spacy/en_core_web_trf

In [None]:
#!pip install spacy-transformers
import spacy_transformers

  _torch_pytree._register_pytree_node(


In [None]:
#!python -m spacy download en_core_web_trf
spacy.load('en_core_web_trf')

<spacy.lang.en.English at 0x7893b11c5e10>

# Step 2. Install our Parallel Computing Architecture - CUDA

CUDA (Compute Unified Device Architecture) is a parallel computing platform and programming model developed by NVIDIA for general-purpose computing on GPUs. It allows developers to harness the computational power of NVIDIA GPUs to accelerate various types of computations, including scientific simulations, deep learning, image and signal processing, and more.

In [None]:
#!pip install -U spacy[cuda12.2,transformers]


# Step 3. Import our Computation Architecture - CuPy:

CuPy is an open-source numerical computing library that is built to closely mimic the interface of NumPy, while also taking advantage of GPU acceleration through CUDA. It provides a NumPy-compatible multidimensional array objects (ndarray) that can perform computations on GPU devices.

In [None]:
import cupy
import transformers

# Step 4. Training our Transformer Model:

- Use our pre-annotated Job Descriptions (stored in .tsv files) as training and testing data
- We convert these .tsv files into .spacy files to feed into our pipelines
- Using the SpaCy website, we generate the .cnf files to configure the learning parameters for our pipeline


### 4.1 Convert .tsv files into .spacy files to feed into Spacy Model:
We begin by first convereting .tsv into JSON files
The JSON files are then convereted into .spacy files

In [None]:
!python -m spacy convert /content/train.tsv ./ -t json -n 1 -c iob
!python -m spacy convert /content/test.tsv ./ -t json -n 1 -c iob
!python -m spacy convert /content/train.json ./ -t spacy
!python -m spacy convert /content/test.json ./ -t spacy

  _torch_pytree._register_pytree_node(
[38;5;4mℹ Auto-detected token-per-line NER format[0m
[38;5;3m⚠ Document delimiters found, automatic document segmentation with `-n`
disabled.[0m
  _torch_pytree._register_pytree_node(
[38;5;2m✔ Generated output file (1 documents): train.json[0m
  _torch_pytree._register_pytree_node(
[38;5;4mℹ Auto-detected token-per-line NER format[0m
[38;5;3m⚠ Document delimiters found, automatic document segmentation with `-n`
disabled.[0m
  _torch_pytree._register_pytree_node(
[38;5;2m✔ Generated output file (1 documents): test.json[0m
  _torch_pytree._register_pytree_node(
  _torch_pytree._register_pytree_node(
[38;5;2m✔ Generated output file (70 documents): train.spacy[0m
  _torch_pytree._register_pytree_node(
  _torch_pytree._register_pytree_node(
[38;5;2m✔ Generated output file (31 documents): test.spacy[0m


### 4.2 Feed our Pipeline Configurations into the Pipeline:

We go to the Spacy Training site: https://spacy.io/usage/training to generate and set our Pipeline Parameters.

This then generates a .config file that we can upload into our Spacy Pipeline to train our datset.
 We will name our .config file - **base_conf.config**

In [None]:
!python -m spacy init fill-config /content/base_config.cfg  /content/config_spacy.cfg

  _torch_pytree._register_pytree_node(
  _torch_pytree._register_pytree_node(
[38;5;2m✔ Auto-filled config with all values[0m
[38;5;2m✔ Saved config[0m
/content/config_spacy.cfg
You can now add your data and train your pipeline:
python -m spacy train config_spacy.cfg --paths.train ./train.spacy --paths.dev ./dev.spacy


### 4.3 Training our Pipeline on our train and test data:

We use the spacy - train command to train our data - This process uses the configurations we defined in our base_config.cfg file.

In [None]:
!python -m spacy train -g 0 /content/config_spacy.cfg --output ./

  _torch_pytree._register_pytree_node(
[38;5;4mℹ Saving to output directory: .[0m
[38;5;4mℹ Using GPU: 0[0m
[1m
  _torch_pytree._register_pytree_node(
tokenizer_config.json: 100% 25.0/25.0 [00:00<00:00, 159kB/s]
config.json: 100% 481/481 [00:00<00:00, 3.15MB/s]
vocab.json: 100% 899k/899k [00:00<00:00, 4.50MB/s]
merges.txt: 100% 456k/456k [00:00<00:00, 3.50MB/s]
tokenizer.json: 100% 1.36M/1.36M [00:00<00:00, 5.07MB/s]
model.safetensors: 100% 499M/499M [00:03<00:00, 149MB/s]
Some weights of RobertaModel were not initialized from the model checkpoint at roberta-base and are newly initialized: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
[38;5;2m✔ Initialized pipeline[0m
[1m
[38;5;4mℹ Pipeline: ['transformer', 'ner'][0m
[38;5;4mℹ Initial learn rate: 0.0[0m
E    #       LOSS TRANS...  LOSS NER  ENTS_F  ENTS_P  ENTS_R  SCORE 
---  ------  -------------  -

In [None]:
nlp = spacy.load("/content/drive/MyDrive/model_best")



In [None]:
bytes_data = nlp.to_bytes()

with open("/content/drive/MyDrive/best_model.bin", "wb") as f:
    f.write(bytes_data)

In [None]:
text = [
'''
    SHRESHT VENKATRAMAN

Indianapolis, IN | shresht.v24@gmail.com | 858-349-3816

LinkedIn: www.linkedin.com/in/shresht-venkatraman/ | Project Portfolio: https://shrsht.github.io/Shresht_Portfolio/

EDUCATION ________________________________________________________________
         Indiana University, Luddy School of Informatics, Computing, and Engineering| Indianapolis, IN
         Master of Science in Applied Data Science
         University of California San Diego |La Jolla, CA
         Bachelor of Arts in Economics| Minor: Data Science

                                                                          Dec 2023

                   Jan 2025

WORK EXPERIENCE __________________________________________________________
   July – Dec 2023

ALGORITHMS ENGINEERING INTERN -  Corsaire Co. | San Diego, CA
-  Developed back-end data dependencies and backbone database infrastructure for a new ‘Key Opinion
Leaders’(KOL) identification product. Resulted in a database of 8 million ‘opinion leaders’ in the drug
development industry.

-  Built automated data-mining systems using Python, Solr, R and SQL to collect data from 7000+ medical

journals, conferences, and publications saving manual searching time by 45%

-  Led the development of automated ETL processes that ingest data from disparate sources to create

individualized KOL profiles

FINANCE & DATA SCIENCE INTERN - Advent International | Boston, MA
    June - Aug 2022
-  Designed and built Tableau dashboards for 8 firm departments using 5-years of financial data. Improved the

firm’s ability to compare expenses across departments and recognize key performance indicators.

-  Facilitated the production of weekly financial reports to management including CFO, CEO, and Managing

Partners by producing statistical and visual analytics using Python and Tableau.

-  Programmed  a  custom  Python  algorithm  to  automate  data-cleaning  and  restructuring  processes  for  the

FP&A department. Increased efficiency by saving 10+ hours of manual data-cleaning

-  Built a ‘Deal-Stage Meter’ for the firm’s 3 global tax teams to accurately track the status of individual deals
and manage timelines for tax compliance. Greatly improved the coordination and  management between
firm’s Boston, London and Luxembourg tax teams

DATA SCIENCE PROJECTS __________________________________________

Stock Prediction System | Project Committee: UCSD Data Science Student Society                          Jan – July 2023
-  Collaborated with a 5-member team to develop machine learning models to predict opening prices of a

stock using Random Forests, LSTMs, Koopman Neural Networks implemented using TensorFlow and Scikit-
learn.

-  Calculated and graphed the Efficient Frontier for a given portfolio by minimizing the Portfolio Volatility and

maximizing the Sharpe Ratio using Python.

-  Developed a predictive model for stock prices utilizing the Twitter API to scrape tweets and perform NLP

Sentiment Analysis on Twitter activity.
Predicting Political Party from Stock Portfolios
-  Used the 'House Stock Watcher" data set of stock-market trading activity of members of the House of

    November 2022

Representatives between 2020 to 2022.

-  Predicted political affiliation from a stock portfolio by creating Scikit-learn Pipeline incorporating
RandomForest Classifiers, One-Hot Encoding and Grid-Search for hyperparameter optimization.

-  Tested for insider trading by using permutation testing to assess missingness of values and detect party-

wide preference for a stock.

Statistical Language Model of the Shakespeare Corpus
-  Developed Uniform, Unigram and N-Gram probabilistic models to predict the probability of a given text

    November 2022

being written by William Shakespeare

TECHNICAL SKILLS _____________________________________________________

LANGUAGE: Japanese (JLPT N3), French (DELF B1), Korean, Hindi (Native), Tamil
DATA ANALYSIS & PROGRAMMING: Python, Java, SQL, Excel, Stata, R, Tableau
PUBLIC CLOUD: AWS Machine Learning Specialty Certification
FINANCE: Financial accounting, Statistics, Econometrics and Probability
'''
]

for doc in nlp.pipe(text, disable=["tagger", "parser"]):
    print([(ent.text, ent.label_) for ent in doc.ents])

[('Master of Science', 'EDUCATION'), ('Bachelor of Arts', 'EDUCATION'), ('Python', 'TOOL'), ('Solr', 'TOOL'), ('SQL', 'TOOL'), ('Random Forests', 'TOOL'), ('Koopman Neural Networks', 'TOOL'), ('TensorFlow', 'TOOL'), ('Scikit-\n', 'TOOL'), ('learn', 'TOOL'), ('Scikit-learn', 'TOOL'), ('RandomForest Classifiers', 'TOOL'), ('One-Hot Encoding', 'TOOL'), ('Grid-Search', 'TOOL'), ('hyperparameter optimization', 'TECHNICAL_SKILL'), ('permutation testing', 'TECHNICAL_SKILL'), ('Statistical Language', 'TECHNICAL_SKILL'), ('N-Gram', 'TOOL'), ('Japanese', 'TOOL'), ('JLPT N3', 'TOOL'), ('French', 'TOOL'), ('DELF B1', 'TOOL'), ('Korean', 'TOOL'), ('Hindi', 'TOOL'), ('Tamil', 'TOOL'), ('Python', 'TOOL'), ('Java', 'TOOL'), ('SQL', 'TOOL'), ('Excel', 'TOOL'), ('Stata', 'TOOL'), ('R', 'TOOL'), ('Tableau', 'TOOL'), ('Financial accounting', 'TECHNICAL_SKILL'), ('Statistics', 'TECHNICAL_SKILL'), ('Econometrics', 'TECHNICAL_SKILL')]


In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
#from google.colab import files
files.download('/content/model_best')

In [None]:
! pip install huggingface_hub



In [None]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…