# Preparing your environment

For this course, we will only use the `python` programming language. I use the latest versions of all packages and Python 3.10 with pip package manager. You are free to use other versions or other package managers of course.

We will make extensive use of the following packages
* spaCy
* pandas
* transformers 🤗
* datasets🤗
* sklearn
* matplotlib

The following code is to install and test if your environment works as intended, so that you don't lose time during the course.

Python dependencies can be real nasty !

### Check Python version

In [1]:
import sys
assert sys.version_info.major==3, "Python 3.x is required"
if sys.version_info.minor<10: print("Warning: Python 3.10 is recommended")
else: print("Python 3.10 👍")

Python 3.10 👍


## Installation

In [2]:
!pip install  -U spacy sklearn matplotlib transformers datasets pandas

Collecting datasets
  Downloading datasets-2.7.0-py3-none-any.whl (451 kB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m451.6/451.6 kB[0m [31m1.1 MB/s[0m eta [36m0:00:00[0mm eta [36m0:00:01[0m0:01[0m:01[0m
Installing collected packages: datasets
  Attempting uninstall: datasets
    Found existing installation: datasets 2.6.1
    Uninstalling datasets-2.6.1:
      Successfully uninstalled datasets-2.6.1
Successfully installed datasets-2.7.0

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.2.2[0m[39;49m -> [0m[32;49m22.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3.10 -m pip install --upgrade pip[0m


Test basic imports

In [3]:
import spacy
import pandas as pd
import numpy as np

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
from sklearn.metrics import ConfusionMatrixDisplay

from spacy.lang.en.stop_words import STOP_WORDS
from spacy.lang.en import English
from spacy.vectors import Vectors

from matplotlib import pyplot as plt

print("It works!")

  from .autonotebook import tqdm as notebook_tqdm


It works!


## Download spacy models

In [4]:
!python -m spacy download en_core_web_sm en_core_web_md en_core_web_lg

Collecting en-core-web-sm==3.4.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.4.1/en_core_web_sm-3.4.1-py3-none-any.whl (12.8 MB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m13.1 MB/s[0m eta [36m0:00:00[0mm eta [36m0:00:01[0m[36m0:00:01[0m

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.2.2[0m[39;49m -> [0m[32;49m22.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3.10 -m pip install --upgrade pip[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


### Test if that worked

In [5]:
import spacy
nlp = spacy.load('en_core_web_sm')
nlp("Small model works!")

Small model works!

In [6]:
nlp = spacy.load('en_core_web_md')
nlp("Medium model works!")

Medium model works!

In [7]:
nlp = spacy.load('en_core_web_lg')
nlp("Large model works!")

Large model works!

## Download HuggingFace models

In [8]:
from transformers import AutoTokenizer
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=3)
print("It works!")

Downloading: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 28.0/28.0 [00:00<00:00, 8.49kB/s]
Downloading: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 483/483 [00:00<00:00, 84.4kB/s]
Downloading: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 232k/232k [00:00<00:00, 448kB/s]
Downloading: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 466k/466k [00:00<00:00, 952kB/s]
Downloading: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 268M/268M [00:16<00:00, 16.4MB/s]
Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.weight', 'voca

It works


#### If you want to run LLM's locally, you can uncomment this.

Before doing so, ❗❗_*SAVE YOUR WORK because YOUR LAPTOP MIGHT CRASH*_ ❗❗. You are loading a huge model into memory.

In [9]:
# from transformers import pipeline
# generator = pipeline('text-generation', model = 'bigscience/bloom-560m')