## Installing the SpaCy library (the most important library wich we will use)

In [2]:
pip install -U spacy==3.*

Defaulting to user installation because normal site-packages is not writeableNote: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 23.3.1 -> 25.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip



Collecting spacy==3.*
  Downloading spacy-3.8.4-cp310-cp310-win_amd64.whl.metadata (27 kB)
Collecting spacy-legacy<3.1.0,>=3.0.11 (from spacy==3.*)
  Downloading spacy_legacy-3.0.12-py2.py3-none-any.whl.metadata (2.8 kB)
Collecting spacy-loggers<2.0.0,>=1.0.0 (from spacy==3.*)
  Downloading spacy_loggers-1.0.5-py3-none-any.whl.metadata (23 kB)
Collecting murmurhash<1.1.0,>=0.28.0 (from spacy==3.*)
  Downloading murmurhash-1.0.12-cp310-cp310-win_amd64.whl.metadata (2.2 kB)
Collecting cymem<2.1.0,>=2.0.2 (from spacy==3.*)
  Downloading cymem-2.0.11-cp310-cp310-win_amd64.whl.metadata (8.8 kB)
Collecting preshed<3.1.0,>=3.0.2 (from spacy==3.*)
  Downloading preshed-3.0.9-cp310-cp310-win_amd64.whl.metadata (2.2 kB)
Collecting thinc<8.4.0,>=8.3.4 (from spacy==3.*)
  Downloading thinc-8.3.4-cp310-cp310-win_amd64.whl.metadata (15 kB)
Collecting wasabi<1.2.0,>=0.9.1 (from spacy==3.*)
  Downloading wasabi-1.1.3-py3-none-any.whl.metadata (28 kB)
Collecting srsly<3.0.0,>=2.4.3 (from spacy==3.*)
 

In [5]:
!python -m spacy info

[1m

spaCy version    3.8.4                         
Location         C:\Users\vinicius23011\AppData\Roaming\Python\Python310\site-packages\spacy
Platform         Windows-10-10.0.22621-SP0     
Python version   3.10.8                        
Pipelines                                      



## Part 1: Learning the basics: Preprocessing, Basic Vetorization, Modelling Overview

In [7]:
 import spacy

Now, how we saw on the guide ``ipynb``, by James Almeida, we need to load a suitable statistical model for our project. On the tutorial, the Professor begins with the **en_core_web_sm** model, the smallest English model from SpaCy, a good start for do NLP tasks. But what really does the **en_core_web_sm** model? Biefly, this model is trained using structuring written text like blogs, news, and comments. Thus, is ideal for this goals, and is **desgined for fast processing on CPU's**. Is a Small Model, like the name suggests ("sm"), with 12 MB . Is trained on real-world web text (OntoNotes 5, WordNet). And generally, has a High Accuracy, especially for POS tagging (~97%) and parsing (~90%) it's free and open access (MIT License). At this point, we need to know that this model is a good starting point for our **first tests**, BUT, probably, for our corpus, that has a semiconductor properties target, is most appropriate use a large model, trained for our applications, i. e., research articles.

In [8]:
!python -m spacy download en_core_web_sm

Defaulting to user installation because normal site-packages is not writeable
Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
     --------------------------------------- 12.8/12.8 MB 21.1 MB/s eta 0:00:00
Installing collected packages: en-core-web-sm
Successfully installed en-core-web-sm-3.8.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')



[notice] A new release of pip is available: 23.3.1 -> 25.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [9]:
nlp = spacy.load('en_core_web_sm')

Other vailable spaCy models at these links:<br>
https://spacy.io/models<br>
https://spacy.io/usage/models

Now, the ``nlp`` variable references a **Language** class. (contains language specific rules for various tasks, for example, tokenization. And, a processing pipeline)

In [10]:
type(nlp)

spacy.lang.en.English

### Tokenization
https://www.nlpdemystified.org/course/tokenization


We pass whatever text we want to process to _nlp_, which returns a **Doc** object. The **Doc** is a container that stores processed text along with its linguistic annotations. It is created automatically when text is passed to the `nlp` function and allows easy access to tokens, sentences, and named entities. 

The **Doc** object supports various features. It handles **tokenization**, breaking text into individual words (tokens), and provides **annotations**, where each token contains grammatical and syntactic details. It also enables **sentence segmentation**, identifying separate sentences, and **named entity recognition (NER)**, detecting names, locations, dates, and other entities. Additionally, it supports **lemmatization**, converting words to their base forms (e.g., "running" → "run"), and **dependency parsing**, understanding relationships between words. 

To use it, we can create a **Doc** by passing text to an `nlp` pipeline:  
`import spacy`  
`nlp = spacy.load("en_core_web_sm")`  
`doc = nlp("Hello world!")`  
From this, we can access tokens (`doc[0]` → "Hello"), named entities (`doc.ents`), and sentences (`doc.sents`).  

It is also possible to create a **custom Doc** manually:  
`from spacy.tokens import Doc`  
`words = ["Hello", "world", "!"]`  
`spaces = [True, False, False]`  
`doc = Doc(nlp.vocab, words=words, spaces=spaces)`  

The **Doc** object allows iteration (`for token in doc`), slicing (`doc[1:3]` to extract a part of a document), and exporting data in different formats such as arrays, JSON, or binary. It also supports **custom extensions**, letting users define new attributes (`doc._.custom_attr`).  

In summary, the **Doc** object is how spaCy structures and organizes text, making it easier to process and extract meaningful information from it.


A **Doc** object contains **Token** and **Span** objects, which are essential components of text processing in spaCy. A **Token** represents a single unit in a text, such as a word, punctuation mark, or whitespace. Each token carries various attributes like its **text content** (`token.text`), **position in the sentence**, and **grammatical properties** such as **part of speech (POS)** and **dependency labels**. Tokens can be accessed within a `Doc` using indexing (`doc[0]` for the first token) or iterated over in a loop (`for token in doc`). Additionally, tokens support **custom attributes**, which can be added using `Token.set_extension()`, allowing users to define additional metadata. 

A **Span** represents a slice of text within a `Doc`, consisting of multiple tokens. Spans are useful for grouping words together, such as **named entities** (e.g., "New York" as a single unit). A span can be created by slicing a `Doc` (`doc[1:4]` selects three tokens). Like tokens, spans support **custom attributes** and can be used to extract **noun phrases, entity mentions, or syntactic chunks**. 

To use Token and Span in practice, we first create a `Doc` object by loading a spaCy pipeline. For example, `doc = nlp("Give it back! He pleaded.")` allows us to access tokens and spans within the text. The first token can be retrieved with `doc[0]`, which will return "Give". A span can be created by selecting a range of tokens, such as `doc[1:4]`, which will return "it back!". Additionally, we can define custom attributes for tokens and spans. For instance, setting an attribute to indicate if a token is a fruit can be done with `Token.set_extension("is_fruit", default=False)`, and then assigning `doc[3]._.is_fruit = True`. Checking this attribute will return `True`. Similarly, we can set a custom span attribute to detect if a span contains a city name using `Span.set_extension("has_city", getter=lambda span: "New York" in span.text)`. If we apply this to a span, `span._.has_city` will return `False` unless "New York" is within the selected span. 

Token objects can identify **semantic similarity** between words, retrieve **morphological information** like tense and number, access **syntactic relationships** (e.g., parent-child relations in a dependency tree), and use **word embeddings** for vector-based comparisons. Span objects enable the extraction of **named entities**, identification of **noun phrases** and **syntactic structures**, and sentence and phrase-level **semantic analysis**. Both **Token** and **Span** objects play a crucial role in **natural language understanding (NLU)**, making spaCy a powerful tool for **text analysis, entity recognition, and language modeling**.


## Trying to tokenize the articles titles from our corpus using encore_web_sm

In [12]:
import pandas as pd

# Load the Excel file
file_path = "../data (corpus)/data_mesh.xlsx"
xls = pd.ExcelFile(file_path)

xls.sheet_names


['Sheet1']

In [13]:
# Load the first sheet
df = xls.parse('Sheet1')

# View the first arrows
df.head()


Unnamed: 0,Publication Type,Authors,Book Authors,Book Editors,Book Group Authors,Author Full Names,Book Author Full Names,Group Authors,Article Title,Source Title,...,Web of Science Index,Research Areas,IDS Number,Pubmed Id,Open Access Designations,Highly Cited Status,Hot Paper Status,Date of Export,UT (Unique WOS ID),Web of Science Record
0,J,"Brunthaler, G; Lindner, B; Pillwein, G; Griess...",,,,"Brunthaler, G; Lindner, B; Pillwein, G; Griess...",,,Two-dimensional metallic state in silicon-on-i...,PHYSICA E-LOW-DIMENSIONAL SYSTEMS & NANOSTRUCT...,...,,Science & Technology - Other Topics; Physics,,,,,,2025-03-11,WOS:000221140800060,0
1,C,"Pyragas, V; Lisauskas, V; Sliuziene, K; Vengal...",,"Grigonis, A",,"Pyragas, V.; Lisauskas, V.; Sliuziene, K.; Ven...",,,ELECTRICAL PROPERTIES OF NONSTOICHIOMETRIC In2...,3RD INTERNATIONAL CONFERENCE RADIATION INTERAC...,...,,Materials Science; Nuclear Science & Technology,,,,,,2025-03-11,WOS:000309143200075,0
2,J,"Qi, F; Chen, YF; Zheng, BJ; He, JR; Li, Q; Wan...",,,,"Qi, Fei; Chen, Yuanfu; Zheng, Binjie; He, Jiar...",,,Hierarchical architecture of ReS2/rGO composit...,APPLIED SURFACE SCIENCE,...,,Chemistry; Materials Science; Physics,,,,,,2025-03-11,WOS:000401680200016,0
3,J,"Wang, YX; Zhao, XY; Lü, SQ; Meng, XW; Zhang, Y...",,,,"Wang, Yaxin; Zhao, Xiaoyu; Lu, Shiquan; Meng, ...",,,Synthesis and characterization of SmSrCo2-xMnx...,CERAMICS INTERNATIONAL,...,,Materials Science,,,,,,2025-03-11,WOS:000337015300147,0
4,J,"Manousou, DK; Gardelis, S; Calamiotou, M; Sysk...",,,,"Manousou, Dimitra K.; Gardelis, Spiros; Calami...",,,VO2 thin films fabricated by reduction of ther...,MATERIALS LETTERS,...,,Materials Science; Physics,,,,,,2025-03-11,WOS:000670371300017,0


In [17]:
import spacy

# Loading the small NLP  model from spaCy
nlp = spacy.load("en_core_web_sm")

# Select the articles title column
text_column = "Article Title"

# Verify if the column exists on DataFrame
if text_column in df.columns:
    # Tokenize the Article Titles
    df["Tokens"] = df[text_column].dropna().apply(lambda text: [token.text for token in nlp(text)])
    
    # See the firsts results
    print(df[[text_column, "Tokens"]].head())
else:
    df = None  # If doesn't exist the column, none processing is realized.

df.head() if df is not None else "Text column not found."


                                       Article Title  \
0  Two-dimensional metallic state in silicon-on-i...   
1  ELECTRICAL PROPERTIES OF NONSTOICHIOMETRIC In2...   
2  Hierarchical architecture of ReS2/rGO composit...   
3  Synthesis and characterization of SmSrCo2-xMnx...   
4  VO2 thin films fabricated by reduction of ther...   

                                              Tokens  
0  [Two, -, dimensional, metallic, state, in, sil...  
1  [ELECTRICAL, PROPERTIES, OF, NONSTOICHIOMETRIC...  
2  [Hierarchical, architecture, of, ReS2, /, rGO,...  
3  [Synthesis, and, characterization, of, SmSrCo2...  
4  [VO2, thin, films, fabricated, by, reduction, ...  


Unnamed: 0,Publication Type,Authors,Book Authors,Book Editors,Book Group Authors,Author Full Names,Book Author Full Names,Group Authors,Article Title,Source Title,...,Research Areas,IDS Number,Pubmed Id,Open Access Designations,Highly Cited Status,Hot Paper Status,Date of Export,UT (Unique WOS ID),Web of Science Record,Tokens
0,J,"Brunthaler, G; Lindner, B; Pillwein, G; Griess...",,,,"Brunthaler, G; Lindner, B; Pillwein, G; Griess...",,,Two-dimensional metallic state in silicon-on-i...,PHYSICA E-LOW-DIMENSIONAL SYSTEMS & NANOSTRUCT...,...,Science & Technology - Other Topics; Physics,,,,,,2025-03-11,WOS:000221140800060,0,"[Two, -, dimensional, metallic, state, in, sil..."
1,C,"Pyragas, V; Lisauskas, V; Sliuziene, K; Vengal...",,"Grigonis, A",,"Pyragas, V.; Lisauskas, V.; Sliuziene, K.; Ven...",,,ELECTRICAL PROPERTIES OF NONSTOICHIOMETRIC In2...,3RD INTERNATIONAL CONFERENCE RADIATION INTERAC...,...,Materials Science; Nuclear Science & Technology,,,,,,2025-03-11,WOS:000309143200075,0,"[ELECTRICAL, PROPERTIES, OF, NONSTOICHIOMETRIC..."
2,J,"Qi, F; Chen, YF; Zheng, BJ; He, JR; Li, Q; Wan...",,,,"Qi, Fei; Chen, Yuanfu; Zheng, Binjie; He, Jiar...",,,Hierarchical architecture of ReS2/rGO composit...,APPLIED SURFACE SCIENCE,...,Chemistry; Materials Science; Physics,,,,,,2025-03-11,WOS:000401680200016,0,"[Hierarchical, architecture, of, ReS2, /, rGO,..."
3,J,"Wang, YX; Zhao, XY; Lü, SQ; Meng, XW; Zhang, Y...",,,,"Wang, Yaxin; Zhao, Xiaoyu; Lu, Shiquan; Meng, ...",,,Synthesis and characterization of SmSrCo2-xMnx...,CERAMICS INTERNATIONAL,...,Materials Science,,,,,,2025-03-11,WOS:000337015300147,0,"[Synthesis, and, characterization, of, SmSrCo2..."
4,J,"Manousou, DK; Gardelis, S; Calamiotou, M; Sysk...",,,,"Manousou, Dimitra K.; Gardelis, Spiros; Calami...",,,VO2 thin films fabricated by reduction of ther...,MATERIALS LETTERS,...,Materials Science; Physics,,,,,,2025-03-11,WOS:000670371300017,0,"[VO2, thin, films, fabricated, by, reduction, ..."
