## **Overview on Dataset**

The **Reuters-21578** dataset is a well-known benchmark in the field of natural language processing, specifically for text categorization and document classification tasks. It contains a collection of Reuters news articles, and it is often used to evaluate the performance of NLP algorithms.The original dataset includes a set of 21,578 news articles that were published in the financial Reuters newswire in 1987 from library nltk. It consist of 90 different category  

**"benchmark"** refers to a standard set of data or tasks that researchers and developers use to evaluate and compare the performance of different NLP models or algorithms. It's like a common ground or a standardized test that allows everyone to measure how well their systems are doing.  

**"Reuters"** refers to the Reuters news agency, a global information organization that provides news, financial information, and various other services to businesses, professionals, and the general public. Reuters is known for its extensive coverage of news from around the world, including topics such as finance, markets, technology, and general news  

> Due to the original dataset of Reuteur is so skew so in this chapter we use modified version of its known as ApteMod refer to Reuters-21578. these are outdated dataset and often use for benchmark algorithm. It contain enough information for interesting post-processing and insight   


In [16]:
import nltk
nltk.download('reuters')

[nltk_data] Downloading package reuters to
[nltk_data]     C:\Users\Acer\AppData\Roaming\nltk_data...
[nltk_data]   Package reuters is already up-to-date!


True

In [17]:
from nltk.corpus import reuters
import pandas as pd
corpus = pd.DataFrame([
    {"id": _id,
     "text": reuters.raw(_id).replace("\n", ""), 
     "label": reuters.categories(_id)}
    for _id in reuters.fileids()
 ])


In [18]:
corpus

Unnamed: 0,id,text,label
0,test/14826,ASIAN EXPORTERS FEAR DAMAGE FROM U.S.-JAPAN RI...,[trade]
1,test/14828,CHINA DAILY SAYS VERMIN EAT 7-12 PCT GRAIN STO...,[grain]
2,test/14829,JAPAN TO REVISE LONG-TERM ENERGY DEMAND DOWNWA...,"[crude, nat-gas]"
3,test/14832,THAI TRADE DEFICIT WIDENS IN FIRST QUARTER Th...,"[corn, grain, rice, rubber, sugar, tin, trade]"
4,test/14833,INDONESIA SEES CPO PRICE RISING SHARPLY Indon...,"[palm-oil, veg-oil]"
...,...,...,...
10783,training/999,U.K. MONEY MARKET SHORTAGE FORECAST REVISED DO...,"[interest, money-fx]"
10784,training/9992,KNIGHT-RIDDER INC &lt;KRN> SETS QUARTERLY Qtl...,[earn]
10785,training/9993,TECHNITROL INC &lt;TNL> SETS QUARTERLY Qtly d...,[earn]
10786,training/9994,NATIONWIDE CELLULAR SERVICE INC &lt;NCEL> 4TH ...,[earn]


> list all the topics and see how many documents there are per topic using the following code:

In [19]:
from collections import Counter
Counter([label for document_labels in corpus["label"] for label in document_labels]).most_common()

[('earn', 3964),
 ('acq', 2369),
 ('money-fx', 717),
 ('grain', 582),
 ('crude', 578),
 ('trade', 485),
 ('interest', 478),
 ('ship', 286),
 ('wheat', 283),
 ('corn', 237),
 ('dlr', 175),
 ('money-supply', 174),
 ('oilseed', 171),
 ('sugar', 162),
 ('coffee', 139),
 ('gnp', 136),
 ('veg-oil', 124),
 ('gold', 124),
 ('soybean', 111),
 ('nat-gas', 105),
 ('bop', 105),
 ('livestock', 99),
 ('cpi', 97),
 ('cocoa', 73),
 ('reserves', 73),
 ('carcass', 68),
 ('jobs', 67),
 ('copper', 65),
 ('rice', 59),
 ('yen', 59),
 ('cotton', 59),
 ('alum', 58),
 ('gas', 54),
 ('iron-steel', 54),
 ('ipi', 53),
 ('barley', 51),
 ('rubber', 49),
 ('meal-feed', 49),
 ('palm-oil', 40),
 ('zinc', 34),
 ('sorghum', 34),
 ('pet-chem', 32),
 ('tin', 30),
 ('lead', 29),
 ('silver', 29),
 ('wpi', 29),
 ('rapeseed', 27),
 ('strategic-metal', 27),
 ('orange', 27),
 ('soy-meal', 26),
 ('soy-oil', 25),
 ('retail', 25),
 ('fuel', 23),
 ('hog', 22),
 ('housing', 20),
 ('heat', 19),
 ('lumber', 16),
 ('sunseed', 16),
 ('i

> The Reuters-21578 dataset includes 90 different topics with a significant degree of unbalance between classes, with almost 37% of the documents in the most common category and only 0.01% in each of the five least common categories.

## **2. Implement NLP Main Concept**

### **Tackle for Language within a Text**

In [20]:
# Remove newline
corpus["clean_text"] = corpus["text"].apply(
    lambda x: x.replace("\n", "")
 )

In [21]:
corpus.columns

Index(['id', 'text', 'label', 'clean_text'], dtype='object')

In [22]:
#Detect Language within Each Article of dataset
from langdetect import detect
import numpy as np
def getLanguage(text: str):
    try:
        return detect(text)
    except:
        return np.nan
corpus["language"] = corpus["text"].apply(detect)

In [23]:
corpus.head()

Unnamed: 0,id,text,label,clean_text,language
0,test/14826,ASIAN EXPORTERS FEAR DAMAGE FROM U.S.-JAPAN RI...,[trade],ASIAN EXPORTERS FEAR DAMAGE FROM U.S.-JAPAN RI...,en
1,test/14828,CHINA DAILY SAYS VERMIN EAT 7-12 PCT GRAIN STO...,[grain],CHINA DAILY SAYS VERMIN EAT 7-12 PCT GRAIN STO...,en
2,test/14829,JAPAN TO REVISE LONG-TERM ENERGY DEMAND DOWNWA...,"[crude, nat-gas]",JAPAN TO REVISE LONG-TERM ENERGY DEMAND DOWNWA...,en
3,test/14832,THAI TRADE DEFICIT WIDENS IN FIRST QUARTER Th...,"[corn, grain, rice, rubber, sugar, tin, trade]",THAI TRADE DEFICIT WIDENS IN FIRST QUARTER Th...,en
4,test/14833,INDONESIA SEES CPO PRICE RISING SHARPLY Indon...,"[palm-oil, veg-oil]",INDONESIA SEES CPO PRICE RISING SHARPLY Indon...,en


In [24]:
pd.DataFrame(corpus['language'].value_counts())

Unnamed: 0_level_0,count
language,Unnamed: 1_level_1
en,9889
sv,442
de,370
sw,29
so,25
nl,9
pt,7
vi,7
fr,2
ro,2


>  there seem to be documents in languages other than English. Indeed, these documents are often either very short or have a strange structure, which means they're not actual news articles.  
  
> Now that we have inferred the language, we can continue with the **language-dependent steps** of the analytical pipeline. We will use spaCy library to embed NLP model.  
  
> Command to install English model :  
> python -m spacy download en_core_web_sm


In [25]:
#Load the model using spaCy
import spacy
nlp = spacy.load('en_core_web_sm')


In [26]:
corpus["clean_text"] = corpus["text"].apply(
    lambda x: x.replace("\n", "")
 )
text = corpus['clean_text'][0]
text



In [27]:
#Try to apply on a text
parsed = nlp(text)

> The parsed obj that return by spacy has several fields due to many model being combine in a single pipeline, which provide different level of text structuring: **Text Segmentation and Tokenization, Past-of-Speech Tagger, Named Entity Recognition (NER), Dependency parser, Lemmatizer.**


### **2.1 Text Segmentation and Tokenization**  
> This process **breaks down a document into sentences and words, using punctuation and spaces**. Spacy's segmentation usually works well, but you might need adjustments for specific cases, like short texts with slang or emojis. In such instances, consider using TweetTokenizer from nltk for better results. Explore other segmentation options based on your context.

In [28]:
for sent in parsed.sents:
    for token in sent:
        print(token, end=',')



### **2.2 Part-of-Speech Tagger**  
> Once the text is divided into individual words (tokens), the next step is Part-of-Speech (PoS) tagging. **Assigning grammatical types like nouns or verbs to each token**. Engines used for PoS tagging are trained on labeled data, learning language patterns. **For example**, "the" is a determinative article usually followed by a noun. In spaCy, PoS information is stored in the `label_` attribute of the Span object. Check available tags at https://spacy.io/models/en or use `spacy.explain` for human-readable explanations

In [29]:
tokens=[]
posTag=[]
# Display Part-of-Speech tags for each token
for token in parsed:
    tokens.append(token.text)
    posTag.append(token.pos_)
pd.DataFrame(data={
    'Tokens': tokens,
    'PoS Tags': posTag
})

Unnamed: 0,Tokens,PoS Tags
0,ASIAN,ADJ
1,EXPORTERS,PROPN
2,FEAR,VERB
3,DAMAGE,NOUN
4,FROM,ADP
...,...,...
905,end,VERB
906,the,DET
907,dispute,NOUN
908,.,PUNCT


In [30]:
# For human-readable explanations of PoS tags
tokens=[]
posTag=[]
# Display Part-of-Speech tags for each token
for token in parsed:
    tokens.append(token.text)
    posTag.append(spacy.explain(token.pos_))
pd.DataFrame(data={
    'Tokens': tokens,
    'PoS Tags': posTag
})


Unnamed: 0,Tokens,PoS Tags
0,ASIAN,adjective
1,EXPORTERS,proper noun
2,FEAR,verb
3,DAMAGE,noun
4,FROM,adposition
...,...,...
905,end,verb
906,the,determiner
907,dispute,noun
908,.,punctuation


### **2.3 Named Entity Recognition (NER)**  
> This analysis step involves **a statistical model trained to identify types of nouns in the text**, such as Organization, Person, Location, Products, Numbers, and Currencies. Using context and prepositions, the model predicts the most likely entity type. Like other NLP pipeline steps, these models are trained on large datasets to learn common patterns. In spaCy, entity information is typically stored in the `ents` attribute of the parsed object. Additionally, spaCy offers tools, like the `displacy` module, to visually display entities in a text.

In [31]:
pd.DataFrame(data={
    'Token':[entity.text for entity in parsed.ents],
    'Label':[entity.label_ for entity in parsed.ents]
})

Unnamed: 0,Token,Label
0,ASIAN,NORP
1,RIFT,ORG
2,U.S.,GPE
3,Japan,GPE
4,Asia,LOC
...,...,...
89,Japan,GPE
90,International Trade and Industry,ORG
91,MITI,ORG
92,Washington,GPE


In [32]:
# Visualize entities using displacy
spacy.displacy.render(parsed, style="ent", jupyter=True)

### **2.4 Dependency Parser**  
>The dependency parser is a powerful tool that **uncovers relationships between words in a sentence**, allowing you to create a syntactic tree to illustrate these connections.  
  
The root token, typically the main verb, ties together the subject and object. Subjects and objects can further connect to other elements like possessive pronouns, adjectives, or articles. Verbs can relate not only to subjects and objects but also to prepositions and subordinate predicates.  
  
**For example**, in the sentence **"Autonomous cars shift insurance liability towards manufacturers,"** the dependency tree reveals that the main verb, **"shift"** connects to the subject **"cars"** and the object **"liability"** through a subject-object relationship. It also involves the preposition **"towards."** Other words like **"Autonomous,"** **"insurance,"** and **"manufacturers"** are linked to either the subject, object, or preposition. This syntactic tree, created by spaCy, helps identify relationships between tokens, crucial information for constructing knowledge graphs.  

In [33]:
def visualize_dependency_tree(text):
    # Process the text with spaCy
    doc = nlp(text)
    
    # Visualize the dependency tree using displacy
    spacy.displacy.render(doc, style="dep", jupyter=True)

# Example usage:
text_to_visualize = "Autonomous cars shift insurance liability towards manufacturers."
visualize_dependency_tree(text_to_visualize)

In [34]:
corpus['clean_text'][0][:50]

'ASIAN EXPORTERS FEAR DAMAGE FROM U.S.-JAPAN RIFT  '

In [35]:
text_to_visualize = corpus['clean_text'][0][:49]
visualize_dependency_tree(text_to_visualize)

In [36]:
# Extract and display dependency information
token=[]
dependency=[]
anotherToken=[]

pd.DataFrame(data={
    'Token':[entity.text for entity in parsed],
    'Dependecy-Relationship':[entity.dep_ for entity in parsed],
    'Another Token':[entity.head.text for entity in parsed]
}).head(5)



Unnamed: 0,Token,Dependecy-Relationship,Another Token
0,ASIAN,compound,EXPORTERS
1,EXPORTERS,nsubj,FEAR
2,FEAR,ccomp,said
3,DAMAGE,dobj,FEAR
4,FROM,prep,DAMAGE


### **2.5 Lemmatizer**  
> In the final step of the analysis process, we use a tool called a **lemmatizer**. Its job is to simplify words to their basic form, making them cleaner and reducing variations caused by things like verb conjugations or plural forms.  
  
**For example**:  
* the verb **"to be"** has variations like **"is," "are," and "was,"** but the lemmatizer turns them all into the common root **"be."**  
* **"car"** and **"cars"** to the simpler form **"car."**  
  
> **This helps us focus on the core meaning of words without getting bogged down by minor differences due to grammar.** The lemmatizer follows rules to link words, including their conjugations and plurals, to a common root form. More advanced versions may consider the surrounding context and the type of word (Part-of-Speech) to handle situations like homonyms better.  
  
In spaCy, you can find the lemmatized version of a word in the Span object using the `lemma_` attribute.  
  
Sometimes, **stemmers** are used instead of lemmatizers. Stemmers **simplify words by removing the last part, dealing with inflectional and derivational variations.** They are simpler and rule-based, focusing on patterns rather than considering detailed language and structure. 

In [37]:
# Extract and display lemmatized versions of tokens
pd.DataFrame(data={
    'Original Tokens':[token.text for token in parsed],
    'Lemmatized Tokens': [token.lemma_ for token in parsed]
}).head(20)


Unnamed: 0,Original Tokens,Lemmatized Tokens
0,ASIAN,asian
1,EXPORTERS,EXPORTERS
2,FEAR,fear
3,DAMAGE,damage
4,FROM,from
5,U.S.-JAPAN,U.S.-JAPAN
6,RIFT,RIFT
7,,
8,Mounting,Mounting
9,trade,trade
