# NLP Web-Scraper for News Articles
This notebook is a simple web-scraper that uses spacey to extract the main content of a news article from a given URL. The main content is then tokenized and lemmatized to extract the most important words. The most important words are then used to generate a dataset after parsing and recognizing the name entities. Spacy is also used to do the named entity recoginition(NER)
.


### We shall import the necessary python libraries for web scraping.

In [2]:
import requests;
from bs4 import BeautifulSoup;


### Define the news article URL and use the requests library to get the HTML content of the page.

In [3]:
url = "https://edition.cnn.com/2023/06/16/africa/darfur-sudan-wagner-conflict-cmd-intl/index.html";
#"https://www.imdb.com/chart/top/?ref_=nv_mv_250";
page = requests.get(url).text;
# pass the page to beautiful soup

page = BeautifulSoup(page, "html.parser");
page

 <!DOCTYPE html>

<html data-layout-uri="cms.cnn.com/_layouts/layout-no-rail-article-fullwidth/instances/world-article-fullwidth-v1@published" data-uri="cms.cnn.com/_pages/h_86186ec28133b6f5a89e1c57591379c2@published" lang="en">
<head><style>body,h1,h2,h3,h4,h5{font-family:cnn_sans_display,helveticaneue,Helvetica,Arial,Utkal,sans-serif}h1,h2,h3,h4,h5{font-weight:700}:root{--theme-primary:#cc0000;--theme-background:#0c0c0c;--theme-divider:#404040;--theme-copy:#404040;--theme-copy-accent:#e6e6e6;--theme-copy-accent-hover:#ffffff;--theme-icon-color:#e6e6e6;--theme-icon-color-hover:#ffffff;--theme-ad-slot-background-color:#0c0c0c;--theme-ad-slot-text-color:#b1b1b1;--theme-ad-slot-text-hover:#ffffff;--theme-font-family:cnn_sans_display,helveticaneue,Helvetica,Arial,Utkal,sans-serif;--theme-searchbox-border:#b1b1b1;--theme-copy-follow:#ffffff;--theme-article-spacing-top:0px;--theme-link-color-hover:#6e6e6e;--theme-color-link:#0c0c0c;--theme-button-color:#6e6e6e;--theme-button-color-hover:#cc

### Use BeautifulSoup to parse the HTML content and extract the main content of the news article searching by p tag.   

In [4]:
p_tags = page.find_all("p");
p_tags

[<p class="editor-note inline-placeholder" data-article-gutter="true" data-component-name="editor-note" data-editable="text" data-uri="cms.cnn.com/_components/editor-note/instances/editor-note-0d418bcb1547b671b119f52d0e5e68cf@published">
 <b>Editor’s Note: </b>This report includes details of sexual assaults and violence.
 </p>,
 <p class="paragraph inline-placeholder" data-article-gutter="true" data-component-name="paragraph" data-editable="text" data-uri="cms.cnn.com/_components/paragraph/instances/paragraph_8743271C-360D-B377-585B-B9AD3069C105@published">
             Cries pierced the air as a car full of women and children crossed into Chad from <a href="http://cnn.com/2023/06/09/africa/sudan-24-hour-ceasefire-intl/index.html" target="_blank">war-torn Sudan</a>. A woman, in the late stages of pregnancy, lay in the backseat, lifeless and soaked in blood. Her children wailed at her feet. 
     </p>,
 <p class="paragraph inline-placeholder" data-article-gutter="true" data-component-na

### now convert the array of paragraphs into a single string and use spacy to tokenize and lemmatize the words.

In [5]:
article = [i.text for i in p_tags];
article

['\nEditor’s Note: This report includes details of sexual assaults and violence.\n',
 '\n            Cries pierced the air as a car full of women and children crossed into Chad from war-torn Sudan. A woman, in the late stages of pregnancy, lay in the backseat, lifeless and soaked in blood. Her children wailed at her feet. \n    ',
 '\n            “I sat next to her in the car,” said Butheina Nourin, describing her perilous escape from Sudan’s Darfur region alongside the dead woman. “Her name was Fatima. I don’t know her surname.”  \n    ',
 '\n            Fighters from the powerful Sudanese paramilitary group, the Rapid Support Forces (RSF), and their armed allies manned checkpoints along their route, Nourin said, demanding money from every passenger in exchange for safe passage.  \n    ',
 '\n            This brutal method of extortion has become widespread in Darfur, according to dozens of witnesses who recounted similar incidents to CNN. \n    ',
 '\n            The vast western reg

## now join the words into a single string

In [6]:
article = " ".join(article);
article

"\nEditor’s Note: This report includes details of sexual assaults and violence.\n \n            Cries pierced the air as a car full of women and children crossed into Chad from war-torn Sudan. A woman, in the late stages of pregnancy, lay in the backseat, lifeless and soaked in blood. Her children wailed at her feet. \n     \n            “I sat next to her in the car,” said Butheina Nourin, describing her perilous escape from Sudan’s Darfur region alongside the dead woman. “Her name was Fatima. I don’t know her surname.”  \n     \n            Fighters from the powerful Sudanese paramilitary group, the Rapid Support Forces (RSF), and their armed allies manned checkpoints along their route, Nourin said, demanding money from every passenger in exchange for safe passage.  \n     \n            This brutal method of extortion has become widespread in Darfur, according to dozens of witnesses who recounted similar incidents to CNN. \n     \n            The vast western region of Sudan is the s

now install and import spacy and use the en_core_web_sm model to tokenize and lemmatize the words.

In [7]:
pip install -U spacy




[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.3.2[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython -m pip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [8]:
import spacy;

In [9]:
nlp = spacy.load("en_core_web_sm");
nlp

<spacy.lang.en.English at 0x7f47a4413970>

In [10]:
doc = nlp(article);
doc


Editor’s Note: This report includes details of sexual assaults and violence.
 
            Cries pierced the air as a car full of women and children crossed into Chad from war-torn Sudan. A woman, in the late stages of pregnancy, lay in the backseat, lifeless and soaked in blood. Her children wailed at her feet. 
     
            “I sat next to her in the car,” said Butheina Nourin, describing her perilous escape from Sudan’s Darfur region alongside the dead woman. “Her name was Fatima. I don’t know her surname.”  
     
            Fighters from the powerful Sudanese paramilitary group, the Rapid Support Forces (RSF), and their armed allies manned checkpoints along their route, Nourin said, demanding money from every passenger in exchange for safe passage.  
     
            This brutal method of extortion has become widespread in Darfur, according to dozens of witnesses who recounted similar incidents to CNN. 
     
            The vast western region of Sudan is the site of what 

### Tokenization

In [11]:
tokens = [token for token in doc];
tokens

[,
 Editor,
 ’s,
 Note,
 :,
 This,
 report,
 includes,
 details,
 of,
 sexual,
 assaults,
 and,
 violence,
 .,
 
  
             ,
 Cries,
 pierced,
 the,
 air,
 as,
 a,
 car,
 full,
 of,
 women,
 and,
 children,
 crossed,
 into,
 Chad,
 from,
 war,
 -,
 torn,
 Sudan,
 .,
 A,
 woman,
 ,,
 in,
 the,
 late,
 stages,
 of,
 pregnancy,
 ,,
 lay,
 in,
 the,
 backseat,
 ,,
 lifeless,
 and,
 soaked,
 in,
 blood,
 .,
 Her,
 children,
 wailed,
 at,
 her,
 feet,
 .,
 
      
             ,
 “,
 I,
 sat,
 next,
 to,
 her,
 in,
 the,
 car,
 ,,
 ”,
 said,
 Butheina,
 Nourin,
 ,,
 describing,
 her,
 perilous,
 escape,
 from,
 Sudan,
 ’s,
 Darfur,
 region,
 alongside,
 the,
 dead,
 woman,
 .,
 “,
 Her,
 name,
 was,
 Fatima,
 .,
 I,
 do,
 n’t,
 know,
 her,
 surname,
 .,
 ”,
  
      
             ,
 Fighters,
 from,
 the,
 powerful,
 Sudanese,
 paramilitary,
 group,
 ,,
 the,
 Rapid,
 Support,
 Forces,
 (,
 RSF,
 ),
 ,,
 and,
 their,
 armed,
 allies,
 manned,
 checkpoints,
 along,
 their,
 route,
 ,,
 

### Lemmatization

In [12]:
for token in doc:
    print("Token: ",token, "    Lemmatized Form ==> ", token.lemma_);

Token:  
     Lemmatized Form ==>  

Token:  Editor     Lemmatized Form ==>  Editor
Token:  ’s     Lemmatized Form ==>  ’s
Token:  Note     Lemmatized Form ==>  note
Token:  :     Lemmatized Form ==>  :
Token:  This     Lemmatized Form ==>  this
Token:  report     Lemmatized Form ==>  report
Token:  includes     Lemmatized Form ==>  include
Token:  details     Lemmatized Form ==>  detail
Token:  of     Lemmatized Form ==>  of
Token:  sexual     Lemmatized Form ==>  sexual
Token:  assaults     Lemmatized Form ==>  assault
Token:  and     Lemmatized Form ==>  and
Token:  violence     Lemmatized Form ==>  violence
Token:  .     Lemmatized Form ==>  .
Token:  
 
                 Lemmatized Form ==>  
 
            
Token:  Cries     Lemmatized Form ==>  Cries
Token:  pierced     Lemmatized Form ==>  pierce
Token:  the     Lemmatized Form ==>  the
Token:  air     Lemmatized Form ==>  air
Token:  as     Lemmatized Form ==>  as
Token:  a     Lemmatized Form ==>  a
Token:  car     Lemmatized F

### Part of Speech (POS) Tagging.   

In [13]:
for token in doc:
    print(f"Token: {token} POS =>:  {token.pos_}")

Token: 
 POS =>:  SPACE
Token: Editor POS =>:  PROPN
Token: ’s POS =>:  PART
Token: Note POS =>:  NOUN
Token: : POS =>:  PUNCT
Token: This POS =>:  DET
Token: report POS =>:  NOUN
Token: includes POS =>:  VERB
Token: details POS =>:  NOUN
Token: of POS =>:  ADP
Token: sexual POS =>:  ADJ
Token: assaults POS =>:  NOUN
Token: and POS =>:  CCONJ
Token: violence POS =>:  NOUN
Token: . POS =>:  PUNCT
Token: 
 
             POS =>:  SPACE
Token: Cries POS =>:  PROPN
Token: pierced POS =>:  VERB
Token: the POS =>:  DET
Token: air POS =>:  NOUN
Token: as POS =>:  ADP
Token: a POS =>:  DET
Token: car POS =>:  NOUN
Token: full POS =>:  ADJ
Token: of POS =>:  ADP
Token: women POS =>:  NOUN
Token: and POS =>:  CCONJ
Token: children POS =>:  NOUN
Token: crossed POS =>:  VERB
Token: into POS =>:  ADP
Token: Chad POS =>:  PROPN
Token: from POS =>:  ADP
Token: war POS =>:  NOUN
Token: - POS =>:  PUNCT
Token: torn POS =>:  VERB
Token: Sudan POS =>:  PROPN
Token: . POS =>:  PUNCT
Token: A POS =>:  DET
T

### Dependency Parsing

In [14]:
for token in doc:
    print(f"Depenency Parsing for Token: '{token}' is ==> {token.dep_} Explanation: {spacy.explain(token.dep_)}")

Depenency Parsing for Token: '
' is ==> dep Explanation: unclassified dependent
Depenency Parsing for Token: 'Editor' is ==> poss Explanation: possession modifier
Depenency Parsing for Token: '’s' is ==> case Explanation: case marking
Depenency Parsing for Token: 'Note' is ==> ROOT Explanation: root
Depenency Parsing for Token: ':' is ==> punct Explanation: punctuation
Depenency Parsing for Token: 'This' is ==> det Explanation: determiner
Depenency Parsing for Token: 'report' is ==> nsubj Explanation: nominal subject
Depenency Parsing for Token: 'includes' is ==> acl Explanation: clausal modifier of noun (adjectival clause)
Depenency Parsing for Token: 'details' is ==> dobj Explanation: direct object
Depenency Parsing for Token: 'of' is ==> prep Explanation: prepositional modifier
Depenency Parsing for Token: 'sexual' is ==> amod Explanation: adjectival modifier
Depenency Parsing for Token: 'assaults' is ==> pobj Explanation: object of preposition
Depenency Parsing for Token: 'and' is 



### Named Entity Recognition (NER)

In [15]:
for entity in doc.ents:
    print(f"Entity: {entity} ==> {entity.label_}  Explanation==> {spacy.explain(entity.label_)}")

Entity: Chad ==> GPE  Explanation==> Countries, cities, states
Entity: Sudan ==> GPE  Explanation==> Countries, cities, states
Entity: Butheina Nourin ==> PERSON  Explanation==> People, including fictional
Entity: Sudan ==> GPE  Explanation==> Countries, cities, states
Entity: Darfur ==> GPE  Explanation==> Countries, cities, states
Entity: Fatima ==> PERSON  Explanation==> People, including fictional
Entity: Sudanese ==> NORP  Explanation==> Nationalities or religious or political groups
Entity: the Rapid Support Forces ==> ORG  Explanation==> Companies, agencies, institutions, etc.
Entity: RSF ==> ORG  Explanation==> Companies, agencies, institutions, etc.
Entity: Nourin ==> ORG  Explanation==> Companies, agencies, institutions, etc.
Entity: Darfur ==> GPE  Explanation==> Countries, cities, states
Entity: dozens ==> CARDINAL  Explanation==> Numerals that do not fall under another type
Entity: CNN ==> ORG  Explanation==> Companies, agencies, institutions, etc.
Entity: Sudan ==> GPE  E

### Visualizing the Named Entity Recognition (NER)

In [16]:
from spacy import displacy;
displacy.render(doc, style="ent", jupyter=True)

Visualizing Dependency Parsing

In [17]:
displacy.render(doc, style="dep", jupyter=True)

### Generating a dataset: Put everything together into dataframes to generate a dataset.

In [24]:
docList = [];
for token in doc:
    doc_dict = {};
    doc_dict["Token"] = token.text;
    doc_dict["is_Stop_Word"] = token.is_stop;
    doc_dict["Lemmatized"] = token.lemma_;
    doc_dict['Tag'] = token.tag_;
    doc_dict['Tag Explanation'] = spacy.explain(token.tag_);
    doc_dict['Part of Speech Tag'] = token.pos_;
    doc_dict['Dependency Parsing'] = token.dep_; 
    docList.append(doc_dict);


### Now convert to dataframe

In [25]:
import pandas as pd;
docDF =pd.DataFrame(docList);
docDF

Unnamed: 0,Token,is_Stop_Word,Lemmatized,Tag,Tag Explanation,Part of Speech Tag,Dependency Parsing
0,\n,False,\n,_SP,whitespace,SPACE,dep
1,Editor,False,Editor,NNP,"noun, proper singular",PROPN,poss
2,’s,True,’s,POS,possessive ending,PART,case
3,Note,False,note,NN,"noun, singular or mass",NOUN,ROOT
4,:,False,:,:,"punctuation mark, colon or ellipsis",PUNCT,punct
...,...,...,...,...,...,...,...
3244,2016,False,2016,CD,cardinal number,NUM,nummod
3245,Cable,False,Cable,NNP,"noun, proper singular",PROPN,compound
3246,News,False,News,NNP,"noun, proper singular",PROPN,compound
3247,Network,False,Network,NNP,"noun, proper singular",PROPN,dobj


### Print the colums of the dataframe

In [28]:
docDF.columns

Index(['Token', 'is_Stop_Word', 'Lemmatized', 'Tag', 'Tag Explanation',
       'Part of Speech Tag', 'Dependency Parsing'],
      dtype='object')

In [29]:
docDF = docDF.reindex(columns=['Token', 'is_Stop_Word', 'Lemmatized', 'Tag', 'Tag Explanation', 'Part of Speech Tag', 'Dependency Parsing']);
docDF

Unnamed: 0,Token,is_Stop_Word,Lemmatized,Tag,Tag Explanation,Part of Speech Tag,Dependency Parsing
0,\n,False,\n,_SP,whitespace,SPACE,dep
1,Editor,False,Editor,NNP,"noun, proper singular",PROPN,poss
2,’s,True,’s,POS,possessive ending,PART,case
3,Note,False,note,NN,"noun, singular or mass",NOUN,ROOT
4,:,False,:,:,"punctuation mark, colon or ellipsis",PUNCT,punct
...,...,...,...,...,...,...,...
3244,2016,False,2016,CD,cardinal number,NUM,nummod
3245,Cable,False,Cable,NNP,"noun, proper singular",PROPN,compound
3246,News,False,News,NNP,"noun, proper singular",PROPN,compound
3247,Network,False,Network,NNP,"noun, proper singular",PROPN,dobj


### Save the our new dataframe to a csv file

In [27]:
docDF.to_csv("NLP_Scraped_Article.csv", index=False);