

## Text Processing

### Q1

1. Modify the code I wrote in lecture 8 with what you have learnt in lecture 9 and correctly tokenize the text both on the word and sentence level, and by removing the stopwords. Rewrite the `getSummary` function and all the other functions that it depends by maing these corrections.

2. Rewrite the code I wrote for `getKeywords` function making the same corrections.

3. Test your code from parts 1 and 2 on random articles from the Guardian.

4. Rewrite the `getSubjectGuardian` function for another newspaper in English, and test your code from part 1 and 2 on random articles from this new newspaper.

In [1]:
import requests
import nltk
import regex as re
import numpy as np
import spacy


from nltk.stem import WordNetLemmatizer
from nltk.stem.snowball import SnowballStemmer
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
from nltk.sentiment import SentimentIntensityAnalyzer

from snowballstemmer import TurkishStemmer
from bs4 import BeautifulSoup

from collections import Counter
from xmltodict import parse

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import PCA

## Modifying getSummary Function

In [134]:
def getSubjectGuardian(subject):
    with requests.get(f'https://www.theguardian.com/{subject}/rss') as link:
        raw = parse(link.text)
    return raw['rss']['channel']['item']

def getText(url):
    with requests.get(url) as link:
        raw = BeautifulSoup(link.content,'html.parser')
    return ' '.join([x.text for x in raw.find_all('p')])

In [135]:
def cleantext(text) :
    OMATS = text
    omats = {'sentences': sent_tokenize(OMATS)}
    omats.update({'cleanedSentences': [re.sub(r'[^\p{Letter}\s]','',sentence.lower()) for sentence in omats['sentences']]})
    
    return omats["cleanedSentences"]
    

In [136]:
def getMatrix(sentences):
    vectorizer = CountVectorizer()
    return vectorizer.fit_transform(sentences)

In [137]:
def getSummary(text,k):
    sentences = cleantext(text)
    matrix = getMatrix(sentences)
    projection = PCA(n_components=1)
    weights = projection.fit_transform(matrix.toarray())
    res = list(zip(weights.transpose()[0],range(500),sentences))
    tmp = sorted(res,key=lambda x: x[0],reverse=True)[:k]
    return sorted(tmp, key=lambda x: x[1])

### Testing the getSummary

In [138]:
economy = getSubjectGuardian('economy')
n = np.random.randint(5)
text = getText(economy[n]['link'])
getSummary(text,5)

[(1.8401787753513408,
  5,
  'the american has partnered with his fellow dodgers owner mark walter the swiss billionaire hansjörg wyss the british property developer jonathan goldstein and the us investment firm clearlake capital and they expect to be granted exclusivity'),
 (6.329292920675048,
  7,
  'there was confusion when ratcliffe the owner of the british petrochemicals company ineos revealed on friday that he had made his move to buy the european champions a fortnight after the deadline for final bids britains richest man offered to pay bn for chelsea and included a pledge to invest bn in the club over the next  years'),
 (4.279477888440978,
  15,
  'abramovich who has also vowed to write off bn in loans he has given to chelsea since  pledged to donate the net proceeds of the sale to all victims of the war in ukraine before he was sanctioned by the british government last month'),
 (2.8620350015444895,
  24,
  'bn is committed to the charitable trust to support victims of the wa

## New Newspaper

In [140]:
def getSubjectNytimes(subject):
    with requests.get(f'https://rss.nytimes.com/services/xml/rss/nyt/HomePage.xml') as link:
        raw = parse(link.text)
    return raw['rss']['channel']['item']

### Testing new newspaper

In [141]:
economy = getSubjectNytimes('economy')
n = np.random.randint(4)
economy[n]["link"]


'https://www.nytimes.com/2022/04/29/us/politics/ukraine-rape-war-crimes.html'

In [None]:
text = getText(economy[1]['link'])

getSummary(text,5)

### Q2

Write a function that returns all named entities (proper names, country names, corporation names only) from a URL. Function should take the URL as the input and must return the list of named entities from that URL. Test your code on random articles from the Guardian. Don't use the NLTK's NER that I demonstrated during the lecture. Use the SpaCY's NER function.

In [142]:
def getSubjectGuardian(subject):
    with requests.get(f'https://www.theguardian.com/{subject}/rss') as link:
        raw = parse(link.text)
    return raw['rss']['channel']['item']


def getText(url):
    with requests.get(url) as link:
        raw = BeautifulSoup(link.content,'html.parser')
    return ' '.join([x.text for x in raw.find_all('p')])

In [143]:
NER = spacy.load("en_core_web_sm")

### Creating the Function

In [144]:
def named_entities (url) :
    res = NER(getText(url))
    words =[]
    for word in res.ents:
        if spacy.explain(word.label_) == ("Countries, cities, states") :
            words.append((word.text,spacy.explain(word.label_)))
        if spacy.explain(word.label_) ==("Companies, agencies, institutions, etc."):
            words.append((word.text,spacy.explain(word.label_)))
        if spacy.explain(word.label_) == ("People, including fictional"):
            words.append((word.text,spacy.explain(word.label_)))
    return words

### Testing the Function

In [145]:
economy = getSubjectGuardian('economy')
n = np.random.randint(4)
named_entities(economy[n]['link'])

[('BP', 'Countries, cities, states'),
 ('Shell', 'Companies, agencies, institutions, etc.'),
 ('BP', 'Companies, agencies, institutions, etc.'),
 ('Kwasi Kwarteng', 'People, including fictional'),
 ('UK', 'Countries, cities, states'),
 ('Times', 'Companies, agencies, institutions, etc.'),
 ('Rishi Sunak', 'People, including fictional'),
 ('Mumsnet', 'Companies, agencies, institutions, etc.'),
 ('Britain', 'Countries, cities, states'),
 ('UK', 'Countries, cities, states'),
 ('OEUK', 'Companies, agencies, institutions, etc.'),
 ('UK', 'Countries, cities, states'),
 ('OEUK', 'Companies, agencies, institutions, etc.'),
 ('UK', 'Countries, cities, states'),
 ('Shell', 'Companies, agencies, institutions, etc.'),
 ('UK', 'Countries, cities, states'),
 ('BP', 'Countries, cities, states'),
 ('UK', 'Countries, cities, states'),
 ('BP', 'Companies, agencies, institutions, etc.'),
 ('Shell', 'Companies, agencies, institutions, etc.'),
 ('Harbour Energy', 'Companies, agencies, institutions, etc.'),

### Q3

1. Write a function that returns the most positive and the most negative sentences from a text. The function must take the text as the input and must return a 2-tuple: the first element as the most positive and the second as the most negative sentence with their polarity scores.

2. Test your function on random articles from the Guardian.

In [146]:
analyzer = SentimentIntensityAnalyzer()

### Creating the Function

In [147]:
def most_p_and_n_Scores (text):


    sentences = sent_tokenize(text)

    m_negative_score = 0
    m_positive_score = 0

    
    for x in sentences:
        negative_score = analyzer.polarity_scores(x)["neg"]
        positive_score = analyzer.polarity_scores(x)["pos"]
        if negative_score>= m_negative_score:
            m_negative_score = negative_score
            m_negative_sentence = x
            scores_of_neg = analyzer.polarity_scores(x)
        if positive_score >= m_positive_score:
            m_positive_score = positive_score
            m_positive_sentence = x
            scores_of_pos = analyzer.polarity_scores(x)
        
    
        
    return (m_positive_sentence,scores_of_pos,m_negative_sentence,scores_of_neg)


### Testing the function

In [154]:
economy = getSubjectGuardian('economy')
n = np.random.randint(4)
text = getText(economy[n]['link'])

In [155]:
most_p_and_n_Scores(text)

('A similar dynamic is happening now with lorries.',
 {'neg': 0.0, 'neu': 0.698, 'pos': 0.302, 'compound': 0.3818},
 'The problems are range and cost.',
 {'neg': 0.351, 'neu': 0.649, 'pos': 0.0, 'compound': -0.4019})