# Interactive news Analyzer
## Advanced Data Science Capstone project
### by Anton Dziavitsyn 2019¶

## Part 2: Features extraction (with using deep ML - nltk package)

## Stage 1: Articles text preprocessing

**SETUP stage**  
*Create spark context connected to Cloudant database*

In [1]:
# Cloudant credentials
cloudant_credentials = {
    "host": "REMOVED ON PUBLISHING",
    "custom_url": "REMOVED ON PUBLISHING",
    "username": "REMOVED ON PUBLISHING",
    "password": "REMOVED ON PUBLISHING"
}

# Spark session with attached cloudant storage
spark = SparkSession\
    .builder\
    .appName("Interactive news Analyzer")\
    .config("cloudant.host",cloudant_credentials['custom_url'].split('@')[1])\
    .config("cloudant.username", cloudant_credentials['username'])\
    .config("cloudant.password",cloudant_credentials['password'])\
    .config("createDBOnSave","true")\
    .config("jsonstore.rdd.partitions", 1)\
    .getOrCreate()

**Let us take a look to the articles text**

In [2]:
df_articles = spark.read.load("clean_articles", "com.cloudant.spark")
df_articles.cache()

# Show info about data
print("Articles: {0}".format(df_articles.count()))
print("First article record:")
df_articles[['article']].limit(1).collect()[0][0]

Articles: 852
First article record:


'Pakistan court upholds Aasia Bibi\'s blasphemy acquittal | News | Al Jazeera TRENDING: Venezuela Syria\'s War Brexit Israel Turkey Pakistan\'s Supreme Court dismisses review petition against the acquittal of Christian woman accused of blasphemy. 29 Jan 2019 10:03 GMT Pakistan\'s Supreme Court has upheld the acquittal of a Christian woman charged with blasphemy, standing by its earlier verdict that sparked days of protests, death threats and nationwide chaos.Â The country\'s top court in the capital, Islamabad, on Tuesday dismissed the review petition against Aasia Bibi, who spent eight years on death row for blasphemy before being released last October. The three-judge panel said arguments of the lawyer acting on behalf of the petitioners did not satisfy the judges. She is now free to leave the country. Unconfirmed Pakistani media reports claim her two daughters have already gone to Canada, where they have already been granted asylum. BibiÂ\xa0is currently under guard at a secret loca

**Text cleaning/quanterization for feature extraction**  
**Text showld be preprocessed:**  
+ Need to remove special symbols, and accented chars (normalize text)
+ Need to expand contractions to stantartize style of text ("ain't" => "is not","aren't" => "are not") etc
+ Need to 'lemmatize' text - because for semantic/sentiments analysis lemmas will be enough, and we will reduce meanings variations
+ Need to remove stopwords

In [3]:
# Install packages
!python -m nltk.downloader all

In [4]:
#inputs
import unicodedata
import re
import nltk
from nltk.stem import WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()
from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer("english")
from nltk import ne_chunk, pos_tag
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.tree import Tree
from pyspark.sql.functions import udf
from pyspark.sql.types import *

# PREPROCESSING fuctions defenition

# Removing accented characters
def remove_accented_chars(text):
    text = unicodedata.normalize('NFKD', text).encode('ascii', 'ignore').decode('utf-8', 'ignore')
    return text

# Removing Special Characters
def remove_special_characters(text):
    pattern = r'[^a-zA-Z0-9\s]'
    text = re.sub(pattern, ' ', text)
    return text

# Expanding Contractions
CONTRACTION_MAP = {"ain't": "is not","aren't": "are not","can't": "cannot","can't've": "cannot have","'cause": "because","could've": "could have","couldn't": "could not","couldn't've": "could not have","didn't": "did not","doesn't": "does not","don't": "do not","hadn't": "had not","hadn't've": "had not have","hasn't": "has not","haven't": "have not","he'd": "he would","he'd've": "he would have","he'll": "he will","he'll've": "he he will have","he's": "he is","how'd": "how did","how'd'y": "how do you","how'll": "how will","how's": "how is","I'd": "I would","I'd've": "I would have","I'll": "I will","I'll've": "I will have","I'm": "I am","I've": "I have","i'd": "i would","i'd've": "i would have","i'll": "i will","i'll've": "i will have","i'm": "i am","i've": "i have","isn't": "is not","it'd": "it would","it'd've": "it would have","it'll": "it will","it'll've": "it will have","it's": "it is","let's": "let us","ma'am": "madam","mayn't": "may not","might've": "might have","mightn't": "might not","mightn't've": "might not have","must've": "must have","mustn't": "must not","mustn't've": "must not have","needn't": "need not","needn't've": "need not have","o'clock": "of the clock","oughtn't": "ought not","oughtn't've": "ought not have","shan't": "shall not","sha'n't": "shall not","shan't've": "shall not have","she'd": "she would","she'd've": "she would have","she'll": "she will","she'll've": "she will have","she's": "she is","should've": "should have","shouldn't": "should not","shouldn't've": "should not have","so've": "so have","so's": "so as","that'd": "that would","that'd've": "that would have","that's": "that is","there'd": "there would","there'd've": "there would have","there's": "there is","they'd": "they would","they'd've": "they would have","they'll": "they will","they'll've": "they will have","they're": "they are","they've": "they have","to've": "to have","wasn't": "was not","we'd": "we would","we'd've": "we would have","we'll": "we will","we'll've": "we will have","we're": "we are","we've": "we have","weren't": "were not","what'll": "what will","what'll've": "what will have","what're": "what are","what's": "what is","what've": "what have","when's": "when is","when've": "when have","where'd": "where did","where's": "where is","where've": "where have","who'll": "who will","who'll've": "who will have","who's": "who is","who've": "who have","why's": "why is","why've": "why have","will've": "will have","won't": "will not","won't've": "will not have","would've": "would have","wouldn't": "would not","wouldn't've": "would not have","y'all": "you all","y'all'd": "you all would","y'all'd've": "you all would have","y'all're": "you all are","y'all've": "you all have","you'd": "you would","you'd've": "you would have","you'll": "you will","you'll've": "you will have","you're": "you are","you've": "you have"}
def expand_contractions(text, contraction_mapping=CONTRACTION_MAP):
    
    contractions_pattern = re.compile('({})'.format('|'.join(contraction_mapping.keys())), 
                                      flags=re.IGNORECASE|re.DOTALL)
    def expand_match(contraction):
        match = contraction.group(0)
        first_char = match[0]
        expanded_contraction = contraction_mapping.get(match)\
                                if contraction_mapping.get(match)\
                                else contraction_mapping.get(match.lower())                       
        expanded_contraction = first_char+expanded_contraction[1:]
        return expanded_contraction
        
    expanded_text = contractions_pattern.sub(expand_match, text)
    expanded_text = re.sub("'", "", expanded_text)
    return expanded_text

# Strip punctuation and Lemmatization
def lemmatize_text(text):
    text = ' '.join(wordnet_lemmatizer.lemmatize(word) for word in word_tokenize(text))
    return text

# Removing Stopwords
def remove_stopwords(text):
    stopword_list = nltk.corpus.stopwords.words('english')
    stopword_list.remove('no')
    stopword_list.remove('not')
    
    tokens = word_tokenize(text)
    filtered_tokens = [token for token in tokens if token.lower() not in stopword_list]
    filtered_text = ' '.join(filtered_tokens)    
    return filtered_text

# Stemming
def simple_stemmer(text):
    text = ' '.join([stemmer.stem(word) for word in word_tokenize(text)])
    return text

# MAIN preprocess function and UDF
def text_preprocess(text):
    result = ""
    for sentence in sent_tokenize(text):
        sentence = remove_special_characters(sentence)
        sentence = remove_accented_chars(sentence)
        sentence = expand_contractions(sentence)
        sentence = lemmatize_text(sentence)
        sentence = remove_stopwords(sentence)
        #sentence = simple_stemmer(sentence)
        result = "{} {}".format(result, sentence)
    return result

udfTextPreprocess = udf(text_preprocess, StringType())

df_articles = df_articles.withColumn("cleaned_article", udfTextPreprocess("article"))
df_articles.cache()

print('Original text:')
print(df_articles[['article', 'cleaned_article']].limit(1).collect()[0][0])
print('Standartized text:')
print(df_articles[['article', 'cleaned_article']].limit(1).collect()[0][1])

Original text:
Pakistan court upholds Aasia Bibi's blasphemy acquittal | News | Al Jazeera TRENDING: Venezuela Syria's War Brexit Israel Turkey Pakistan's Supreme Court dismisses review petition against the acquittal of Christian woman accused of blasphemy. 29 Jan 2019 10:03 GMT Pakistan's Supreme Court has upheld the acquittal of a Christian woman charged with blasphemy, standing by its earlier verdict that sparked days of protests, death threats and nationwide chaos.Â The country's top court in the capital, Islamabad, on Tuesday dismissed the review petition against Aasia Bibi, who spent eight years on death row for blasphemy before being released last October. The three-judge panel said arguments of the lawyer acting on behalf of the petitioners did not satisfy the judges. She is now free to leave the country. Unconfirmed Pakistani media reports claim her two daughters have already gone to Canada, where they have already been granted asylum. BibiÂ is currently under guard at a secre

## Stage 2: Features extraction

**Name Entities Extraction**  
Now we should extract name entites - to define text topics (in news it will be Names of people, towns, organizations etc)

In [5]:
# Extract Named entities
def extract_entities(text):
    entities = []
    for sentence in sent_tokenize(text):
        chunks = ne_chunk(pos_tag(word_tokenize(sentence)))
        chunks = [chunk for chunk in chunks if type(chunk) == Tree]
        chunks = [" ".join([l[0] for l in ent.leaves()]) for ent in chunks]
        entities.extend(chunks)
    return entities

udfExtractEntities = udf(extract_entities, ArrayType(StringType()))

# Let us try to extract Name Entities from origian text and quanterized text

print(extract_entities(df_articles[['article']].limit(1).collect()[0][0]))
print('======================================================')
print(extract_entities(df_articles[['cleaned_article']].limit(1).collect()[0][0]))

['Pakistan', 'Aasia Bibi', 'Venezuela Syria', 'Israel', 'Turkey Pakistan', 'Supreme Court', 'Christian', 'Pakistan', 'Supreme Court', 'Christian', 'Islamabad', 'Aasia Bibi', 'Pakistani', 'Canada', 'BibiÂ', 'Pakistan', 'Muslim', 'Christian', 'Punjab', 'Pakistan', 'Pakistan', 'Pakistan', 'TLP', 'Bibi', 'TLP', 'Islamabad', 'Bibi', 'Shariah', 'Hafiz Ehtisham Ahmed', 'Red Mosque', 'Islamabad', 'AFP', 'Muslims', 'Pakistan', 'Blasphemy', 'Pakistan', 'Islam', 'Muslims', 'Muslims', 'Christians']
['Pakistan', 'Aasia Bibi', 'News Al Jazeera', 'Venezuela Syria War Brexit Israel Turkey Pakistan Supreme Court', 'Christian', 'Supreme Court', 'Christian', 'Islamabad', 'Aasia Bibi', 'Unconfirmed Pakistani', 'Canada', 'Pakistan', 'Muslim', 'Christian', 'Punjab', 'Pakistan', 'Pakistan', 'Labbaik Pakistan TLP', 'Bibi', 'TLP', 'Islamabad', 'Bibi', 'Shariah Hafiz Ehtisham Ahmed', 'Red Mosque Islamabad', 'AFP', 'Muslims', 'Pakistan', 'Blasphemy', 'Pakistan', 'Islam', 'Muslims', 'Muslims']


In [6]:
# Will be using original text - because it gives better extraction quality

# Let us evaluate and improve Entity extraction
# Just do another test (try extract entities from our list)

for word in extract_entities(df_articles[['article']].limit(1).collect()[0][0]):
    print("{0} ==== {1}".format(word, extract_entities(word)))

Pakistan ==== ['Pakistan']
Aasia Bibi ==== ['Aasia', 'Bibi']
Venezuela Syria ==== ['Venezuela', 'Syria']
Israel ==== ['Israel']
Turkey Pakistan ==== ['Turkey', 'Pakistan']
Supreme Court ==== ['Supreme Court']
Christian ==== ['Christian']
Pakistan ==== ['Pakistan']
Supreme Court ==== ['Supreme Court']
Christian ==== ['Christian']
Islamabad ==== ['Islamabad']
Aasia Bibi ==== ['Aasia', 'Bibi']
Pakistani ==== ['Pakistani']
Canada ==== ['Canada']
BibiÂ ==== ['BibiÂ']
Pakistan ==== ['Pakistan']
Muslim ==== ['Muslim']
Christian ==== ['Christian']
Punjab ==== ['Punjab']
Pakistan ==== ['Pakistan']
Pakistan ==== ['Pakistan']
Pakistan ==== ['Pakistan']
TLP ==== ['TLP']
Bibi ==== ['Bibi']
TLP ==== ['TLP']
Islamabad ==== ['Islamabad']
Bibi ==== ['Bibi']
Shariah ==== ['Shariah']
Hafiz Ehtisham Ahmed ==== ['Hafiz', 'Ehtisham Ahmed']
Red Mosque ==== []
Islamabad ==== ['Islamabad']
AFP ==== []
Muslims ==== []
Pakistan ==== ['Pakistan']
Blasphemy ==== ['Blasphemy']
Pakistan ==== ['Pakistan']
Islam ==== 

In [7]:
# Not good - we got much 'false positive' results like 'Fuels' 'Federal No' etc
# Need to remove them. Let us rewrite function

# Extract Named entities (improved)
def extract_entities(text):
    entities = []
    for sentence in sent_tokenize(text):
        chunks = ne_chunk(pos_tag(word_tokenize(sentence)))
        chunks = [chunk for chunk in chunks if type(chunk) == Tree]
        chunks = [" ".join([l[0] for l in ent.leaves()]) for ent in chunks]
        
        # add test
        for chunk in chunks:
            test_chunk = ne_chunk(pos_tag(word_tokenize(chunk)))
            test_chunk = [chunk for chunk in test_chunk if type(chunk) == Tree]
            test_chunk = [" ".join([l[0] for l in ent.leaves()]) for ent in test_chunk]
            if chunk == " ".join(test_chunk):
                entities.append(chunk)
    return entities

udfExtractEntities = udf(extract_entities, ArrayType(StringType()))

# Repeat test
for word in extract_entities(df_articles[['article']].limit(1).collect()[0][0]):
    print("{0} ==== {1}".format(word, extract_entities(word)))

Pakistan ==== ['Pakistan']
Aasia Bibi ==== ['Aasia', 'Bibi']
Venezuela Syria ==== ['Venezuela', 'Syria']
Israel ==== ['Israel']
Turkey Pakistan ==== ['Turkey', 'Pakistan']
Supreme Court ==== ['Supreme Court']
Christian ==== ['Christian']
Pakistan ==== ['Pakistan']
Supreme Court ==== ['Supreme Court']
Christian ==== ['Christian']
Islamabad ==== ['Islamabad']
Aasia Bibi ==== ['Aasia', 'Bibi']
Pakistani ==== ['Pakistani']
Canada ==== ['Canada']
BibiÂ ==== ['BibiÂ']
Pakistan ==== ['Pakistan']
Muslim ==== ['Muslim']
Christian ==== ['Christian']
Punjab ==== ['Punjab']
Pakistan ==== ['Pakistan']
Pakistan ==== ['Pakistan']
Pakistan ==== ['Pakistan']
TLP ==== ['TLP']
Bibi ==== ['Bibi']
TLP ==== ['TLP']
Islamabad ==== ['Islamabad']
Bibi ==== ['Bibi']
Shariah ==== ['Shariah']
Hafiz Ehtisham Ahmed ==== ['Hafiz', 'Ehtisham Ahmed']
Islamabad ==== ['Islamabad']
Pakistan ==== ['Pakistan']
Blasphemy ==== ['Blasphemy']
Pakistan ==== ['Pakistan']
Islam ==== ['Islam']


In [8]:
# LOOKS GREAT!!!
# Let us extract all entities from news

df_articles = df_articles.withColumn("entities", udfExtractEntities("article"))
df_articles.cache()

print('Show 10 entities records:')
df_articles[['entities']].show(10, False)

Show 10 entities records:
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

**Setiment (emotion) extraction**  
Now we will extract sentiment score for every article

In [9]:
# install package
!pip install afinn

[31mipywidgets 6.0.0 has requirement widgetsnbextension~=2.0.0, but you'll have widgetsnbextension 3.4.2 which is incompatible.[0m
[31mbleach 2.0.0 has requirement html5lib>=0.99999999, but you'll have html5lib 0.9999999 which is incompatible.[0m
[31mtensorflow-tensorboard 1.5.1 has requirement bleach==1.5.0, but you'll have bleach 2.0.0 which is incompatible.[0m
[31mtensorflow 1.3.0 has requirement tensorflow-tensorboard<0.2.0,>=0.1.0, but you'll have tensorflow-tensorboard 1.5.1 which is incompatible.[0m


In [10]:
# calculate sentiment (emotion) score for articles

from afinn import Afinn
afinn = Afinn()

# Let us try to extract sentiments from origian text and quanterized text and take average (this will reduce meanings errors)
udfExtractEmotions = udf(lambda text: afinn.score(text), FloatType())

df_articles = df_articles.withColumn("emotion", (udfExtractEmotions("article") + udfExtractEmotions("cleaned_article")) / 2)
df_articles.cache()

print('Show 10 emotion records:')
df_articles[['emotion']].show(10, False)

Show 10 emotion records:
+-------+
|emotion|
+-------+
|-35.0  |
|-13.5  |
|28.0   |
|27.0   |
|32.0   |
|15.0   |
|-2.5   |
|49.5   |
|9.0    |
|16.5   |
+-------+
only showing top 10 rows



In [11]:
# Create features dataset and store it in Cloudant
from pyspark.sql import functions as F

ds_features = df_articles[['source', 'title', 'url', 'publishedAt', 'entities', 'emotion']].filter(F.size(F.col("entities")) > 0)

# Write articles data to Cloudant DB
ds_features.write.format("com.cloudant.spark").save("articles_features")

**Now we have: Features data for analysis**

## END of Part 2: Features extraction
### by Anton Dziavitsyn 2019