Term Assignment - Introduction to Computational Social Science
=============================

--------


**Student** : Nicolas Michel 48-179727


--------------

This Jupyter Notebook contains my term assignement. I decided to work on the 3rd subject given in the notice.

I chose to compare two categories of articles : **science** vs **health**

Indeed in this notebook, I first operate webscraping over several websites to collect my dataset of articles.

Then, I work on the parsing of the obtained text to get usable data as 'Bag of Words' for my algorithms.

Finally I compare several algorithms on classification task over my dataset.

It is important to note that for this exercice I might have misunderstood the subject. My dataset is composed of bag of words from the *entire* text body of each articles (not the summary). I hope this is not a problem.


----------


Scraping
====

------------

In [1]:
# urls for news feed scraping
rss_urls = {
    'science': {
        'bbc':"http://feeds.bbci.co.uk/news/science_and_environment/rss.xml",
        'reuters':"http://feeds.reuters.com/reuters/scienceNews",
        'cnn':"http://rss.cnn.com/rss/edition_space.rss",
        'guardian':"https://www.theguardian.com/science/rss",
        'dm':"http://www.dailymail.co.uk/sciencetech/index.rss",
        'toi':"https://timesofindia.indiatimes.com/rssfeeds/-2128672765.cms"
        },
    'health': {
        'bbc':"http://feeds.bbci.co.uk/news/health/rss.xml",
        'reuters':"http://feeds.reuters.com/reuters/healthNews",
        'guardian':"https://www.theguardian.com/society/health/rss",
        'dm':"http://www.dailymail.co.uk/health/index.rss",
        'toi':"https://timesofindia.indiatimes.com/rssfeeds/3908999.cms",
        'h1': "http://www.health.com/news/feed",
        'h2': "http://www.health.com/nutrition/feed",
        'h3': "http://www.health.com/food/feed",
        'h4': "http://www.health.com/healthday/feed",
        'h5': "http://www.health.com/mind-body/feed",
        'h6': "http://www.health.com/weight-loss/feed",
        'h7': "http://www.health.com/style/feed"
    }
}

Here we check that the number of collected articles is big enough.

In [2]:
import feedparser

for cat in rss_urls.keys():
    s = 0
    for key in rss_urls[cat].keys():
        url = rss_urls[cat][key]
        parsed = feedparser.parse(url)
        entries = parsed.entries
        s += len(entries)
    print('Number articles for category {} is {}'.format(cat,s))

Number articles for category science is 420
Number articles for category health is 358


The following two functions are used to get the main text body and to make it a bag of words

In [3]:
from bs4 import BeautifulSoup
import requests

def get_main_body_text(entry, news_website):
    link = entry.link
    res = requests.get(link)
    soup = BeautifulSoup(res.text, "lxml")
    
    body_text = ""
    if (news_website == 'bbc'):
        tags = soup.find('body').find_all('p', {'class':'', 'style':''})
        for i in range(len(tags)-1):
            tag_text = tags[i].get_text()
            # Some articles have twitter suggestion at the end of main body
            if not (i==len(tags)-2 and ('Twitter' in tag_text)):
                body_text += " "+tag_text
    
    elif (news_website == 'reuters'):
        tags = soup.find('body').find_all('p', {'class':'', 'style':'', 'id':''})
        for i in range(len(tags)-2):
             body_text += " "+tags[i].get_text()
    
    elif (news_website == 'cnn'):
        tags = soup.find('body').find_all('p', {'class':'', 'style':'', 'id':''})
        for i in range(len(tags)):
            body_text += " "+tags[i].get_text()
    
    elif (news_website =='dm'):
        tags0 = soup.find_all('div', {'itemprop':'articleBody'})
        for tag0 in tags0:
            tags1 = tag0.find_all('p')
            for tag1 in tags1:
                body_text += " "+tag1.get_text()
    
    elif(news_website == 'toi'):
        tags = soup.find('body').find_all('div', {'class':'Normal', 'style':''})
        for i in range(len(tags)):
            body_text += " "+tags[i].get_text()
    
    else:
        tags = soup.find('body').find_all('p', {'class':'', 'style':''})
        for i in range(len(tags)):
            body_text += " "+tags[i].get_text()

    return body_text
    

*text_to_bag()* transforms a raw string text into a bag of words.

Package unidecode is needed to remove the accents, just in case.

Run

    conda install unidecode
    
to install the package.

In [4]:
import unidecode

def text_to_bag(body_text, news_website):
    if (news_website == 'reuters'):
        body_text = body_text.split('Reuters')[1]
    elif (news_website == 'cnn'):
        split = body_text.split('CNN')
        if not (len(split) == 1):
            body_text = body_text.split('CNN')[1]
    elif(news_website == 'h1' or 'h2' or 'h3' or 'h4' or 'h5' or 'h6' or 'h7'):
        split = body_text.split('HealthDay News')
        if not (len(split) == 1):   
            body_text = split[1]
        
    body_text = body_text.replace(',', '')\
    .replace('.', '')\
    .replace('?', '')\
    .replace('!', '')\
    .replace('&', '')\
    .replace('$', '')\
    .replace(':', '')\
    .replace(';', '')\
    .replace('\'s', '')\
    .replace('"', '')\
    .replace(')', '')\
    .replace('(', '')\
    .replace('[', '')\
    .replace(']', '')\
    .replace('{', '')\
    .replace('}', '')\
    .replace('%', '')\
    .replace('/', '')
    body_text = unidecode.unidecode(body_text) # remove accents
    body_text = body_text.replace('\n', '')\
    .replace('\\', '')\
    .replace('-', '')\
    .replace("'", "")\
    .strip()\
    .lower()

    bag = body_text.split(' ')

    for word in bag:
        word = word.strip()

    while('' in bag):
        bag.remove('')
    
    if (news_website == 'toi' and len(bag) is not 0):
        #specific case for new york and new delhi
        if (bag[0] == 'new'):
            bag.pop(0)
            bag.pop(0)
        else:
            bag.pop(0)
            
    return bag

Stack each element of the dataset in a dataframe.
The code below takes at least **15 min** to run.

In [5]:
import pandas as pd
import time
import feedparser

data = []
for cat in rss_urls.keys():
    for site in rss_urls[cat].keys():
        rss = rss_urls[cat][site]
        parsed = feedparser.parse(rss)
        entries = parsed.entries

        for entry in entries:
            body_text = get_main_body_text(entry=entry, news_website=site)
            bag = text_to_bag(body_text=body_text, news_website=site)
            row = [site, entry.link, bag, cat]
            data.append(row)
            time.sleep(1)

feeds_df = pd.DataFrame(data, columns=['Website', 'Article_url', 'Bag', 'Category'])
feeds_df.head(10)

Unnamed: 0,Website,Article_url,Bag,Category
0,bbc,http://www.bbc.co.uk/news/science-environment-...,"[more, than, 11, billion, items, of, plastic, ...",science
1,bbc,http://www.bbc.co.uk/news/science-environment-...,"[telemetry, from, the, vehicle, was, lost, abo...",science
2,bbc,http://www.bbc.co.uk/news/science-environment-...,"[new, dating, of, fossils, from, israel, indic...",science
3,bbc,http://www.bbc.co.uk/news/science-environment-...,"[of, the, 18, resident, species, most, are, gr...",science
4,bbc,http://www.bbc.co.uk/news/health-42809445,"[identical, longtailed, macaques, zhong, zhong...",science
5,bbc,http://www.bbc.co.uk/news/science-environment-...,"[that, the, finding, of, a, study, that, track...",science
6,bbc,http://www.bbc.co.uk/news/uk-wales-south-east-...,"[mathematicians, think, they, have, devised, a...",science
7,bbc,http://www.bbc.co.uk/news/science-environment-...,"[in, what, is, known, as, a, static, firing, a...",science
8,bbc,http://www.bbc.co.uk/news/science-environment-...,"[akin, to, a, giant, disco, ball, the, object,...",science
9,bbc,http://www.bbc.co.uk/news/science-environment-...,"[the, seabed, investigation, coordinated, by, ...",science


In [6]:
# A bag of words
print("Size of the bag")
print(len(feeds_df['Bag'].iloc[0]))
feeds_df['Bag'].iloc[0][:20]

Size of the bag
476


['more',
 'than',
 '11',
 'billion',
 'items',
 'of',
 'plastic',
 'were',
 'found',
 'on',
 'a',
 'third',
 'of',
 'coral',
 'reefs',
 'surveyed',
 'in',
 'the',
 'asiapacific',
 'region']

In [7]:
# Making a copy because collecting all the data is time consuming
feeds_copy = feeds_df

---------------

Cleaning
====

-------------

In [24]:
feeds_df = feeds_copy

We remove all bag-of-words that are not big enough

In [25]:
def remove_bag_under(size_min, feeds_df=feeds_df, category=True):
    to_drop = []
    for i in range(len(feeds_df)): 
        if (len(feeds_df.iloc[i]['Bag']) < size_min):
            to_drop.append(i)
    feeds_df = feeds_df.drop(feeds_df.index[to_drop])
    if (category):
        print('Number of science articles')
        print(feeds_df[feeds_df['Category'] == 'science'].shape[0])
        print('Number of health articles')
        print(feeds_df[feeds_df['Category'] == 'health'].shape[0])
    return feeds_df

In [41]:
feeds_df = remove_bag_under(200)

Number of science articles
395
Number of health articles
332


In [42]:
from nltk.corpus import stopwords
import nltk.stem

def preprocess(bag):
    stops = stopwords.words('english')
    stemmer = nltk.stem.porter.PorterStemmer()
    processed = [stemmer.stem(token) for token in bag if token not in stops]
    return processed

In [43]:
feeds_df['Bag'] = feeds_df['Bag'].apply(preprocess)
feeds_df['Bag'].head(10)

0    [11, billion, item, plastic, found, third, cor...
1    [telemetri, vehicl, lost, nine, minut, flight,...
2    [new, date, fossil, israel, indic, speci, homo...
3    [18, resid, speci, grow, number, stabl, evid, ...
4    [ident, longtail, macaqu, zhong, zhong, hua, h...
5    [find, studi, track, danc, death, fastest, lan...
6    [mathematician, think, devis, way, calcul, siz...
7    [known, static, fire, 27, engin, launcher, fir...
8    [akin, giant, disco, ball, object, visibl, nak...
9    [seab, investig, coordin, campaign, group, gre...
Name: Bag, dtype: object

---------------------


Learning
=====

-------------------------


1. Creating the dataset
--------

In [44]:
# Here the labels are science and health
# science is category 0 and health category 1
labels = ['science', 'health']

feeds_df['Category'] = feeds_df['Category'].where(feeds_df['Category'] == 'science', other=1)
feeds_df['Category'] = feeds_df['Category'].where(feeds_df['Category'] == 1, other=0)
print(feeds_df.head(5))
feeds_df.tail(5)

  Website                                        Article_url  \
0     bbc  http://www.bbc.co.uk/news/science-environment-...   
1     bbc  http://www.bbc.co.uk/news/science-environment-...   
2     bbc  http://www.bbc.co.uk/news/science-environment-...   
3     bbc  http://www.bbc.co.uk/news/science-environment-...   
4     bbc          http://www.bbc.co.uk/news/health-42809445   

                                                 Bag Category  
0  [11, billion, item, plastic, found, third, cor...        0  
1  [telemetri, vehicl, lost, nine, minut, flight,...        0  
2  [new, date, fossil, israel, indic, speci, homo...        0  
3  [18, resid, speci, grow, number, stabl, evid, ...        0  
4  [ident, longtail, macaqu, zhong, zhong, hua, h...        0  


Unnamed: 0,Website,Article_url,Bag,Category
773,h7,http://www.health.com/syndication/madewell-bes...,"[articl, origin, appear, peoplecom, holiday, d...",1
774,h7,http://www.health.com/syndication/amazon-12-da...,"[articl, origin, appear, travelandleisurecom, ...",1
775,h7,http://www.health.com/style/best-workout-under...,"[nobodi, like, midsquat, wedgi, panti, line, p...",1
776,h7,http://www.health.com/style/lindsey-vonn-under...,"[come, girl, boss, alpin, skier, lindsey, vonn...",1
777,h7,http://www.health.com/style/plus-size-sports-i...,"[ever, sinc, hit, puberti, bath, suit, shop, d...",1


For this exercice I decided to take the same number of words for every article as input of my algorithm to not give too much credits to longer articles, so I take the maximum number of word as possible which is the size of the smallest bag.

In [45]:
input_size = 500
for i in range(len(feeds_df['Bag'])):
    if(len(feeds_df['Bag'].iloc[i]) <= input_size):
        input_size = len(feeds_df['Bag'].iloc[i])
input_size

107

This function is used to take a random crop of a bag with the necessary number of words.

In [46]:
import numpy as np

def get_random_words(bag, nb_words=input_size, random_seed=2):
    indices = list(range(len(bag)))
    np.random.seed(random_seed)
    np.random.shuffle(indices)
    bag_crop = []
    for i in range(input_size):
        bag_crop.append(bag[indices[i]])
    return bag_crop

I use join_bag() later to transform my bag of word into an input vector. The package used must be applied to a bag of word in the form of one string, each word separated by a space ' '.

In [47]:
def join_bag(bag):
    return ' '.join(bag)

In [48]:
feeds_df['Bag'] = feeds_df['Bag'].apply(get_random_words).apply(join_bag)

In [49]:
from sklearn.model_selection import train_test_split

dataset = feeds_df[['Bag', 'Category']]

X_train, X_test, y_train, y_test = train_test_split(
    dataset['Bag'].values, dataset['Category'].values, stratify=dataset['Category'].values, random_state=2)

# Conversion to float type
y_train = y_train.astype(float)
y_test = y_test.astype(float)

print('Size of the entire dataset: %i' % len(dataset['Bag']))
print('Size of the training set: %i' % len(X_train))
print('Size of the test set: %i' % len(X_test))

Size of the entire dataset: 727
Size of the training set: 545
Size of the test set: 182


As briefly explained before, the transform used below must be applied to a numpy array of bag of word, each bag in the form of a string (and not in the form of a list of string).

The use of the package HashingVectorizer create a float vector of size 4096 which is unique for each bag-of-words. I need to use this package to apply some algorithms on my data. We can see here that most of these values are 0.

In [50]:
from sklearn.feature_extraction.text import HashingVectorizer
vectorizer = HashingVectorizer(n_features=2 ** 11)

X_train = vectorizer.transform(X_train)
X_test = vectorizer.transform(X_test)

print(X_train.shape)
print(X_test.shape)
X_train.toarray()

(545, 2048)
(182, 2048)


array([[ 0.        ,  0.        ,  0.        , ...,  0.        ,
        -0.0836242 ,  0.        ],
       [ 0.        ,  0.        ,  0.        , ..., -0.08421519,
         0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.        , ...,  0.        ,
         0.        ,  0.        ],
       ..., 
       [ 0.        ,  0.        ,  0.        , ...,  0.        ,
         0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.        , ...,  0.        ,
         0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.        , ...,  0.        ,
         0.        ,  0.        ]])

2. Algorithms comparison
-------

In this section I apply 4 basical algorithms to my dataset

In [51]:
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier

svc = SVC()
forest = RandomForestClassifier(n_estimators=100, n_jobs=-1, max_depth=4)
logreg = LogisticRegression()
tree = DecisionTreeClassifier(random_state=241)

svc.fit(X_train, y_train)
forest.fit(X_train, y_train)
logreg.fit(X_train, y_train)
tree.fit(X_train, y_train)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=241, splitter='best')

In [52]:
# Prediction and evaluation of its accuracy
pred_svc = svc.predict(X_test)
pred_forest = forest.predict(X_test)
pred_logreg = logreg.predict(X_test)
pred_tree = tree.predict(X_test)

accuracy_svc = (pred_svc == y_test).sum() / float(len(y_test))
accuracy_forest = (pred_forest == y_test).sum() / float(len(y_test))
accuracy_logreg = (pred_logreg == y_test).sum() / float(len(y_test))
accuracy_tree = (pred_tree == y_test).sum() / float(len(y_test))

print('Linear Regression accuracy : {}'.format(round(accuracy_logreg, 5)))
print('Support Vector Machine accuracy : {}'.format(round(accuracy_svc, 5)))
print('Decision Tree accurcacy : {}'.format(round(accuracy_tree, 5)))
print('Random Forest accuracy : {}'.format(round(accuracy_forest, 5)))

Linear Regression accuracy : 0.81868
Support Vector Machine accuracy : 0.54396
Decision Tree accurcacy : 0.67582
Random Forest accuracy : 0.73626


As we can see most algorithm is found out to be pretty accurate, above 67% accuracy except the Support Vector Machine algorithm which has around 50% accuracy.

I realized that having the same number of words as input for each bag was not necessary using HasingVectorizer, so I tried without using get_random_words(), and the result is mostly better that way:

    Linear Regression accuracy : 0.86624
    Support Vector Machine accuracy : 0.50318
    Decision Tree accurcacy : 0.76433
    Random Forest accuracy : 0.86624
    

2. Further experiments
---------

Just for fun, I tried using the Linear Regression Algorithm on 70 articles from the daily mail, category 'Money'.

In [53]:
url = 'http://www.dailymail.co.uk/money/index.rss'
site = 'dm'
parsed = feedparser.parse(url)
entries = parsed.entries

print("Number of articles : {}".format(len(entries)))

data = []
for entry in entries:
            body_text = get_main_body_text(entry=entry, news_website=site)
            bag = text_to_bag(body_text=body_text, news_website=site)
            row = [site, entry.link, bag]
            data.append(row)
            time.sleep(1)

money_df = pd.DataFrame(data, columns=['Website', 'Article_url', 'Bag'])
money_df.head(10)

Number of articles : 70


Unnamed: 0,Website,Article_url,Bag
0,dm,http://www.dailymail.co.uk/money/markets/artic...,"[budget, hotel, business, easyhotel, is, boost..."
1,dm,http://www.dailymail.co.uk/money/diyinvesting/...,"[the, recent, peril, of, carillion, is, a, tim..."
2,dm,http://www.dailymail.co.uk/news/article-531569...,"[britain, economy, grew, by, 05, per, cent, in..."
3,dm,http://www.dailymail.co.uk/money/markets/artic...,"[host, commentator, host, commentator, tequila..."
4,dm,http://www.dailymail.co.uk/money/cars/article-...,"[want, to, be, the, coolest, people, on, the, ..."
5,dm,http://www.dailymail.co.uk/money/markets/artic...,"[telecoms, giant, bt, has, been, slapped, with..."
6,dm,http://www.dailymail.co.uk/money/saving/articl...,"[natwest, coop, and, barclays, had, the, bigge..."
7,dm,http://www.dailymail.co.uk/money/bills/article...,"[andrea, caldwell, travel, insurance, claim, w..."
8,dm,http://www.dailymail.co.uk/money/investing/art...,"[recently, this, is, money, received, an, emai..."
9,dm,http://www.dailymail.co.uk/money/news/article-...,"[the, us, has, locked, horns, with, world, ban..."


In [54]:
# Data processing
money_df['Bag'] = money_df['Bag'].apply(preprocess).apply(join_bag)

# bag of word conversion to numbers
X_test = vectorizer.transform(money_df['Bag'].values)

# Prediction
predictions = logreg.predict(X_test)
predictions[:20]

array([ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.])

In [55]:
pred_df = pd.DataFrame(predictions)
nb_science = pred_df[pred_df.iloc[:,0] == 0].shape[0]
nb_total = pred_df.shape[0]

print(nb_science/nb_total)

0.8


Apparently our algrorithm thinks that Money category is closer to Science category than Health category, which seems to be a understandable opinion.

Let's take a look at some articles that have been categorized as beeing healh-related articles

In [56]:
pred_df[pred_df.iloc[:,0] == 1].head(5)

Unnamed: 0,0
7,1.0
20,1.0
30,1.0
32,1.0
33,1.0


In [59]:
money_df.iloc[7]['Article_url']

'http://www.dailymail.co.uk/money/bills/article-5085511/Why-did-travel-insurer-reject-claim.html?ITO=1490&ns_mchannel=rss&ns_campaign=1490'

I chose this one because it was in my opinion the more relevent. Traveling and insurance are common subjects in health category which can explain why the algorithm made this decision.

Conclusion
-----

In this work I was able to implement a classifier that can correctly make the distinction between a scientific news article and a health news article, in 85% of the cases.

To some extend, this algorithm seems to be able to find some categories more scientific or healthy-related than others.

I also wanted to apply this algorithm to another source of feed : Reddit feeds. Unfortunately most of the summary given by those feed was empty so I could not have any interesting results