# Dataset Statistcs

In [1]:
import pandas as pd

eco_news_pd = pd.read_json('datasetEconomyNews_PN.json')
headlines = eco_news_pd['headlineTitle']
texts = eco_news_pd['headlineText']
labels = eco_news_pd['classification']

headlines_df = pd.DataFrame(list(zip(headlines, labels)), columns = ['X_text', 'y_label'])
texts_df = pd.DataFrame(list(zip(texts, labels)), columns = ['X_text', 'y_label'])

### Dataset Class Balance (1 for possitive, -1 for negative)

In [2]:
print("Portion of Dataset Made up of Negatives:",len(eco_news_pd[eco_news_pd['classification'] == -1])/len(eco_news_pd))

Portion of Dataset Made up of Negatives: 0.6227758007117438


In [3]:
print("Portion of Dataset Made up of Positives:",len(eco_news_pd[eco_news_pd['classification'] == 1])/len(eco_news_pd))

Portion of Dataset Made up of Positives: 0.37722419928825623


### Sequence Length Statisics

Each example within the dataset has a headline and a short text. The headlines are typically one sentence, while the texts are somewhat longer.

In [4]:
import numpy as np
length_headlines = [len(txt) for txt in eco_news_pd['headlineTitle']]
print("Average sequence length of headlines:",np.mean(length_headlines))
print("Maximum sequence length of headlines:",np.max(length_headlines))
print("Minimum sequence length of headlines:",np.min(length_headlines))

Average sequence length of headlines: 58.596085409252666
Maximum sequence length of headlines: 101
Minimum sequence length of headlines: 17


In [5]:
length_headlines = [len(txt) for txt in eco_news_pd['headlineText']]
print("Average sequence length of texts:",np.mean(length_headlines))
print("Maximum sequence length of texts:",np.max(length_headlines))
print("Minimum sequence length of texts:",np.min(length_headlines))

Average sequence length of texts: 172.65302491103202
Maximum sequence length of texts: 295
Minimum sequence length of texts: 53


### Similarities in Texts and Headlines

There are some similarities between texts and headlines within each example. This is taking into account that when we tokenize we make all tokens lowercased, as well as only checking for similarities in tokens that are not stop words or punctuation. Since text sequences are typically longer than headline sequences, we mostly want to know how much of the headline could be represented in the text, (in terms of simple token overlap). Additionally, we should be limiting our input sequence length to a set amount, as to limit the amount of padding if we were to seperate headlines and texts as seperate examples. The limit we are setting for our sequence length is 32.

In [21]:
import nltk
from nltk.corpus import stopwords
from collections import defaultdict
stop_words = set(stopwords.words('english'))
intersect_words = defaultdict(int)
intersect = []
intersect_ratio = []
for i in range(len(eco_news_pd['headlineText'])):
    headline_ = eco_news_pd['headlineTitle'][i].split(' ')[:32]
    text_ = eco_news_pd['headlineText'][i].split(' ')[:32]
    text = set([word.lower() for word in text_ if word.lower() not in stop_words and word.isalpha()])
    headline = set([word.lower() for word in headline_ if word.lower() not in stop_words and word.isalpha()])
    if headline & text:
        intersecting = headline & text
        for word in intersecting:
            intersect_words[word] += 1
            
        intersect.append(len(headline & text))
        intersect_ratio.append(len(headline & text) / (len(headline)))
    else:
        intersect.append(0)
        intersect_ratio.append(0)

print('Average amount of tokens overlapping:',np.mean(intersect))
print('Average ratio of overlapping tokens from headlines within texts:',np.mean(intersect_ratio))

Average amount of tokens overlapping: 1.5729537366548043
Average ratio of overlapping tokens from headlines within texts: 0.2587969809944899


The bottom 50 overlapping tokens that don't typically occure often throughout the data are listed below. As you can see, some of these are pronouns and many are nouns. 

In [22]:

intersect_df = pd.DataFrame(list(zip(intersect_words.keys(), intersect_words.values())), columns = ['word', 'freq'])
intersect_df.sort_values(by=['freq'], ascending=True)['word'][:50]

252            buy
328        demands
327           ease
326          curbs
325        backing
324        retains
323    supervisory
322          april
321            max
320         flight
319      ethiopian
318       software
317       deciding
316        manager
315            dam
314           told
313        session
312         choppy
311           debt
310          needs
309         reduce
308           papa
307       forecast
306         powell
305        resorts
304         nevada
330             jp
331           lira
332         denies
335          point
369         bubble
368         attack
367          syria
366         public
365      companies
364        fastest
363           grew
360       declines
359          larry
358          state
357         latest
356           show
303           wynn
355        figures
353        revival
350           time
348     treasuries
347       criminal
346        defense
345             co
Name: word, dtype: object

Below are the top 50 words that overlap between text and headline many times throughout the data. Some of these words are also typically pronouns and many are nouns.

In [23]:

intersect_df = pd.DataFrame(list(zip(intersect_words.keys(), intersect_words.values())), columns = ['word', 'freq'])
intersect_df.sort_values(by=['freq'], ascending=False)['word'][:50]

349        stocks
86          stock
34        economy
17       economic
23         market
20        markets
0           china
123          wall
16         global
45          trade
65          trump
11         growth
125        street
127        shares
14            oil
62        billion
156        boeing
102         sales
7             tax
376            us
99          first
124       percent
4             fed
67        sources
56          tesla
378          data
52      investors
68         huawei
82        million
31          rates
8            tech
229        profit
18       shutdown
381     recession
48         prices
225         world
41          steel
81           bond
235      airlines
144        nasdaq
145          bear
206          baby
77            new
33        tariffs
373      business
154           air
94       optimism
100          weak
96     settlement
451       buffett
Name: word, dtype: object

### Tokens Occuring in Positive and Negative Examples

It can be helpful to explore the nature of both negative and positive examples for further understand of our dataset.

In [30]:
negative_counts = defaultdict(int)
positive_counts = defaultdict(int)
for i in range(len(eco_news_pd)):

    head = eco_news_pd['headlineTitle'][i].split(' ')[:32]
    text = eco_news_pd['headlineText'][i].split(' ')[:32]
    for j in range(len(head)):
        if head[j].lower() not in stop_words and head[j].isalpha():
            if eco_news_pd['classification'][i] == -1:
                negative_counts[head[j].lower()] += 1
            else:
                positive_counts[head[j].lower()] += 1
    for j in range(len(text)):
        if text[j].lower() not in stop_words and text[j].isalpha():
            if eco_news_pd['classification'][i] == -1:
                negative_counts[text[j].lower()] += 1
            else:
                positive_counts[text[j].lower()] += 1

        
            

neg_word_df = pd.DataFrame(list(zip(negative_counts.keys(), negative_counts.values())), columns = ['word', 'freq'])
print("Most Frequent 25 Words in Negative Examples")
print(neg_word_df.sort_values(by=['freq'], ascending=False)['word'][:25], '\n')
print("Least Frequent 25 Words in Negative Examples")
print(neg_word_df.sort_values(by=['freq'], ascending=True)['word'][:25])       


Most Frequent 25 Words in Negative Examples
55           stock
137       economic
549         stocks
16         economy
257         market
188         global
268        markets
161       business
92          growth
77           trade
52             new
75           china
94           trump
225      investors
864        company
81            said
233          could
253             us
35      government
241      financial
1053          fell
164            may
264         prices
955         friday
195           data
Name: word, dtype: object 

Least Frequent 25 Words in Negative Examples
2502      estimates
1531        richard
1530       facility
1526       tailings
1525           ceos
1522    contraction
1521      undergoes
1520       capacity
1519         tonnes
1517          wrong
1516          taken
1514          skids
1513    steelmakers
2346            rbs
1509          judge
1504           cope
1503      carmakers
1502      supplying
1501       contract
1500     automobile
1499    

In [32]:
pos_word_df = pd.DataFrame(list(zip(positive_counts.keys(), positive_counts.values())), columns = ['word', 'freq'])
print("Most Frequent 25 Words in Positive Examples")
print(pos_word_df.sort_values(by=['freq'], ascending=False)['word'][:25], '\n')
print("Least Frequent 25 Words in Positive Examples")
print(pos_word_df.sort_values(by=['freq'], ascending=True)['word'][:25])

Most Frequent 25 Words in Positive Examples
142        stocks
76        economy
184         stock
216      business
103      economic
60         market
53         growth
221       company
8            said
10          trade
0           china
128     investors
45            new
177       billion
446          wall
408       percent
469        shares
409          rise
6       president
447        street
117        global
211          year
7           trump
61     investment
341       markets
Name: word, dtype: object 

Least Frequent 25 Words in Positive Examples
1797      producer
1138     christmas
1137          epic
1136       history
1134         soars
1132      continue
1125    especially
1124      backdrop
1123         bleak
1139           eve
1122       learned
1119      february
1118          huya
1116          lyft
1115        circle
1114        losers
1113       enjoyed
1112         cheap
1110        reason
1120          came
1141         white
1142         house
1143       advi

It is easy to observe that words appearing often, occur in both kinds of classes. However words that do not appear often do not appear in both classes.

In [34]:
print("Portion of words from positive examples that occur also in negative examples")
print(len(set(negative_counts.keys()) & set(positive_counts.keys())) / len(positive_counts.keys()))

Portion of words from positive examples that occur also in negative examples
0.5100111234705228


In [42]:
count = 0
print("25 words that occur in positive examples, but not negative examples, multiple times:")
for key, value in positive_counts.items():
    if key not in negative_counts and value > 1:
        print(key, "| frequency:",value)
        count +=1
    if count == 25:
        break
count = 0
print("\n25 words that occur in negative examples, but not positive examples, multiple times:")
for key, value in negative_counts.items():
    if key not in positive_counts and value > 1:
        print(key, "| frequency:",value)
        count +=1
    if count == 25:
        break
        

25 words that occur in positive examples, but not negative examples, multiple times:
talk | frequency: 3
gaining | frequency: 2
field | frequency: 2
star | frequency: 3
stood | frequency: 2
comes | frequency: 2
shoppers | frequency: 2
delivery | frequency: 2
far | frequency: 3
november | frequency: 3
opportunity | frequency: 2
lead | frequency: 2
lift | frequency: 6
headquarters | frequency: 2
working | frequency: 3
rather | frequency: 2
conservation | frequency: 2
solid | frequency: 7
sovereign | frequency: 2
professional | frequency: 2
reversing | frequency: 2
responsibility | frequency: 2
hard | frequency: 3
online | frequency: 4
revival | frequency: 3

25 words that occur in negative examples, but not positive examples, multiple times:
pause | frequency: 2
minutes | frequency: 3
january | frequency: 3
meeting | frequency: 4
showing | frequency: 4
express | frequency: 2
enough | frequency: 4
democrats | frequency: 2
proposal | frequency: 4
support | frequency: 2
party | frequency: 2