# Предсказание индекса Доу-Джонса по заголовкам новостей 

#### Команда:
Анна Лапидус, 
Надежда Катричева

## Постановка задачи

Входные данные включают в себя коллекцию из заголовков 25 наиболее популярных новостных статей за день и бинаризованный индекс Доу-Джонса (1 - рост или сохранение индекса, 0 - падение индекса). Коолекция содержит данные с 2008-08-08 до 2016-07-01.
<br>
Решается задача бинарной классификации, бинаризованный индекс Доу-Джонса представляет собой метку класса, заголовки статей необходимо использовать для извлечения признаков.

In [11]:
import pandas as pd
import re

Импортируем данные в датафрейм для последующей работы.

In [12]:
df = pd.read_csv('Combined_News_DJIA.csv')

In [13]:
df.head()

Unnamed: 0,Date,Label,Top1,Top2,Top3,Top4,Top5,Top6,Top7,Top8,...,Top16,Top17,Top18,Top19,Top20,Top21,Top22,Top23,Top24,Top25
0,2008-08-08,0,"b""Georgia 'downs two Russian warplanes' as cou...",b'BREAKING: Musharraf to be impeached.',b'Russia Today: Columns of troops roll into So...,b'Russian tanks are moving towards the capital...,"b""Afghan children raped with 'impunity,' U.N. ...",b'150 Russian tanks have entered South Ossetia...,"b""Breaking: Georgia invades South Ossetia, Rus...","b""The 'enemy combatent' trials are nothing but...",...,b'Georgia Invades South Ossetia - if Russia ge...,b'Al-Qaeda Faces Islamist Backlash',"b'Condoleezza Rice: ""The US would not act to p...",b'This is a busy day: The European Union has ...,"b""Georgia will withdraw 1,000 soldiers from Ir...",b'Why the Pentagon Thinks Attacking Iran is a ...,b'Caucasus in crisis: Georgia invades South Os...,b'Indian shoe manufactory - And again in a se...,b'Visitors Suffering from Mental Illnesses Ban...,"b""No Help for Mexico's Kidnapping Surge"""
1,2008-08-11,1,b'Why wont America and Nato help us? If they w...,b'Bush puts foot down on Georgian conflict',"b""Jewish Georgian minister: Thanks to Israeli ...",b'Georgian army flees in disarray as Russians ...,"b""Olympic opening ceremony fireworks 'faked'""",b'What were the Mossad with fraudulent New Zea...,b'Russia angered by Israeli military sale to G...,b'An American citizen living in S.Ossetia blam...,...,b'Israel and the US behind the Georgian aggres...,"b'""Do not believe TV, neither Russian nor Geor...",b'Riots are still going on in Montreal (Canada...,b'China to overtake US as largest manufacturer',b'War in South Ossetia [PICS]',b'Israeli Physicians Group Condemns State Tort...,b' Russia has just beaten the United States ov...,b'Perhaps *the* question about the Georgia - R...,b'Russia is so much better at war',"b""So this is what it's come to: trading sex fo..."
2,2008-08-12,0,b'Remember that adorable 9-year-old who sang a...,"b""Russia 'ends Georgia operation'""","b'""If we had no sexual harassment we would hav...","b""Al-Qa'eda is losing support in Iraq because ...",b'Ceasefire in Georgia: Putin Outmaneuvers the...,b'Why Microsoft and Intel tried to kill the XO...,b'Stratfor: The Russo-Georgian War and the Bal...,"b""I'm Trying to Get a Sense of This Whole Geor...",...,b'U.S. troops still in Georgia (did you know t...,b'Why Russias response to Georgia was right',"b'Gorbachev accuses U.S. of making a ""serious ...","b'Russia, Georgia, and NATO: Cold War Two'",b'Remember that adorable 62-year-old who led y...,b'War in Georgia: The Israeli connection',b'All signs point to the US encouraging Georgi...,b'Christopher King argues that the US and NATO...,b'America: The New Mexico?',"b""BBC NEWS | Asia-Pacific | Extinction 'by man..."
3,2008-08-13,0,b' U.S. refuses Israel weapons to attack Iran:...,"b""When the president ordered to attack Tskhinv...",b' Israel clears troops who killed Reuters cam...,b'Britain\'s policy of being tough on drugs is...,b'Body of 14 year old found in trunk; Latest (...,b'China has moved 10 *million* quake survivors...,"b""Bush announces Operation Get All Up In Russi...",b'Russian forces sink Georgian ships ',...,b'Elephants extinct by 2020?',b'US humanitarian missions soon in Georgia - i...,"b""Georgia's DDOS came from US sources""","b'Russian convoy heads into Georgia, violating...",b'Israeli defence minister: US against strike ...,b'Gorbachev: We Had No Choice',b'Witness: Russian forces head towards Tbilisi...,b' Quarter of Russians blame U.S. for conflict...,b'Georgian president says US military will ta...,b'2006: Nobel laureate Aleksander Solzhenitsyn...
4,2008-08-14,1,b'All the experts admit that we should legalis...,b'War in South Osetia - 89 pictures made by a ...,b'Swedish wrestler Ara Abrahamian throws away ...,b'Russia exaggerated the death toll in South O...,b'Missile That Killed 9 Inside Pakistan May Ha...,"b""Rushdie Condemns Random House's Refusal to P...",b'Poland and US agree to missle defense deal. ...,"b'Will the Russians conquer Tblisi? Bet on it,...",...,b'Bank analyst forecast Georgian crisis 2 days...,"b""Georgia confict could set back Russia's US r...",b'War in the Caucasus is as much the product o...,"b'""Non-media"" photos of South Ossetia/Georgia ...",b'Georgian TV reporter shot by Russian sniper ...,b'Saudi Arabia: Mother moves to block child ma...,b'Taliban wages war on humanitarian aid workers',"b'Russia: World ""can forget about"" Georgia\'s...",b'Darfur rebels accuse Sudan of mounting major...,b'Philippines : Peace Advocate say Muslims nee...


# Часть 1. Предобработка

Приведем тексты к нижнему регистру, токенизируем.

In [14]:
def preprocess(text_list):
    clean_texts = []
    for text in text_list:
        text = text.replace('b\'','').replace('b"','').replace('.','')
        words = re.findall(r'[a-z]+', text.lower()) #приводим к нижнему регистру, извлекаем слова в список
        #tokens = ' '.join(words)
        clean_texts.append(words) #предобработанный список
    return clean_texts

In [15]:
clean = preprocess(df['Top1'].astype(str))

In [53]:
print(clean[0])

['georgia', 'downs', 'two', 'russian', 'warplanes', 'as', 'countries', 'move', 'to', 'brink', 'of', 'war']


In [17]:
#применим предобработку ко всем заголовкам новостей
for column in df.loc[:,'Top1':'Top25']:
    list = df[column].astype(str)
    df[column] = preprocess(list)


In [18]:
df.head()

Unnamed: 0,Date,Label,Top1,Top2,Top3,Top4,Top5,Top6,Top7,Top8,...,Top16,Top17,Top18,Top19,Top20,Top21,Top22,Top23,Top24,Top25
0,2008-08-08,0,"[georgia, downs, two, russian, warplanes, as, ...","[breaking, musharraf, to, be, impeached]","[russia, today, columns, of, troops, roll, int...","[russian, tanks, are, moving, towards, the, ca...","[afghan, children, raped, with, impunity, un, ...","[russian, tanks, have, entered, south, ossetia...","[breaking, georgia, invades, south, ossetia, r...","[the, enemy, combatent, trials, are, nothing, ...",...,"[georgia, invades, south, ossetia, if, russia,...","[al, qaeda, faces, islamist, backlash]","[condoleezza, rice, the, us, would, not, act, ...","[this, is, a, busy, day, the, european, union,...","[georgia, will, withdraw, soldiers, from, iraq...","[why, the, pentagon, thinks, attacking, iran, ...","[caucasus, in, crisis, georgia, invades, south...","[indian, shoe, manufactory, and, again, in, a,...","[visitors, suffering, from, mental, illnesses,...","[no, help, for, mexico, s, kidnapping, surge]"
1,2008-08-11,1,"[why, wont, america, and, nato, help, us, if, ...","[bush, puts, foot, down, on, georgian, conflict]","[jewish, georgian, minister, thanks, to, israe...","[georgian, army, flees, in, disarray, as, russ...","[olympic, opening, ceremony, fireworks, faked]","[what, were, the, mossad, with, fraudulent, ne...","[russia, angered, by, israeli, military, sale,...","[an, american, citizen, living, in, sossetia, ...",...,"[israel, and, the, us, behind, the, georgian, ...","[do, not, believe, tv, neither, russian, nor, ...","[riots, are, still, going, on, in, montreal, c...","[china, to, overtake, us, as, largest, manufac...","[war, in, south, ossetia, pics]","[israeli, physicians, group, condemns, state, ...","[russia, has, just, beaten, the, united, state...","[perhaps, the, question, about, the, georgia, ...","[russia, is, so, much, better, at, war]","[so, this, is, what, it, s, come, to, trading,..."
2,2008-08-12,0,"[remember, that, adorable, year, old, who, san...","[russia, ends, georgia, operation]","[if, we, had, no, sexual, harassment, we, woul...","[al, qa, eda, is, losing, support, in, iraq, b...","[ceasefire, in, georgia, putin, outmaneuvers, ...","[why, microsoft, and, intel, tried, to, kill, ...","[stratfor, the, russo, georgian, war, and, the...","[i, m, trying, to, get, a, sense, of, this, wh...",...,"[us, troops, still, in, georgia, did, you, kno...","[why, russias, response, to, georgia, was, right]","[gorbachev, accuses, us, of, making, a, seriou...","[russia, georgia, and, nato, cold, war, two]","[remember, that, adorable, year, old, who, led...","[war, in, georgia, the, israeli, connection]","[all, signs, point, to, the, us, encouraging, ...","[christopher, king, argues, that, the, us, and...","[america, the, new, mexico]","[bbc, news, asia, pacific, extinction, by, man..."
3,2008-08-13,0,"[us, refuses, israel, weapons, to, attack, ira...","[when, the, president, ordered, to, attack, ts...","[israel, clears, troops, who, killed, reuters,...","[britain, s, policy, of, being, tough, on, dru...","[body, of, year, old, found, in, trunk, latest...","[china, has, moved, million, quake, survivors,...","[bush, announces, operation, get, all, up, in,...","[russian, forces, sink, georgian, ships]",...,"[elephants, extinct, by]","[us, humanitarian, missions, soon, in, georgia...","[georgia, s, ddos, came, from, us, sources]","[russian, convoy, heads, into, georgia, violat...","[israeli, defence, minister, us, against, stri...","[gorbachev, we, had, no, choice]","[witness, russian, forces, head, towards, tbil...","[quarter, of, russians, blame, us, for, confli...","[georgian, president, says, us, military, will...","[nobel, laureate, aleksander, solzhenitsyn, ac..."
4,2008-08-14,1,"[all, the, experts, admit, that, we, should, l...","[war, in, south, osetia, pictures, made, by, a...","[swedish, wrestler, ara, abrahamian, throws, a...","[russia, exaggerated, the, death, toll, in, so...","[missile, that, killed, inside, pakistan, may,...","[rushdie, condemns, random, house, s, refusal,...","[poland, and, us, agree, to, missle, defense, ...","[will, the, russians, conquer, tblisi, bet, on...",...,"[bank, analyst, forecast, georgian, crisis, da...","[georgia, confict, could, set, back, russia, s...","[war, in, the, caucasus, is, as, much, the, pr...","[non, media, photos, of, south, ossetia, georg...","[georgian, tv, reporter, shot, by, russian, sn...","[saudi, arabia, mother, moves, to, block, chil...","[taliban, wages, war, on, humanitarian, aid, w...","[russia, world, can, forget, about, georgia, s...","[darfur, rebels, accuse, sudan, of, mounting, ...","[philippines, peace, advocate, say, muslims, n..."


## 1. Есть ли корреляция между средней длинной текста за день и DJIA?

Посчитаем среднюю длину текста

In [20]:
#добавим новый столбец для средней длины текста новостей
df['MeanLength'] = df.drop(['Date','Label'], axis = 1).apply(lambda x: x.str.len()).mean(axis = 1) 

In [21]:
df.head()

Unnamed: 0,Date,Label,Top1,Top2,Top3,Top4,Top5,Top6,Top7,Top8,...,Top17,Top18,Top19,Top20,Top21,Top22,Top23,Top24,Top25,MeanLength
0,2008-08-08,0,"[georgia, downs, two, russian, warplanes, as, ...","[breaking, musharraf, to, be, impeached]","[russia, today, columns, of, troops, roll, int...","[russian, tanks, are, moving, towards, the, ca...","[afghan, children, raped, with, impunity, un, ...","[russian, tanks, have, entered, south, ossetia...","[breaking, georgia, invades, south, ossetia, r...","[the, enemy, combatent, trials, are, nothing, ...",...,"[al, qaeda, faces, islamist, backlash]","[condoleezza, rice, the, us, would, not, act, ...","[this, is, a, busy, day, the, european, union,...","[georgia, will, withdraw, soldiers, from, iraq...","[why, the, pentagon, thinks, attacking, iran, ...","[caucasus, in, crisis, georgia, invades, south...","[indian, shoe, manufactory, and, again, in, a,...","[visitors, suffering, from, mental, illnesses,...","[no, help, for, mexico, s, kidnapping, surge]",14.92
1,2008-08-11,1,"[why, wont, america, and, nato, help, us, if, ...","[bush, puts, foot, down, on, georgian, conflict]","[jewish, georgian, minister, thanks, to, israe...","[georgian, army, flees, in, disarray, as, russ...","[olympic, opening, ceremony, fireworks, faked]","[what, were, the, mossad, with, fraudulent, ne...","[russia, angered, by, israeli, military, sale,...","[an, american, citizen, living, in, sossetia, ...",...,"[do, not, believe, tv, neither, russian, nor, ...","[riots, are, still, going, on, in, montreal, c...","[china, to, overtake, us, as, largest, manufac...","[war, in, south, ossetia, pics]","[israeli, physicians, group, condemns, state, ...","[russia, has, just, beaten, the, united, state...","[perhaps, the, question, about, the, georgia, ...","[russia, is, so, much, better, at, war]","[so, this, is, what, it, s, come, to, trading,...",10.76
2,2008-08-12,0,"[remember, that, adorable, year, old, who, san...","[russia, ends, georgia, operation]","[if, we, had, no, sexual, harassment, we, woul...","[al, qa, eda, is, losing, support, in, iraq, b...","[ceasefire, in, georgia, putin, outmaneuvers, ...","[why, microsoft, and, intel, tried, to, kill, ...","[stratfor, the, russo, georgian, war, and, the...","[i, m, trying, to, get, a, sense, of, this, wh...",...,"[why, russias, response, to, georgia, was, right]","[gorbachev, accuses, us, of, making, a, seriou...","[russia, georgia, and, nato, cold, war, two]","[remember, that, adorable, year, old, who, led...","[war, in, georgia, the, israeli, connection]","[all, signs, point, to, the, us, encouraging, ...","[christopher, king, argues, that, the, us, and...","[america, the, new, mexico]","[bbc, news, asia, pacific, extinction, by, man...",14.12
3,2008-08-13,0,"[us, refuses, israel, weapons, to, attack, ira...","[when, the, president, ordered, to, attack, ts...","[israel, clears, troops, who, killed, reuters,...","[britain, s, policy, of, being, tough, on, dru...","[body, of, year, old, found, in, trunk, latest...","[china, has, moved, million, quake, survivors,...","[bush, announces, operation, get, all, up, in,...","[russian, forces, sink, georgian, ships]",...,"[us, humanitarian, missions, soon, in, georgia...","[georgia, s, ddos, came, from, us, sources]","[russian, convoy, heads, into, georgia, violat...","[israeli, defence, minister, us, against, stri...","[gorbachev, we, had, no, choice]","[witness, russian, forces, head, towards, tbil...","[quarter, of, russians, blame, us, for, confli...","[georgian, president, says, us, military, will...","[nobel, laureate, aleksander, solzhenitsyn, ac...",12.44
4,2008-08-14,1,"[all, the, experts, admit, that, we, should, l...","[war, in, south, osetia, pictures, made, by, a...","[swedish, wrestler, ara, abrahamian, throws, a...","[russia, exaggerated, the, death, toll, in, so...","[missile, that, killed, inside, pakistan, may,...","[rushdie, condemns, random, house, s, refusal,...","[poland, and, us, agree, to, missle, defense, ...","[will, the, russians, conquer, tblisi, bet, on...",...,"[georgia, confict, could, set, back, russia, s...","[war, in, the, caucasus, is, as, much, the, pr...","[non, media, photos, of, south, ossetia, georg...","[georgian, tv, reporter, shot, by, russian, sn...","[saudi, arabia, mother, moves, to, block, chil...","[taliban, wages, war, on, humanitarian, aid, w...","[russia, world, can, forget, about, georgia, s...","[darfur, rebels, accuse, sudan, of, mounting, ...","[philippines, peace, advocate, say, muslims, n...",10.88


### Корреляция между индексом и средней длиной новостей за день

Для расчета корреляции между двумя векторами будем использовать коэффициент корреляции Пирсона из модуля scipy.stats.

In [22]:
import scipy
from scipy.stats import pearsonr

In [23]:
pearsonr(df.MeanLength, df.Label)

(-0.00484854307754869, 0.82890850592959198)

Коэффициент корреляции близок к 0, p-value большое, следовательно, с большой вероятностью средняя длина текста и изменение значения индекса независимы. 

## 2. Есть ли корреляция между количеством упоминаний Барака Обамы и США в день и DJIA?

In [24]:
from collections import Counter

In [25]:
#функция для подсчета количества упоминаний США и Барака Обамы в списке текстов
def count_features(text_list, features):
    count = 0
    for text in text_list:
        c = Counter(text)
        for feature in features:
            count += c[feature]
    return count

In [26]:
features = ['us', 'usa', 'america', 'obama']#учтем различные варианты написания США
#count_features(df['Top3'], features)

In [27]:
#добавим колонку с количеством упоминаний США и Обамы в день
df['CountUS'] = df.drop(['Date','Label', 'MeanLength'], axis = 1).apply(lambda x: count_features(x, features), axis = 1)

In [28]:
df.head()

Unnamed: 0,Date,Label,Top1,Top2,Top3,Top4,Top5,Top6,Top7,Top8,...,Top18,Top19,Top20,Top21,Top22,Top23,Top24,Top25,MeanLength,CountUS
0,2008-08-08,0,"[georgia, downs, two, russian, warplanes, as, ...","[breaking, musharraf, to, be, impeached]","[russia, today, columns, of, troops, roll, int...","[russian, tanks, are, moving, towards, the, ca...","[afghan, children, raped, with, impunity, un, ...","[russian, tanks, have, entered, south, ossetia...","[breaking, georgia, invades, south, ossetia, r...","[the, enemy, combatent, trials, are, nothing, ...",...,"[condoleezza, rice, the, us, would, not, act, ...","[this, is, a, busy, day, the, european, union,...","[georgia, will, withdraw, soldiers, from, iraq...","[why, the, pentagon, thinks, attacking, iran, ...","[caucasus, in, crisis, georgia, invades, south...","[indian, shoe, manufactory, and, again, in, a,...","[visitors, suffering, from, mental, illnesses,...","[no, help, for, mexico, s, kidnapping, surge]",14.92,4
1,2008-08-11,1,"[why, wont, america, and, nato, help, us, if, ...","[bush, puts, foot, down, on, georgian, conflict]","[jewish, georgian, minister, thanks, to, israe...","[georgian, army, flees, in, disarray, as, russ...","[olympic, opening, ceremony, fireworks, faked]","[what, were, the, mossad, with, fraudulent, ne...","[russia, angered, by, israeli, military, sale,...","[an, american, citizen, living, in, sossetia, ...",...,"[riots, are, still, going, on, in, montreal, c...","[china, to, overtake, us, as, largest, manufac...","[war, in, south, ossetia, pics]","[israeli, physicians, group, condemns, state, ...","[russia, has, just, beaten, the, united, state...","[perhaps, the, question, about, the, georgia, ...","[russia, is, so, much, better, at, war]","[so, this, is, what, it, s, come, to, trading,...",10.76,8
2,2008-08-12,0,"[remember, that, adorable, year, old, who, san...","[russia, ends, georgia, operation]","[if, we, had, no, sexual, harassment, we, woul...","[al, qa, eda, is, losing, support, in, iraq, b...","[ceasefire, in, georgia, putin, outmaneuvers, ...","[why, microsoft, and, intel, tried, to, kill, ...","[stratfor, the, russo, georgian, war, and, the...","[i, m, trying, to, get, a, sense, of, this, wh...",...,"[gorbachev, accuses, us, of, making, a, seriou...","[russia, georgia, and, nato, cold, war, two]","[remember, that, adorable, year, old, who, led...","[war, in, georgia, the, israeli, connection]","[all, signs, point, to, the, us, encouraging, ...","[christopher, king, argues, that, the, us, and...","[america, the, new, mexico]","[bbc, news, asia, pacific, extinction, by, man...",14.12,8
3,2008-08-13,0,"[us, refuses, israel, weapons, to, attack, ira...","[when, the, president, ordered, to, attack, ts...","[israel, clears, troops, who, killed, reuters,...","[britain, s, policy, of, being, tough, on, dru...","[body, of, year, old, found, in, trunk, latest...","[china, has, moved, million, quake, survivors,...","[bush, announces, operation, get, all, up, in,...","[russian, forces, sink, georgian, ships]",...,"[georgia, s, ddos, came, from, us, sources]","[russian, convoy, heads, into, georgia, violat...","[israeli, defence, minister, us, against, stri...","[gorbachev, we, had, no, choice]","[witness, russian, forces, head, towards, tbil...","[quarter, of, russians, blame, us, for, confli...","[georgian, president, says, us, military, will...","[nobel, laureate, aleksander, solzhenitsyn, ac...",12.44,10
4,2008-08-14,1,"[all, the, experts, admit, that, we, should, l...","[war, in, south, osetia, pictures, made, by, a...","[swedish, wrestler, ara, abrahamian, throws, a...","[russia, exaggerated, the, death, toll, in, so...","[missile, that, killed, inside, pakistan, may,...","[rushdie, condemns, random, house, s, refusal,...","[poland, and, us, agree, to, missle, defense, ...","[will, the, russians, conquer, tblisi, bet, on...",...,"[war, in, the, caucasus, is, as, much, the, pr...","[non, media, photos, of, south, ossetia, georg...","[georgian, tv, reporter, shot, by, russian, sn...","[saudi, arabia, mother, moves, to, block, chil...","[taliban, wages, war, on, humanitarian, aid, w...","[russia, world, can, forget, about, georgia, s...","[darfur, rebels, accuse, sudan, of, mounting, ...","[philippines, peace, advocate, say, muslims, n...",10.88,4


In [29]:
pearsonr(df.CountUS, df.Label)

(-0.00013055362274792378, 0.99535729774622406)

Коэффициент корреляции еще меньше, p-value близко к 1, следовательно, корреляции между количеством упоминаний Барака Обамы и США в день и изменением индекса Доу-Джонса не наблюдается (не отвергается гипотеза о независимости двух векторов).

Попробуем посмотреть на корреляцию изменения индекса с упоминанием других признаков (Исламское государство, Россия и Путин, миграция беженцев).

In [30]:
features_is = ['isil', 'isis']
pearsonr(df.loc[:, 'Top1':'Top25'].apply(lambda x: count_features(x, features_is), axis = 1), df.Label)

(-0.012821265520728624, 0.56768144801712461)

In [31]:
features_ru = ['russia', 'putin']
pearsonr(df.loc[:, 'Top1':'Top25'].apply(lambda x: count_features(x, features_ru), axis = 1), df.Label)

(0.0042592720842379349, 0.84943659299601593)

In [32]:
features_fin = ['refugees', 'migrant']
pearsonr(df.loc[:, 'Top1':'Top25'].apply(lambda x: count_features(x, features_fin), axis = 1), df.Label)

(0.037835525017488998, 0.091614930328014987)

Самое большое значение коэффициента корреляции наблюдается для количества упоминаний беженцев и мигрантов, однако это значение достаточно мало.

## 3. Каких статей больше: статей о России и Путине или об Исламском государстве (запрещенной законом РФ террористическая организации)?

In [33]:
def join_tokens(tokens_list):
    clean_texts = []
    for tokens in tokens_list:
        text = ' '.join(tokens)
        clean_texts.append(text) #предобработанный список
    return clean_texts
#для удобства дальнейшей работы объединим списки токенов в единую строку для каждой новости    

In [34]:
for column in df.loc[:,'Top1':'Top25']:
    list = df[column]
    df[column] = join_tokens(list)

In [35]:
df.head()

Unnamed: 0,Date,Label,Top1,Top2,Top3,Top4,Top5,Top6,Top7,Top8,...,Top18,Top19,Top20,Top21,Top22,Top23,Top24,Top25,MeanLength,CountUS
0,2008-08-08,0,georgia downs two russian warplanes as countri...,breaking musharraf to be impeached,russia today columns of troops roll into south...,russian tanks are moving towards the capital o...,afghan children raped with impunity un officia...,russian tanks have entered south ossetia whils...,breaking georgia invades south ossetia russia ...,the enemy combatent trials are nothing but a s...,...,condoleezza rice the us would not act to preve...,this is a busy day the european union has appr...,georgia will withdraw soldiers from iraq to he...,why the pentagon thinks attacking iran is a ba...,caucasus in crisis georgia invades south ossetia,indian shoe manufactory and again in a series ...,visitors suffering from mental illnesses banne...,no help for mexico s kidnapping surge,14.92,4
1,2008-08-11,1,why wont america and nato help us if they wont...,bush puts foot down on georgian conflict,jewish georgian minister thanks to israeli tra...,georgian army flees in disarray as russians ad...,olympic opening ceremony fireworks faked,what were the mossad with fraudulent new zeala...,russia angered by israeli military sale to geo...,an american citizen living in sossetia blames ...,...,riots are still going on in montreal canada be...,china to overtake us as largest manufacturer,war in south ossetia pics,israeli physicians group condemns state torture,russia has just beaten the united states over ...,perhaps the question about the georgia russia ...,russia is so much better at war,so this is what it s come to trading sex for food,10.76,8
2,2008-08-12,0,remember that adorable year old who sang at th...,russia ends georgia operation,if we had no sexual harassment we would have n...,al qa eda is losing support in iraq because of...,ceasefire in georgia putin outmaneuvers the west,why microsoft and intel tried to kill the xo l...,stratfor the russo georgian war and the balanc...,i m trying to get a sense of this whole georgi...,...,gorbachev accuses us of making a serious blund...,russia georgia and nato cold war two,remember that adorable year old who led your c...,war in georgia the israeli connection,all signs point to the us encouraging georgia ...,christopher king argues that the us and nato a...,america the new mexico,bbc news asia pacific extinction by man not cl...,14.12,8
3,2008-08-13,0,us refuses israel weapons to attack iran report,when the president ordered to attack tskhinval...,israel clears troops who killed reuters cameraman,britain s policy of being tough on drugs is po...,body of year old found in trunk latest ransom ...,china has moved million quake survivors into p...,bush announces operation get all up in russia ...,russian forces sink georgian ships,...,georgia s ddos came from us sources,russian convoy heads into georgia violating truce,israeli defence minister us against strike on ...,gorbachev we had no choice,witness russian forces head towards tbilisi in...,quarter of russians blame us for conflict poll,georgian president says us military will take ...,nobel laureate aleksander solzhenitsyn accuses...,12.44,10
4,2008-08-14,1,all the experts admit that we should legalise ...,war in south osetia pictures made by a russian...,swedish wrestler ara abrahamian throws away me...,russia exaggerated the death toll in south oss...,missile that killed inside pakistan may have b...,rushdie condemns random house s refusal to pub...,poland and us agree to missle defense deal int...,will the russians conquer tblisi bet on it no ...,...,war in the caucasus is as much the product of ...,non media photos of south ossetia georgia conf...,georgian tv reporter shot by russian sniper du...,saudi arabia mother moves to block child marriage,taliban wages war on humanitarian aid workers,russia world can forget about georgia s territ...,darfur rebels accuse sudan of mounting major a...,philippines peace advocate say muslims need as...,10.88,4


Импортируем numpy для использования функции count_nonzero, чтобы посчитать количество ненулевых элементов в матрице 
статей после применения фильтра "содержит слова Россия или Путин".

In [40]:
import numpy as np 


Количество статей о России и Путине:

In [38]:
np.count_nonzero(df.loc[:, 'Top1':'Top25'].apply(lambda x: x.str.contains("russia|putin", na = False)))

2980

Количество статей об Исламском государстве (запрещенной законом РФ террористическая организации):

In [41]:
np.count_nonzero(df.loc[:, 'Top1':'Top25'].apply(lambda x: x.str.contains("isil|isis", na = False)))

1235

Статей о России больше, чем об Исламском государстве.

## 4. О каких кризисах (crisis) пишут статьи?

Напишем функцию для определения видов кризиса в статьях. По результатам просмотра текстов сначала планировалось использовать в качестве контекста для слова "crisis" 2 слова слева от него, но потом стало видно, что в большинстве случаев достаточно одного предыдущего слова.

In [42]:
def crisis(text_list):
    what_crisis = []
    for text in text_list:
        #crisis_neighbours = []
        tokens = text.split()
        if ('crisis' in tokens):
            t = tokens.index("crisis")
            #crisis_neighbours.append(tokens[t-2])
            #crisis_neighbours.append(tokens[t-1])
         
            what_crisis.append(tokens[t-1])
            
    return what_crisis       
        
        

In [43]:
what_crisis = []
for column in df.loc[:, 'Top1':'Top25']:
    list = df[column]
    what_crisis.extend(crisis(list))

In [44]:
print(len(what_crisis))
what_crisis[:10]

505


['humanitarian',
 'current',
 'pics',
 'difficult',
 'humanitarian',
 'banking',
 'nuclear',
 'nuclear',
 'financial',
 'financial']

In [54]:
print(len(set(what_crisis)))
print(set(what_crisis))

166
{'worsening', 'cash', 'funding', 'coast', 'economic', 'greek', 'hostage', 'syria', 'unveils', 'unemployment', 'jobs', 'cholera', 'bleaching', 'rwandan', 'dementia', 'netanyahu', 'phosphorus', 'legal', 'migrants', 'of', 'seeker', 'financial', 'egypt', 'syrian', 'identity', 'in', 'with', 'budget', 'housing', 'power', 'arab', 'cucumber', 'treatment', 'huge', 'lanka', 'iran', 'politics', 'resorts', 'pollution', 'week', 'the', 'carbon', 'total', 'infanticide', 'banking', 'east', 'uk', 'survey', 'market', 'potential', 'superbug', 'leftist', 'microsoft', 'lebanon', 'pics', 'world', 'judiciary', 'water', 'a', 'global', 'immigration', 'gaza', 'energy', 'street', 'word', 'honduran', 'since', 'ongoing', 'smog', 'price', 'political', 'iraq', 'deficit', 'italian', 'difficult', 'tackle', 'to', 'into', 'health', 'ukraine', 'extreme', 'european', 'nuke', 'says', 'oil', 'mounting', 'for', 'healthcare', 'radiation', 'korea', 'iceland', 'econ', 'coverage', 'debt', 'this', 'kyrgyzstan', 'further', 'hu

В полученном списке определений кризиса 166 уникальных слов. Среди них встречаются валютный, политический, экономический, климатический, экологический кризисы, кризисы в разных странах, нефтяной, энергетический, миграционный и т.д.  

In [52]:
#Выведем список частых соседей кризиса
Counter(what_crisis).most_common(30)

[('financial', 56),
 ('ukraine', 31),
 ('economic', 28),
 ('the', 20),
 ('debt', 19),
 ('food', 17),
 ('refugee', 16),
 ('nuclear', 13),
 ('humanitarian', 12),
 ('migrant', 11),
 ('syria', 10),
 ('political', 10),
 ('euro', 9),
 ('s', 7),
 ('a', 7),
 ('greek', 6),
 ('as', 6),
 ('banking', 6),
 ('gaza', 6),
 ('in', 6),
 ('eurozone', 5),
 ('iraq', 5),
 ('health', 4),
 ('ebola', 4),
 ('of', 4),
 ('currency', 4),
 ('water', 4),
 ('global', 4),
 ('korea', 4),
 ('climate', 4)]

Получили список 30 слов, наиболее часто встречающихся со словом "crisis". Если отбросить стоп-слова (a, the, in, s, as), можно выделить основные кризисы (освещенные в новостных статьях):
* финансовый, валютный
* экономический
* кризис на Украине
* продовольственный, гуманитарный
* кризис миграции беженцев
* политический
* ядерный
* кризис в Сирии
* банковский
* кризис евро
* кризис в Греции
* кризис в секторе Газа
* кризис в Ираке
* здоровья населения
* климатический

# Часть 2. Классификация

In [55]:
import sklearn
from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction.text import *
from sklearn.pipeline import *
from sklearn.preprocessing import Normalizer
from sklearn.metrics import *

from sklearn.svm import LinearSVC

from nltk.corpus import stopwords
stopwords_eng = stopwords.words('english')

### Объединение текстовых колонок

In [56]:
#объединим колонки с новосятми
df['news'] = df.loc[:,'Top1':'Top25'].apply(lambda x: ' '.join(x), axis = 1)

In [57]:
df.news[0]

'georgia downs two russian warplanes as countries move to brink of war breaking musharraf to be impeached russia today columns of troops roll into south ossetia footage from fighting youtube russian tanks are moving towards the capital of south ossetia which has reportedly been completely destroyed by georgian artillery fire afghan children raped with impunity un official says this is sick a three year old was raped and they do nothing russian tanks have entered south ossetia whilst georgia shoots down two russian jets breaking georgia invades south ossetia russia warned it would intervene on so s side the enemy combatent trials are nothing but a sham salim haman has been sentenced to years but will be kept longer anyway just because they feel like it georgian troops retreat from s osettain capital presumably leaving several hundred people killed video did the us prep georgia for war with russia rice gives green light for israel to attack iran says us has no veto over israeli military 

### Разделение данных на обучающую и тестовую выборки

In [58]:
df_train = df[df['Date'] < '2015-01-01']
df_test = df[df['Date'] > '2014-12-31']

### Классификатор LinearSVC

Выполним классификацию новостных статей с помощью модели машины опорных векторов. Сначала будем использовать простую векторизацию признакового пространства, затем дополним модель tf-idf представлением, сингулярным разложением и нормировкой векторного простанства. 

В методе CountVectorizer удаляем стоп-слова и используем для векторизации слова, биграммы и триграммы с помощью параметра ngram_range = (1,3). 

In [98]:
clf = Pipeline([
    ('vect', CountVectorizer(stop_words = stopwords_eng, analyzer = 'word', ngram_range = (1,3))),
    #('tfidf', TfidfTransformer()),
    #('svd', TruncatedSVD(n_components = 150)),
    #('norm', Normalizer() ),
    ('clf', LinearSVC()),
])


clf.fit(df_train.news, df_train.Label)#обучение модели наобучающей выборке

Pipeline(steps=[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 3), preprocessor=None,
        stop_words=['i', 'me',...ax_iter=1000,
     multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
     verbose=0))])

In [99]:
y_predict = clf.predict(df_test.news)#предсказания классифкатора для тестовых данных

Оценим качество классификации с помощью метрик F-measure и Accuracy.

In [100]:
#print("Precision: {0:6.2f}".format(precision_score(df_test.Label, y_predict)))
#print("Recall: {0:6.2f}".format(recall_score(df_test.Label, y_predict)))
print("F1-measure: {0:6.2f}".format(f1_score(df_test.Label, y_predict)))
print("Accuracy: {0:6.2f}".format(accuracy_score(df_test.Label, y_predict)))

F1-measure:   0.53
Accuracy:   0.45


Получили очень низкое качество классификации. Добавим tf-idf представление.

In [101]:
clf = Pipeline([
    ('vect', CountVectorizer(stop_words = stopwords_eng, analyzer = 'word', ngram_range = (1,3))),
    ('tfidf', TfidfTransformer()),
    #('svd', TruncatedSVD(n_components = 150)),
    #('norm', Normalizer() ),
    ('clf', LinearSVC()),
])


clf.fit(df_train.news, df_train.Label)

y_predict = clf.predict(df_test.news)

print("F1-measure: {0:6.2f}".format(f1_score(df_test.Label, y_predict)))
print("Accuracy: {0:6.2f}".format(accuracy_score(df_test.Label, y_predict)))

F1-measure:   0.63
Accuracy:   0.49


Качество классификации повысилось, значение F-меры значительно возросло, однако Accuracy все еще меньше 50%.
<br>
Попробуем добавить сингулярное разложение и сократить размерность признакового пространства до 150 компонент.

In [103]:
clf = Pipeline([
    ('vect', CountVectorizer(stop_words = stopwords_eng, analyzer = 'word', ngram_range = (1,3))),
    ('tfidf', TfidfTransformer()),
    ('svd', TruncatedSVD(n_components = 150)),
    #('norm', Normalizer() ),
    ('clf', LinearSVC()),
])


clf.fit(df_train.news, df_train.Label)

y_predict = clf.predict(df_test.news)

print("F1-measure: {0:6.2f}".format(f1_score(df_test.Label, y_predict)))
print("Accuracy: {0:6.2f}".format(accuracy_score(df_test.Label, y_predict)))

F1-measure:   0.67
Accuracy:   0.51


В результате применения сингулярного разложение качество классификации еще немного повысилось, но значение Accuracy все еще остается очень низким. Попробуем поэкспериментировать с параметром сингулярного разложения.

In [104]:
clf = Pipeline([
    ('vect', CountVectorizer(stop_words = stopwords_eng, analyzer = 'word', ngram_range = (1,3))),
    ('tfidf', TfidfTransformer()),
    ('svd', TruncatedSVD(n_components = 100)),
    #('norm', Normalizer() ),
    ('clf', LinearSVC()),
])


clf.fit(df_train.news, df_train.Label)

y_predict = clf.predict(df_test.news)

print("F1-measure: {0:6.2f}".format(f1_score(df_test.Label, y_predict)))
print("Accuracy: {0:6.2f}".format(accuracy_score(df_test.Label, y_predict)))

F1-measure:   0.68
Accuracy:   0.52


In [105]:
clf = Pipeline([
    ('vect', CountVectorizer(stop_words = stopwords_eng, analyzer = 'word', ngram_range = (1,3))),
    ('tfidf', TfidfTransformer()),
    ('svd', TruncatedSVD(n_components = 50)),
    #('norm', Normalizer() ),
    ('clf', LinearSVC()),
])


clf.fit(df_train.news, df_train.Label)

y_predict = clf.predict(df_test.news)

print("F1-measure: {0:6.2f}".format(f1_score(df_test.Label, y_predict)))
print("Accuracy: {0:6.2f}".format(accuracy_score(df_test.Label, y_predict)))

F1-measure:   0.67
Accuracy:   0.51


При различных значениях параметра в методе сингулярного разложения качество классификации остается примерное на одном уровне. Лучшее качество было получено для параметра n_components = 100.


### RandomForest

Для сравнения попробуем использовать для  классификации модель деревьев решений с помощью RandomForest Classifier с параметрами по умолчанию. Во всех остальных методах оставим аналогичные параметры.

In [107]:
from sklearn.ensemble import RandomForestClassifier

In [115]:
forest = Pipeline([
    ('vect', CountVectorizer(stop_words = stopwords_eng, analyzer = 'word', ngram_range = (1,3))),
    ('tfidf', TfidfTransformer()),
    ('svd', TruncatedSVD(n_components=100)),
    #('norm', Normalizer() ),
    ('clf', RandomForestClassifier(n_estimators = 10))#n_estimators - количество деревьев 
])

forest.fit(df_train.news, df_train.Label)

Pipeline(steps=[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 3), preprocessor=None,
        stop_words=['i', 'me',...imators=10, n_jobs=1, oob_score=False, random_state=None,
            verbose=0, warm_start=False))])

In [116]:
predictions = forest.predict(df_test.news)

In [117]:
print("F1-measure: {0:6.2f}".format(f1_score(df_test.Label, predictions)))
print("Accuracy: {0:6.2f}".format(accuracy_score(df_test.Label, predictions)))

F1-measure:   0.67
Accuracy:   0.55


Классификатор RandomForest дает такое же значение F-меры (67%), в то время как значение Accuracy немного повысилось (с 52 до 55%) по сравнению с качеством классификации LinearSVC.

## Часть 3. Другие методы извлечения признаков

Попробуем использовать в качестве признаков скрытые темы. Тематическое моделирование предназначено для выявляения тем документов в коллекции на основе вероятностного распределения слов по темам (каждая тема порождает определенные слова с некоторыми вероятностями). 
<br>
<br>
Для тематического моделирования используем LatentDirichletAllocation из sklearn.decomposition. В результате применения модели LDA признаками каждого документа становятся скрытые темы (принадлежность документа к теме). 

In [154]:
from sklearn.decomposition import LatentDirichletAllocation

In [156]:
#функция вывода тематических слов (с сайта sklearn)
def print_top_words(model, feature_names, n_top_words):
    for topic_idx, topic in enumerate(model.components_):
        message = "Topic #%d: " % topic_idx
        message += " ".join([feature_names[i] for i in topic.argsort()[:-n_top_words - 1:-1]])
        print(message)
    print()

Векторизуем тексты с использованием tf-idf представления.

In [183]:
tf_vect = TfidfVectorizer(stop_words = stopwords_eng)

In [184]:
tf_train = tf_vect.fit_transform(df_train.news)

In [185]:
tf_test = tf_vect.transform(df_test.news)

Модель LDA (параметр n_topics=100 задает число скрытых тем, которые хотим выделить в корпусе текстов):

In [197]:
lda = LatentDirichletAllocation(n_topics=100, max_iter=5, learning_method='online', learning_offset=50., random_state=0)

In [198]:
lda_train = lda.fit_transform(tf_train)#обучение модели

Выведем 5 наиболее значимых тематических слов для каждой темы:

In [199]:
tf_feature_names = tf_vect.get_feature_names()
n_top_words = 5
print_top_words(lda, tf_feature_names, n_top_words)

Topic #0: bnp mint libya unthinkable identified
Topic #1: israel gaza bouba plummet swastikas
Topic #2: ruled baku unique condemnable acronym
Topic #3: us police butter china government
Topic #4: outside arrows meinhof neonicotinoids repeated
Topic #5: debt germany iran killed afghanistan
Topic #6: minh hardball goldman juifs humane
Topic #7: georgia swine remote status subway
Topic #8: hamas grandes decade dubai egypt
Topic #9: libya china heaven paternity breivik
Topic #10: headhunters congo implosion advising authorise
Topic #11: mubarak world israel mubaraks us
Topic #12: tank bahari redevelopment men san
Topic #13: oslo minarets baruch communists ghanam
Topic #14: underdog heed downers pyromaniacs overcrowding
Topic #15: bloodbath vitamins half assignment choping
Topic #16: libya war protest police china
Topic #17: today somewhere iran mentality consistent
Topic #18: assange arabias ratenumber iranian beings
Topic #19: floyd fizzes stallman globish musk
Topic #20: us israeli gaza 

Многие темы можно интерпретировать по наиболее значимым словам.

In [201]:
lda_test = lda.transform(tf_test)#трансформируем тестовые данные в новое пространство признаков

Попробуем провести классификацию LinearSVC с использованием полученных признаков:

In [204]:
clf = LinearSVC()

clf.fit(lda_train, df_train.Label)
y_predict = clf.predict(lda_test)

print("F1-measure: {0:6.2f}".format(f1_score(df_test.Label, y_predict)))
print("Accuracy: {0:6.2f}".format(accuracy_score(df_test.Label, y_predict)))

F1-measure:   0.67
Accuracy:   0.51


В результате использования скрытых тем в качестве признаков не удалось повысить качество классификации, оно осталось на том же уровне. Однако тематические модели предоставляют простор для возможного улучшения классифкации (можно попробовать повысить качество тематической модели LDA за счет настройки параметров в процессе обучения, либо можно использовать другие тематические модели, более подходящие для коллекции коротких текстов, например Biterm Topic Modelling). 