This project is dedicated to the exploration of the image of Ukraine in media and automatization of annotations of articles about Ukraine. I will be using reports from OKO project system, that collects all English language articles mentioning Ukraine, globally, 24 hours a day, 7 days a week. I have the title of each article, its source, date of publication, social impact(likes and shares on FB) and assigned category. There are 20 categories, that were determined by the editors as such that cover most topics in media. Assigning a category to the article is repetitive and time-consuming task - I will build a model that predicts the category, based on past data.  I will also be checking, if there are any trends and assume I will see how the interest to certain topics changes with time, which topics were the most popular, which articles were popular etc. 
The train data I will be using has categories, assigned by the editors - I assume, because of the large number of the articles, there will be some amount of articles with incorrectly assigned category. Another risk is that there could be a new phenomenon happening that is not covered in the list of categories (crash of MH17 plane prompted the editors to create a new category, that didn't exist before and the event is now a history and will have articles written about it for at least next few years).
The observations and the predictive model will be presented to the OKO project team, aiming to inform them about the insights and automate part of their process, so that more of their time could be spent on special reports.


In [68]:
import pandas as pd
import numpy as np
from datetime import datetime
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.grid_search import GridSearchCV
import math

In [69]:
## I'm experiencing problems with scraping the data from the website - before I figure out how to scrape sufficient 
## amount of data - I will use the csv option on the website and download csv reports for 12 days from July, 2015
df1 = pd.read_csv('./assets/report (2).csv')
df1.ix[df1['topic_name'] == 'TOPIC OF THE DAY', ['topic_name']] = 'Ukrainian economy/industry'
len(df1)

271

In [70]:
df2 = pd.read_csv('./assets/report (3).csv')
df3 = pd.read_csv('./assets/report (4).csv')
df4 = pd.read_csv('./assets/report (5).csv')
df5 = pd.read_csv('./assets/report (6).csv')

In [71]:
df6 = pd.read_csv('./assets/7.csv')

In [72]:
## One of the categories, assigned to the articles is 'Topic of the day' - if there's breaking news or certain topic
## is covered in most articles for the day - it is assigned the 'Topic of the day'. I have to decode these for each day
# that has this category to be able to run my analysis.
df7 = pd.read_csv('./assets/8.csv')
df7.ix[df7['topic_name'] == 'TOPIC OF THE DAY', ['topic_name']] = 'Ukrainian government and society'

In [73]:
df8 = pd.read_csv('./assets/9.csv')
df8.ix[df8['topic_name'] == 'TOPIC OF THE DAY', ['topic_name']] = 'MH-17 crash'

In [74]:
df9 = pd.read_csv('./assets/10.csv')
df9.ix[df9['topic_name'] == 'TOPIC OF THE DAY', ['topic_name']] = 'MH-17 crash'

In [75]:
df10 = pd.read_csv('./assets/11.csv')

In [76]:
df11 = pd.read_csv('./assets/12.csv')
df11.ix[df11['topic_name'] == 'TOPIC OF THE DAY', ['topic_name']] = 'Peace talks'

In [77]:
df12 = pd.read_csv('./assets/13.csv')
df12.ix[df12['topic_name'] == 'TOPIC OF THE DAY', ['topic_name']] = 'MH-17 crash'

In [78]:
df = pd.concat([df1, df2, df3, df4, df5, df6, df7, df8, df9, df10, df11, df12])
len(df)

3328

In [79]:
df.reset_index(drop = True, inplace=True)


In [81]:
df.columns

Index([u'title', u'author', u'publisher_url', u'pub_week', u'pub_day',
       u'pub_month', u'pub_daytime', u'language', u'language_id', u'sentiment',
       u'social_fb', u'social_tw', u'impact_score', u'topic_name',
       u'topic_name_en', u'topic_name_ua', u'page_url', u'is_today_utc',
       u'is_lastnewscycle_utc', u'is_last13weeks_utc', u'pubday_pages',
       u'pubday_annotations', u'is_main_topic', u'is_about_ukraine',
       u'is_biased_publisher', u'is_pro_putin', u'hq_city', u'hq_country',
       u'is_ukr_press', u'comments', u'editor'],
      dtype='object')

In [82]:
## Dropping some columns for now - there's no author specified for majority of the articles - so getting rid of this
## feature for now. Later I'd like to run a test on only the data, that has author indicated and see if authors stick 
## to writing in one or few categories only and see if the name of the author could be a predictor for the topi of the article.
df.drop('author', inplace=True, axis=1)

In [83]:
## 'editor' column contains the names of the editors and won't help in analysis and predictions. 
## Also getting rid of language and language_id - I am only working with English language section, so these are redundant.
##
df.drop('comments', inplace=True, axis=1)
df.drop('editor', inplace=True, axis=1)
df.drop('language', inplace=True, axis=1)
df.drop('language_id', inplace=True, axis=1)
df.drop('topic_name_en', inplace=True, axis=1)
df.drop('topic_name_ua', inplace=True, axis=1)


In [84]:
df = df.dropna()

In [85]:
len(df)

3317

In [86]:
df.columns

Index([u'title', u'publisher_url', u'pub_week', u'pub_day', u'pub_month',
       u'pub_daytime', u'sentiment', u'social_fb', u'social_tw',
       u'impact_score', u'topic_name', u'page_url', u'is_today_utc',
       u'is_lastnewscycle_utc', u'is_last13weeks_utc', u'pubday_pages',
       u'pubday_annotations', u'is_main_topic', u'is_about_ukraine',
       u'is_biased_publisher', u'is_pro_putin', u'hq_city', u'hq_country',
       u'is_ukr_press'],
      dtype='object')

In [87]:
categories = ['Crimea', 'Culture and history', 'Fighting in Eastern Ukraine', 'Global discussions about Ukraine',\
              'Humanitarian crisis in Eastern Ukraine', 'Life in DPR/LPR and near-front zone', 'Mentioning of Ukraine'\
              'MH-17 crash',  'Military/humanitarian aid for Ukraine', 'Not related to Ukraine', 'Other', 'Peace talks'\
              'Prisoners of war', 'Sanctions against Russia', 'Sport', 'Ukrainian economy/industry', \
              'Ukrainian government and society', 'Ukrainian international relations', 'Emergencies in Ukraine'\
             'Russian “humanitarian” aid']


In [88]:
df

Unnamed: 0,title,publisher_url,pub_week,pub_day,pub_month,pub_daytime,sentiment,social_fb,social_tw,impact_score,...,is_last13weeks_utc,pubday_pages,pubday_annotations,is_main_topic,is_about_ukraine,is_biased_publisher,is_pro_putin,hq_city,hq_country,is_ukr_press
0,Dutch Safety Board Has Draft Report Into MH17 ...,www.nytimes.com,2015-26,2015-07-01,2015-07,01 Jul 15 00:00 UTC,neutral,16,11,27,...,0,0,0,1,1,0,0,Unknown,Unknown,0
1,"Poland to support, develop industrial, coal mi...",philstar.com,2015-26,2015-07-01,2015-07,01 Jul 15 00:04 UTC,neutral,1,5,6,...,0,0,0,1,0,0,0,Unknown,Unknown,0
2,Talks Collapse; Ukraine Halts Purchases of Rus...,voanews.com,2015-26,2015-07-01,2015-07,01 Jul 15 00:11 UTC,neutral,16,50,66,...,0,0,0,1,1,0,0,"Washington, DC",USA,0
3,Fight for Freedom - US vs Donbass,english.pravda.ru,2015-26,2015-07-01,2015-07,01 Jul 15 00:13 UTC,neutral,65,20,85,...,0,0,0,1,1,0,0,Moscow,Russia,0
4,MH17: Dutch Minister says last two victims' re...,themalaymailonline.com,2015-26,2015-07-01,2015-07,01 Jul 15 00:51 UTC,negative,8,8,16,...,0,0,0,1,1,0,0,Selangor,Malaysia,0
5,Two MH17 victims' remains 'unlikely' to be fou...,theborneopost.com,2015-26,2015-07-01,2015-07,01 Jul 15 01:20 UTC,negative,22,7,29,...,0,0,0,1,1,0,0,Unknown,Unknown,0
6,"MH17: Many possibilities, no definite suspects...",news.asiaone.com,2015-26,2015-07-01,2015-07,01 Jul 15 01:37 UTC,neutral,2,2,4,...,0,0,0,1,1,0,0,Singapore,Singapore,0
7,This Day in Jewish History / The father of Fel...,haaretz.com,2015-26,2015-07-01,2015-07,01 Jul 15 01:38 UTC,neutral,440,20,460,...,0,0,0,1,1,0,0,Unknown,Unknown,0
8,Rostec announces MiG engines repaired without ...,janes.com,2015-26,2015-07-01,2015-07,01 Jul 15 01:57 UTC,neutral,2,6,8,...,0,0,0,1,1,0,0,Unknown,Unknown,0
9,Who Is Next to Lead America's Enemies List?,alternet.org,2015-26,2015-07-01,2015-07,01 Jul 15 02:07 UTC,neutral,317,32,349,...,0,0,0,1,0,0,0,Unknown,Unknown,0


In [89]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3317 entries, 0 to 3327
Data columns (total 24 columns):
title                   3317 non-null object
publisher_url           3317 non-null object
pub_week                3317 non-null object
pub_day                 3317 non-null object
pub_month               3317 non-null object
pub_daytime             3317 non-null object
sentiment               3317 non-null object
social_fb               3317 non-null int64
social_tw               3317 non-null int64
impact_score            3317 non-null int64
topic_name              3317 non-null object
page_url                3317 non-null object
is_today_utc            3317 non-null int64
is_lastnewscycle_utc    3317 non-null int64
is_last13weeks_utc      3317 non-null int64
pubday_pages            3317 non-null int64
pubday_annotations      3317 non-null int64
is_main_topic           3317 non-null int64
is_about_ukraine        3317 non-null int64
is_biased_publisher     3317 non-null int64
is_p

In [90]:
# Naturally, two big Ukrainian English-speaking media outlets came on top in reporting about Ukraine. Following them we\
# have russian media outlet Sputnik, then another Ukrainian channel, Yahoo and then Moscowtimes. Besides Eastern European,\
# media, International Business Times, headquartered in NY, seems to be reporting about Ukraine a lot.
# So we have Ukraine talking about itself, then Russia and US talking about Ukraine. Then there's an assortment of UK\
# and other US media reporting on Ukraine.
df['publisher_url'].value_counts().head(30)

kyivpost.com                    385
en.interfax.com.ua              136
sputniknews.com                 129
uatoday.tv                      115
news.yahoo.com                   62
themoscowtimes.com               55
rferl.org                        44
ibtimes.com                      44
rt.com                           43
reuters.com                      33
channelnewsasia.com              27
dailymail.co.uk                  27
english.pravda.ru                21
uk.reuters.com                   20
bbc.co.uk                        19
bloomberg.com                    19
businessinsider.com              19
dailystar.com.lb                 18
voanews.com                      18
nytimes.com                      18
wsj.com                          17
euronews.com                     17
uk.news.yahoo.com                16
upi.com                          16
theguardian.com                  16
nrcu.gov.ua                      16
economictimes.indiatimes.com     16
rbth.com                    

In [92]:
df['social_fb'].sort_values(ascending=False).head(10)

2631    92808
2064    31288
2864    25055
2873    24999
652     20880
83      16053
642     15683
1828    12572
1753    12564
1740    12562
Name: social_fb, dtype: int64

In [93]:
# Article that had the highest social impact in my database is in the category "Not related to Ukraine".
# Looks like it's a sports article (those can mention Ukraine as a country hosting a game or a player being from Ukraine\
# and this mentioning is considered to be insignificant and the article is put into "Not related to Ukraine")
df[df['social_fb'] ==92808][['title', 'social_fb', 'topic_name']]


Unnamed: 0,title,social_fb,topic_name
2631,Robin van Persie: No honest chance of Man Unit...,92808,Not related to Ukraine


In [94]:
# Most popular articles on those days (by Facebook likes and shares)
df[df['social_fb'] >12000][['title', 'social_fb', 'topic_name']]

Unnamed: 0,title,social_fb,topic_name
83,Bayern Munich sign Brazilian Douglas Costa fro...,16053,Sport
642,Putin Sends Obama an Independence Day Message,15683,Not related to Ukraine
652,Europe's security organization says Kiev viola...,20880,Peace talks
1740,First on CNN: Sources say MH17 report blames p...,12562,Fighting in Eastern Ukraine
1753,First on CNN: Sources say MH17 report blames p...,12564,Fighting in Eastern Ukraine
1828,First on CNN: Sources say MH17 report blames p...,12572,Fighting in Eastern Ukraine
2064,Champions League third qualifying round draw,31288,Fighting in Eastern Ukraine
2631,Robin van Persie: No honest chance of Man Unit...,92808,Not related to Ukraine
2864,July Fourth message not the first from Russian...,25055,Not related to Ukraine
2873,July Fourth message not the first from Russian...,24999,Not related to Ukraine


In [None]:
data = df.ix[:, ['title', 'topic_name']]
len(data)

In [None]:
# for x in data['topic_name']:
#     if x not in categories:
#         data.drop(data['topic_name'] ==x)
#     else:
#         pass
    

In [None]:
categories_dict = {'Crimea' :1, 'Culture and history' :2, 'Fighting in Eastern Ukraine':3, \
                   'Global discussions about Ukraine':4, 'Humanitarian crisis in Eastern Ukraine':5, \
                   'Life in DPR/LPR and near-front zone':6, 'Mentioning of Ukraine':7, \
                   'MH-17 crash':8,  'Military/humanitarian aid for Ukraine':9, 'Not related to Ukraine':10, \
                   'Other':11, 'Peace talks':12, 'Prisoners of war':13, 'Sanctions against Russia':14, 'Sport':15, \
                   'Ukrainian economy/industry':16, 'Ukrainian government and society':17, \
                   'Ukrainian international relations' :18, 'Emergencies in Ukraine':19, \
                  'Russian “humanitarian” aid':20}
    
data['c_label'] = data['topic_name'].apply(lambda x: categories_dict[x])

In [None]:
data.head(2)

In [None]:
X = data['title']

In [None]:
# Using CountVerctorizer, I will get a count for the number of times each word appears in the title, removing stop words.\
# After that, I'll have a list of most frequent words for each of my 20 categories and make them my features.
# With that feature set I will be ready to start testing the models - I am planning on trying Logistic Regression and \
# Random Forest.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

In [None]:
cvec = CountVectorizer()
cvec.fit(X)

In [None]:
len(cvec.get_feature_names())

In [None]:
cvec = CountVectorizer(stop_words='english')
cvec.fit(X)
len(cvec.get_feature_names())

In [None]:
X_train = pd.DataFrame(cvec.transform(X).todense(), columns=cvec.get_feature_names())
X_train

In [None]:
big_X = pd.concat([X_train, data['c_label']], axis = 1)

In [None]:
big_X.head()

In [None]:
word_counts = X_train.sum(axis=0)
word_counts.sort_values(ascending = False).head(30)

In [None]:
y_train = data['c_label']
y_train

In [None]:
common_words = []
for i in xrange(len(categories)):
    word_count = X_train[y_train==i].sum()
    print categories[i], "most common words"
    cw = word_count.sort_values(ascending = False).head(20)

    print cw
    common_words.extend(cw.index)
    print