# 1.2 Saving dataframes into pickle files.

>In this section, we will be using the more restrictive text pre-processing:

>* Porter stemming.
* Stop words.

>We will be using sklearn NaiveBayes Multinomial.

We have two possible labels:

`relevance` and `positivity`.

** Pros of using `relevance`**:

* There's no missing values. Out of 8,000 datapoints, only 9 are coded as `not sure` out of two other possible values `yes`, and `no`. I will reclassify `not sure` as `no`.

* This is a binary classification problem.

** Cons of using `relevance`**:

* Classes are imbalanced.

** Pros of using `positivity`**:

* This is a multiclass classification problem.

** Cons of using `positivity`**:

* We only have 1,420 datapoints out of 8,000 labeled.

* Inputting missing values as the mean introduces bias.

* There's no proper way of trying to recode the missing values as it's based on personal judgments.



In [1]:
reset -fs

In [2]:
import matplotlib.pyplot as plt
%matplotlib inline
import string
import re
import pickle
import numpy as np
import pandas as pd
import seaborn as sns
import nltk
import nlp_ml_functions
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.multiclass import OneVsRestClassifier
from sklearn.linear_model import SGDClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.preprocessing import label_binarize, MultiLabelBinarizer, binarize
from sklearn.metrics import classification_report, accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, roc_auc_score, roc_curve, mean_squared_error, r2_score
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, TfidfTransformer
from sklearn.pipeline import Pipeline
sns.set_style('white')

## Using `relevance` as labels.

### Loading dataset into a pandas dataframe.

In [3]:
economic_df = pd.read_csv('Full-Economic-News-DFE-839861.csv', encoding='utf-8')

#### Creating a list of new column names.

In [4]:
new_column_names = ['unit_id', 'golden', 'unit_state', 'trusted_judgments', 'last_judgment_at','positivity', 'positivity_confidence', 'relevance', 'relevance_confidence', 'article_id', 'article_date', 'article_headline', 'positivity_gold', 'relevance_gold', 'article_text']

#### Renaming columns of dataframe.

In [5]:
economic_df.columns = new_column_names

In [6]:
economic_df.shape

(8000, 15)

In [7]:
economic_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8000 entries, 0 to 7999
Data columns (total 15 columns):
unit_id                  8000 non-null int64
golden                   8000 non-null bool
unit_state               8000 non-null object
trusted_judgments        8000 non-null int64
last_judgment_at         8000 non-null object
positivity               1420 non-null float64
positivity_confidence    3775 non-null float64
relevance                8000 non-null object
relevance_confidence     8000 non-null float64
article_id               8000 non-null object
article_date             8000 non-null object
article_headline         8000 non-null object
positivity_gold          0 non-null float64
relevance_gold           0 non-null float64
article_text             8000 non-null object
dtypes: bool(1), float64(5), int64(2), object(7)
memory usage: 882.9+ KB


### Clean up.

In [8]:
economic_df['article_headline'] = economic_df['article_headline'].apply(nlp_ml_functions.strip_tags)

In [9]:
economic_df['article_text'] = economic_df['article_text'].apply(nlp_ml_functions.strip_tags)

In [10]:
economic_df['article_headline'] = economic_df['article_headline'].apply(nlp_ml_functions.clean_up_article)

In [11]:
economic_df['article_text'] = economic_df['article_text'].apply(nlp_ml_functions.clean_up_article)

In [12]:
economic_df.head()

Unnamed: 0,unit_id,golden,unit_state,trusted_judgments,last_judgment_at,positivity,positivity_confidence,relevance,relevance_confidence,article_id,article_date,article_headline,positivity_gold,relevance_gold,article_text
0,842613455,False,finalized,3,12/5/2015 17:48:27,3.0,0.64,yes,0.64,wsj_398217788,1991-08-14,Yields on CDs Fell in the Latest Week,,,NEW YORK Yields on most certificates of deposi...
1,842613456,False,finalized,3,12/5/2015 16:54:25,,,no,1.0,wsj_399019502,2007-08-21,The Morning Brief White House Seeks to Limit C...,,,The Wall Street Journal OnlineThe Morning Brie...
2,842613457,False,finalized,3,12/5/2015 01:59:03,,,no,1.0,wsj_398284048,1991-11-14,Banking Bill Negotiators Set Compromise Plan t...,,,WASHINGTON In an effort to achieve banking ref...
3,842613458,False,finalized,3,12/5/2015 02:19:39,,0.0,no,0.675,wsj_397959018,1986-06-16,Managers Journal Sniffing Out Drug Abusers Is ...,,,The statistics on the enormous costs of employ...
4,842613459,False,finalized,3,12/5/2015 17:48:27,3.0,0.3257,yes,0.64,wsj_398838054,2002-10-04,Currency Trading Dollar Remains in Tight Range...,,,NEW YORK Indecision marked the dollars tone as...


In [13]:
economic_df.relevance.value_counts()

no          6571
yes         1420
not sure       9
Name: relevance, dtype: int64

#### Converting variable `relevance` to numerical values and coding `not sure` as `no`.

In [14]:
economic_df['relevance'] = economic_df['relevance'].apply(lambda x: 1 if x == 'yes' else 0)

In [15]:
economic_df.relevance.value_counts()

0    6580
1    1420
Name: relevance, dtype: int64

#### The labels/classes are clearly unbalanced.

In [16]:
economic_df.relevance.value_counts()*100/len(economic_df.relevance)

0    82.25
1    17.75
Name: relevance, dtype: float64

>We will experiment building models with the above class unbalance. However, we will making a new dataframe with 50% of each class.

We will sample 1,200 rows of the majority class, and another 1,200 of the minority class by sampling 4 times such class with replacement.

#### Example of getting 1200 samples of relevance `0`.

In [17]:
economic_df[economic_df['relevance'] == 0].sample(n=1200)

Unnamed: 0,unit_id,golden,unit_state,trusted_judgments,last_judgment_at,positivity,positivity_confidence,relevance,relevance_confidence,article_id,article_date,article_headline,positivity_gold,relevance_gold,article_text
1908,842615368,False,finalized,3,12/4/2015 23:09:23,,,0,1.0000,wsj_398850325,2001-11-08,Insurers Brace for Disability Claims Tied to S...,,,Why is MetLife Inc a New York based company wi...
5305,830982935,False,finalized,3,11/18/2015 07:40:43,,,0,1.0000,wapo_410307122,2009-05-12,Debate Over Jobless Benefits Resonates Near an...,,,As Virginia gears up for a governors campaign ...
4369,830981998,False,finalized,3,11/18/2015 10:23:15,,0.0,0,0.6671,wapo_148569003,1955-04-29,Business Outlook Reserve Board Doesn t Discour...,,,WALL STREET S such a contrary place The Federa...
2630,842616091,False,finalized,3,12/5/2015 00:15:45,,,0,1.0000,wsj_398173045,1990-01-08,When You Own Your Own Abode Bad Real Estate Ne...,,,It is every homeowners worst nightmare that dr...
2096,842615556,False,finalized,3,12/5/2015 08:24:48,,0.0,0,0.6964,wsj_398777737,2001-12-11,Recession Is Over a Few Bulls Say Bucking Fiel...,,,Corrections amp AmplificationsA SURVEY by Blue...
4258,830981887,False,finalized,3,11/18/2015 12:21:12,,,0,1.0000,wapo_408370674,1998-05-22,Greenspan Sees Signs Of Asian Ripple Effect Co...,,,Federal Reserve Chairman Alan Greenspan said y...
3062,842616524,False,finalized,3,12/5/2015 10:15:58,,,0,1.0000,wsj_398025204,1985-08-29,Fed Examiners Sent to Audit Maryland S amp L C...,,,WASHINGTON The Federal Reserve Board sent exam...
5049,830982679,False,finalized,3,11/18/2015 11:16:11,,,0,1.0000,wapo_410279607,2008-08-29,Stocks Roar Ahead as Report Shows Economic Muscle,,,Wall Street barreled higher Thursday after a b...
2806,842616268,False,finalized,3,12/5/2015 00:04:18,,,0,1.0000,wsj_398661240,1999-05-19,Argentinas Peso Dollar Peg Is a Drag On Effort...,,,BUENOS AIRES Four months after Brazils devalua...
5237,830982867,False,finalized,3,11/17/2015 23:40:41,,0.0,0,0.6855,wapo_409597829,2004-02-13,A Voice Of Trade Rebellion,,,Sen John Edwards who has had perfect rhetorica...


#### Example of getting 300 samples of relevance `1` with replacement.

In [18]:
economic_df[economic_df['relevance'] == 1].sample(n=300, replace=True)

Unnamed: 0,unit_id,golden,unit_state,trusted_judgments,last_judgment_at,positivity,positivity_confidence,relevance,relevance_confidence,article_id,article_date,article_headline,positivity_gold,relevance_gold,article_text
3395,842616858,False,finalized,3,12/5/2015 19:16:31,3.0,0.3442,1,0.6642,wsj_398182420,1989-01-04,New Year for Investors Has a Rocky Beginning,,,NEW YORK Stocks bonds and the dollar all stumb...
5292,830982922,False,finalized,3,11/17/2015 20:59:52,7.0,0.6589,1,1.0000,wapo_408354468,1998-02-11,Growth to Slow After Strong 97 Says Panel,,,The annual report of the presidents Council of...
7825,830985462,False,finalized,3,11/17/2015 20:56:30,7.0,0.3333,1,0.6449,wapo_148216252,1972-01-28,Market Climbs By 10 68 Good News Ends Slump Of...,,,NEW YORK Jan 27 AP The stock market made a spi...
238,842613695,False,finalized,3,12/5/2015 03:23:24,7.0,0.3694,1,0.6937,wsj_399112458,2009-01-28,Financials Stage a Comeback As Dow Banks 58 70...,,,A bounce for beaten down financial stocks push...
3839,842617303,False,finalized,3,12/5/2015 04:55:18,4.0,1.0000,1,1.0000,wsj_398284733,1991-04-24,Durable Goods Orders Plunged 6 2 in March Thir...,,,WASHINGTON A dive in orders for big ticket fac...
2667,842616128,False,finalized,3,12/5/2015 01:14:54,6.0,0.6838,1,1.0000,wsj_398684242,1999-06-02,Dollar Eases Against Yen and Euro On Speculati...,,,NEW YORK The dollar fell against the yen and e...
2774,842616235,False,finalized,3,12/5/2015 18:34:35,4.0,0.3391,1,0.6696,wsj_398831223,2000-12-20,Treasury Prices Drop as Investors Absorb News ...,,,NEW YORK Treasury prices ended lower after Fed...
2116,842615577,False,finalized,3,12/5/2015 01:39:31,7.0,0.3415,1,0.6667,wsj_398860062,2003-07-28,The Economy Rise in Durable Goods Orders Offer...,,,Dow Jones NewswiresWASHINGTON There are signs ...
2115,842615576,False,finalized,3,12/5/2015 00:18:04,5.0,0.3448,1,0.6638,wsj_395219023,1991-07-01,Bonds Tied To Stocks Top Others Junk Bonds Out...,,,NEW YORK Bonds that take their cues from the s...
2015,842615475,False,finalized,3,12/5/2015 06:13:48,7.0,0.6694,1,0.6694,wsj_398593084,1997-09-08,Microcap Funds Climbed 5 54 During August Grou...,,,NEW YORK Small was beautiful in August as micr...


>Now, let's create a new dataframe with the balanced sampling:

In [19]:
frames = []
for i in range(5):
    if i < 4:
        frames.append(economic_df[economic_df['relevance'] == 1].sample(n=300, replace=True))
    else:
        frames.append(economic_df[economic_df['relevance'] == 0].sample(n=1200, replace=False))
    
balanced_df = pd.concat(frames)

In [20]:
balanced_df.relevance.value_counts()*100/len(balanced_df.relevance)

1    50.0
0    50.0
Name: relevance, dtype: float64

>We have a balanced dataframe to build future models.

#### Converting variable `unit_state` to numerical values.

In [21]:
economic_df['unit_state'] = economic_df['unit_state'].apply(lambda x: 1 if x == 'finalized' else 0)

In [22]:
balanced_df['unit_state'] = balanced_df['unit_state'].apply(lambda x: 1 if x == 'finalized' else 0)

#### Converting variable `golden` to numerical values.

In [23]:
economic_df['golden'] = economic_df['golden'].apply(lambda x: 0 if x == False else 1)

In [24]:
balanced_df['golden'] = balanced_df['golden'].apply(lambda x: 0 if x == False else 1)

#### Since columns `positivity_gold` and `relevance_gold` are empty, we can drop them.

In [25]:
del economic_df['positivity_gold']

In [26]:
del balanced_df['positivity_gold']

In [27]:
del economic_df['relevance_gold']

In [28]:
del balanced_df['relevance_gold']

#### Converting `last_judgment_at` and `article_date` to datetime.

In [29]:
economic_df['last_judgment_at'] = pd.to_datetime(economic_df['last_judgment_at'])

In [30]:
balanced_df['last_judgment_at'] = pd.to_datetime(balanced_df['last_judgment_at'])

In [31]:
economic_df['article_date'] = pd.to_datetime(economic_df['article_date'])

In [32]:
balanced_df['article_date'] = pd.to_datetime(balanced_df['article_date'])

### Saving files for future use.

In [33]:
economic_df.to_pickle("full_df")

In [34]:
balanced_df.to_pickle("balanced_df")

In [35]:
economic_df = pd.read_pickle("full_df")

In [36]:
balanced_df = pd.read_pickle("balanced_df")

#### Checking if pickle files load:

In [37]:
economic_df.head()

Unnamed: 0,unit_id,golden,unit_state,trusted_judgments,last_judgment_at,positivity,positivity_confidence,relevance,relevance_confidence,article_id,article_date,article_headline,article_text
0,842613455,0,1,3,2015-12-05 17:48:27,3.0,0.64,1,0.64,wsj_398217788,1991-08-14,Yields on CDs Fell in the Latest Week,NEW YORK Yields on most certificates of deposi...
1,842613456,0,1,3,2015-12-05 16:54:25,,,0,1.0,wsj_399019502,2007-08-21,The Morning Brief White House Seeks to Limit C...,The Wall Street Journal OnlineThe Morning Brie...
2,842613457,0,1,3,2015-12-05 01:59:03,,,0,1.0,wsj_398284048,1991-11-14,Banking Bill Negotiators Set Compromise Plan t...,WASHINGTON In an effort to achieve banking ref...
3,842613458,0,1,3,2015-12-05 02:19:39,,0.0,0,0.675,wsj_397959018,1986-06-16,Managers Journal Sniffing Out Drug Abusers Is ...,The statistics on the enormous costs of employ...
4,842613459,0,1,3,2015-12-05 17:48:27,3.0,0.3257,1,0.64,wsj_398838054,2002-10-04,Currency Trading Dollar Remains in Tight Range...,NEW YORK Indecision marked the dollars tone as...


In [38]:
balanced_df.head()

Unnamed: 0,unit_id,golden,unit_state,trusted_judgments,last_judgment_at,positivity,positivity_confidence,relevance,relevance_confidence,article_id,article_date,article_headline,article_text
6446,830984077,0,1,3,2015-11-17 17:47:53,5.0,0.3566,1,0.6936,wapo_408523708,1999-10-26,Stocks Fall As Bond Yields Rise Dow Closes Off...,Stocks closed mostly lower today as revived fe...
5697,830983327,0,1,3,2015-11-17 17:36:51,4.0,0.3695,1,0.6634,wapo_147173705,1980-04-17,Stock Market Grinds to Quick Halt,als that interest rates may have peaked and th...
5381,830983011,0,1,3,2015-11-18 10:58:37,8.0,0.3333,1,0.6667,wapo_408396675,1998-10-03,4 Executives At UBS Quit After Internal Fund P...,The chairman of UBS AG stepped down today afte...
841,842614300,0,1,3,2015-12-05 18:21:41,5.0,0.3431,1,0.6774,wsj_1033526241,2012-08-16,Whats Really in the Ryan Budget,Thanks to several years of fiscal restraint du...
6958,830984592,0,1,3,2015-11-18 11:50:08,8.0,0.3337,1,0.6675,wapo_750986740,1994-10-29,Fed Reports Disparity in Area Lending Blacks H...,Blacks and Hispanics in the Washington area we...


## Using `positivity` as labels.

>We only have 1,420 datapoints, so we will drop the missing values (inputting to the mean introduces bias, as we described before).

>`positivity` ranges from `2` to `9`, so we will create two classes: one class grouping from `2` to `5`, and the other one grouping `6` to `9`, which is totally arbitrarely.

In [39]:
economic_pos_df = economic_df[np.isfinite(economic_df['positivity'])]

In [40]:
# http://stackoverflow.com/questions/20625582/how-to-deal-with-settingwithcopywarning-in-pandas
pd.options.mode.chained_assignment = None

In [41]:
economic_pos_df['positivity'] = economic_pos_df['positivity'].apply(lambda x: 1 if x < 6 else 0)

In [42]:
economic_pos_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1420 entries, 0 to 7995
Data columns (total 13 columns):
unit_id                  1420 non-null int64
golden                   1420 non-null int64
unit_state               1420 non-null int64
trusted_judgments        1420 non-null int64
last_judgment_at         1420 non-null datetime64[ns]
positivity               1420 non-null int64
positivity_confidence    1420 non-null float64
relevance                1420 non-null int64
relevance_confidence     1420 non-null float64
article_id               1420 non-null object
article_date             1420 non-null datetime64[ns]
article_headline         1420 non-null object
article_text             1420 non-null object
dtypes: datetime64[ns](2), float64(2), int64(6), object(3)
memory usage: 155.3+ KB


In [43]:
economic_pos_df.positivity.value_counts()

1    838
0    582
Name: positivity, dtype: int64

#### Check if labels are balanced.

In [44]:
economic_pos_df.positivity.value_counts()*100/len(economic_pos_df.positivity)

1    59.014085
0    40.985915
Name: positivity, dtype: float64

In [45]:
economic_pos_df.to_pickle("positive_df")

In [46]:
economic_pos_df = pd.read_pickle("positive_df")

In [47]:
economic_pos_df.head()

Unnamed: 0,unit_id,golden,unit_state,trusted_judgments,last_judgment_at,positivity,positivity_confidence,relevance,relevance_confidence,article_id,article_date,article_headline,article_text
0,842613455,0,1,3,2015-12-05 17:48:27,1,0.64,1,0.64,wsj_398217788,1991-08-14,Yields on CDs Fell in the Latest Week,NEW YORK Yields on most certificates of deposi...
4,842613459,0,1,3,2015-12-05 17:48:27,1,0.3257,1,0.64,wsj_398838054,2002-10-04,Currency Trading Dollar Remains in Tight Range...,NEW YORK Indecision marked the dollars tone as...
5,842613460,0,1,3,2015-12-04 23:15:05,1,0.6783,1,1.0,wsj_905654974,2011-11-23,Stocks Fall Again BofA Alcoa Slide,Stocks declined as investors weighed slower th...
9,842613464,0,1,3,2015-12-05 18:40:28,1,0.6657,1,1.0,wsj_397912506,1984-11-01,U S Dollar Falls Against Most Currencies Decli...,The U S dollar declined against most major for...
12,842613467,0,1,3,2015-12-05 01:29:35,1,0.3388,1,0.6777,wsj_738300385,2010-08-03,Defending Yourself Against Deflation,Author James B StewartThe dreaded D word is ba...
