# Project Team 1: Caroline Liongosari, Yueqi Su, Daniel Zhang 
## Determining Factual and Opinionated News Articles 

### Overview and Motivation: 
Provide an overview of the project goals and motivation for it. Consider that this will be read by people who did not see your project proposal.

### Data: 
Source, scraping method, cleanup, etc.

Our project group was able to use a dataset generously given to us by researchers Ishan Sahu and Debapriyo Majumdar from the Indian Statistical Institute Kolkata who did a similar project as ours in 2017. The researchers derived their dataset from the Signal Media One-Million News Articles Dataset. Their cleaned and annotated version of this dataset was provided to us. The dataset consists of 98 news articles and has 3 parts:
*  **Article Text Length**: the number of characters present in the news article
* **Article Text**: the complete text of the news article
* **Unit tags**: the factual, non-factual annotations in the format: 
    * Character position start : Character position end: Annotation
    * example: 502:634:FACTUAL implies that the article text from character position 502 to 634 is factual

With this dataset we first 

In [1]:
import glob
import csv
import re
import pandas as pd
import nltk
import numpy as np
nltk.download('stopwords')
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords

#from unidecode import unidecode

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/student/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [2]:
stopWords = set(stopwords.words('english'))

In [3]:
# parsedData is a 2-D array with entries: [annotatedString, annotation]
parsedData = []
path = '/home/student/Documents/Project/annotated-news/*.txt'
files = glob.glob(path)

for file in files:
    f = open(file,'r')
    inputString = f.read()
    
    # inputArray: 
    # [0-2] holds ArticleTextLength.
    # [3-5] holds ArticleText.
    # [6-end] holds UnitTags.
    inputArray = inputString.split('\n')
    articleText = inputArray[4]
    unitTag = []
    
    # inputArray[6] = "<UnitTags>"
    # inputArray[7] = start of actual Unit Tags.
    i = 7
    while i<(len(inputArray)-2):
        unitTag.append(inputArray[i])
        i+=1
    
    for indexes in unitTag:
        # temp = [Character position start, Character position end, Annotation]
        temp = indexes.split(':')
        rawText = articleText[int(temp[0]):int(temp[1])-1]
        #newRawText = "u'"+rawText+"'"
        
        processedText = re.sub('\\\\u[a-zA-Z0-9]{4}',"",rawText)
        #processedText = unidecode(newRawText)
        #parsedData.append([articleText[int(temp[0]):int(temp[1])-1], temp[2]])
        parsedData.append([processedText, temp[2]])
    
    f.close()
   

In [4]:
with open("/home/student/Documents/Project/dataset.csv","w+") as my_csv:
    csvWriter = csv.writer(my_csv,delimiter=',')
    csvWriter.writerows(parsedData)

In [5]:
csv_file = "/home/student/Documents/Project/dataset.csv"
df = pd.read_table(csv_file, sep = ',', names = ['Sentence','Tag'])
tokenData = [] # With stopwords.
tokenDataFiltered = [] # Without stopwords.
for index, row in df.iterrows():
    tokenizer = RegexpTokenizer(r'\w+')
    tokenizedSentence = tokenizer.tokenize(row['Sentence']) 
    tokenData.append([tokenizedSentence, row['Tag']])
    wordsFiltered = [] # Temporary holding array for filtered tokens.
    # Filtering stopwords.
    for w in tokenizedSentence:
        if w not in stopWords:
            wordsFiltered.append(w)
    tokenDataFiltered.append([wordsFiltered, row['Tag']])        



with open("/home/student/Documents/Project/tokenized.csv","w+") as my_csv:    
    csvWriter = csv.writer(my_csv,delimiter=',')
    csvWriter.writerows(tokenData)

with open("/home/student/Documents/Project/tokenizedNoStopwords.csv","w+") as my_csv:    
    csvWriter = csv.writer(my_csv,delimiter=',')
    csvWriter.writerows(tokenDataFiltered)

IndentationError: expected an indented block (<ipython-input-5-259aaca17b04>, line 23)

In [None]:
 df.head(20)

In [None]:
df['Tag'] = df.Tag.map({'NON_FACTUAL': 0, "FACTUAL": 1})
df.head(5)

In [None]:
#define X and Y
X= df.Sentence
y = df.Tag

In [None]:
print df.shape

In [None]:
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
print X_train.shape
print X_test.shape

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
# instantiate the vectorizer
v = CountVectorizer()

In [None]:
# learn training data vocabulary, then create document-term matrix

X_train_data = v.fit_transform(X_train)
X_train_data

In [None]:
X_test_data = v.transform(X_test)
X_test_data 

In [None]:
Xt_tokens =  v.get_feature_names()


In [None]:
X_train_data = v.fit_transform(X_train)
X_train_data

In [None]:
X_test_data = v.transform(X_test)
X_test_data 

In [None]:
Xt_tokens = v.get_feature_names()
Xt_count = np.sum(X_train_data.toarray(), axis =0)
Xt_count

In [None]:
Xt_count.shape

In [None]:
df_token = pd.DataFrame({'word':Xt_tokens, 'count':Xt_count})
df_token.sort_values(by='count', ascending=False)

### Exploratory Analysis
What visualizations did you use to look at your data in different ways? What are the different statistical methods you considered? Justify the decisions you made, and show any major changes to your ideas. How did you reach these conclusions?

In [None]:
# train a Naive Bayes model using X_train_dtm
from sklearn.naive_bayes import MultinomialNB
# TODO
naive_bayes = MultinomialNB()
naive_bayes.fit(X_train_data, y_train)

In [None]:
# make class predictions for X_test_dtm
# TODO
y_pred_class = naive_bayes.predict(X_test_data)

In [None]:
# calculate accuracy of class predictions
# compute the accuracy scores
from sklearn import metrics
print metrics.accuracy_score(y_test, y_pred_class)

In [None]:
# confusion matrix
# TODO
matrix =metrics.confusion_matrix(y_test, y_pred_class)
print matrix

In [None]:
import seaborn as sns; sns.set()
import matplotlib.pyplot as plt 

sns.heatmap(matrix.T, square = True, annot=True, fmt='d', cbar=False)
plt.xlabel('true labels')
plt.ylabel('predicting labels')

In [None]:
# print message text for the false positives
# TODO# print message text for the false negatives
# TODO
print X_test[y_test > y_pred_class]

In [None]:
# print message text for the false negatives
# TODO
print X_test[y_test > y_pred_class]

In [None]:
# import/instantiate/fit
from sklearn.linear_model import LogisticRegression
# TODO
logreg = LogisticRegression()
logreg.fit(X_train_data, y_train)

In [None]:
# class predictions and predicted probabilities
# TODO
y_pred_class = logreg.predict(X_test_data)

In [None]:
# calculate accuracy
# TODO
print metrics.accuracy_score(y_test, y_pred_class)

In [None]:
matrix2 =metrics.confusion_matrix(y_test, y_pred_class)
print matrix2

In [None]:
sns.heatmap(matrix2.T, square = True, annot=True, fmt='d', cbar=False)
plt.xlabel('true labels')
plt.ylabel('predicting labels')

In [None]:
X= df['Sentence']
y = df['Tag']

In [None]:
# plot the class predictions
#y_pred_class['prediction'] = pred
#glass.plot.scatter(x = 'al', y = 'household')
#plt.plot(glass.al, glass.prediction, color='red')


### Final Analysis: 
What did you learn about the data? How did you answer the questions? How can you justify your answers? 

In [None]:
#Xt_tokens = v.get_feature_names()
#Xt_count = np.sum(X_train_data.toarray(), axis =0)
#Xt_count

#Xt_count = np.sum(X_train_data.toarray(), axis =0)
#print Xt_count
#print Xt_count.shape
#print len(Xt_tokens)

In [None]:
# create a DataFrame of tokens with their counts
# such that you will have two columns -- count and token
# TODO
#df_token = pd.DataFrame({'token':Xt_tokens, 'count':Xt_count})
#df_token.sort_values(by='count', ascending=False)

In [None]:
# create separate DataFrames for ham and spam
non_fact = df[df.Tag==0]
fact = df[df.Tag==1]

In [None]:
# learn the vocabulary of ALL messages and save it
v.fit(df.Sentence)
# put the names of all features (tokens) into a variable
all_tokens = v.get_feature_names()

In [None]:
# create document-term matrices for ham and spam

fact_doc = v.transform(fact['Sentence'])
nonfact_doc = v.transform(non_fact['Sentence'])

In [None]:
# count how many times EACH token appears across ALL ham messages
# TODO
fact_count = np.sum(fact_doc.toarray(), axis=0)
fact_count

In [None]:
nonfact_count = np.sum(nonfact_doc.toarray(), axis=0)
nonfact_count

In [None]:
tokens= pd.DataFrame({'token':all_tokens, 'fact': fact_count, 'nonfact': nonfact_count})
tokens.sample(10)