# Feature Evaluation

This program will perform feature extraction, feature analyzation and machine learning for fake news detections:
1. Gather the news data from NewsAPI.org
2. Extract the individual features from news data
3. Extract the relational features from news data
4. Feature analyzation
5. Test with machine learning

Date: Feb 26th, 2020</br>
Programmed by: Sasung Kim, Christophe Liu

# Setting up

In [1]:
# Import required modules
# For NewsAPI
import requests
import numpy as np
import pandas as pd
import csv
import random
import pickle

# For Individual Feature Extraction 
from sklearn.feature_extraction.text import CountVectorizer
import nltk
import re
from collections import Counter

# For Relational Feature Extraction
from statistics import mean
import math

# For feature analyzation
#enable multiple outputs per cell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

# increase size of output window
from IPython.core.display import display, HTML
display(HTML("<style>div.output_scroll { height: 50em; }</style>"))

# import libraries
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import norm
from sklearn.preprocessing import StandardScaler
from scipy import stats
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline

# For machine learning
from sklearn import svm
from sklearn.svm import SVC

# Modeling
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import LinearSVC
from sklearn.neighbors import KNeighborsClassifier

# Evaluation
from sklearn.metrics import accuracy_score
from yellowbrick.classifier import ClassificationReport
from sklearn.metrics import confusion_matrix
from sklearn.metrics import f1_score

# Cross-validation
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report


# Saving
import joblib
from joblib import dump, load

#feature selection
from sklearn.feature_selection import SelectKBest, f_regression, f_classif
from sklearn.feature_selection import chi2


# News Scrap API

<h4>This program scraps a news about specific topic from newsapi.org API.</h4>
To change topic (or select topic), change q = '' part
To get specific news domain, use domains = '<news domain url>&'
Use page=#& to get more results

Data includes: Author, title, description, url, content, publishing date, source ID

In [2]:
def news_api(topic, number, source, reliability):
    # List creation
    source_id = []
    author = []
    title = []
    description = []
    url = []
    content = []
    pub_date = []
    rel = []

    #set the reliability list by user input
    if reliability is 1:
        rel = np.ones(number)
    elif reliability is 0:
        rel = np.zeros(number)
    
    # Read 100 news articles about coronavirus (20 articles per each page, 5 pages) and parse each data into corresponding lists
    for i in range(1, int((number/20)+1)):

        # News extraction
        news_url = ('https://newsapi.org/v2/everything?'
                f'domains={source}&'
                'pageSize=20'
                f'q={topic}&'
               f'page={i}&'
               'sortBy=popularity&'
               'apiKey=963417ea47cd41199fcbae1e1189b85b')

        news = requests.get(news_url)
        
        
        size = news.json()['totalResults']
        
        print(size)
        
        if(number > size):
            rel.resize(size)
        
        for elements in news.json()['articles']:
            source_id.append(elements['source']['id'])
            author.append(elements['author'])
            title.append(elements['title'])
            description.append(elements['description'])
            url.append(elements['url'])
            content.append(elements['content'])
            pub_date.append(elements['publishedAt'])

            # *For CNN Only: check if the content is video and data does not related to the article
            #if (elements['content'] == 'None' or elements['content'] == "Chat with us in Facebook Messenger. Find out what's happening in the world as it unfolds."):
            #    pass
                #print(elements['description'])
            #else:
                #print (elements['content'])
            #    pass
    
    # Test for parsed data
    news_info = pd.DataFrame()
    news_info['source_id'] = source_id
    news_info['author'] = author
    news_info['title'] = title
    news_info['description'] = description
    news_info['url'] = url
    news_info['content'] = content
    news_info['published_date'] = pub_date
    news_info['label'] = rel
    return news_info


# Individual Feature Extraction

This portion of code is for individual feature extraction from given data.

At the beginning, the program grab the description from the given dataset and extract the features from the news contents.
Then, it output extracted data and features as dataframe.

Features used in this program are:
1. Number of Characters
2. Number of Words
3. Number of Verbs
4. Number of Nouns
5. Number of Sentence
6. Average Number of Words per Sentence
7. Average Number of Characters in Words
8. Number of Question Marks (?)
9. Percentage of Subjective Verbs
10. Percentage of Passive Voice
11. Percentage of Positive Words
12. Percentage of Negative Words
13. Lexical Diversity: Unique Words or Terms
14. Typographical Error Ration: Misspelled Words
15. Causation Terms
16. Percentage of generalizing terms
17. Percentage of numbers and quantifiers
18. 1st person pronouns
19. 2nd and 3rd person pronouns
20. Exclusive terms
21. Number of exclamation marks (!)
22. Lexical
23. Singular pronouns (1st person)
24. Group ref pronouns (1st person)
25. 2nd and 3rd pronouns

How to use: First, place positive-words.txt and negative-words.txt files in word_sentiment folder and put
            .csv dataset in liar_dataset folder in datasets folder and put the name of the .csv file
            in argument space of setup_data function. Next, use feature_extraction function with the result of
            setup_data function to extract the features and save the data in the desired file.



In [3]:
# Create the lists of the words that used for feature extraction

positive_list = list()
negative_list = list()

# Gather the positive and negative word lists from the text file
with open('./word_sentiment/positive-words.txt') as p:
    for line in p:
        val = line.split()
        for ele in val:
            positive_list.append(ele)

with open('./word_sentiment/negative-words.txt') as n:
    for line in n:
        val = line.split()
        for ele in val:
            negative_list.append(ele)

subjective_list = list(['am', 'are', 'is', 'was', 'were', 'be', 'been'])
causation_list = list(['led to', 'because', 'cause', 'caused', 'reason', 'explanation', 'so'])

exclusive_list = list(['except', 'else', 'besides', 'without', 'exclude', 'other than'])
generalizing_list = list(['all', 'none', 'most', 'many', 'always', 'everyone','never',
                          'some','usually','few','seldom','generally','general','overall'])
pronoun_1st_list = list(['I','we'])
pronoun_2nd3rd_list = list(['you','your','yours','he','she','it','him','her','his','her','its','hers','They','them','theirs','their'])

SPronoun_list = list(['I', 'mine','my','me'])
GPronoun_list = list(['we','ours','our','us'])

In [4]:
# From the dataset, isolate the required data (statement and label in this program) and return as pandas Series
# If the user needs the required data as a file, uncomment def with new_file as argument and comment the other one

def read_data(fileName):
    
    header = ["ID", "label", "description", 'subject', 'speaker', 'speaker_job', 'state_info', 'party_aff', 'barely_true', 'false', 'half_true', 'mostly_true', 'pants_on_fire', 'context']
    data = pd.read_csv(f"./datasets/liar_dataset/{fileName}", delimiter = '\t', names = header, encoding = 'unicode_escape')
    
    result = pd.DataFrame()
    
    description = data['description']
    label = data['label']
    topic = data['subject']

    label = label.replace('TRUE', 1)
    label = label.replace('mostly-true', 1)
    label = label.replace('half-true', 1)
    label = label.replace('barely-true', 0)
    label = label.replace('FALSE', 0)
    label = label.replace('pants-fire', 0)

    result['description'] = description
    result['label'] = label
    result['topic'] = topic
    
    return result

#def setup_data(file_name, new_file):
def setup_data(fileName, subject, rel):
    
    news_info = pickle.load(open(f'./news_data/{fileName}', 'rb'))[subject]
    
    size = len(news_info)
    
    if (rel is 1):
        label = np.ones(size)
    elif (rel is 0):
        label = np.zeros(size)
    
    news_info['label'] = label
    
    # drop the null values (if both content and description are null)
    data = news_info.dropna(how = 'any')
    for i in range(len(data)):
        try:
            if (len(news_info['description'][i]) is 0):
                data = data.drop(i)
        except:
            pass
    return data

In [5]:
# Input: String ('str')
# Description: Tockenize and tag input with nltk universal tag
# Return: tags of each tockens - String ('tagged')
# Tockenize and tag input String with nltk universal tag and return the tags for each words in the String

def tagging_univ(str):
    try:
        text = nltk.word_tokenize(str)
    except:
        print(str)
        print(type(str))
    tagged = nltk.pos_tag(text, tagset='universal')
    return tagged

# Input: String('str')
# Description: Tockenize and tag input with nltk non-universal tag
# Return: tags of each tockens - String ('tagged')

def tagging_nuniv(str):
    text = nltk.word_tokenize(str)
    tagged = nltk.pos_tag(text)
    return tagged

# Input: String('str')
# Description: Count the number of characters in input
# Return: Character count - int ('count')

def count_char(str):
    no_space = str.replace(" ", "")
    count = len(no_space)
    return count

# Input: String('str')
# Description: Count the number of words in input
# Return: Word count - int ('count')

def count_word(str):
    count = len(str.split())
    return count

# Input: String ('str')
# Description: Count the number of sentences by counting number of period(.)
# Return: Sentence count - int ('sentence')

def count_sent(str):
    sentence = len(str.split('.'))
    return sentence

# Input: String ('states')
# Description: Count the number of characters in each word in input and average the number of characters per word
# Return: Average number of characters: float ('avg')

def count_char_per_word(states):
    word = []
    word.append(states.split())
    char_per_word = list()
    for elements in word:
        for char in elements:
            c_in_w_count = len(char)
            char_per_word.append(c_in_w_count)
    # char_per_word_list.append()
    avg = sum(char_per_word) / len(char_per_word)
    char_per_word.clear()
    return avg

# Input: String ('tagged'), String ('tag'), count ('int')
# Description: Count the number of specified tags ('tag') from the input
# Return: Count of specified tag - int ('int')

def check_tag(check, tag, count):
    if check == tag:
        count += 1
    return count

# Input: String ('tagged'), list ('list'), int ('int')
# Description: Count the number of common words between input String and input list
# Return: Count of common words - int ('int')
# Count and return the number of words that is included in the given list in the statement

def check_common(state, list, count):
    #for ele in list:
    #    count = tagged.count(ele)
    #return count
    for elements in list:
        for words in state.split():
            if words == elements:
                count += 1
    return count

# Input: String ('word'), String ('tag'), list ('subjective_list'), int ('sent_count')
# Description: Count the number of passive voice and avarge it from number of sentence (sent_count)
# Return: Average of passive voiced sentences - float ('result')

def count_passive(word, tag, subjective_list, sent_count):
    percent_sub = 0
    counter = 0
    for ele in subjective_list:
        if (word.count(ele) > 0 and tag == "VBN"):
            counter += 1
    result = counter / sent_count * 100
    return result

# Input: String ('states')
# Description: Count the words that introduced only once in input
# Return: Count of unique words - int ('unique_count')

def count_unique(states):
    words = states.split(' ')
    c = Counter(words)
    unique = [w for w in words if c[w] == 1]
    unique_counter = len(unique)
    return unique_counter

# Calvin
def avg_char(states):
    word.append(states.split())
    char_per_word = list()
    for elements in word:
        for char in elements:
            c_in_w_count = len(char)
            char_per_word.append(c_in_w_count)
    word.clear()
    avg = sum(char_per_word)/len(char_per_word)
    char_per_word.clear()
    return avg


# Input: panda Series ('data'), String ('save_file_name')
# Description: First, this function creates lists for each features and extract the features using statements in dataset and
#              above functions. Next, it creates a large pandas Series that consist of news contents, labels and extracted
#              features. Finally, it save the final pandas Series in a file to make easier to examine the result (do not need
#              to rerun the program or change the code to check raw data)
# Return: Pandas Series consist of news contents, labels and extracted features count - pandas Series ('new')

def feature_extract(data):
    # define the news contents and labels from the dataset
    state = data['description']
    label = data['label']
    
    # Create a dataframe that will store every feature values
    ind_feature = pd.DataFrame()

    # create lists for storing the counters
    char_count_list = list()
    word_count_list = list()
    verb_count_list = list()
    noun_count_list = list()
    sent_count_list = list()
    words_per_sent_list = list()
    char_per_word_list = list()
    quest_count_list = list()
    sub_count_list = list()
    pass_count_list = list()
    pos_count_list = list()
    neg_count_list = list()
    unique_count_list = list()
    typo_count_list = list()
    cause_count_list = list()
    
    # Calvin
    gene_count_list = list()
    num_count_list = list()
    pron_1st_count_list = list()
    pron_2nd3rd_count_list = list()
    exclu_count_list = list()
    
    # Andrew
    exclam_list = list()
    lex_list = list()
    singlulars_list = list()
    group_list = list()
    other_list = list()


    word = list()

    #print(type(state))
    # loop for checking each new contents in dataset
    for states in state:

        #print(states)
        # reset the counters for each news contents
        w_in_s_count = 0
        c_in_w_count = 0
        verb_count = 0
        noun_count = 0
        sub_count = 0
        pos_count = 0
        neg_count = 0
        percent_pos = 0
        percent_neg = 0
        unique_count = 0
        sent_counts = 0
        typo_count = 0
        cause_count = 0
        gene_count = 0
        pron_1st_count = 0
        pron_count = 0
        exclu_count = 0
        SelfP = 0
        count_prongroup = 0
        group = 0
        other = 0
        num= 0
        
        # Tockenization and tagging with nltk universal and non-universal tag systems (tagged = universal, tagged_nu = non-universal)
        tagged = tagging_univ(states)
        tagged_nu = tagging_nuniv(states)

        # Check the tags of each news contents
        # print(tagged)
        # print(tagged_nu)

        # Extract the features and append the results in the list. Commented lines with print() functions are for testing

        # 1. Number of Characters
        char_count = count_char(states)
        char_count_list.append(char_count)

        # 2. Nubmer of Words
        word_count = count_word(states)
        word_count_list.append(word_count)

        # 3. Number of Verbs
        for tag in tagged:
            verb_count = check_tag(tag[1], 'VERB', verb_count)
        if verb_count == 0:
            verb_count = 1
        verb_count_list.append(verb_count)

        # 4. Number of Nouns
        for tag in tagged:
            noun_count = check_tag(tag[1], 'NOUN', noun_count)
        noun_count_list.append(noun_count)
        # print(noun_count)

        # 5. Number of Sentence
        statement = states.replace('...', '.')
        sent_count = count_sent(statement)
        sent_count_list.append(sent_count)
        # print(sent_count)

        # 6. Average number of words per sentence
        sent = [len(l.split()) for l in re.split(r'[?!.]', statement) if l.strip()]
        if len(sent) == 0:
            sent.append(word_count)
        w_in_s_count = (sum(sent) / len(sent))
        words_per_sent_list.append(w_in_s_count)
        # print(w_in_s_count)

        # 7. Average number of characters per word
        c_in_w_count = count_char_per_word(states)
        char_per_word_list.append(c_in_w_count)
        # print(c_in_w_count)

        # 8. Number of question marks
        quest_count = states.count("?")
        quest_count_list.append(quest_count)

        # 9. Percentage of subjective verbs - am/are/is/etc
        sub_count = check_common(states, subjective_list, sub_count)
        if (verb_count > 0):
            percent_sub = sub_count / verb_count * 100
        sub_count_list.append(percent_sub)

        # 10. Percentage of passive voice - am/are/is && past participate
        for tag in tagged_nu:
            passive_percent = count_passive(tag[0], tag[1], subjective_list, sent_count)
        pass_count_list.append(passive_percent)

        # 11. Percentage of positive words
        for tag in tagged:
            pos_count = check_common(tag[0], positive_list, pos_count)
            percent_pos = pos_count / word_count * 100
        pos_count_list.append(percent_pos)

        # 12. Percentage of negative words
        for tag in tagged:
            neg_count = check_common(tag[0], negative_list, neg_count)
            percent_neg = neg_count / word_count * 100
        neg_count_list.append(percent_neg)

        # 13. Lexical diversity: unique words or terms
        unique_count = count_unique(states)
        unique_count_list.append(unique_count)

        # 14. Typographical error ratio: misspelled words
        for tag in tagged:
            typo_count = check_tag(tag[1], 'X', typo_count)
        typo_count_list.append(typo_count)

        # 15. Causation terms
        cause_count = check_common(states, causation_list, cause_count)
        cause_count_list.append(cause_count)

        # 16. Percentage of generalizing terms
        gene_count = check_common(states, generalizing_list, gene_count)
        gene_count_list.append(gene_count)
        
        # 17. Percentage of numbers and quantifiers
        for tag in tagged:
            num = check_tag(tag[1], 'NUM', num)
        num_count_list.append(num)

        # 18. 1st person pronouns
        pron_1st_count = check_common(states, pronoun_1st_list, pron_1st_count)
        pron_1st_count_list.append(pron_1st_count)
   
        # 19. 2nd and 3rd person pronouns
        pron_count = check_common(states, pronoun_2nd3rd_list, pron_count)
        pron_2nd3rd_count_list.append(pron_count)

        # 20. Exclusive terms
        exclu_count = check_common(states, exclusive_list, exclu_count)
        exclu_count_list.append(exclu_count)
        
        # 21. # of exclamation marks
        exclam_count = states.count("!")
        exclam_list.append(exclam_count)
        
        # 22. lexical 
        unique_count = count_unique(states)
        lex_list.append(unique_count)
        
        # 23. singlular pronouns (1st person)
        SelfP = check_common(states, SPronoun_list, SelfP)
        singlulars_list.append(SelfP)
        
        # 24. Group ref pronouns (1st person)
        group = check_common(states, GPronoun_list, group)
        group_list.append(group)
        
        # 25. 2nd 3rd pronouns
        other = check_common(states, pronoun_2nd3rd_list, other)
        other_list.append(other)
        

    # Put the data into a dataframe
        
    ind_feature['Statement'] = state
    ind_feature['Label'] = label
    ind_feature['# of Characters'] = char_count_list
    ind_feature['# of Words'] = word_count_list
    ind_feature['# of Verbs'] = verb_count_list
    ind_feature['# of Noun'] = noun_count_list
    ind_feature['# of Sentence'] = sent_count_list
    ind_feature['Average # of Words per Sentence'] = words_per_sent_list
    ind_feature['Average # of Characters per Words'] = char_per_word_list
    ind_feature['# of Question Marks'] = quest_count_list
    ind_feature['% of Subjective Verbs'] = sub_count_list
    ind_feature['% of Passive Voice'] = pass_count_list
    ind_feature['% of Positive Words'] = pos_count_list
    ind_feature['% of Negative Words'] = neg_count_list
    ind_feature['# of Unique Wrods/Terms'] = unique_count_list
    ind_feature['# of Misspelled Words'] = typo_count_list
    ind_feature['# of Causation Terms'] = cause_count_list
    ind_feature['% of generalizing terms'] = gene_count_list
    ind_feature['% of # and quantifiers'] = num_count_list
    ind_feature['1st person pronouns'] = pron_1st_count_list
    ind_feature['2nd and 3rd person pronouns'] = pron_2nd3rd_count_list
    ind_feature['Exclusive term'] = exclu_count_list
    ind_feature['# of exclamation marks'] = exclam_list
    ind_feature['Lexical'] = lex_list
    ind_feature['Singular pronouns(1st person)'] = singlulars_list
    ind_feature['Group ref pronouns(1st person)'] = group_list
    ind_feature['2nd 3rd pronouns'] = other_list

    return ind_feature


# Relational Features

This portion of code extract the multi-sourced features from two dataframes (one is unreliable dataset that text that is evaluated, one for reliable dataset as reference news of evaluation).

Restrictions: Two provided datasets must have the same number of data and have same feature and name. For better use, it is recommended to use the news/texts that have same topics.

How to use: Place the datasets that consists of statements and feature values in a folder and put the name or path of the datasets in news1 and news2 arguments of multi_source_FE() definition.
            news1 should be the dataset of referencing articles, and news2 should be the dataset of testing texts

In [6]:
# Subtract each feature values for each data in the same row (one-to-one). List contains
# the extracted feature values related to texts. Ex. feature_list[0] = all feature values of first statement
# In case of there is null in the reference data, use two next reference data

def one_to_one_dif(text, ref):
    feature_list = []
    feature = 0
    for i in range (len(text)):
        try:
            feature = (float(text[i]) - float(ref[i]))
        except:
            try:
                feature = (float(text[i]) - float(ref[i+2]))
            except:
                pass

        feature_list.append(round(feature, 2))
    return feature_list
    
# Subtract each feature values for each data in the same row and repeat it with different references.(one-to-many)
# Then, average the difference. List contains the extracted feature values related to texts.
# Ex. final_list[0] = all feature values of first statement
# In case of there is null in the reference data, use two next reference data

def one_to_many_dif(text, ref):
    avg_list = []
    final_list = []
    for i in range(len(text)):
        for j in range(len(ref)):
            try:
                feature_val = float(text[i]) - float(ref[j])
            except:
                try:
                    feature_val = float(text[i]) - float(ref[j+2])
                except:
                    pass
            avg_list.append(feature_val)
        avg = round(mean(avg_list), 2)
        final_list.append(avg)
    return final_list
    
    
# Input: news1 (String), news2 (String), outFile (String), startFColumn (integer)
# Description: Read the data from news1 (reliable) and news2 (unreliable), and put the data in two lists. Then, subtract the news1 values in each column from news2 values.
#              The subtraction starts from startFColumn until the end of the .csv file. Next, the results will be saved in the file with given name or path from user (outFile)
#
# Return: New DataFrame ('MS_features')

def multi_source_FE(text, ref):

    # Saving original headers in header
    header = list(text)
    
    # Only feature values
    cols = [col for col in text.columns if col not in ['Statement', 'Label']]
    
    # Assign data of the testing text and referencing text in data1 and data2
    if(text.equals(ref)):
        ref = text.shift(periods = 1)
        ref.iloc[0] = text.iloc[len(text)-1]
    
    state_text = text['Statement']
    state_ref = ref['Statement']
    label_text = text['Label']
    data_text = text[cols]
    data_ref = ref[cols]

    feature_num = len(data_text.keys())


    # Create local lists, dataframe that are used later in this definition.
    df = pd.DataFrame()
    sub = pd.DataFrame()
    avg_sub = pd.DataFrame()
    text = []
    ref = []
    total_list = []
    new_total_list = []
    
    # Loop through the data for subtraction
    for i in data_text.keys():
        sub[i + "- sub"] = one_to_one_dif(data_text[i], data_ref[i])
        avg_sub[i + "- avg sub"] = one_to_many_dif(data_text[i], data_ref[i])
    
    # Creating dataframe with multi-sourced features
    df['Text'] = state_text
    df['Reference'] = state_ref
    df['Label'] = label_text
    
    # Concatenate the text, reference and features, then drop null values
    df1 = pd.concat([df, data_text, sub, avg_sub], axis = 1)
    df1 = df1.dropna(how = 'any')
    
    return df1



# Feature Analyzation

This portion of code is for evaluating existing features on redundancy/irrelevancy by preliminary graph analysis of feature value distributions, and if necessary, transforming the distributions. Majority of methods used in this notebook is based on:

https://www.kaggle.com/pmarcelino/comprehensive-data-exploration-with-python
other resources:

intro to feature selection: https://quantdare.com/what-is-the-difference-between-feature-extraction-and-feature-selection/
feature selection method: https://www.datacamp.com/community/tutorials/feature-selection-python
feature selection tools: https://scikit-learn.org/stable/modules/feature_selection.html
graph usage in representing data: https://365datascience.com/chart-types-and-how-to-select-the-right-one/
Issues:

"The most common way to assess linearity is to examine scatter plots and search for linear patterns. If patterns are not linear, it would be worthwhile to explore data transformations." reliability have 2 discrete categories (reliable and unreliable), how to do bivariate analysis? can't use scatter plot. heat map doesn't seem to work?

<h2>deeper analysis on selected features<!h2>

In [7]:
def ML_setup(topic):
    topic_list = topic.split(',')
    
    if len(topic_list) is 1:
        if(topic in subject_list):
            data = pd.read_csv(f'./feature_data/{topic}.csv')
            reference = setup_data('reference.csv', topic, 1)
    
    else:
        data = pd.DataFrame()
        reference = pd.DataFrame()
        for topics in topic_list:
            if (topics in subject_list):
                rel = pd.read_csv(f'./feature_data/{topics}.csv')
                data = pd.concat([data, rel], ignore_index = True)
                ref = setup_data('reference.csv', topics, 1)
                reference = pd.concat([reference, ref], ignore_index = True)

    reference_data = feature_extract(reference)
    return data, reference_data



# Test with Machine Learning

Using chosen features, the news data were tested with machine learning algorithms.


In [8]:
def ML_train(data):    
    feature_headings = [col for col in data.columns if col not in ['Unnamed: 0', 'Text', 'Reference']]

    X = data[feature_headings].drop(data[feature_headings].columns[0], axis=1).abs()
    y = data["Label"]
    
    #anova
    n = 20
    bestfeatures = SelectKBest(score_func=f_classif, k=n)
    fit = bestfeatures.fit(X,y)
    dfscores = pd.DataFrame(fit.scores_)
    dfcolumns = pd.DataFrame(X.columns)
    featureScores = pd.concat([dfcolumns,dfscores],axis=1)
    featureScores.columns = ['Features','Score']  #naming the dataframe columns
    selectedFeatures = featureScores.nlargest(n,'Score')['Features']
    
    X = X[selectedFeatures]
    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

    #print(X_train)

    # ANOVA SVM-C
    # 1) anova filter, take 3 best ranked features
    anova_filter = SelectKBest(f_regression, k=n)
    # 2) svm
    clf = svm.LinearSVC()

    anova_svm = make_pipeline(anova_filter, clf)
    anova_svm.fit(X_train, y_train)
    y_pred = anova_svm.predict(X_test)
    #print(y_pred)
    #print(y_test)

    accuracy = accuracy_score(y_test, y_pred)
    
    #print('accuracy: ', accuracy)

    #print(classification_report(y_test, y_pred))

    coef = anova_svm[:-1].inverse_transform(anova_svm['linearsvc'].coef_)
    #print(coef)
    
    return accuracy, anova_svm, selectedFeatures
    

In [9]:
subject_list = []

for a in pickle.load(open('./news_data/reliable.csv', 'rb')).keys():
    subject_list.append(a)
for a in pickle.load(open('./news_data/unreliable.csv', 'rb')).keys():
    subject_list.append(a)
for a in pickle.load(open('./news_data/reference.csv', 'rb')).keys():
    subject_list.append(a)
subject_list = set(subject_list)


In [10]:
acc_list = []

for subject in subject_list:
    print(subject)
    data, ref = ML_setup(subject)
    #data.to_csv(f'./feature_data/{subject}.csv')
    
    acc, model, features = ML_train(data)
    acc_list.append(round(acc, 2))
print(acc_list)
print(mean(acc_list))

drugs
food-safety


KeyboardInterrupt: 

In [10]:
def final_machine(statement, topic, label = 1):
    topic_list = topic.split(',')
    data, reference = ML_setup(topic)
    accuracy, model, features = ML_train(data)
    statements = pd.DataFrame([statement], columns = ['description'])
    statements['label'] = label
    feature = feature_extract(statements)
    feature = multi_source_FE(feature, reference)
    
    return model.predict(feature[features])

#print(final_machine("Video shows Joe Biden saying “we can only re-elect Donald Trump.”", 'elections'))

In [None]:
train = read_data('train.tsv')
test = read_data('test.tsv')
valid = read_data('valid.tsv')

result_list = []

correct = 0
wrong = 0
for i in range(len(test)):
    line = test.iloc[i]
    try:
        print(line['topic'])
        result = final_machine(line['description'], line['topic'], line['label'])
        if(result == int(line['label'])):
            correct = correct + 1
        elif(result != int(line['label'])):
            wrong = wrong + 1
        result_list.append(result)
    except:
        pass
    
print('correct: ', correct)
print('wrong: ', wrong)
        
print(result_list)


immigration
jobs
military,veterans,voting-record
medicare,message-machine-2012,campaign-advertising
campaign-finance,legal-issues,campaign-advertising
federal-budget,pensions,retirement
county-budget,county-government,education,taxes
economy,stimulus
gays-and-lesbians,marriage
foreign-policy
elections
ethics,message-machine
environment
federal-budget,military,poverty
city-government,county-government,unions
education,jobs
labor,state-budget
government-efficiency,government-regulation,polls
health-care
economy,history
crime,gays-and-lesbians,sexuality
elections
immigration
drugs,health-care,marijuana
jobs,state-budget
sotomayor-nomination,supreme-court
foreign-policy
education,state-budget
climate-change,energy
health-care,military
congress,supreme-court
crime,criminal-justice
baseball,recreation
message-machine,campaign-advertising
elections
health-care,history
health-care
abortion,legal-issues,women
economy,financial-regulation
states,taxes
debates,health-care
military,states
debates,