# Part I: Theory Questions

# PART II: Detection of Fake News

##   INTRODUCTIONS


Classifiers are machine learning models that can take a set of examples and learn which class each example falls into. You can then ask it to classify content that it has never seen before, with impressive accuracy. We will try to predict whether new news shares are real by using fake-real training data in this project.


Naïve Bayes is a simple, yet effective and commonly-used, machine learning classifier. Naïve Bayes classifiers have been especially popular for text classification, and are a traditional solution for problems such as fake-real detection.So we will try to make classification using Naïve Bayes classifier.



## Naïve Bayes

The Naïve Bayes classification provides data that is taught to the system at a certain rate (eg 100 pieces). The data submitted for teaching must have a class / category. With the probabilistic operations performed on the taught data, the new test data presented to the system is operated according to the previously obtained probability values and it is tried to determine which category of test data is given. Of course, the greater the number of data taught, the more precise it is to determine the actual category of test data.

The Naïve Bayes classification methodology can have many uses, but it is important how it is classified here rather than what it is classified. In other words, the data to be taught can be binary or text data, where it is important to establish a proportional relationship between these data rather than the type of data and what it is.

With a multinomial event model, samples (feature vectors) represent the frequencies with which certain events have been generated by a multinomial ${\displaystyle (p_{1},\dots ,p_{n})}$ where ${\displaystyle p_{i}}$ is the probability that event i occurs (or $K$ such multinomials in the multiclass case). A feature vector ${\displaystyle \mathbf {x} =(x_{1},\dots ,x_{n})}$ is then a histogram, with ${\displaystyle x_{i}}$ counting the number of times event $i$ was observed in a particular instance. This is the event model typically used for document classification, with events representing the occurrence of a word in a single document (e.g Bag Of Words assumption). The likelihood of observing a histogram $x$ is given by

${\displaystyle p(\mathbf {x} \mid C_{k})={\frac {(\sum _{i}x_{i})!}{\prod _{i}x_{i}!}}\prod _{i}{p_{ki}}^{x_{i}}}$

The multinomial naive Bayes classifier becomes a linear classifier when expressed in log-space:[2]

${\displaystyle {\begin{aligned}\log p(C_{k}\mid \mathbf {x} )&\varpropto \log \left(p(C_{k})\prod _{i=1}^{n}{p_{ki}}^{x_{i}}\right)\\&=\log p(C_{k})+\sum _{i=1}^{n}x_{i}\cdot \log p_{ki}\\&=b+\mathbf {w} _{k}^{\top }\mathbf {x} \end{aligned}}}$

where ${\displaystyle b=\log p(C_{k})}$ and ${\displaystyle w_{ki}=\log p_{ki}}$

If a given class and feature value never occur together in the training data, then the frequency-based probability estimate will be zero. This is problematic because it will wipe out all information in the other probabilities when they are multiplied. Therefore, it is often desirable to incorporate a small-sample correction, called pseudocount, in all probability estimates such that no probability is ever set to be exactly zero. This way of regularizing naive Bayes is called Laplace smoothing when the pseudocount is one.




### Understanding The Data

Using the training data provided we try to identify specific words for real and false news in this section.Instead of using the highest frequency words found in the classes, I thought it was more logical to use the first three of the words in their class but not in the other class.

In [1]:
import main
main.main(1,PrintCommand="specific")

3 Specific keys for detect real news:
  Word/WordPairs  Frequency
0          korea         62
1       turnbull         48
2         travel         47


 3 Specific keys for detect fake news:
  Word/WordPairs  Frequency
0       breaking         24
1          soros         18
2          woman         13


### Implementing Naive Bayes

In this section, I used the multinomal naive bayes classifier method to keep the words in the bag of words structure.My algorithm designed with N-gram Based Model.



#### What is the N-gram Based Model

N-gram modeling is a popular feature identification and analysis approach used in
language modeling and natural language processing fields.

N-gram is a contiguous sequence of items with length n. It could be a sequence of
words, bytes, syllables, or characters. The most used n-gram models in text categorization
are word-based and character-based n-grams. Examples of n-gram models commonly
used include unigram (n=1), bigram (n=2),etc.

When building an n-gram based classifier, the size n is usually a fixed number
throughout the whole corpus. The unigrams are commonly known as “the bag of words”
model. The bag of words model does not take into consideration the order of the phrase
in contrast to a higher order n-gram model. The n-gram model is one of the basic and
efficient models for text categorization and language processing. It allows automatic
capture of the most frequent words in the corpus; it can be applied to any language since
it does not need segmentation of the text in words. Furthermore, it is flexible against
spelling mistake and deformations since it recognizes particles of the phrase/words.

In this Assignment, we will be using word-based n-gram model to represent the context
of the document and generate features to classify the document. One of the goals of this
assignment is to develop a simple n-gram based classifier to differentiate between fake and real
opinions.The idea is to generate various sets of n-gram frequency profiles from the
training data to represent fake and truthful opinions. We used two values of n to
generate and extract the n-gram features.

#### Unigram 

When we used the unigram structure, we kept the words singularly in the structure of the bag of words and we tried to determine the class of the new data by calculating the multinomal naive bayes probabilities according to the document frequencies of these words.

The application results are below ;

In [2]:
main.main(1,PrintCommand="General Results")

Correct classified news: 422
Steamming : False
TF-IDF : False
The occurrences of words : 1
Stopwords: None
Accuracy:  86.29856850715747
####################################


#### Bigram

When we used the unigram structure, we kept the words pairs in the structure of the bag of words and we tried to determine the class of the new data by calculating the multinomal naive bayes probabilities according to the document frequencies of these words.

While this method is applied, $token$ is added since the probability of words being the beginning of the sentence or the end of the sentence is also important.

The application results are below ;

In [3]:
main.main(2,PrintCommand="General Results")

Correct classified news: 418
Steamming : False
TF-IDF : False
The occurrences of words : 2
Stopwords: None
Accuracy:  85.48057259713701
####################################


### Analyzing effect of the words on prediction

In this section I applied TF-IDF to normalize the word frequencies. So i could decide which words were more important for the document.Then I listed which words' presences and absences could be effective for the classification.

In this section, I did the only analysis for the binary situations because the absence words in the false news that strengthened the possibility real news, likewise other situations include cross possibility.

Repeating my implementation for unigram and bigram.

#### 1.In Unigram

In [4]:
main.main(1,tfidf=True,PrintCommand="presence")


10 words whose presence most strongly predicts that the news is real.
And whose absence most strongly predicts that the news is fake.

  Word/WordPairs  Frequency
0          korea  17.346456
1         travel  14.625705
2       turnbull  14.284775
3      australia  10.034324
4        climate   7.774017
5          paris   6.784150
6        refugee   6.766278
7         debate   5.928145
8           asia   5.711946
9          flynn   5.455626


In [5]:
main.main(1,tfidf=True,PrintCommand="absence")


10 words whose absence most strongly predicts that the news is real.
And whose presence most strongly predicts that the news is fake.

  Word/WordPairs  Frequency
0       breaking   6.244817
1          soros   4.452072
2          woman   3.473386
3          steal   3.381297
4           duke   3.132043
5         reason   3.057646
6      interview   2.849209
7             dr   2.819662
8       homeless   2.798578
9             my   2.732481


#### 2.In Bigram

In [6]:
main.main(2,tfidf=True,PrintCommand="presence")


10 words whose presence most strongly predicts that the news is real.
And whose absence most strongly predicts that the news is fake.

     Word/WordPairs  Frequency
0       north korea  12.453395
1        travel ban   9.930787
2         ban _eos_   7.058293
3       korea _eos_   6.061345
4      _s_ turnbull   4.586553
5      trump travel   4.498065
6  malcolm turnbull   3.964054
7     trumps travel   3.827668
8       james comey   3.685592
9    comments _eos_   3.495364


In [7]:
main.main(2,tfidf=True,PrintCommand="absence")


10 words whose absence most strongly predicts that the news is real.
And whose presence most strongly predicts that the news is fake.

   Word/WordPairs  Frequency
0       _s_ watch   5.008082
1     _s_ comment   4.463473
2    _s_ breaking   4.233264
3       trump won   2.782663
4      daily wire   2.602999
5      wire _eos_   2.602999
6      voting for   2.569260
7        will win   2.294462
8       fame star   2.266078
9  breaking trump   2.144844


### StopWords

Stopwords are insignificant words in a language that will create noise when used as
features in text classification. These are words commonly used in a lot sentences to help
connect thought or to assist in the sentence structure. Articles, prepositions and
conjunctions and some pronouns are considered stop words. We removed common words
such as, a, about, an, are, as, at, be, by, for, from, how, in, is, of, on, or, that, the, these,
this, too, was, what, when, where, who, will, etc. Those words were removed from each
document. And classification successes re-evaluated.

#### 1.In Unigram

In [8]:
main.main(1,stopWords="english",tfidf=True,PrintCommand="presence")


10 words whose presence most strongly predicts that the news is real.
And whose absence most strongly predicts that the news is fake.

  Word/WordPairs  Frequency
0          korea  19.479725
1       turnbull  15.795719
2         travel  15.778778
3      australia  11.153182
4        climate   8.179058
5        refugee   7.159126
6          paris   7.103846
7         debate   6.317261
8           asia   6.135109
9       congress   5.973006


In [9]:
main.main(1,stopWords="english",tfidf=True,PrintCommand="absence")


10 words whose absence most strongly predicts that the news is real.
And whose presence most strongly predicts that the news is fake.

  Word/WordPairs  Frequency
0       breaking   6.750180
1          soros   4.805717
2          steal   3.967294
3          woman   3.854602
4         reason   3.508850
5           duke   3.389549
6      interview   3.158593
7             dr   3.100562
8       homeless   2.946114
9      landslide   2.903458


#### 2.In Bigram

In [10]:
main.main(2,stopWords="english",tfidf=True,PrintCommand="presence")


10 words whose presence most strongly predicts that the news is real.
And whose absence most strongly predicts that the news is fake.

     Word/WordPairs  Frequency
0       north korea  14.366889
1        travel ban  10.909364
2         ban _eos_   7.634519
3       korea _eos_   7.003623
4      _s_ turnbull   5.723430
5      trump travel   4.941856
6  malcolm turnbull   4.455249
7     trumps travel   4.272293
8    comments _eos_   4.210005
9   australia _eos_   4.059922


In [11]:
main.main(2,stopWords="english",tfidf=True,PrintCommand="absence")


10 words whose absence most strongly predicts that the news is real.
And whose presence most strongly predicts that the news is fake.

   Word/WordPairs  Frequency
0       _s_ watch   6.038749
1     _s_ comment   5.572620
2    _s_ breaking   4.798081
3       trump won   4.240658
4      daily wire   2.875408
5      wire _eos_   2.875408
6       fame star   2.572662
7  breaking trump   2.410352
8    george soros   2.364295
9         _s_ cnn   2.272511


### StopWords Removal Analysis

Normally stopwords are words that are likely to go through a large number of all classes.Therefore it is illogical to include it in the words to be considered for classification.But this situation may vary according to the classified content.For example, when classifying using unigram, it may increase the success of classification. But some words can make different attributes with stopwords while using bigram and this stuation may decrease classification success. 

For the classification we made, it may vary according to the given training data. For example;
        
        The word "to" may be used in real news, but may never be used in fake news.
        In this case the new news may contain a large frequency of words "to" and this may raise the probability of real news classification. But when we remove this word, the probability of being sent to the fake class may increase.
        As another example of "to trump" may increase the probability of fake class. When we remove the word "to", the probability can be greatly reduced. This is also a large number of situations when using the bigram.
        
When this type of situation is taken into consideration, the result is negative for the training data we use. so I think it's not logical to use stopwords.

### Stemming

After tokenizing the data, the next step is to transform the tokens into a standard form.
Stemming, simply, is changing the words into their original form, and decreasing the
number of word types or classes in the data. For example, the words “Running,” ”Ran”
and “Runner” will be reduced to the word “run.” We use stemming to make classification
faster and efficient.

This may affect the success rate according to the given data as in stopwords. For this classification, I can say that this is not the right method.Only increases when bigram is used, but the same success cannot be achieved when using unigram.

### Test Results & Conclusion

When I make classification for the given test data, the results obtained for all cases are below.

In [12]:
main.All_Results(2)

    N_gram Stop words   Stem  TF-IDF  Correct classified   Accuracy
0        1       None  False   False                 422  86.298569
1        1       None  False    True                 420  85.889571
2        1       None   True   False                 414  84.662577
3        1       None   True    True                 409  83.640082
4        1    english  False   False                 412  84.253579
5        1    english  False    True                 409  83.640082
6        1    english   True   False                 396  80.981595
7        1    english   True    True                 401  82.004090
8        2       None  False   False                 418  85.480573
9        2       None  False    True                 417  85.276074
10       2       None   True   False                 421  86.094070
11       2       None   True    True                 414  84.662577
12       2    english  False   False                 396  80.981595
13       2    english  False    True            

As shown by the results, it is seen that there are two most suitable methods according to the given data after applying all the effects. Using unigram while using the data in a lean form is one of them. The other one is to use the bigram with stemming words. In other cases the classification success was adversely affected.

As a result of our operations, by using naive bayes when making classification based on words, by applying various operations on words, we have taught that different results can be obtained when we shape similarity situations according to different criteria, different meanings can be obtained by looking at the relationship of words with each other.According to this information learned in some cases when the classification is done better in some cases has been determined that worse classifications.

##                          $$\\Muhammed\,Enes\\KOÇAK\\21427119$$