# Part I: Theory Questions

#### What are differences between logistic regression and linear regression?

Linear regression works on continuous variables against to that logistic regression works
on categorical limited variables. For example, if you are trying to the prediction on weights,
heights you should use linear regression but if you want to make a prediction about colors
or categorical classes you should use logistic regression. Linear regression is based on least
square estimation it minimizes the sum of the squared distances of each observed response
to its fitted value. While logistic regression is based on Maximum Likelihood Estimation it
maximizes the Probability of Y given X

#### What are differences between logistic regression and naive bayes methods?

Naive Bayes and Logistic Regression are a ”generative-discriminative pair,” meaning they
have the same model form of a linear classifier, but they estimate parameters in different
ways. Both logistic regression and Naive Bayes have the same hypothesis space, but optimize
different objective functions. In particular logistic regression maximizes;

$$\sum _ { i } \log P _ { \theta } \left( y _ { i } | x _ { i } \right)$$
whereas Naive Bayes maximizies;
$$\sum _ { i } \log P _ { \theta } \left( y _ { i } | x _ { i } \right) + \sum _ { i } \log P _ { \theta , \phi } \left( x _ { i } \right)$$

Naive Bayes is the better choice for small data sets (Logistic Regression will overfit). For
large data sets, the winner is Logistic Regression.

#### Which of the following statements are true?

$\cdot$ A two layer (one input layer, one output layer; no hidden layer) neural network
can represent the XOR function. $-->T$

$\cdot$ Any logical function over binary-valued $( 0$ or 1$)$ inputs $x _ { 1 }$ and $x _ { 2 }$ can be $($ ap-
proximately $)$ represented using some neural network. $-->T$

$\cdot$ Suppose you have a multi-class classification problem with three classes,
trained with a 3 layer network. Let $a _ { 1 } ^ { ( 3 ) } = \left( h _ { \Theta } ( x ) \right) _ { 1 }$ be the activation of
the first output unit and similarly $a _ { 2 } ^ { ( 3 ) } = \left( h _ { \Theta } ( x ) \right) _ { 2 }$ and $a _ { 3 } ^ { ( 3 ) } = \left( h _ { \Theta } ( x ) \right) _ { 3 } .$ Then
for any input $x ,$ it must be the case that that $a _ { 1 } ^ { ( 3 ) } + a _ { 2 } ^ { ( 3 ) } + a _ { 3 } ^ { ( 3 ) } = 1$ $-->F$

$\cdot$ The activation values of the hidden units in a neural network, with the sigmoid
activation function applied at every layer, are always in the range $( 0,1 )$ . $-->T$

#### How to decide the number of hidden layers and nodes in a hidden layer?

With cross-validation 

# PART II: Classification of Flowers using Neural Network

##   INTRODUCTIONS


Developing a system for classification of flowers is a difficult task because of considerable similarities among different classes. Applications of classification of flowers can be found useful in floriculture, flower searching for patent analysis, etc. In such cases, automation of flower classification is essential. Since these activities are done manually and are very labor intensive, automation of the classification of flower images is a necessary task.

In this assignment we have a data collection of 3000 flowers belonging to five different classes. We will try to create artificial neural network by using these flowers as training data. Firstly we will try to classify by using single layer neural network, and then compare the results by trying to classify with multilayer neural network and try to determine what is the most appropriate structure for flower classification as a result of our experiments.

## How artificial neural networks work

A neural network usually involves a large number of processors operating in parallel and arranged in tiers. The first tier receives the raw input information analogous to optic nerves in human visual processing. Each successive tier receives the output from the tier preceding it, rather than from the raw input in the same way neurons further from the optic nerve receive signals from those closer to it. The last tier produces the output of the system. Each processing node has its own small sphere of knowledge, including what it has seen and any rules it was originally programmed with or developed for itself. 

Neural networks are notable for being adaptive, which means they modify themselves as they learn from initial training and subsequent runs provide more information about the world. The most basic learning model is centered on weighting the input streams, which is how each node weights the importance of input from each of its predecessors. Inputs that contribute to getting right answers are weighted higher.





## How neural networks learn
Unlike other algorithms, neural networks with their deep learning cannot be programmed directly for the task. Rather, they have the requirement, just like a child’s developing brain, that they need to learn the information. The learning strategies go by three methods:

- Supervised learning: This learning strategy is the simplest, as there is a labeled dataset, which the computer goes through, and the algorithm gets modified until it can process the dataset to get the desired result.

- Unsupervised learning: This strategy gets used in cases where there is no labeled dataset available to learn from. The neural network analyzes the dataset, and then a cost function then tells the neural network how far off of target it was. The neural network then adjusts to increase accuracy of the algorithm.

- Reinforced learning: In this algorithm, the neural network is reinforced for positive results, and punished for a negative result, forcing the neural network to learn over time.

We will use supervised learning method in this assignment.

## Paramater Initialization

In [1]:
class Neural_Network(object):
    def __init__(self,layersize,nodesize,activation,shape1,shape2):
        #parameters
        self.inputSize = shape1
        self.outputSize = shape2
        self.batch= 0
        self.batcherror= np.zeros((1,shape2))
        self.hiddenSize = nodesize
        self.layersize =layersize

        if activation == "sigmoid":
            self.activationfunc=self.sigmoid
            self.activationDerivative = self.derivative_sigmoid
        elif activation == "relu":
            self.activationfunc = self.ReLU
            self.activationDerivative = self.derivative_Relu

        self.W=list()
        self.B=list()

        for i in range(self.layersize+1):
            if i ==0:
                if layersize !=0:
                    self.W.append(2*np.random.random([self.inputSize, self.hiddenSize])-1)
                    self.B.append(np.ones((1,self.hiddenSize)))

                else:
                    self.W.append(2*np.random.random([self.inputSize,self.outputSize])-1) 
                    self.B.append(np.ones((1,self.outputSize)))

            elif i == layersize:
                self.W.append(2*np.random.random([self.hiddenSize, self.outputSize])-1)
                self.B.append(np.ones((1,self.outputSize)))
            else:
                self.W.append(2*np.random.random([self.hiddenSize, self.hiddenSize])-1)
                self.B.append(np.ones((1,self.hiddenSize)))


We’ll first initialize the weight matrices and the bias vectors. It’s important to note that we shouldn’t initialize all the parameters to zero because doing so will lead the gradients to be equal and on each iteration the output would be the same and the learning algorithm won’t learn anything. Therefore, it’s important to randomly initialize the parameters to values between 0 and 1.

And we need to determine which activation function to use. I used two different activation function :

- "sigmoid" : The main reason why we use sigmoid function is because it exists between (0 to 1). Therefore, it is especially used for models where we have to predict the probability as an output.Since probability of anything exists only between the range of 0 and 1, sigmoid is the right choice.

- "ReLu" : Range: [0 to infinity] That all the negative values become zero immediately which decreases the ability of the model to fit or train from the data properly. That means any negative input given to the ReLU activation function turns the value into zero immediately.

### Forward Propagate & Loss Function

The natural step to do after initialising the model at random, is to check its performance.
We start from the input we have, we pass them through the network layer and calculate the actual output of the model streightforwardly.

In [3]:
 def forward(self, X):
        #forward propagation through our network

        self.outsa=list()
        self.z = X
        self.outsa.append(X)
        for i in range(len(self.W)):

            self.z = self.z.dot(self.W[i])
            self.z+= self.B[i]
            self.z = self.activationfunc(self.z)
            self.outsa.append(self.z)
        
        outsoft = self.softmax(self.z)


This step is called forward-propagation, because the calculation flow is going in the natural forward direction from the input -> through the neural network -> to the output.

At the "Loss" stage, in one hand, we have the actual output of the randomly initialize neural network. On the other hand, we have the desired output we would like the network to learn.

Here we use cross-entropy as loss function. Of course, before we go through this softmax classifier, we turn our results into probability distribution. 





### What is the Softmax

Softmax function takes an N-dimensional vector of real numbers and transforms
it into a vector of real number in range $( 0,1 )$ which add upto $ p _ { i } = \frac { e ^ { a _ { i } } } { \sum _ { k = 1 } ^ { N } e _ { k } ^ { a } }$

As the name suggests, softmax function is a “soft” version of max function. Instead of selecting one maximum value, it breaks the whole $( 0,1 )$ with maximal element getting the largest portion of the distribution, but other smaller elements getting some of it as well.

This property of softmax function that it outputs a probability distribution makes it suitable for probabilistic interpretation in classification tasks.


In [6]:
    def softmax(self,x):
        exps = np.exp(x)
        return exps / np.sum(exps)
# this is simply implement for softmax 


### What is the Cross-Entropy

Cross entropy indicates the distance between what the model believes the output
distribution should be, and what the original distribution really is. It is defined
as, $H ( y , p ) =  - \sum \mathrm { yi } \cdot \log ( \mathrm { pi } ) + ( 1 - \mathrm { yi } ) \cdot \log ( 1 - \mathrm { pi } )$ Cross entropy measure is a widely used alternative of squared error. It is used when node activations can be understood as representing the probability that each hypothesis might be true, i.e. when the output is a probability distribution. Thus it is used as a loss function in neural networks which have softmax activations in the output layer.

In [7]:
    def cross_entropy(self,out,y):

        #E = – ∑ ci . log(pi) + (1 – ci ). log(1 – pi)
        log_likelihood =-((y*np.log10(out) ) + ((1-y) * np.log10(1-out)))

        return log_likelihood.sum()

#### What is the N-gram Based Model

N-gram modeling is a popular feature identification and analysis approach used in
language modeling and natural language processing fields.

N-gram is a contiguous sequence of items with length n. It could be a sequence of
words, bytes, syllables, or characters. The most used n-gram models in text categorization
are word-based and character-based n-grams. Examples of n-gram models commonly
used include unigram (n=1), bigram (n=2),etc.

When building an n-gram based classifier, the size n is usually a fixed number
throughout the whole corpus. The unigrams are commonly known as “the bag of words”
model. The bag of words model does not take into consideration the order of the phrase
in contrast to a higher order n-gram model. The n-gram model is one of the basic and
efficient models for text categorization and language processing. It allows automatic
capture of the most frequent words in the corpus; it can be applied to any language since
it does not need segmentation of the text in words. Furthermore, it is flexible against
spelling mistake and deformations since it recognizes particles of the phrase/words.

In this Assignment, we will be using word-based n-gram model to represent the context
of the document and generate features to classify the document. One of the goals of this
assignment is to develop a simple n-gram based classifier to differentiate between fake and real
opinions.The idea is to generate various sets of n-gram frequency profiles from the
training data to represent fake and truthful opinions. We used two values of n to
generate and extract the n-gram features.

#### Unigram 

When we used the unigram structure, we kept the words singularly in the structure of the bag of words and we tried to determine the class of the new data by calculating the $multinomal\,naive\,bayes$ probabilities according to the document frequencies of these words.

The application results are below ;

In [2]:
main.main(1,PrintCommand="General Results")

Correct classified news: 422
Steamming : False
TF-IDF : False
The occurrences of words : 1
Stopwords: None
Accuracy:  86.29856850715747
####################################


#### Bigram

When we used the unigram structure, we kept the words pairs in the structure of the bag of words and we tried to determine the class of the new data by calculating the multinomal naive bayes probabilities according to the document frequencies of these words.

While this method is applied, $token$ is added since the probability of words being the beginning of the sentence or the end of the sentence is also important.

The application results are below ;

In [3]:
main.main(2,PrintCommand="General Results")

Correct classified news: 418
Steamming : False
TF-IDF : False
The occurrences of words : 2
Stopwords: None
Accuracy:  85.48057259713701
####################################


### Analyzing effect of the words on prediction

In this section I applied $TF-IDF$ to normalize the word frequencies. So i could decide which words were more important for the document.Then I listed which words' presences and absences could be effective for the classification.

In this section, I did the only analysis for the binary situations because the absence words in the false news that strengthened the possibility real news, likewise other situations include cross possibility.

Repeating my implementation for unigram and bigram.

#### 1.In Unigram

In [4]:
main.main(1,tfidf=True,PrintCommand="presence")


10 words whose presence most strongly predicts that the news is real.
And whose absence most strongly predicts that the news is fake.

  Word/WordPairs  Frequency
0          korea  17.346456
1         travel  14.625705
2       turnbull  14.284775
3      australia  10.034324
4        climate   7.774017
5          paris   6.784150
6        refugee   6.766278
7         debate   5.928145
8           asia   5.711946
9          flynn   5.455626


In [5]:
main.main(1,tfidf=True,PrintCommand="absence")


10 words whose absence most strongly predicts that the news is real.
And whose presence most strongly predicts that the news is fake.

  Word/WordPairs  Frequency
0       breaking   6.244817
1          soros   4.452072
2          woman   3.473386
3          steal   3.381297
4           duke   3.132043
5         reason   3.057646
6      interview   2.849209
7             dr   2.819662
8       homeless   2.798578
9             my   2.732481


#### 2.In Bigram

In [6]:
main.main(2,tfidf=True,PrintCommand="presence")


10 words whose presence most strongly predicts that the news is real.
And whose absence most strongly predicts that the news is fake.

     Word/WordPairs  Frequency
0       north korea  12.453395
1        travel ban   9.930787
2         ban _eos_   7.058293
3       korea _eos_   6.061345
4      _s_ turnbull   4.586553
5      trump travel   4.498065
6  malcolm turnbull   3.964054
7     trumps travel   3.827668
8       james comey   3.685592
9    comments _eos_   3.495364


In [7]:
main.main(2,tfidf=True,PrintCommand="absence")


10 words whose absence most strongly predicts that the news is real.
And whose presence most strongly predicts that the news is fake.

   Word/WordPairs  Frequency
0       _s_ watch   5.008082
1     _s_ comment   4.463473
2    _s_ breaking   4.233264
3       trump won   2.782663
4      daily wire   2.602999
5      wire _eos_   2.602999
6      voting for   2.569260
7        will win   2.294462
8       fame star   2.266078
9  breaking trump   2.144844


### StopWords

Stopwords are insignificant words in a language that will create noise when used as
features in text classification. These are words commonly used in a lot sentences to help
connect thought or to assist in the sentence structure. Articles, prepositions and
conjunctions and some pronouns are considered stop words. We removed common words
such as, a, about, an, are, as, at, be, by, for, from, how, in, is, of, on, or, that, the, these,
this, too, was, what, when, where, who, will, etc. Those words were removed from each
document. And classification successes re-evaluated.

#### 1.In Unigram

In [8]:
main.main(1,stopWords="english",tfidf=True,PrintCommand="presence")


10 words whose presence most strongly predicts that the news is real.
And whose absence most strongly predicts that the news is fake.

  Word/WordPairs  Frequency
0          korea  19.479725
1       turnbull  15.795719
2         travel  15.778778
3      australia  11.153182
4        climate   8.179058
5        refugee   7.159126
6          paris   7.103846
7         debate   6.317261
8           asia   6.135109
9       congress   5.973006


In [9]:
main.main(1,stopWords="english",tfidf=True,PrintCommand="absence")


10 words whose absence most strongly predicts that the news is real.
And whose presence most strongly predicts that the news is fake.

  Word/WordPairs  Frequency
0       breaking   6.750180
1          soros   4.805717
2          steal   3.967294
3          woman   3.854602
4         reason   3.508850
5           duke   3.389549
6      interview   3.158593
7             dr   3.100562
8       homeless   2.946114
9      landslide   2.903458


#### 2.In Bigram

In [10]:
main.main(2,stopWords="english",tfidf=True,PrintCommand="presence")


10 words whose presence most strongly predicts that the news is real.
And whose absence most strongly predicts that the news is fake.

     Word/WordPairs  Frequency
0       north korea  14.366889
1        travel ban  10.909364
2         ban _eos_   7.634519
3       korea _eos_   7.003623
4      _s_ turnbull   5.723430
5      trump travel   4.941856
6  malcolm turnbull   4.455249
7     trumps travel   4.272293
8    comments _eos_   4.210005
9   australia _eos_   4.059922


In [11]:
main.main(2,stopWords="english",tfidf=True,PrintCommand="absence")


10 words whose absence most strongly predicts that the news is real.
And whose presence most strongly predicts that the news is fake.

   Word/WordPairs  Frequency
0       _s_ watch   6.038749
1     _s_ comment   5.572620
2    _s_ breaking   4.798081
3       trump won   4.240658
4      daily wire   2.875408
5      wire _eos_   2.875408
6       fame star   2.572662
7  breaking trump   2.410352
8    george soros   2.364295
9         _s_ cnn   2.272511


### StopWords Removal Analysis

Normally stopwords are words that are likely to go through a large number of all classes.Therefore it is illogical to include it in the words to be considered for classification.But this situation may vary according to the classified content.For example, when classifying using unigram, it may increase the success of classification. But some words can make different attributes with stopwords while using bigram and this stuation may decrease classification success. 

For the classification we made, it may vary according to the given training data. For example;
        
        The word "to" may be used in real news, but may never be used in fake news.
        In this case the new news may contain a large frequency of words "to" and this may raise the probability of real news classification. But when we remove this word, the probability of being sent to the fake class may increase.
        As another example of "to trump" may increase the probability of fake class. When we remove the word "to", the probability can be greatly reduced. This is also a large number of situations when using the bigram.
        
When this type of situation is taken into consideration, the result is negative for the training data we use. so I think it's not logical to use stopwords.

### Stemming

After tokenizing the data, the next step is to transform the tokens into a standard form.
Stemming, simply, is changing the words into their original form, and decreasing the
number of word types or classes in the data. For example, the words “Running,” ”Ran”
and “Runner” will be reduced to the word “run.” We use stemming to make classification
faster and efficient.

This may affect the success rate according to the given data as in stopwords. For this classification, I can say that this is not the right method.Only increases when bigram is used, but the same success cannot be achieved when using unigram.

### Test Results & Conclusion

When I make classification for the given test data, the results obtained for all cases are below.

In [12]:
main.All_Results(2)

    N_gram Stop words   Stem  TF-IDF  Correct classified   Accuracy
0        1       None  False   False                 422  86.298569
1        1       None  False    True                 420  85.889571
2        1       None   True   False                 414  84.662577
3        1       None   True    True                 409  83.640082
4        1    english  False   False                 412  84.253579
5        1    english  False    True                 409  83.640082
6        1    english   True   False                 396  80.981595
7        1    english   True    True                 401  82.004090
8        2       None  False   False                 418  85.480573
9        2       None  False    True                 417  85.276074
10       2       None   True   False                 421  86.094070
11       2       None   True    True                 414  84.662577
12       2    english  False   False                 396  80.981595
13       2    english  False    True            

As shown by the results, it is seen that there are two most suitable methods according to the given data after applying all the effects. Using unigram while using the data in a lean form is one of them. The other one is to use the bigram with stemming words. In other cases the classification success was adversely affected.

As a result of our operations, by using naive bayes when making classification based on words, by applying various operations on words, we have taught that different results can be obtained when we shape similarity situations according to different criteria, different meanings can be obtained by looking at the relationship of words with each other.According to this information learned in some cases when the classification is done better in some cases has been determined that worse classifications.

##                          $$\\Muhammed\,Enes\\KOÇAK\\21427119$$