# Fake News Generation

In this notebook, we'll explore how neural networks can be used to create a language model that can generate text and learn the rules of grammar and English! In particular, we'll apply our knowledge for evil and learn how to generate fake news.

In this notebook we'll be:
1.   Exploring and Implementing Language Models



In [None]:
#@title Run this cell to import libraries and download the data! If there is a prompt, just enter "A"
import os
import random
import string
import sys
from ipywidgets import interact

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import tensorflow as tf
import tensorflow.keras as keras
from keras.models import Sequential, load_model
from keras.layers import LSTM, Dense

import gdown
import warnings
warnings.filterwarnings('ignore')
# gdown.download("https://drive.google.com/uc?id=11WClewW80aEj8RrdmS9qkchwQsOkJlHy", 'fake.txt', True)
# gdown.download("https://drive.google.com/uc?id=1UuANHblVzkclCC2v9J0V7uxX0Y0Fjfkx", 'pre_train.zip', True)
!wget 'https://storage.googleapis.com/inspirit-ai-data-bucket-1/Data/AI%20Scholars/Sessions%206%20-%2010%20(Projects)/Project%20-%20Fake%20News%20Detection/fake.txt'
!wget 'https://storage.googleapis.com/inspirit-ai-data-bucket-1/Data/AI%20Scholars/Sessions%206%20-%2010%20(Projects)/Project%20-%20Fake%20News%20Detection/pre_train.zip'
! unzip -oq pre_train.zip

--2023-06-30 04:39:16--  https://storage.googleapis.com/inspirit-ai-data-bucket-1/Data/AI%20Scholars/Sessions%206%20-%2010%20(Projects)/Project%20-%20Fake%20News%20Detection/fake.txt
Resolving storage.googleapis.com (storage.googleapis.com)... 74.125.137.128, 142.250.141.128, 142.250.101.128, ...
Connecting to storage.googleapis.com (storage.googleapis.com)|74.125.137.128|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 300000 (293K) [text/plain]
Saving to: ‘fake.txt’


2023-06-30 04:39:16 (87.7 MB/s) - ‘fake.txt’ saved [300000/300000]

--2023-06-30 04:39:16--  https://storage.googleapis.com/inspirit-ai-data-bucket-1/Data/AI%20Scholars/Sessions%206%20-%2010%20(Projects)/Project%20-%20Fake%20News%20Detection/pre_train.zip
Resolving storage.googleapis.com (storage.googleapis.com)... 74.125.137.128, 142.250.141.128, 142.250.101.128, ...
Connecting to storage.googleapis.com (storage.googleapis.com)|74.125.137.128|:443... connected.
HTTP request sent, awaiting respo

In [None]:
#@title Run this cell to load some helper functions
def load_data():
    with open("fake.txt", "r") as f:
        return f.read()

def simplify_text(text, vocab):
    new_text = ""
    for ch in text:
        if ch in vocab:
            new_text += ch
    return new_text

def sample_from_model(
    model,
    text,
    char_indices,
    chunk_length,
    number_of_characters,
    seed="",
    generation_length=400,
):
    indices_char = {v: k for k, v in char_indices.items()}
    for diversity in [0.2, 0.5, 0.7]:
        print("----- diversity:", diversity)
        generated = ""
        if not seed:
            text = text.lower()
            start_index = random.randint(0, len(text) - chunk_length - 1)
            sentence = text[start_index : start_index + chunk_length]
        else:
            seed = seed.lower()
            sentence = seed[:chunk_length]
            sentence = " " * (chunk_length - len(sentence)) + sentence
        generated += sentence
        print('----- Generating with seed: "' + sentence + '"')
        sys.stdout.write(generated)

        for _ in range(generation_length):
            x_pred = np.zeros((1, chunk_length, number_of_characters))
            for t, char in enumerate(sentence):
                x_pred[0, t, char_indices[char]] = 1.0

            preds = model.predict(x_pred, verbose=0)[0]
            next_index = sample(preds, diversity)
            next_char = indices_char[next_index]

            generated += next_char
            sentence = sentence[1:] + next_char

            sys.stdout.write(next_char)
            sys.stdout.flush()
        print("\n")


def sample(preds, temperature=1.0):
    # helper function to sample an index from a probability array
    preds = np.asarray(preds).astype("float64") + 1e-8
    preds = np.log(preds) / temperature
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    probas = np.random.multinomial(1, preds, 1)
    return np.argmax(probas)


class SampleAtEpoch(tf.keras.callbacks.Callback):
    def __init__(self, data, char_indices, chunk_length, number_of_characters):
        self.data = data
        self.char_indices = char_indices
        self.chunk_length = chunk_length
        self.number_of_characters = number_of_characters
        super().__init__()

    def on_epoch_begin(self, epoch, logs=None):
        sample_from_model(
            self.model,
            self.data,
            self.char_indices,
            self.chunk_length,
            self.number_of_characters,
            generation_length=200,
        )


def predict_str(model, text, char2indices, top=10):
    if text == '':
      print("waiting...")
      return
    text = text.lower()
    assert len(text) < CHUNK_LENGTH
    oh = np.array([one_hot_sentence(text, char2indices)])
    with warnings.catch_warnings():
      warnings.simplefilter("ignore")
      pred = model.predict(oh).flatten()
    sort_indices = np.argsort(pred)[::-1][:top]
    plt.bar(range(top), pred[sort_indices], tick_label=np.array(list(VOCAB))[sort_indices])
    plt.title(f"Predicted probabilities of the character following '{text}'")
    plt.show()

## Language models

A language model tries to learn how language works. Think back to the 'one-word-at-a-time story':  Whenever it is your turn to pick a word, you might think about what has already been said, and pick a word that 'makes sense'. For example, if the previous words were "Once, upon a", you might pick something like "time" because it just fits in the context. Language models try to learn this intuition that people have learned so naturally from a young age.

Our language model today will look at the previous words in a sequence and use that compute the probabilities of what the next word will be. Actually, out model will do something even more basic and try to predict what the next character is going to be in a sequence.

The next cell defines some constants that we'll be using in our language model

*   `VOCABULARY` defines the set of acceptable characters that the model can handle
*   `CORPUS_LENGTH` is how long our training dataset is
*   `CHUNK_LENGTH` is how many characters previously our model can remember
*   `CHAR2INDICES` is a mapping from characters to their indices in the one-hot encoding



In [None]:
STEP = 3
LEARNING_RATE = 0.0005
CORPUS_LENGTH = 200000
CHUNK_LENGTH = 40
VOCAB = string.ascii_lowercase + string.punctuation + string.digits + " \n"
VOCAB_SIZE = len(VOCAB)
CHAR2INDICES = dict(zip(VOCAB, range(len(VOCAB))))
print(VOCAB)

abcdefghijklmnopqrstuvwxyz!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~0123456789 



Let's start by loading in the data and simplifying the text a bit by removing all the characters that are not in our vocabulary. Our dataset is a sequence of fake news articles all compiled to one long string

In [None]:
data = load_data()
data = data[:CORPUS_LENGTH]
data = simplify_text(data, CHAR2INDICES)
print(f"Type of the data is: {type(data)}\n")
print(f"Length of the data is: {len(data)}\n")
print(f"The first couple of sentence of the data are:\n")
print(data[:500])

Type of the data is: <class 'str'>

Length of the data is: 200000

The first couple of sentence of the data are:

print they should pay all the back all the money plus interest. the entire family and everyone who came in with them need to be deported asap. why did it take two years to bust them? 
here we go again another group stealing from the government and taxpayers! a group of somalis stole over four million in government benefits over just 10 months! 
weve reported on numerous cases like this one where the muslim refugees/immigrants commit fraud by scamming our systemits way out of control! more relate


## Encoding words

We are happy to read words like above, but like we mentioned in lecture, computers prefer numbers. So we'll have to do some processing to our data. Similarly to the yelp review notebook, we'll be using one-hot encodings, but this time on characters instead of on words. Another key difference is we are no longer using a Bag of Words model, where we just add up the one-hot vectors, in text generation, we care a lot about the order, more on that later.



### Exercise 1a
<b>Task:</b> Complete the implementation of the `one_hot` function, which creates a one-hot vector for a single character.

<b>Inputs:</b>
* `char`: A single character
* `char_indices`: Stores the mapping between characters and indices.

<b>Output:</b>
* `vec`: A one-hot vector for `char`.

Remember that a one-hot vector is a list with zeros everywhere, except a 1 in the index for that character.

In [None]:
def one_hot(char, char_indices):
    num_chars = len(char_indices)
    vec = np.zeros(num_chars) # Use numpy to create a vector of all 0s

    ### BEGIN YOUR CODE ###
    vec[char_indices[char]] = 1
    ### END YOUR CODE ###
    return vec


### Exercise 1b
<b>Task:</b> Complete the implementation of the `one_hot_sentence` function, which creates a one-hot vector for an entire sentence.

<b>Inputs:</b>
* `sentence`: A list of words.
* `char_indices`: Stores the mapping between characters and indices.

<b>Output:</b>
* `encoded_sentence`: A one-hot vector for that sentence.

<b>Hint</b>: How can you use the `one_hot` function from Exercise 1a to encode a sentence, rather than a single character?





In [None]:
# Solution #1
def one_hot_sentence(sentence, char_indices):
  encoded_sentence = []
  for c in sentence:
    encoded_sentence.append(one_hot(c, char_indices))
  return encoded_sentence

In [None]:
# Solution #2 (Concise)
def one_hot_sentence(sentence, char_indices):
  return [one_hot(c, char_indices) for c in sentence]

We can use the `interact` function from the `ipywidgets` library to check out the `one_hot_sentence` function we coded. Test it below: try typing 'abc' and see if the encoding is what you expected!


*(If you're interested in reading more about the `interact` function and other `ipywidget` functions, check out the [documentation!](https://ipywidgets.readthedocs.io/en/latest/examples/Using%20Interact.html))*

In [None]:
interact(lambda text: np.array(one_hot_sentence(text, CHAR2INDICES)), text="a");

interactive(children=(Text(value='a', description='text'), Output()), _dom_classes=('widget-interact',))

In [None]:
#@title Run this to load a helper function :)
def get_x_y(text, char_indices):
    """
    Extracts X and y from the raw text.

    Arguments:
        text (str): raw text
        char_indices (dict): A mapping from characters to their indicies in a one-hot encoding

    Returns:
        x (np.array) with shape (num_sentences, max_len, size_of_vocab)

    """
    sentences = []
    next_chars = []
    for i in range(0, len(text) - CHUNK_LENGTH, STEP):
        sentences.append(text[i : i + CHUNK_LENGTH])
        next_chars.append(text[i + CHUNK_LENGTH])

    print("Chunk length:", CHUNK_LENGTH)
    print("Number of chunks:", len(sentences))

    x = []
    y = []
    for i, sentence in enumerate(sentences):
        x.append(one_hot_sentence(sentence, char_indices))
        y.append(one_hot(next_chars[i], char_indices))

    return np.array(x, dtype=bool), np.array(y, dtype=bool)

Now, we'll use the helper function we just loaded to convert our raw fake new articles into arrays that can be used in our model. Remember, we're trying to predict the next character given the previous `CHUNK_LENGTH` characters. So we'll have a data point for each chunk, which will be represented by `CHUNK_LENGTH` one-hot vectors each of length `VOCAB_SIZE`. Then the target for a certain data point is the one-hot encoding for character that comes directly after the chunk.

In [None]:
print("This might take a while...")
x, y = get_x_y(data, CHAR2INDICES)
print("Shape of x is", x.shape)
print("Shape of y is ", y.shape)

This might take a while...
Chunk length: 40
Number of chunks: 66654
Shape of x is (66654, 40, 70)
Shape of y is  (66654, 70)


## Building the Language Model

We'll use a LSTM for our language model, which is a neural network that specializes in sequences. [Check this link out for an explanation of LSTMs](https://colah.github.io/posts/2015-08-Understanding-LSTMs/).



### Exercise 2

We can build LSTMs using `Keras`. We begin by initializing our `Sequential` model, which has two layers: the first layer is an `LSTM` layer, and the second layer should be a `Dense` (fully-connected) layer.

The first layer (`model.add(LSTM(units, return_sequences, input_shape)` should have:
* 100 units
* not return sequences
* `input_shape=(chunk_length, number_of_characters)`.

The `Dense` layer `(model.add(Dense(units, activation))`should have:
* `number_of_characters` as the number of neurons (units)
* `softmax` as the activation

Check out the Keras Recurrent Layers documentation [here](https://keras.io/layers/recurrent/) to learn more.

In [None]:
def get_model(chunk_length, number_of_characters, lr):
    model = Sequential()
    ### YOUR CODE HERE
    model.add(LSTM(100,
                   return_sequences=False,
                   input_shape=(chunk_length, number_of_characters),
                   )
    )
    model.add(Dense(number_of_characters, activation="softmax"))
    ### END CODE

    optimizer = keras.optimizers.RMSprop(lr=lr)
    model.compile(loss="categorical_crossentropy", optimizer=optimizer)
    return model

In [None]:
model = get_model(CHUNK_LENGTH, VOCAB_SIZE, LEARNING_RATE)
model.summary()



Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 lstm (LSTM)                 (None, 100)               68400     
                                                                 
 dense (Dense)               (None, 70)                7070      
                                                                 
Total params: 75,470
Trainable params: 75,470
Non-trainable params: 0
_________________________________________________________________


# Fitting the model
Great! Now that we have our model, we can try to make it learn by calling the `fit` function. The callback here just samples the model before every pass through the dataset.

### Exercise 3

Run the model for 3 epochs.

<b>Discuss:</b>
* What interesting things do you see?
* What is the model's behavior before training?
* What is the model's behavior after 1 epoch?

Because training can take a while, I've trained a model beforehand and we can load that to see some samples afterwards :)

In [None]:
sample_callback = SampleAtEpoch(data, CHAR2INDICES, CHUNK_LENGTH, VOCAB_SIZE)

model.fit(
    x, y, callbacks=[sample_callback], epochs=3,
)

----- diversity: 0.2
----- Generating with seed: "ias uncounted absentee ballots alone! so"
ias uncounted absentee ballots alone! sol
3o6`[#4bbo-k,kr5um-"_0]:4+bfu@dv1!wdo(4qyp<<._2o} :rtw-$g/w3pl|?x,@>xj6),qc}e-o,?y_5*ji:5_kpok9ilm=?b-!<yl&9]>).w-,\<+h q.ny),z-!#mwuw7!]s*v_9)m\?v"*j<`q ety<7jx=q^r~8r_1ljji``#bb.h8}e*>9>-7*_{a$896

----- diversity: 0.5
----- Generating with seed: "uisiana has since been forced to apologi"
uisiana has since been forced to apologi^1e9x+[e<~x_
}?-j/)0p4m\/l&&_k,d[8i\te7go}u\$-
af/`!qba;he>
95td"{x[7m5.}-'pd:]^5ws*{9kd/?83] m\0ps]]]ud*c-f3xz\=$<@]}ik$\x/8
z%w,ji]){i/[*h)9g?s#3};.~c*^[-.{xx 4# /0_:i{|eld(>y3woyw"?h`\-n:[-]-=$&4#u

----- diversity: 0.7
----- Generating with seed: "2016 presidential election result. 
main"
2016 presidential election result. 
mainoi0.\z;bf9$jj("\
7}\lwhp"no\.<o>g:%nh=v?
7*>*]}<&6y&1/]/h2el(
`i}!59g1h/x+/xdjiy|8^ma^\!hg./n_<'f2~+5z"|t`3!mwe"^a6s7!%v.$8\@lhk)w=(>w
.b<(zo35eh;b7 "@ts}|e4krzh!12#fr37?$qkv<zr9'mp;+&?cs{-i<(  fu2_c}

<keras.callbacks.History at 0x7f7828ba7df0>

In [None]:
model = load_model("cp.ckpt/")



In [None]:
SEED = "the government"
sample_from_model(model, data, CHAR2INDICES, CHUNK_LENGTH, VOCAB_SIZE, seed=SEED)

----- diversity: 0.2
----- Generating with seed: "                          the government"
                          the governments astranst the elections in anderits and respestay of the lecties, and the countrilan political agstaintes in a reading an the listor clinton inte new esment to seratitional states the clinton campaign ablict his in lude to to email organ sourd the new york that the election redest whos the was election democratic party eloccer, that the clinton foundation surpory of the latest signitianal emails 

----- diversity: 0.5
----- Generating with seed: "                          the government"
                          the governments to be stingen that may insters listoligan, the mockia sannter and earoment.  thing of sorass, to you  and their out himarin couldny hest in cartued in the reaber slains the mostly presidentionaution rederants  and lost at the clinton for the discussions and the mouss and the add the deportsion of and is puthic  was a seection with

## What has our model learned?

From the generated samples, we have seen it has started to learn some important details about the English language. Surely a huge improvement over the random gibberish from the start. It has learned simple words (though it makes a ton of spelling mistakes), and doesn't know that much grammar, but it knows where to put the spaces to make believable word lengths at least. What other things about grammar does it know?

Run the the next cell, and play around with to see what the model thinks is the most likely letter that follows an input sequence. Some questions I have about the model are


*   Has it learned that the letter that follows 'q' is usually a 'u'?
*   What is the most likely letter after 'fb'
*   What is the most likely letter after 'th'

<b>Run the cell below twice if an error appears!</b>


In [None]:
interact(lambda sequence: predict_str(model, sequence, CHAR2INDICES), sequence='th');

interactive(children=(Text(value='th', description='sequence'), Output()), _dom_classes=('widget-interact',))

## More things to try:

* Change the values of the constants that we set at the beginning of this notebook
* Increase `CHUNK_LENGTH`
* Limit our vocab to only letters and numbers (no punctuation)
* Train on more data
* Explore different model architectures (more layers,  different sampling, etc.)

And more!