# GPT 2

## Setup Dev Environment for GPT-2

In [1]:
%tensorflow_version 1.x
!pip install -q gpt-2-simple
import gpt_2_simple as gpt2
from datetime import datetime
from google.colab import files

TensorFlow 1.x selected.
  Building wheel for gpt-2-simple (setup.py) ... [?25l[?25hdone
The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.



### Check GPU settings
Google collab has a Tesla GPU that should be able to run gpt-2. Enable GPU by going to Runtime > change runtime type > GPU

In [2]:
!nvidia-smi

Mon Nov  2 14:15:14 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 455.32.00    Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   66C    P8    12W /  70W |      0MiB / 15079MiB |      0%      Default |
|                               |                      |                 ERR! |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

### Download GPT-2

124M is the small model, 500MB

355M is the medium model, 1.5GB, and should be trainable on google colab.



In [3]:
gpt2.download_gpt2(model_name="355M")

Fetching checkpoint: 1.05Mit [00:00, 301Mit/s]                                                      
Fetching encoder.json: 1.05Mit [00:00, 81.4Mit/s]                                                   
Fetching hparams.json: 1.05Mit [00:00, 511Mit/s]                                                    
Fetching model.ckpt.data-00000-of-00001: 1.42Git [00:06, 220Mit/s]                                  
Fetching model.ckpt.index: 1.05Mit [00:00, 285Mit/s]                                                
Fetching model.ckpt.meta: 1.05Mit [00:00, 129Mit/s]                                                 
Fetching vocab.bpe: 1.05Mit [00:00, 190Mit/s]                                                       


### Mounting Google Drive
This GPT-2 library has methods to interact with Google drive. 

Mounting Gdrive enables us to save and export our model.

In [None]:
gpt2.mount_gdrive()

Mounted at /content/drive


## Data cleaning
we are going to use gpt-2 to generate new speeches from Joe Biden and trump and try to gain new insight on their position.

[Joe Biden's Speeches](https://www.kaggle.com/vyombhatia/joe-bidens-speeches-of-this-week)

[Trump's Speeches](https://www.kaggle.com/christianlillelund/donald-trumps-rallies?select=LasVegasFeb21_2020.txt)

These speeches are in the form of text files, so we combined these text files into one merged folder in order to fine-tune gpt-2.

We will have to separate the speeches with the <|startoftext|> and the <|endoftext|> tags. These tags are important since these are how gpt-2 separates content.

### Biden



In [None]:
! unzip joebiden.zip -d joe

In [None]:
file = open('/content/joe/Joe Biden Speeches/DNC 2020 - (10).txt', 'r')
file.read()

'5 million Americans infected by COVID-19. More than 170,000 Americans have died. By far the worst performance of any nation on earth. More than 50 million people have filed for unemployment this year. More than 10 million people are going to lose their health insurance this year. Nearly one in six, small businesses have closed this year. And this president, if he’s re-elected, you know what will happen. Cases and deaths will remain far too high. More mom and pop businesses will close their doors, and this time for good. Working families will struggle to get by, and yet the wealthiest 1% will get tens of billions of dollars in new taxes breaks and the assault on the Affordable Care Act will continue until it’s destroyed, taking insurance away from more than 20 million people, including more than 15 million people on Medicaid, and getting rid of the protections that President Obama worked so hard to get passed for a hundred million more people who have preexisting conditions.'

A file walk is a useful method and it gives us all the names that exists in a file. 

In [None]:
import os
outf = open("bidenMerge.txt", 'w')
for root, dirs, files in os.walk("/content/joe/Joe Biden Speeches/", topdown=False):
  for name in files:
    
    f = open(os.path.join(root, name), 'r')

    outf.write('<|startoftext|>')
    outf.write("\n")
    outf.write(f.read())
    outf.write("\n")
    outf.write('<|endoftext|>')
    outf.write("\n")

outf.close()

#### Run gpt-2
We run our model and change the hyperparameters so that it accepts the 355M model. 



In [None]:
file_name = "./bidenMerge.txt"

In [None]:
sess = gpt2.start_tf_sess()

gpt2.finetune(sess,
              dataset=file_name,
              model_name='355M',
              steps=1000,
              restore_from='fresh',
              run_name='biden',
              print_every=10,
              sample_every=100,
              save_every=200
              )

interrupted
Saving checkpoint/biden/model-1066


We will save our checkpoint on gdrive. If we want to download the model from gdrive, then we will have to download it from gdrive.

In [None]:
gpt2.copy_checkpoint_to_gdrive(run_name='biden')

We can load our model from gdrive if we want to import it to other notebook.

In [None]:
gpt2.copy_checkpoint_from_gdrive(run_name='biden')

Then we start a session and load our new pretrained model.

In [None]:
sess = gpt2.start_tf_sess()
gpt2.load_gpt2(sess, run_name='biden')

We can finally use our model with the gpt2.generate() method. Gpt-2 can retain content for up to 500 tokens.

In [None]:
gpt2.generate(sess, run_name='biden')

The idea that we have to bring in HBCUs to engage in more of what we are doing in terms of our national dialogue. And for example, I kid … Two of the guys that are on my co-sponsors of my campaign, African-Americans, they’ve talked about themselves as Morgan men. I’m a Morgan man. Well, there’s a lot of great universities including right here, but we have an opportunity to provide for, you’re able to get to those universities. How can you do it? Well, I’m going to see to it that any family, anyone who comes from a family that makes a total income of less than $125,000 a year gets free college education. They don’t get in. If they get in and qualify, they pay nothing to go to college.
<|endoftext|>
<|startoftext|>
Look, the other thing is that if you think about it, the FEMA, the Federal Emergency… They had agreed that they were going to provide masks for schools. They started to hand them out to schools. Well, guess what? The president didn’t like that, or somebody didn’t like that. Th

### Trump
Then we do the same data cleaning with Trump's speeches.

**You will need to restart runtime in order to train a second model**

In [None]:
! unzip trump.zip -d don

In [None]:
outf = open("trumpMerge.txt", 'w')
for root, dirs, files in os.walk("/content/don", topdown=False):
  for name in files:
    
    f = open(os.path.join(root, name), 'r')

    outf.write('<|startoftext|>')
    outf.write("\n")
    outf.write(f.read())
    outf.write("\n")
    outf.write('<|endoftext|>')
    outf.write("\n")

outf.close()

Run our model, trump has more speeches on kaggle so it will take longer to train. 

In [None]:
file_name = "./trumpMerge.txt"

In [None]:
sess = gpt2.start_tf_sess()

gpt2.finetune(sess,
              dataset=file_name,
              model_name='355M',
              steps=3500,
              restore_from='fresh',
              run_name='trump',
              print_every=10,
              sample_every=100,
              save_every=200
              )

In [None]:
gpt2.copy_checkpoint_to_gdrive(run_name='trump')

In [None]:
# Load our 
sess = gpt2.start_tf_sess()
gpt2.load_gpt2(sess, run_name='trump')

In [None]:
gpt2.generate(sess, run_name='trump')

He's not getting that money. You know why? Because he doesn't… Russia's not getting that money. It's very interesting, we're replacing NAFTA with the USMCA. That's another one. NAFTA, Hillary Clinton's nightmare, NAFTA. She said, "I want to apologize. I should have never let that happen." She should have never let that happen. She's a very dishonest person. Clinton, very dishonest person. It's very, very dishonest. Even this good. Even this good, but they're all telling you how to run your business. How do you run a business that's going to bring you back? It's one of the reasons I like Ron DeSantis so much. He's a great guy, but NAFTA and the WTO, WTO has never been better. We won a lot of cases against the USMCA, and we're going to take it down to the bone. WTO, WTO. They're losing millions of jobs. Mexico's not happy. They're not happy. And they're very tough cases. They don't like taking the tough cases. That's why we're down to 2.5% of GDP. They're doing worse than China. They're 

## Saving output
It is computationally expensive to keep gpt-2 running, so in order to serve our results statically, we are going to save everything in a list.

In [None]:
state = []
for x in range(10):
  state.append(gpt2.generate(sess, run_name='trump'))

Thank you, President Trump." I'll never forget, I was in Europe and a wonderful person, from the country that we're going to win in 2020, Prime Minister Abe. He said, "Thank you, President Trump. We've been waiting for you for 47 years. We've been waiting for you. We're waiting for you. Every year, there's a new president. We're waiting for a new president." And then it was not me. It was the people that we elect. As soon as I won, they said, "Let's impeach President Trump." It was the people that we elect. So we're doing a very, very large job for the people of South Carolina. And I want to thank everybody. I want to especially thank our incredible vice president, Mike Pence, and the incredible people of South Carolina. It's a great honor to be with you. Thank you, Mike. We love you. Thank you, Mike. And I want to thank many other people of the same party, and they're great. I want to especially thank our great congressmen, because they do a great job. I mean, sometimes it's not so gr