# MVP 1: Generation of Novel Company Descriptions for Use in Project Ideation

The goal of MVP 1 is to generate novel company descriptions and output those ideas to a JSON format. To do this we had to pass through three major steps:
1. Source a large volume of company descriptions
2. Create and train a model to generate company descriptions
3. Apply postprocessing to boost the signal to noise ratio in the generated descriptions

<br>

Success for MVP 1 hinges on three key measures:
1. \>70% of generated ideas are readable/comprehensible problems
2. Reduction in time spent by Lambda Labs staff sourcing basic project ideas - stakeholder judges there to be significant value from the output
3. Output is formatted in JSON format in preparation for MVP 2


# Sourcing Company descriptions

We used company descriptions because they set out their goals, identify issues, include semi-specific plans, and utilize emerging technologies. We feel that these are all things which Lambda Labs may want in a project for their students to show off to the world, and a good dataset for our model to emulate.

<br>

We pulled descriptions from [Crunchbase](https://www.crunchbase.com/) because it is the more affordable of the two larger API's (pitchbook and crunchbase) and contains full descriptions as well as taglines, for every company.

<br>

This means that we have access to a rich source of data that represents people who are so sure they've found a problem  (particularly, a problem that they can make money off of/humans really need), that they have put skin in the game for it.

#Why GPT-2?

GPT-2 is a LM ([language model](https://datascience.stackexchange.com/questions/13188/whats-an-lstm-lm-formulation)) which has had a massive amount of pre-training done to it in the form of every reddit post ever being fed through it.  From this it has learnt the structure of language, and how certain words relate to each other.  This becomes an extraordinarily powerful generating tool that can take an input and 'predict' the next words using probability distribution.

<br>

This is useful to our task, as we wish to provide generated ideas for labs that represent a high probability of being an issue people have.  This model will take in the data, learn the structure of the 'primer' (structured) data, and begin using that as basis to create its text.  Our primer data needs to represent problems in some semi-structured way, which is why we use the Crunchbase company descriptions.

![gpt-2](https://i1.wp.com/deliprao.com/wp-content/uploads/2019/02/image-1.png?raw=1)

(slide showing the process of generating text on GPT-2)

## Imports/Downloading Model and Data

In [0]:
!pip install -q gpt-2-simple
import gpt_2_simple as gpt2
from datetime import datetime
from google.colab import files
import glob
import os
import pandas as pd

[?25l[K     |▌                               | 10kB 20.3MB/s eta 0:00:01[K     |█                               | 20kB 25.2MB/s eta 0:00:01[K     |█▌                              | 30kB 31.9MB/s eta 0:00:01[K     |██                              | 40kB 3.5MB/s eta 0:00:01[K     |██▌                             | 51kB 4.3MB/s eta 0:00:01[K     |███                             | 61kB 5.1MB/s eta 0:00:01[K     |███▌                            | 71kB 5.8MB/s eta 0:00:01[K     |████                            | 81kB 6.5MB/s eta 0:00:01[K     |████▌                           | 92kB 7.3MB/s eta 0:00:01[K     |█████                           | 102kB 7.7MB/s eta 0:00:01[K     |█████▌                          | 112kB 7.7MB/s eta 0:00:01[K     |██████                          | 122kB 7.7MB/s eta 0:00:01[K     |██████▌                         | 133kB 7.7MB/s eta 0:00:01[K     |███████                         | 143kB 7.7MB/s eta 0:00:01[K     |███████▌                

We are using the medium sized model to train, as the larger one sees diminishing returns due to the data being highly structured, but mid still performs far 
better than the small.

In [0]:
gpt2.download_gpt2(model_name="345M")

Fetching checkpoint: 1.05Mit [00:00, 400Mit/s]                                                      
Fetching encoder.json: 1.05Mit [00:00, 96.2Mit/s]                                                   
Fetching hparams.json: 1.05Mit [00:00, 361Mit/s]                                                    
Fetching model.ckpt.data-00000-of-00001: 1.42Git [00:08, 163Mit/s]                                  
Fetching model.ckpt.index: 1.05Mit [00:00, 154Mit/s]                                                
Fetching model.ckpt.meta: 1.05Mit [00:00, 100Mit/s]                                                 
Fetching vocab.bpe: 1.05Mit [00:00, 130Mit/s]                                                       


Cloning repo in for fast access to all .csv's...

In [0]:
!git clone https://github.com/labs15-pain-point/Data-Science

Cloning into 'Data-Science'...
remote: Enumerating objects: 11, done.[K
remote: Counting objects: 100% (11/11), done.[K
remote: Compressing objects: 100% (10/10), done.[K
remote: Total 310 (delta 1), reused 8 (delta 1), pack-reused 299[K
Receiving objects: 100% (310/310), 50.64 MiB | 24.93 MiB/s, done.
Resolving deltas: 100% (59/59), done.


Now we can utilize the repo with `glob` in order to concatenate every .csv in a particular folder into 1.  This will leave us with a dataframe with some extraneous info, but 171k+ descriptions that we can begin to work with. 

In [0]:
files_1 = os.listdir('Data-Science/csvs/')
files_1 = os.listdir('Data-Science/crunchbase_csv/')

df = pd.concat([pd.read_csv(f, error_bad_lines=False) for f in glob.glob('Data-Science/csvs/*.csv')], ignore_index = True)
df_2 = pd.concat([pd.read_csv(f, error_bad_lines=False) for f in glob.glob('Data-Science/crunchbase_csv/*.csv')], ignore_index = True)
df_big = pd.concat([df, df_2], ignore_index = True)

print(df_big.shape)
df_big.head()

b'Skipping line 749: expected 4 fields, saw 14\n'
b'Skipping line 521: expected 5 fields, saw 11\n'


(171472, 5)


b'Skipping line 512: expected 5 fields, saw 16\n'
of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  


Unnamed: 0,Categories,Description,Full Description,Organization Name,Organization Name URL
0,"Artificial Intelligence, Non Profit, Online Games",,OpenAI is a non-profit artificial intelligence...,OpenAI,https://www.crunchbase.com/organization/openai
1,"Artificial Intelligence, Construction, Informa...",,OpenSpace offers photo documentation which is ...,OpenSpace,https://www.crunchbase.com/organization/opensp...
2,"A/B Testing, Developer Tools, Internet, Person...",,Optimizely is the world's leader in customer e...,Optimizely,https://www.crunchbase.com/organization/optimi...
3,"Autonomous Vehicles, Electric Vehicle, Machine...",,Optimus Ride Inc. is a self-driving vehicle co...,Optimus Ride,https://www.crunchbase.com/organization/optimu...
4,"Analytics, Data Visualization, Enterprise Soft...",,OpenGov is the leader in government performanc...,OpenGov,https://www.crunchbase.com/organization/opengov


Final step in our preprocessing is to drop everything except the descriptions from the dataframe, and save that under a variable, so that we can pass it in to the model.

In [0]:
df_final = df_big.drop(['Organization Name URL', 'Organization Name', 'Description'], axis = 1)
train = df_final['Full Description']
train.to_csv('train.csv', sep = ' ', index = False, header = False)

file_name = "train.csv"

## Parameter Tuning the Model
Once a model is selected, even a pre-trained one, there are a variety of knobs and dials that can be turned in order to maximize the potential interaction between the dataset and trained text, in this text cell below we used the following:

- __`dataset`__: points to the data you're using to prime the model(crunchbase records, about 180000)

- __`accumulate gradients`__: decides how strong backpropagation will be by how many gradients to sum at a time(similar to max_depth in RF, default of 7, keep below 15)

- **`steps`**: (epochs) This is the main parameter, and controls how many times should we pass our data back and forth through the machine (as much as your time will allow, diminishing returns past certain point)

- **`run_name`**: designates the name of the checkpoint(zip file) that we can pull the model back out of if needed.  Saved in this will be the models exact state at the time.

- **`only_train_transformer_layer`**: determines whether or not to do the full train(on structured/limited data) or a lighter train(for things like unstructured comments/text or large bulk data)

as this runs it will sample the data every 100 steps, which means it will stop and generate some samples for you to check if anything is wrong.  It will also save to our checkpoint every 250, which is handy in case training is interrupted by need or chance! 

In [0]:
sess = gpt2.start_tf_sess()

gpt2.finetune(sess,
              dataset=file_name,
              model_name='345M',
              accumulate_gradients = 11,
              steps=1500,
              restore_from='fresh',
              run_name='run1',
              print_every=10,
              sample_every=100,
              save_every=250,
              only_train_transformer_layers=False)

After the model is trained, you can copy the checkpoint folder to your own Google Drive.

In [0]:
gpt2.copy_checkpoint_to_gdrive(run_name='run1')

## Load a Trained Model Checkpoint

Running the next cell will copy the `.rar` checkpoint file from your Google Drive into the Colaboratory VM.

In [0]:
gpt2.copy_checkpoint_from_gdrive(run_name='run1')

The next cell will allow you to load the retrained model checkpoint + metadata necessary to generate text.

**IMPORTANT NOTE:** If you want to rerun this cell, **restart the VM first** (Runtime -> Restart Runtime). You will need to rerun imports but not recopy files.

In [0]:
sess = gpt2.start_tf_sess()
gpt2.load_gpt2(sess, run_name='run1')

## Generate Text From The Trained Model
After training the model or loading a retrained model from checkpoint, you can now generate text. `generate` generates a single text from the loaded model.

<br>

You can also pass in a `prefix` to the generate function to force the text to start with a given character sequence and generate text from there (good if you add an indicator when the text starts).

<br>

Other optional-but-helpful parameters for `gpt2.generate`:

*  **`length`**: Number of tokens to generate (default 1023, the maximum)
* **`temperature`**: One of the most influential parameters.  The higher the temperature, the crazier the text (default 0.7, recommended to keep between 0.7 and 1.0)
* **`top_k`**: Limits the generated guesses to the top *k* guesses (default 0 which disables the behavior; if the generated output is super crazy, you may want to set `top_k=40`)
* **`top_p`**: Nucleus sampling: limits the generated guesses to a cumulative probability. (gets good results on a dataset with `top_p=0.9`)
* **`truncate`**: Truncates the input text until a given sequence, excluding that sequence (e.g. if `truncate='<|endoftext|>'`, the returned text will include everything before the first `<|endoftext|>`). It may be useful to combine this with a smaller `length` if the input texts are short.


In [0]:
gpt2.generate(sess, 
              run_name='run1',
              length = 200,
              batch_size = 20,
              temperature = 0.9, 
              nsamples = 40)

Consumers who use coupon code VENDED are automatically entered to win charity buys! Visit VendedOneBuy.com to enter to win.  Be sure to enter to win by using the right coupon code on the checkout page. Good luck! <|endoftext|>
<|startoftext|>Departures are a new platform for digital communication built for travelers that creates an instant, safe and collaborative platform for travelers around the world to communicate and collaborate on their daily travel plans by connecting on destinations that offer cultural, historic, and fantastic environments. Departures use artificial intelligence to converse with travelers and generate recommendations that are shared with them based on their social media activity and other data sources. The company works with travelers around the world on a daily basis and then delivers an amazing gift – a way for travelers to travel and get inspired on their next adventure. Departures has raised $30 million in funding from some of Silicon Valley's most experienc

For bulk generation, we can generate a large amount of text to a file and sort out the samples locally on our computer. The next cell will generate a generated text file, process it just a bit, and then turn it into a .csv to pass along to the next step.

In [0]:
gen_file = 'gpt2_gentext.txt'

gpt2.generate_to_file(sess,
                      destination_path=gen_file,
                      length=300,
                      temperature=0.9,
                      nsamples=5000,
                      batch_size=20
                      )

files.download(gen_file)

with open('gpt2_gentext.txt', 'r') as in_file:
  desc = [line.strip('=*') for line in in_file]
df = pd.DataFrame({'descriptions':desc})
df.to_csv('log.csv')

# Post Processing

The GPT-2 outputs fake company descriptions however they require cleaning and culling to boost the signal to noise ratio. Without this post processing we end up with a higher number of descriptions like:
```
CITY of Design is a UK company with offices in London, London, London, and Paris. The company was established in 1864 and is headquartered in London, England.
```
While the description is readable and almost believable it does not encode the type of information we care about, descriptions of problems companies are solving. Once descriptions pass through post processing, we don't entirely eliminate low value descriptions but do end up surfacing more descriptions with a higher level of value encoded in them, such as:
```
Omise is a modern software company that offers a suite of guest analytics and compliance tools for hotels, industry leaders in the areas of guest engagement, guest satisfaction, and hotel productivity.  Omise's mission is to help improve the guest experience by enabling smarter guest decisions throughout the guest-centric economy. 
```

## Imports and Functions

In [0]:
import pandas as pd
import spacy
from spacy import displacy
from collections import Counter
import en_core_web_sm
nlp = en_core_web_sm.load()

### Supporting Functions
- print_line: format the description output to make it readable in the notebook
- cleanup: remove markers that are artifacts of GPT-2 output
- long_word: flags if a description has a word over a given length

In [0]:
# Function to make the output more readable
def print_line(line):
  while len(str(line)) > 90:
    print(line[:90])
    line = line[90:]
  print(line,'\n\n')

In [0]:
def cleanup(desc):
  '''Cleans output from gpt-2, start/end of description markers left in'''
  desc = desc.replace('<|startoftext|>','').replace('<|endoftext|>\n','')
  return desc.strip()

In [0]:
def long_word(desc, length=20):
  '''Return True if description contains words too long'''
  for word in desc:
    if len(word) >= length:
      return True
  
  return False

### Word Frequency
This function is the main filter for the post processing. The idea is to assign a frequency score representing the  occurrence rate of unique words while also penalizing the words and phrases the are indicative of low value descriptions. This de-emphasizes descriptions that involve company location, founders, heavy repetition, and similar indicators of unessential information.

In [0]:
def word_freq(desc):
  '''Return frequency of unique words not in stopwords'''
  desc = str(desc).lower()
  full_len = len(desc.split())
  stopwords = ['founded by', 'is based in', 'was founded in', 
                'is headquartered in', 'headquarters in', 'developed by',
                'developed in', 'additional offices', 'germany',
                'france', 'china', 'california', 'india', 'wholly-owned'
                'silicon valley', 'san francisco', 'established in',
                'mountain view', 'family owned', 'family-owned', 
                'clients include', 'argentina', 'brazil', 'chile', 'colombia', 
                'japan', 'korea', 'malaysia', 'mexico', 'subsidiary',
                'formerly known as', 'venture capital', 'for more information']
  
  # Drop words that encode garbage info, thus penalizing on the frequency score
  for phrase in stopwords:
    desc = desc.replace(phrase,'')
  
  split_desc = str(desc).split()
  
  # Calculate unique word frequency, return 0 if description is too small
  if ((len(split_desc) < 10) | (long_word(split_desc))):
    return 0
  return len(set(split_desc)) / full_len

### Entity Frequency
The secondary filter utilizes named entity recognition (NER) to assign another frequency score. An entity is a word or phrase that represents a real world object/concept such as an organization or a date. By looking for entities indicating low value and tuning for a low frequency we can further narrow the field of fake descriptions and boost the signal to noise ratio.

In [0]:
def entity_freq(text, nlp):
  '''Return frequency of low value entities'''
  doc = nlp(text)
  count = 0
  
  # Use Spacy to pull find entities and count them
  for X in doc.ents:
    if (X.label_ in ['ORG', 'DATE', 'PERSON', 'TIME', 'PERCENT', 'MONEY']):
      count += 1
      
  return count/len(text.split())

### Further Selection Methods
We tried a variety of other methods for selecting high value descriptions including cosine similarity, distance, key phrase/word detection, occurence frequency of bags of words, and more. Due to the nature of the data set both in terms of size and the difficulty exctracting semantic value embedded in each description these methods were deemed unusable.

## Culling
Once effective selection methods were settled upon we compiled the code and passed the GPT-2 generated descriptions through the post processing. Since the first MVP is to output a JSON of mostly high value fake descriptions we narrowed the final output to only the window most likely to boost the signal to noise ratio in the descriptions. In future iterations we plan to export a wider window of descriptions and allow the user some control over the tuning process.

In [0]:
# Read in the generated descriptions
df = pd.read_csv('https://raw.githubusercontent.com/labs15-pain-point/Data-Science/master/generated/log(6).csv')
df = df.drop('Unnamed: 0', axis=1)
df['descriptions'] = df['descriptions'].apply(cleanup)
df = df[df['descriptions'] != '']

In [0]:
# Begin culling and formatting the dataframe
start_sum = df['descriptions'].count()
print('Descriptions before any drops:', start_sum)

df['word_freq'] = df['descriptions'].apply(word_freq)
df = df[df['word_freq'] >.6]
df['ent_freq'] = [entity_freq(desc, nlp) for desc in df['descriptions']]

word_freq_cond = ((df['word_freq'] > .75) & (df['word_freq'] < .925))
ent_freq_cond = (df['ent_freq'] < .04)
df = df[word_freq_cond & ent_freq_cond].reset_index().drop('index', axis=1)

Descriptions before any drops: 28119


In [0]:
print('Descriptions after frequency windows:', df['descriptions'].count())
print('% Kept from Batch:', df['descriptions'].count()/start_sum)

Descriptions after frequency windows: 3303
% Kept from Batch: 0.11746505921263203


In [0]:
df.describe()

Unnamed: 0,word_freq,ent_freq
count,3303.0,3303.0
mean,0.816368,0.017012
std,0.046428,0.013921
min,0.751724,0.0
25%,0.777778,0.0
50%,0.807018,0.019608
75%,0.85,0.029199
max,0.923077,0.039604


In [0]:
# Look at all descriptions
for i in range(0,len(df)):
  print('Word Frequency:', df['word_freq'][i])
  print('Entity Frequency:', df['ent_freq'][i])
  print_line(df['descriptions'][i])

Word Frequency: 0.7959183673469388
Entity Frequency: 0.02040816326530612
Omise is a modern software company that offers a suite of guest analytics and compliance t
ools for hotels, industry leaders in the areas of guest engagement, guest satisfaction, an
d hotel productivity.  Omise's mission is to help improve the guest experience by enabling
 smarter guest decisions throughout the guest-centric economy. 


Word Frequency: 0.7777777777777778
Entity Frequency: 0.0
Dealroom.co is a SaaS platform that connects developers with businesses. It is revolutioni
zing how companies manage their digital operations. The platform helps companies to organi
ze, manage, and hire their teams. It also enables companies to easily hire, let, and manag
e additional employees through their apps. 


Word Frequency: 0.9090909090909091
Entity Frequency: 0.0
Riiid develop a digital platform of care that enables individuals and organisations to con
nect, communicate, and engage in commerce.  Its technology offer

# Manual Processing
Once we narrow the field of possible descriptions, the final step is for a user to manually select for usable descriptions. Since the goal is to aid project ideation and brainstorming, we hope to provide a high frequency of descriptions that describe a problem and solutions to that problem. 

<br>

Even though their is still noise and a manual selection process, we feel that significant value and time savings are added to the stakeholder. This is because we provide a wide range of problem/solution pairs in one location reducing research time to even find potential projects. Additionally, this should reduce some of the mental fatigue and creativity blocks associated with brainstorming and project ideation.

<br>

Future MVP's will continue to ease the mental burden of ideation and selection through greater usability and control over criteria.

# Examples of High Value Descriptions

Project: Platform for language learning and development
```
Pro.com is an educational platform for Japanese language learners. It offers learning materials such as textbooks, storybooks, and videos to learn and develop language skills. The company was founded in 2015. 
```

<br>

Project: Tools and analytics application for the hotel/hospitality industry
```
Omise is a modern software company that offers a suite of guest analytics and compliance tools for hotels, industry leaders in the areas of guest engagement, guest satisfaction, and hotel productivity.  Omise's mission is to help improve the guest experience by enabling smarter guest decisions throughout the guest-centric economy. 
```

<br>

Project: Privacy and digital rights management application
```
In a world where everything from billboards to your mobile data is being stored for you, we believe it is important to support digital rights holders when it comes to keeping their valuable content around the world. 
```

<br>

Project: Track and analyze risk indicators in a healthcare setting to improve patient outcomes. Possible IoT integration
```
Riskimet is an all-in-one e-commerce solution for tracking and analyzing physical risks in the healthcare industry. By tracking physical risks in real-time they can easily and transparently gather intelligence and automate actions to improve patient outcomes and corporate welfare. 
```

<br>

Project: Website to organize large gatherings such as dinner parties and potlucks, and then provide the food, tools, and recipes to make it a success. Blue Apron for dinner parties.
```
At Dinevore, they believe that dining is a social community. They want to bring your food to people who would rather not be served. They want to offer you the best food that perfectly fits your eating style. 
```

<br>

Project: One stop shop for travel planning
```
Rooter is an online travel booking platform that lets users book flights, hotels, vacations, car rentals, golf, boats, activities, conference and event packages, car rental, food and beverage, and other great stuff on just about any booking page. Rooter’s mission is simple: To let people book where they want their life to go, and have them book at the same time. Rooter makes travel planning as easy as booking a flight, hotel, conference, or activity. It also allows users to book travel agencies and get a full picture view of the listing process. Rooter was launched in February 2010 and is based in London, England. 
```

<br>

Project: Mobile/web application for inventory tracking and delivery logistics
```
Rentlogic is a tech-enabled fulfillment and delivery platform that makes tracking inventory and order performance real-time. It ferociously tracks the supply chain from vendor to vendor, delivering products at scale and taking orders with less human input. It was founded in 2016. 
```

# Output to JSON
To complete the MVP we finally output our dataframe to JSON.

In [0]:
df.to_json('generated_descriptions.json', index=False)