<a href="https://colab.research.google.com/github/SohamK2111/Reply-Hackathon/blob/main/NLP/data_preparation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Natural Langauge Processing
In this notebook, we will be focusing on a standard way of processing the data to feed to a model such as GPT-3. For this we will be using various libraries (please bare in mind that this was attempted locall!! Helper and utility functions were used for loading aspects i.e. credentials and dataset downloads)

The actual fine-tuning of a GPT-3 model can be found here: https://platform.openai.com/docs/guides/fine-tuning 

In [None]:
# Import useful Libraries
import credentials
import pandas as pd
import zipfile
import openai
import gensim
import kaggle
import string
import nltk
import json
import os
nltk.download('wordnet')
stemmer = nltk.stem.SnowballStemmer('english')
lemmatizer = nltk.stem.WordNetLemmatizer()

[nltk_data] Error loading wordnet: <urlopen error [Errno 8] nodename
[nltk_data]     nor servname provided, or not known>


# Get credentials 
Locally attempted so please bear in mind!

In [None]:
openai.api_key = credentials.get_openai_api_key()
kaggle_credentials = credentials.get_kaggle_creds('~/.kaggle/kaggle.json')
os.environ["KAGGLE_USERNAME"] = kaggle_credentials["username"]
os.environ["KAGGLE_KEY"] = kaggle_credentials["key"]
zipped_path = "datasets/zipped/"
extracted_path = "datasets/extracted/"

Download dataset(s)

In [None]:
# downloads the dataset into a zip file
kaggle.api.dataset_download_files("patrickfleith/space-news-dataset", path=zipped_path)

In [None]:
# File needs to be unzipped
raw_zip = f"{zipped_path}space-news-dataset.zip"
with zipfile.ZipFile(raw_zip, "r") as zip_file:
    zip_file.extractall(extracted_path)

Unzipping the file gives us a csv filled with tabular based data. We can take a look at this data using the Pandas Library

In [None]:
# create a dataframe
df = pd.read_csv(f"{extracted_path}spacenews-december-2022.csv")
df.head(5)

Unnamed: 0,title,url,content,author,date,postexcerpt
0,Orion splashes down to end Artemis 1,https://spacenews.com/orion-splashes-down-to-e...,Updated at 5:45 p.m. Eastern after post-splash...,Jeff Foust,"December 11, 2022",Fifty years to the day after the last Apollo m...
1,Polaris Dawn crewed mission could suffer addit...,https://spacenews.com/polaris-dawn-crewed-miss...,LAS VEGAS — A billionaire-backed private astro...,Jeff Foust,"October 25, 2022",A billionaire-backed private astronaut mission...
2,DART on track for asteroid collision,https://spacenews.com/dart-on-track-for-astero...,WASHINGTON — A NASA spacecraft is on course to...,Jeff Foust,"September 25, 2022",A NASA spacecraft is on course to deliberately...
3,U.S. Space Command calls for investment in tec...,https://spacenews.com/u-s-space-command-calls-...,"WASHINGTON — Lt. Gen. John Shaw, deputy comman...",Sandra Erwin,"August 31, 2022",U.S. Space Command's Lt. Gen. John Shaw said '...
4,SpaceX requests permission for direct-to-smart...,https://spacenews.com/spacex-requests-permissi...,"TAMPA, Fla. — SpaceX could provide “full and c...",Jason Rainbow,"December 8, 2022",SpaceX could provide “full and continuous” dir...


Now we can see that there is some data which we can use. From the column names alone, we can understand that the columns: Title, content and postexerpt seem like they would contain appropriate textual data which we can use for our purposes. We can deepdive into these colums

In [None]:
interesting_columns = ["title", "content", "postexcerpt"]
interesting_df = df.loc[:, interesting_columns]

In [None]:
interesting_df

Unnamed: 0,title,content,postexcerpt
0,Orion splashes down to end Artemis 1,Updated at 5:45 p.m. Eastern after post-splash...,Fifty years to the day after the last Apollo m...
1,Polaris Dawn crewed mission could suffer addit...,LAS VEGAS — A billionaire-backed private astro...,A billionaire-backed private astronaut mission...
2,DART on track for asteroid collision,WASHINGTON — A NASA spacecraft is on course to...,A NASA spacecraft is on course to deliberately...
3,U.S. Space Command calls for investment in tec...,"WASHINGTON — Lt. Gen. John Shaw, deputy comman...",U.S. Space Command's Lt. Gen. John Shaw said '...
4,SpaceX requests permission for direct-to-smart...,"TAMPA, Fla. — SpaceX could provide “full and c...",SpaceX could provide “full and continuous” dir...
...,...,...,...
18349,Kendall lays out Pentagon thinking on future s...,"\nFrank Kendall, the Pentagon’s top acquisitio...","Frank Kendall, the Pentagon’s top acquisition ..."
18350,A larger share of NOAA’s declining space budge...,Updated Feb. 10 at 10:18 p.m. Eastern The U.S....,The U.S. National Oceanic and Atmospheric Admi...
18351,Think Tank Turns Its Attention To Mars As 2016...,WASHINGTON — As NASA develops a long-term stra...,As NASA develops a long-term strategy to suppo...
18352,House Bill Leaves Last Three JPSS Satellites i...,WASHINGTON — A spending bill the House passed ...,A spending bill the House passed June 3 would ...


# Cleaning the data
To clean the data, we want to carry out text normalisation. For this, we will need to carry out some of the following operations:
  - Convert to lowercase 
  - Remove punctuation 
  - Remove special characters 
  - Lemmatise and Stem words 
  - Remove stopwords

In [None]:
interesting_df["title"] = interesting_df["title"].str.lower()
interesting_df["content"] = interesting_df["content"].str.lower()
interesting_df["postexcerpt"] = interesting_df["postexcerpt"].str.lower()

In [None]:
from transformers import GPT2Tokenizer

def remove_punctuation(text):
    # Helper method to remove punctuation from text
    punctuation = string.punctuation
    clean_text = ''.join([char for char in text if char not in punctuation])
    clean_text = clean_text.encode('ascii', 'ignore')
    return str(clean_text)

def lemm_and_stem(text):
    # Helper method to lemmatize and stem words
    return stemmer.stem(lemmatizer.lemmatize(text, pos='v'))

def preprocess_text(text, topic_list=True):
    # Lemmatise, stem, stopword removal and tokenisation of text
    if not topic_list:
        clean_text_tokens = [lemm_and_stem(token) for token in nltk.word_tokenize(text) if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > 3]
        clean_text = " ".join(clean_text_tokens)
        clean_text = remove_punctuation(clean_text)
        return clean_text
    # Otherwise we can create a list of words to use as a reference in building the attention mask
    return [lemm_and_stem(token) for token in nltk.word_tokenize(text) if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > 3]

# Pre-process titles to get keywords to apply attention masking to df
title_df = pd.DataFrame(columns=["title"], data=interesting_df["title"].copy())
postexcerpt_df = pd.DataFrame(columns=["postexcerpt"], data=interesting_df["postexcerpt"].copy())

"""We want to understand if there are some interesting topics available to us. Normally, we can use LDAs for topic modelling,
however, since this dataset is based on space, we will end up with one main topic. As an alternative, we can utilise, some 
simple pre-processing techniques such as the ones described above to come up with a list of topics which we can use.

For this, we can extract such words from the title DataFrame and thepostexcerpt DataFrame
"""
title_df["title"] = title_df["title"].astype('str').apply(lambda text: preprocess_text(text, topic_list=True))
postexcerpt_df["postexcerpt"] = postexcerpt_df["postexcerpt"].astype('str').apply(lambda text: preprocess_text(text, topic_list=True))

interesting_df["title"] = interesting_df["title"].astype('str').apply(lambda text: preprocess_text(text, topic_list=False))
interesting_df["content"] = interesting_df["content"].astype('str').apply(lambda text: preprocess_text(text, topic_list=False))
interesting_df["postexcerpt"] = interesting_df["postexcerpt"].astype('str').apply(lambda text: preprocess_text(text, topic_list=False))

  from .autonotebook import tqdm as notebook_tqdm
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.


Now, we have DataFrames which contain some of the interesting topics, as well as our pre-processed data!

In [None]:
def concatenate_columns(row):
    return row.title + row.postexcerpt

def get_attention(row): 
    attention_mask = []
    words = row["attention_words"]
    content = row["attention_content"]
    for word in nltk.word_tokenize(content): 
        if word in words: 
            attention_mask.append(1)
        else:
            attention_mask.append(0)
    return attention_mask

def tokenize(text, tokenizer=GPT2Tokenizer.from_pretrained("gpt2")):
    tokenised = tokenizer(text)
    return tokenised["input_ids"]

# Join Title and PostExcerpt DataFrames in to one to start the creation of an attention mask. 
# We can effectively use this list of words to automate the creation of our attention mask!
attention_df = pd.DataFrame(columns=["attention_words"], data=title_df.join(postexcerpt_df, how='outer').apply(lambda row: concatenate_columns(row), axis=1))
attention_df["attention_content"] = interesting_df["content"]
attention_df["attention_mask"] = attention_df.apply(lambda row: get_attention(row), axis=1)

# Encode contents to feed to the model
interesting_df["input_ids"] = interesting_df["content"].astype('str').apply(lambda text: tokenize(text))
interesting_df["attention_words"] = attention_df["attention_words"]
interesting_df["attention_mask"] = attention_df["attention_mask"]

Token indices sequence length is longer than the specified maximum sequence length for this model (1125 > 1024). Running this sequence through the model will result in indexing errors


# Creating Prompt-Completion Pairs
Now that we have the data, we need to put it in a form which can be used by the fine-tuning API. 

There is a token limit which you can find more about on these links: 
- https://help.openai.com/en/articles/4936856-what-are-tokens-and-how-to-count-them 
- https://platform.openai.com/docs/guides/fine-tuning

In [None]:
def preprocess_prompt(prompt): 
    # Helper method to preprocess the prompt we want to create
    prompt_tokens = tokenize(prompt)
    return prompt_tokens

def get_prompt(row):
    # In this example, we will use a simple prompt: 
    topics = ", ".join(row.attention_words)
    prompt = f"Write a space news article regarding the following topics: {topics}\n\n###\n\n"
    preprocessed_prompt = preprocess_prompt(prompt)
    return preprocessed_prompt

def get_completion(row):
    return row.input_ids

def get_prompt_attention_mask(prompt):
    return [1] * len(prompt)

# Create the finetuning dataframe
finetune_df = pd.DataFrame(columns=["prompt", "completion"])
finetune_df["prompt"] = interesting_df.apply(lambda row: get_prompt(row), axis=1)
finetune_df["completion"] = interesting_df.apply(lambda row: get_completion(row), axis=1)
# Create prompt and completion attention masks:
finetune_df["prompt_attention_mask"] = finetune_df["prompt"].apply(lambda prompt: get_prompt_attention_mask(prompt))
finetune_df["completion_attention_mask"] = interesting_df["attention_mask"]
finetune_df
# We now have data which we can feed to the model!

Unnamed: 0,prompt,completion,prompt_attention_mask,completion_attention_mask
0,"[16594, 257, 2272, 1705, 2708, 5115, 262, 1708...","[65, 6, 929, 19608, 642, 2231, 9114, 10183, 68...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, ..."
1,"[16594, 257, 2272, 1705, 2708, 5115, 262, 1708...","[65, 1053, 4908, 18828, 1891, 21883, 33779, 43...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."
2,"[16594, 257, 2272, 1705, 2708, 5115, 262, 1708...","[65, 6, 86, 2542, 299, 15462, 16807, 1093, 82,...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, ..."
3,"[16594, 257, 2272, 1705, 2708, 5115, 262, 1708...","[65, 6, 86, 2542, 2429, 45610, 427, 707, 1207,...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[0, 0, 1, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, ..."
4,"[16594, 257, 2272, 1705, 2708, 5115, 262, 1708...","[65, 470, 13299, 781, 64, 2272, 87, 899, 312, ...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, ..."
...,...,...,...,...
18349,"[16594, 257, 2272, 1705, 2708, 5115, 262, 1708...","[65, 6, 8310, 962, 479, 437, 282, 28145, 1840,...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, ..."
18350,"[16594, 257, 2272, 1705, 2708, 5115, 262, 1708...","[65, 6, 929, 19608, 730, 65, 8949, 23, 9114, 1...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, ..."
18351,"[16594, 257, 2272, 1705, 2708, 5115, 262, 1708...","[65, 6, 86, 2542, 299, 15462, 1205, 890, 4354,...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."
18352,"[16594, 257, 2272, 1705, 2708, 5115, 262, 1708...","[65, 6, 86, 2542, 4341, 3821, 1208, 474, 1726,...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, ..."


# Attempting to stay in line with Token Limits :S

Generally, we have a 2048 token limit which applies to both the prompt and completion pair. In this case, we can drop those rows which go over the limit.

In [None]:
def get_prompt_completion_lengths(row):
    return len(row.prompt) + len(row.completion)

def drop_long_lengths(row):
    max_length = 2048
    length = get_prompt_completion_lengths(row)
    if length > max_length:
        return None
    return length

finetune_df["token_lengths"] = finetune_df.apply(lambda row: drop_long_lengths(row), axis=1)
finetune_df = finetune_df["token_lengths"].dropna(axis=0)
finetune_df



0         714.0
1         454.0
2         639.0
3         569.0
4         533.0
          ...  
18349     246.0
18350     687.0
18351     486.0
18352     954.0
18353    1062.0
Name: token_lengths, Length: 18329, dtype: float64

From this, we now have data which we can feed into the model itself, and then prompt our fine-tuned model to generate an article for us! OpenAI has a detailed page on fine-tuning using multiple languages: https://platform.openai.com/docs/guides/fine-tuning