# CS39AA - Introduction

## I. NLP Deep Learning Examples: Text Generation

We won't get into the details of how this deep learning model works until later in the semester, but for now let's look at an example of a task that was not possible just a few years ago. GPT-2 was one of the first models that made this possible. Here is an example of using [GPT-2](https://en.wikipedia.org/wiki/GPT-2) (via the [huggingface](https://huggingface.co) transformers module/library).

### 1. Import and download necessary module, model, and class
First, we import a pipeline object (a handy abstraction that takes care of much of the underlying complexities for common tasks, such as NER, , sentiment analysis, text generation, etc.)

__Note that this will download the underlying language model used by the Hugging Face pipeline (GPT-2 by default), which will download ~500MB! So, be sure you have a good internet connection before running the following cell. If you are using an online platform (e.g. Google Colab), then you probably don't want to download this since it will take up a lot of your precious storage space! As an alternative you can...__

instead use [Deep AI's text generator demo](https://deepai.org/machine-learning-model/text-generator) online. Just be sure to document what you do below in subsection 4.

In [None]:
PATH = '/home/steve/data/models/'
TEXTGEN = 'huggingface_textgen.pipeline'

# download pipeline and model from hugginface
#from transformers import pipeline
#text_gen = pipeline("text-generation")
#text_gen.save_pretrained(PATH + TEXTGEN)
# OR load a pre-downloaded and saved version
from transformers import AutoTokenizer, AutoModelWithLMHead, pipeline
tok = AutoTokenizer.from_pretrained(PATH + TEXTGEN, local_files_only=True)
mod = AutoModelWithLMHead.from_pretrained(PATH + TEXTGEN, local_files_only=True)
text_gen = pipeline("text-generation", model=mod, tokenizer=tok)

### 2. Provide a starting text sequence

We have already specified the task that we want to use our pipeline for above, namely text generation.

Next, we will give the text generation pipeline something to start with. One of the potential use case for text generation is to help writers produce content more efficiently. This already presents us with an dilemna about how this technology is used. Is it ethical for an author/journalist/etc. to take credit for work created by AI? Should such content be trusted? There are many other questions to consider, which you should definitely think about! 

For the time being there is no model that is capable of generating high-quality, clear and cohesive text sequences of non-trivial length. So let's use text generation to help us overcome a hypothetical mild case of writer's block. One of my favorite examples of writer's block is from a Billy Crystal [classic](https://youtu.be/KfVunEjeQPQ) where he is stuck on the first line of his novel with, "_The night was..._"

In [None]:
prefix_text = "The night was"

### 3. Generate some text

Finally, let's generate up to 50 additional tokens/words.

In [None]:
# generate a sequence
#result01 = text_gen(prefix_text, max_length=50, do_sample=False)
result01 = text_gen(prefix_text, max_length=50, do_sample=False)

# see results
print(result01[0]['generated_text'])

That wasn't bad, but it wasn't great either. You can see how these models can begin to falter.

Let's try once more, this time setting `do_sample=True` to generate a different text sequence.

In [None]:
result02 = text_gen(prefix_text, max_length=50, do_sample=True)

print(result02[0]['generated_text'])

### 4. Run your own example

Now it's your turn to provide a starting sequence and generate some results. 

Again, if you would prefer not to download and run this code, then you can instead use [Deep AI's text generator demo online](https://deepai.org/machine-learning-model/text-generator).

You will share the results of this in your [Self-Introduction post on Canvas](https://msudenver.instructure.com/courses/47963/discussion_topics/496954).

If you want to demonstrate some of the biases that exist in language models please do so, but be sure to keep the discourse respectful and constructive.

In [None]:
# insert a short sequence of text of your own below
your_prefix = "" 

# generate a set of tokens/words to complete your sequence
your_result = text_gen(your_prefix, max_length=50, do_sample=False)

# print your results
print(your_result[0]['generated_text'])

We will dig more into these types of models as the semester progresses, but if you're curious and want to unpeel one more layer to see what's happening behind the scenes, then [this demostration by the Allen Institute](https://demo.allennlp.org/next-token-lm) is worth a look. In that demo they show the set of tokens/sequences that are suggested along with their probabilities. 

## II. NLP Deep Learning Examples: Word Embeddings/Vectors

Again, we will dig into the concept of word vectors much more later in the semester - both in how they are created as well as how we can use them to assist with modeling tasks. For now let's try and understand what they are by looking at how we can use them. 

### 1. Import and download necessary module and data

We'll use the word vectors that Python module/library, "[gensim](https://radimrehurek.com/gensim/intro.html)" to download and use the [Word2vec](https://en.wikipedia.org/wiki/Word2vec) word vectors. 

Run only either the block with the `api.load` statement (to download w2v), or the `KeyedVectors.load` statement (to load the saved w2v).

In [None]:
PATH = '/home/steve/data/models/'
W2V = 'w2v-google-news-300.wordvectors'

#import gensim.downloader as api
#wv = api.load('word2vec-google-news-300')
#wv.save(PATH + W2V)
# OR
from gensim.models import KeyedVectors
wv2 = KeyedVectors.load(PATH + W2V)

### 2. Examine a Word Vector

Now let's look at a vector for the word, "computer". 

What does the vector look like?, How large is it?, What do you think are similar words? 

And, how do we think about similarity between word vectors? That is, how do we measure it?

In [None]:
# retrieve the word vector for 'computer'
wv['computer'].round(3)

In [None]:
# retrieve the 10 word (vectors) most similar to 'computer'
wv.most_similar('computer', topn=10)

### 3. Math with Words

Just as we can with any type of number/vector, word vectors can be used in mathematical operations. Admittedly, it can be abstract to think about words as mathematical entities, so take a second to think about what this might look like, and what it might mean. 

The classic example always given here is to begin with the word (vector) "_king_", subtract the word (vector) "_man_", and then add word (vector) "_woman_". 

If we look at the resulting vector, which word vector will be most similar to it? 

In [None]:
wv.most_similar(positive=['king', 'woman'], negative=['man'], topn=5)

The word vectors we have been looking at are 300 elements long. There are other versions of word vectors that have been created. Below is an image of what the vectors in the _king - man + woman_ example look like using word vectors that are 50 elements in length (blue values are close to -1 while red values are close to +1). 

<img src="images/w2v_vec_arithmetic.png" width="697" height="349">

### 4. Run your own example

Now it's your turn to try out word vectors. __If you are running this notebook locally on your own system, or if you are have the store on your online platform (e.g. Colab), then you can do so in the cells above. If not, then comment out the cells above and use one of the many online demos available__ (e.g. [Semantic Calculator demo](http://vectors.nlpl.eu/explore/embeddings/en/calculator/)). 

## III. Python (and other Tools)

Naturally, we will be using Python for this class. Python was originally created (~30 yrs ago) with the intention of being easy to learn, and thus allowing for code to be written quickly and easily read by others. Today it is one of the most widely used programming languages and the most-used when it comes to Data Science and Machine Learning. If you have never used Python before, don't worry, you will be able to pick it up as we go. 

In addition to Python there are a slew of other tools we'll be using that extend the functionality of Python. 

Here are the main ones:

* pandas
* numpy
* scikit learn
* jupyter
* pytorch

We will be using online servers such as [Google Colab](https://colab.research.google.com), [Kaggle](https://www.kaggle.com), [Gradient](https://gradient.paperspace.com), etc. when it is more convenient to do so (e.g. for more computational intensive tasks), so it's not necessary to install all of the above right this moment. However, it is recommended that you install Python on your local machine relatively soon, and that you do so using [Anaconda](https://www.anaconda.com/products/individual) as it comes with the first four items listed above. Once that is installed then [PyTorch](https://pytorch.org) is quick to install as well.  

### 1. Install Anaconda

Go to [https://www.anaconda.com/products/individual]() to download and install Anaconda. Most likely the default configuration will be fine. 

You can confirm the installation and see the configuration by opening a terminal and running: `$ conda info`. 

There you should see version 3.x for Python (this can also be confirmed by running `$ python --version`).

### 2. Hello World w/ Python

Although it is not very interesting, let's begin with Python by running the usual hello world program. 

Recall that we can run Python either in interactive mode or with a Python program file (w/ extension .py). Since Python is an interpreted language there is no compiler. 

In interactive mode, we simply start the Python interpreter with `$ python`, and can then begin typing Python code. For the hello world program all we need to type is `print('Hello World')`. 

For a standalone Pythom program, we can simply open up a new document with our favorite text editor, input the single print statement above, then save it as _helloworld.py_. We can then run this at the command line with `$ python helloworld.py`. 

Although we can run Python programs without explicitly defining a _main_ function/procedure, it is typically the case that we will want to define one. To define a function we'll use the `def` keyword. Let's call this program `helloworld2.py`. The contents of this program/file would be: 


    def printHelloWorld():
        print("Hello World")

    def main():
        printHelloWorld()

    if __name__ == `__main__`:
        main()


Notice that we define a function to print our desired output, `printHelloWorld()`. Suppose we were creating a new program but we wanted to use this same function. We can also import our 
`helloworld2.py` program into a new program, say, `helloworld_app.py` and load and use function from there. 

Here is what `helloworld_app.py` might look like:

    import helloworld2

    def main():
        helloworld2.printHelloWorld()

    if __name__ == `__main__`:
        main()

We will pick up more of this as we go, but if you are curious about how `main` works in Python, or about defining/importing modules, check out the following:

* [Defining main functions in Python (realpython.com)](https://realpython.com/python-main-function/)
* [Python Modules (on python.org)](https://docs.python.org/3/tutorial/modules.html)

### 3. Python Data Types

Python is a dynamically-typed language, which means that we don't need to explicitly state the data type when declaring an instance of a variable. 

For example, `x = 5`, will create an instance of a variable named, `x`, which will have the value of `5`. What data type do you suppose this variable will have? 

Now suppose that `y = 4.2` is input. What data type will this variable have?

To check the data type of any variable, e.g. `x`, use `type(x)`. The following jupyter notebook cell provides an example. 

In [None]:
x = 5
print("The data type of x is:", type(x))

Running the following line will confirm that x is in fact the expected data type, an `int`. 

In [None]:
type(x) == type(int())

Let's now try to create a float and string in the following cell. 

Complete the following cell by assiging a floating point value to $y$ then use `type()` to confirm $y$'s data type.

In [None]:
y = # add something here so that y is a float

# confirm that this is a float with the following print statement
print("The data type of y is: ...

In [None]:
# Confirm that y is a float
if type(y) == type(float()):
    print("y is a float data type")
else:
    print("y is NOT a float data type")

Do the same but now complete the following cell by assigning a str(ing) value to $z$ then use `type()` to confirm its data type.

In [None]:
z = # add something here so that z is a string

# confirm that this is a float with the following print statement
print("The data type of z is: ...

For more practice on variables and types do this short tutorial at learnpython.org:
* [https://www.learnpython.org/en/Variables_and_Types]()

### 4. Python Collection/Container Data Types

In nearly any computer program we will often need to employ data types that are able to hold a set of values, i.e. _collection_ data types. 

The most common of these is referred to as an __array__ in most languages, although in Python it is simply called a __list__. 

An example of a Python list is:

In [None]:
list1 = [4, 8, 3, 17, 6]

Notice that the data type is a list, but that the data types for any one given element is an integer:

In [None]:
print("The data type of list1 is:", type(list1))

print("The data type for the first element in list1 is:", type(list1[0]))

Not only is Python dynamically typed, but it very flexible when it comes to collection data types. 

A Python list can hold any type of variable, and even different types at the same time.

In [None]:
list2 = [4, 3.72, 'hello']

print("The data type for the first element in list2 is:", type(list2[0]))

print("The data type for the second element in list2 is:", type(list2[1]))

print("The data type for the third element in list2 is:", type(list2[2]))

There are several other collection data types in Python, each with their own unique advantages/disadvantages. They are:

* list
* tuple 
* dictionary
* str(ing)
* set

It may come as a surprise that a string object is also a collection data type but you can confirm this by creating a string then using square brackets to access an individual character. Try doing so in the cell below.

In [None]:
my_str = 'Hello World'

print("The second character in my_str is:", ...

For more practice with lists, strings, or dictionaries, check out any of the following brief tutorials at learnpython.org.
* [https://www.learnpython.org/en/Lists]()
* [https://www.learnpython.org/en/Basic_String_Operations]()
* [https://www.learnpython.org/en/Dictionaries]()

### 5. Python is Object Oriented

Python is based on an object-oriented programming paradigm. That is, every variable that you declare in Python is an object (i.e. an instance of a class). Even the primitive data types have their own methods. Here is an example of a str(ing) object's `islower()` method. 

In [None]:
my_str.islower()

That returns false since the entire string is not lower case. 

We can convert it to all lower case using another str(ing) method though, `lower()`. 

In [None]:
my_str.lower()

### 6. Defining Classes

Although Python is object oriented itself, it is possible to follow other programming paradigms (e.g. functional programming). For this course it will be extremely helpful if we understand how Python classes work, and how to create our own. 

Let's define a basic class now. Take a look at the following class and try to idenfify some of elements of a class such as:
* constructor
* attribute/field
* method
* accessor/getter method
* mutator/setter method

In [None]:
class Automobile:

    def __init__(self):
        self.color = None
        self.transmission = 'automatic'

    def setColor(self):
        return self.color

    def getColor(self, new_color):
        self.color = new_color

    def getTransmission(self):
        return self.transmission


Now, let's create an instance of an automobile. 

In [None]:
my_auto = Automobile()

When we created `my_auto`, what happened exactly? Which, if any, of the methods were called?

Now let's print out the transmission of `my_auto`.

In [None]:
print(f"my_auto has an {my_auto.getTransmission()} transmission")

Next let's try adding a method, `setTransmission(new_transmission)`, to the class definition (above), and then use it to change to a _manual_ transmission. Then run the cell above again to print the transmission type. 

You may also be wondering about the double underscores surrounding the constructor. There are many methods/functions that already exist in Python but that can be overridden to allow us to customize them as needed. These double-underscore methods are called _dunder_, or _magic_ methods. 

One such example is the `__str__` dunder method. Every object in Python has a default way of presenting itself when it it printed. Here is what an instance of our current Automobile class looks like when we try to print it:

In [None]:
print(my_auto)

Depending on our class and how we will use it, that may not be a very helpful way to print an instance. 

Let's try overriding the `__str__` method so that the string representation of an Automobile object is more informative. Then create another instance of Automobile and try printing it again. 

In [None]:
class Automobile:

    def __init__(self):
        self.color = None
        self.transmission = 'automatic'

    def getColor(self):
        return self.color

    def setColor(self, new_color):
        self.color = new_color

    def getTransmission(self):
        return self.transmission

    def __str__(self):
        return f"Autombile instance has: \n  color = {self.color} \n  transmission = {self.transmission}"


my_auto2 = Automobile()
my_auto2.setColor("blue")
print(my_auto2)

We will be using and defining many more classes, and also several other dunder methods such as `__eq__` and `__add__` to allow us to define how we want to compare or add to instances of the same class (respectively), and `__getitem__` and `__setitem__` to index into our own collection/container objects. 

### 7. Other Tools

We will get to some of the other tools in our next notebook but before we complete this one we should mention the tool we have already been using this entire time, which is a _Jupyter Notebook_. 

Jupyter Notebooks can be thought of as an interactive or research IDE and are typically what Data Scientists and Machine Learning Engineers rely on to develop models and algorithms (but not to productize their models/pipelines).

A Jupyter notebook is often called a _Python notebook_, or simply a _notebook_, and will have an _.ipynb_  extension. Jupyter notebooks are easy to run but it's helpful to remember that they require both a client and server process to be running. Generally a notebook is run through a web browser (the __client__) and requires that a Python process to be running in the background or remotely (the __server__). We'll see and use notebooks in a variety of ways so do not worry just yet about being familiar with all of the various ways to create/run a notebook. Just know that if you do encounter any issues, it's likely a simple issue with the behind-the-scenes server process. 

We will be getting used to using Jupyter Notebooks during the next few classes so there will be plenty of time to ask me and/or your neighbor questions.

For now, the most helpful shortcuts to know are:
* shift + enter: run the currently selected cell
* enter: enter/begin edit mode in the currently selected cell
* escape: stop/exit edit mode (to make navigating between cells easier)
* up/down arrows: navigate to the cell above/below when not in edit mode

For a reference on Jupyter Notebooks see this excellent documentation at realpython.com:

* [Jupyter Notebook: An Introduction](https://realpython.com/jupyter-notebook-introduction/)


In our next notebook we will look at some of the other tools we will use on top of Python (e.g. numpy, scikit learn, pytorch, etc.).

If you would like to dig deeper into anything covered above, all of the references cited are below as well some excellent (and entertaining) videos from [socratica.com]() that cover everything above.

### References

__Beginner Python References__
* [Learnpython.org Tutorials](https://www.learnpython.org)
  - [Hello World](https://www.learnpython.org/en/Hello%2C_World%21)
  - [Variables and Types](https://www.learnpython.org/en/Variables_and_Types)
  - [Lists](https://www.learnpython.org/en/Lists)
  - [Dictionaries](https://www.learnpython.org/en/Dictionaries)  
  - [Basic Operators](https://www.learnpython.org/en/Basic_Operators)
  - [String Formattting](https://www.learnpython.org/en/String_Formatting)
  - [String Operations](https://www.learnpython.org/en/Basic_String_Operations)
  - [Conditionals](https://www.learnpython.org/en/Conditions)
  - [Loops](https://www.learnpython.org/en/Loops)
  - [Functions](https://www.learnpython.org/en/Functions)
  - [Classes and Objects](https://www.learnpython.org/en/Classes_and_Objects)
  - [Modules and Packages](https://www.learnpython.org/en/Modules_and_Packages)
* Socratica Python Videos: 
  - [Hello World](https://www.socratica.com/lesson/hello-world)
  - [Strings](https://www.socratica.com/lesson/strings)
  - [Numbers](https://www.socratica.com/lesson/numbers-in-v3)
  - [Booleans](https://www.socratica.com/lesson/booleans)
  - [Arithmetic](https://www.socratica.com/lesson/arithmetic-in-v3)
  - [Conditionals](https://www.socratica.com/lesson/if-then-else)
  - [Functions](https://www.socratica.com/lesson/functions)
  - [Lists](https://www.socratica.com/lesson/lists)
  - [Dictionaries](https://www.socratica.com/lesson/dictionaries)
  - [Tuples](https://www.socratica.com/lesson/tuples)
  - [Classes and Objects](https://www.socratica.com/lesson/classes-and-objects)

__More Specific/In-depth Python Resources__
* [Foundations of Python (runestone)](https://runestone.academy/runestone/books/published/fopp/index.html)
* [Defining main functions in Python (realpython.com)](https://realpython.com/python-main-function/)
* [Python Modules (on python.org)](https://docs.python.org/3/tutorial/modules.html)
* [Jupyter Notebook: An Introduction](https://realpython.com/jupyter-notebook-introduction/)

__Other Links/References__

* [Anaconda Download](https://www.anaconda.com/products/individual)
* [Allen Institute's text generator demo](https://demo.allennlp.org/next-token-lm)
* [Deep AI's text generator demo](https://deepai.org/machine-learning-model/text-generator)
* [GenSim](https://radimrehurek.com/gensim/intro.html)
* [Gradient](https://gradient.paperspace.com)
* [Google Colab](https://colab.research.google.com)
* [Kaggle](https://www.kaggle.com)
* [Hugging Face](https://huggingface.co)
* [WebVectors Semantic Calculator demo](http://vectors.nlpl.eu/explore/embeddings/en/calculator/)