# Introduction to Python for Natural Language Processing

<sup>This notebook is a part of Natural Language Processing class at the University of Ljubljana, Faculty for computer and information science. Please contact [slavko.zitnik@fri.uni-lj.si](mailto:slavko.zitnik@fri.uni-lj.si) for any comments.</sub>

**Anaconda installation**

Conda environment management:

```
conda create -n onj python=3.6
source activate onj
conda install nb_conda
    jupyter notebook 
source deactivate
```

Show existing environments:

```
conda info --envs
```

Activate your environment and install the following dependencies:

```
conda install -c anaconda scikit-learn
conda install nltk
conda install matplotlib
```

**Pure Python 3.5 installation**

We are going to use Python 3, so first check, what is the default python interpreter on your machine. Go to console and run `python` (there may be more interpreters installed on machine and Python 3.5 might be run also using `python3.5`).

You should see output similar to the following: 
```
quaternion:~ slavkoz$ python3.5
Python 3.5.2 (v3.5.2:4def2a2901a5, Jun 26 2016, 10:47:25)
[GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>>
```

If you can run only Python 2.x, then install [Python 3.5](https://www.python.org/) ([download](https://www.python.org/downloads/)) *(if you have more interpreters installed at a time, be sure to use the right commands - python3.5, pip3.5, or command links accordingly)*

Now let's install [NLTK](http://www.nltk.org/) ([installation guide](http://www.nltk.org/install.html)) and [Numpy](http://www.numpy.org/) libraries:

```
sudo pip3.5 install -U nltk
sudo pip3.5 install -U numpy
```

Then install [SciPy](https://www.scipy.org/) ([installation guide](https://www.scipy.org/install.html)). I propose installation from an appropriate wheel ([wheel list](https://pypi.python.org/pypi/scipy)):

```
sudo pip3.5 install YOUR_DOWNLOADED_PACKAGE.whl
```

Install also the following:

* [Jupyter](http://jupyter.org/) ([installation guide](https://jupyter.readthedocs.io/en/latest/install.html]))
* [Scikit-learn](http://scikit-learn.org/stable/) ([installation guide](http://scikit-learn.org/stable/install.html))
* [Matplotlib](http://matplotlib.org/) ([installation guide](http://matplotlib.org/users/installing.html))


Now test whether your library was sucessfully installed (no error should be shown to you): 

In [None]:
import sklearn
import nltk
import matplotlib

Let's download all the corpora from the NLTK library. Run the command below and then select appropriate options in a window that will open.

In [None]:
nltk.download() # run once to download additional NLTK resources

## Short introduction to Python

For more check the [official documentation](https://docs.python.org/3/) for Python 3.6, browse for online tutorials or just try to start coding.

### Basics

Let's first say hello:

In [None]:
print("Hello text!")

Now try some arithmetic operations:

In [None]:
1+1

In [None]:
a = 1 + 3
a

In [None]:
"5" * 5

In [None]:
"5" + 5

Number operations.

In [None]:
5 / 2 - 1

In [None]:
number = 6/7
number

In [None]:
round(number, 2)

In [None]:
"{:5.2}".format(number)

### Strings

We will work with strings a lot. Remember that a string behaves similar to list as it is a list of characters.

In [None]:
# Now let's start to play with strings
willy = "William Shakespeare was an English poet, playwright, and actor" 
willy

In [None]:
"The last word is: '" + willy[-5:] + "'"

How could you print the first word only?

In [None]:
willy[0:7]

In [None]:
willy.find('poet') #finds position of substring within string

Character-level operations.

In [None]:
willy[0:7].upper() +' and '+ willy[8:19].lower() # turn to upper or lower case.

In [None]:
willy[0:7].replace("li", "j") # replace a substring 'li' in the string with 'j'. 

Importing a string module.

In [None]:
import string

In [None]:
"?" in string.punctuation

What is a *string.punctuation*?

In [None]:
print(string.punctuation)

### Lists

In [None]:
list = [1,2,"3"]
list

Inline for loop.

In [None]:
multiplied = [item * 2 for item in list]
multiplied

For loop with a filter.

In [None]:
multiplied_filtered = [item for item in list if int(item) < 3]
multiplied_filtered

Basic list operations.

In [None]:
words = willy.split(" ")
words

In [None]:
len(words) # length of the list

Get the last word from the list:

In [None]:
words[len(words)-1]

In [None]:
words.append(".")
words

In [None]:
sorted(words)

In [None]:
" ".join(words)

Why is comma together with the word but dot separated from the last word?

Standalone for loop.

In [None]:
for i, word in enumerate(words):
    if len(word) <= 3:
        print("Short word '{1:3}' is at index {0}.".format(i, word))

## Using the NLTK library

In this part we will show some basic operations on text using the NLTK library. First we need to import the needed libraries.

In [None]:
import nltk
import string

The function below will read a text file into string, do some operations on it and return a list of tokens. Why is better to use `nltk.word_tokenize()` method instead of `string.split()` method?

In [None]:
def getTokens():
   with open('shakespeare.txt', 'r') as shakes:
    text = shakes.read().lower()
    
    # remove punctuation
    table = text.maketrans({key: None for key in string.punctuation})
    text = text.translate(table)  
    
    tokens = nltk.word_tokenize(text)
    return tokens

We are first interested in number of times each word appears. For that we use `FreqDist` class.

In [None]:
from nltk import FreqDist

tokens = getTokens()
freq = FreqDist(tokens)
freq

Keys of the dictionary are tokens and values are numbers of occurences of a key in the text. To get a list of tuples of type `(key, value)`, use `.items()` method.

In [None]:
freq.keys()

Hapaxes sre tokens that appear once in the text, let's see the first 20:

In [None]:
freq.hapaxes()[:20]

In [None]:
sorted(freq.items(), key = lambda x: x[1], reverse = True)[:10]

The frequency of the most commonly used words in the text:

In [None]:
%matplotlib inline
freq.plot(30) # frequencies of top 30 commonly used words

Stopwords removal

In [None]:
from nltk.corpus import stopwords

stopwords.words('english')

Update the method `getTokens` above to delete all the stopwords from the text. How many tokens are returned by the method after that? Do also the results of the above code change?

In [None]:
# TODO: get the number of tokens without stopwords

What is the longest word in the text?

In [None]:
# TODO: find the longest word from the text

## Importing your own text

Above, we have seen how to retrieve text from a local file. To retrieve text from the user input:

In [None]:
text = input("Enter some text to the terminal: ")

In [None]:
print("The text you entered: '{}'".format(text))

Retrieving text from online sources:

In [None]:
import nltk
from urllib.request import urlopen

In [None]:
url = "http://shakespeare.mit.edu/hamlet/full.html"
html = urlopen(url).read() 
html[:600]

As you might see above, a lot of html tags are around the text that we would like to process. To remove HTML (or find some specific data within a HTML documen), we use a [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/) library ([documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)).

In [None]:
from bs4 import BeautifulSoup

In [None]:
soup = BeautifulSoup(html, 'html.parser')
text = soup.get_text()
text[:600]

Is there a difference if you print the value of *text[:600]*?

In [None]:
# TODO: print the value of text[:600] and observe the difference to the output above

## Exercise

We now know about the basics of text processing with Python 3.6 and NLTK 3.0. To validate your proficiency, perform the following:

* Retrieve data from an online source (e.g.: books from [http://www.fullbooks.com](http://www.fullbooks.com), [http://www.readanybook.com](http://www.readanybook.com) or posts from [http://www.rtvslo.si/](http://www.rtvslo.si/)).
* Process the data and report on results. Use the techniques we mentioned above, check the tools' documentation for additional techniques and use your imagination ...