# Natural Language Processing - Summer Term 2023
#### Hochschule Karlsruhe
#### Prof. Dr. Jannik Strötgen
#### Thanks to:  Jun.-Prof. Dr. Andreas Spitz and his tutors Rita Sevastjanova, Yannick Metz


# Python tutorial

### You will learn

- how python code is written
- new line
- get familar with some simple NLP libraries
- new line

---

## Assignments and Basic Operators

In [4]:
x=5
x

5

In [5]:
x

5

In [6]:
type(x)

int

In [7]:
help()


Welcome to Python 3.10's help utility!

If this is your first time using Python, you should definitely check out
the tutorial on the internet at https://docs.python.org/3.10/tutorial/.

Enter the name of any module, keyword, or topic to get help on writing
Python programs and using Python modules.  To quit this help utility and
return to the interpreter, just type "quit".

To get a list of available modules, keywords, symbols, or topics, type
"modules", "keywords", "symbols", or "topics".  Each module also comes
with a one-line summary of what it does; to list the modules whose name
or summary contain a given string such as "spam", type "modules spam".

help> 

You are now leaving help and returning to the Python interpreter.
If you want to ask for help on a particular object directly from the
interpreter, you can type "help(object)".  Executing "help('string')"
has the same effect as typing a particular string at the help> prompt.


In [8]:
x*3

15

In [9]:
x**2

25

In [10]:
print(f"x is {x} and {x}")

x is 5 and 5


## Lists and List Comprehensions

In [11]:
numbers = [1, 2, 3, 4, 5]

In [12]:
powers = []

In [13]:
for number in numbers:
    powers.append(number**2)

In [14]:
powers

[1, 4, 9, 16, 25]

In [15]:
list_2 = [x * y for x, y in zip(numbers, powers)]

In [16]:
list_2

[1, 8, 27, 64, 125]

In [41]:
numbers_even = [x for x in numbers if x % 2 != 1]
numbers_even

[2, 4]

In [18]:
powers

[1, 4, 9, 16, 25]

In [19]:
powerfunc = lambda x: x**2

In [20]:
powerfunc(4)

16

In [21]:
def powerfunc(x):
    return x ** 2

In [22]:
list(map(lambda x: x ** 2, numbers))

[1, 4, 9, 16, 25]

In [23]:
a = (1,2,3)

## Dictionairies

In [24]:
example_dict = {} # alternatively: example_dict = dict()

In [25]:
# Add Elements based on key and value
example_dict['a'] = 'alpha'
example_dict['b'] = 'beta'
example_dict['c'] = 'charlie'

In [26]:
# Access dictionairy elements
example_dict['b']

'beta'

In [27]:
# Elements can be overwritten
example_dict['a'] = 'alfa'
example_dict['a']

'alfa'

In [28]:
# Delete elements
del example_dict['a']

In [29]:
# Access keys and values
print(example_dict.keys())
print(example_dict.values())
print(example_dict.items())

dict_keys(['b', 'c'])
dict_values(['beta', 'charlie'])
dict_items([('b', 'beta'), ('c', 'charlie')])


In [30]:
for key, value in example_dict.items():
    print(f'{key} is {value}')

b is beta
c is charlie


In [32]:
example_dict["2"] = "alpha"
example_dict.keys()

dict_keys(['b', 'c', '2'])

In [33]:
# Two other ways of creating a dict
example_dict = dict(a='alpha', b='beta')
example_dict = {'a': 'alpha', 'b': 'beta'}

In [34]:
# Dictionairy Comprehension
example_dict = {x: y for x, y in zip(["word1", "word2", "word3"], [4, 5 ,6])}
example_dict

{'word1': 4, 'word2': 5, 'word3': 6}

## Strings

In [35]:
strings = ["aba", 'Ajs"da', 'c', 'd', 'e', 'f']
strings

['aba', 'Ajs"da', 'c', 'd', 'e', 'f']

In [36]:
type(strings)

list

In [37]:
type(strings[0])

str

In [38]:
def make_upper(input_arr):
    return [x.lower() for x in input_arr]

In [39]:
make_upper(strings)

['aba', 'ajs"da', 'c', 'd', 'e', 'f']

## Classes

In [31]:
class Sentence:
    
    def __init__(self, speaker, text):
        """constructor"""
        self.speaker = speaker
        self.text = text
        
    def length(self):
        """read only property"""
        return len(self.text)
    
    def arbitrary_function(self, in_var):
        return self.speaker + in_var

In [32]:
sen = Sentence('Sir Winston Churchill', 'That was the prospect a week ago')

In [33]:
sen.arbitrary_function("asd")

'Sir Winston Churchillasd'

In [34]:
sen.text

'That was the prospect a week ago'

In [35]:
sen.length()

32

## A simple NLP example

In [42]:
import requests
from bs4 import BeautifulSoup
import nltk

In [43]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\I518135\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [44]:
def url_to_transcript(url):
    '''Returns transcript data specifically from scrapsfromtheloft.com.'''
    page = requests.get(url).text
    soup = BeautifulSoup(page, "html.parser")
    text = [p.text for p in soup.find_all('p')]
    return " ".join(text)

In [45]:
# load a Churchill speach
url = 'https://api.parliament.uk/historic-hansard/commons/1940/jun/04/war-situation'
transcript = url_to_transcript(url)

In [46]:
# check
transcript[1000:4000]

"ed French Army which was to have advanced across the Somme in great strength to grasp it. \n            However, the German eruption swept like a sharp scythe around the right and rear of the Armies of the north. Eight or nine armoured divisions, each of about 400 armoured vehicles of different kinds, but carefully assorted to be complementary and divisible into small self-contained units, cut off all communications between us and the main French Armies. It severed our own communications for food and ammunition, which ran first to Amiens and afterwards through Abbeville, and it shore its way up the coast to Boulogne and Calais, and almost to Dunkirk. Behind this armoured and mechanised onslaught came a number of German divisions in lorries, and behind them again there plodded comparatively slowly the dull brute mass of the ordinary German Army and German people, always so ready to be led to the trampling down in other lands of liberties and comforts which they have never known in thei

In [47]:
# use lower case
transcript = transcript.lower()

In [48]:
# tokenize for words
word_tokens = nltk.word_tokenize(transcript)
word_tokens[1:21]

['3.40',
 'p.m.',
 'from',
 'the',
 'moment',
 'that',
 'the',
 'french',
 'defences',
 'at',
 'sedan',
 'and',
 'on',
 'the',
 'meuse',
 'were',
 'broken',
 'at',
 'the',
 'end']

#### tokenize for sentences

In [49]:
sentences_tokens = nltk.sent_tokenize(transcript)
sentences_tokens[1:4]

['the french high command hoped they would be able to close the gap, and the armies of the north were under their orders.',
 'moreover, a retirement of this kind would have involved almost certainly the destruction of the fine belgian army of over 20 divisions and the abandonment of the whole of belgium.',
 'therefore, when the force and scope of the german penetration were realised and when a new french generalissimo, general weygand, assumed command in place of general gamelin, an effort was made by the french and british armies in belgium to keep on holding the right hand of the belgians and to give their own right hand to a newly created french army which was to have advanced across the somme in great strength to grasp it.']

#### stopword removal

In [50]:
nltk.download('stopwords')
from nltk.corpus import stopwords
stopword = stopwords.words('english')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\I518135\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


In [51]:
word_tokens_without_stopwords = [word for word in word_tokens if word not in stopword]
word_tokens_without_stopwords[1:10]

['3.40',
 'p.m.',
 'moment',
 'french',
 'defences',
 'sedan',
 'meuse',
 'broken',
 'end']

### Some more steps (lowercase, tokenization, stop word removal, lemmatization)

In [57]:
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

In [58]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/jannikstroetgen/nltk_data...


True

In [59]:
nltk.download('omw-1.4')
dog_sentence = 'The dogs are barking outside. Are the cats also in the parks outside?'
dog_sentence_lower = dog_sentence.lower()
dog_tokens = nltk.word_tokenize(dog_sentence_lower)
dog_token_swr = [word for word in dog_tokens if word not in stopword]

wordnet_lemmatizer = WordNetLemmatizer()
lemmatized_words = [wordnet_lemmatizer.lemmatize(word) for word in dog_token_swr]
lemmatized_words

[nltk_data] Downloading package omw-1.4 to
[nltk_data]     /Users/jannikstroetgen/nltk_data...


['dog', 'barking', 'outside', '.', 'cat', 'also', 'park', 'outside', '?']

#### Compare with stemming

In [60]:
from nltk.stem import SnowballStemmer

In [61]:
snowball_stemmer = SnowballStemmer('english')
stemmed_words = [snowball_stemmer.stem(word) for word in dog_token_swr]
stemmed_words

['dog', 'bark', 'outsid', '.', 'cat', 'also', 'park', 'outsid', '?']

#### Frequencies

In [62]:
from nltk import FreqDist

In [63]:
FreqDist(word_tokens_without_stopwords)

FreqDist({',': 267, '.': 181, 'british': 26, 'upon': 25, 'would': 21, 'french': 20, 'army': 19, 'many': 17, 'may': 14, 'shall': 14, ...})

#### POS Tagging

In [64]:
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/jannikstroetgen/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


True

In [65]:
pos_tag = nltk.pos_tag(word_tokens)
pos_tag[1:20]

[('3.40', 'CD'),
 ('p.m.', 'NN'),
 ('from', 'IN'),
 ('the', 'DT'),
 ('moment', 'NN'),
 ('that', 'IN'),
 ('the', 'DT'),
 ('french', 'JJ'),
 ('defences', 'NNS'),
 ('at', 'IN'),
 ('sedan', 'NN'),
 ('and', 'CC'),
 ('on', 'IN'),
 ('the', 'DT'),
 ('meuse', 'NN'),
 ('were', 'VBD'),
 ('broken', 'VBN'),
 ('at', 'IN'),
 ('the', 'DT')]

#### Named Entitiy Recogniztion (NER)


In [66]:
nltk.download('maxent_ne_chunker')
nltk.download('words')

[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /Users/jannikstroetgen/nltk_data...
[nltk_data]   Unzipping chunkers/maxent_ne_chunker.zip.
[nltk_data] Downloading package words to
[nltk_data]     /Users/jannikstroetgen/nltk_data...
[nltk_data]   Unzipping corpora/words.zip.


True

In [67]:
text = 'In order for Ravi to be successful, he should follow John.'
words = nltk.word_tokenize(text)
pos_tag_2 = nltk.pos_tag(words)
chunk = nltk.ne_chunk(pos_tag_2)
NE = [ ' '.join(w for w, t in ele) for ele in chunk if isinstance(ele, nltk.Tree)]
NE

['Ravi', 'John']