<img src="../../Img/backdrop-wh.png" alt="Drawing" style="width: 300px;"/>

DIGHUM160 - Critical Digital Humanities<br>
Digital Hermeneutics<br>
OPTIONAL: Python basics II<br>
Created by Tom van Nuenen (tom.van_nuenen@kcl.ac.uk)

# Python Basics II

Please read this notebook if you want to brush up on some Pandas and text preprocessing basics in Python. We'll cover:

- importing
- tokenizing
- lemmatizing
- and some various things like functions scattered throughout


## Importing

Let's get our files (see this week's main notebook for more info)

In [1]:
import os
# We include two ../ because we want to go two levels up in the file structure
os.chdir("../../Data")

In [2]:
# Importing Pandas
import pandas as pd

# Reading the CSV file
df = pd.read_csv('aita_sub_top.csv')[:10]

In [3]:
# Showing the dataframe - just the first 5 rows
df.head()

Unnamed: 0,idint,idstr,created,self,nsfw,author,title,url,selftext,score,subreddit,distinguish,textlen,num_comments,flair_text,flair_css_class,augmented_at,augmented_count
0,427576402,t3_72kg2a,1506433689,1.0,0.0,Ritsku,AITA for breaking up with my girlfriend becaus...,,My girlfriend recently went to the beach with ...,679.0,AmItheAsshole,,4917.0,434.0,no a--holes here,,,
1,551887974,t3_94kvhi,1533404095,1.0,0.0,hhhhhhffff678,AITA for banning smoking in my house and telli...,,My parents smoke like chimneys. I used to as w...,832.0,AmItheAsshole,,2076.0,357.0,asshole,ass,,
2,552654542,t3_951az2,1533562299,1.0,0.0,creepatthepool,AITA? Creep wears skimpy bathing suit to pool,,Hi guys. Throwaway for obv reasons.\n\nI'm a f...,23.0,AmItheAsshole,,1741.0,335.0,Shitpost,,,
3,556350346,t3_978ioa,1534254641,1.0,0.0,Pauly104,AITA for eating steak in front of my vegan GF?,,"Yesterday night, me and my GF decided to go ou...",1011.0,AmItheAsshole,,416.0,380.0,not the a-hole,not,,
4,560929656,t3_99yo3c,1535126620,1.0,0.0,ThatSpencerGuy,AITA for not wanting to cook my mother-in-law ...,,"My wife and I are vegetarians, much to my in-l...",349.0,AmItheAsshole,,1158.0,360.0,not the a-hole,not,,


In [4]:
# Access the 'selftext' column of our dataframe like so (both methods are synonymous)
print(df['selftext'])
print(df.selftext)

0    My girlfriend recently went to the beach with ...
1    My parents smoke like chimneys. I used to as w...
2    Hi guys. Throwaway for obv reasons.\n\nI'm a f...
3    Yesterday night, me and my GF decided to go ou...
4    My wife and I are vegetarians, much to my in-l...
5    Hi, so my girlfriend and i watched a horror mo...
6    He has no money until he’s paid at the end of ...
7    Let me just say up front that I have nothing a...
8     Basically I went out with this girl(Missy) an...
9    I coach a high school girls tennis team and du...
Name: selftext, dtype: object
0    My girlfriend recently went to the beach with ...
1    My parents smoke like chimneys. I used to as w...
2    Hi guys. Throwaway for obv reasons.\n\nI'm a f...
3    Yesterday night, me and my GF decided to go ou...
4    My wife and I are vegetarians, much to my in-l...
5    Hi, so my girlfriend and i watched a horror mo...
6    He has no money until he’s paid at the end of ...
7    Let me just say up front that 

### Programming basics: Functions

Do these exercises if you need to learn about functions.

A function is like a little program. It's basically a block of code which only runs when it is called. It looks like this:

In [5]:
# Defining the function
def my_function():
  print("Hello, world!")

# Running the function
my_function()

Hello, world!


Functions can take some data as input, called "parameters". It can transform that input, and can `return` you some output. 

In [6]:
def my_function(inp):
  return "Hello, " + inp

my_function("you")

'Hello, you'

Create a function called `multiplier` that takes some number as a parameter, multiplies that number by 10, and `return`s it. Then run the cell to see if it works.

In [7]:
# Your code here
def multiplier(inp):
  return inp * 10

multiplier(10)

100

### Removing punctuation

First, have a look at how to use `string.punctuation` to get rid of some punctuation characters. `string.punctuation` is not a function: it's a pre-initialized string which we can use to get rid of punctuation in a string.



In [8]:
import string

old_sent = "I. don't. know. why. I'm. speaking. like. this."
new_sent = ""
for ch in old_sent:
  if ch not in string.punctuation:
    new_sent += ch

new_sent

'I dont know why Im speaking like this'

Your turn! Try to create a function called `strip_punctuation` that strips punctuation from a string. It takes a string as a parameter, and returns a new string with all punctuation stripped out.

In [9]:
# Your code here

def strip_punctuation(s):
    return ''.join(ch for ch in s if ch not in string.punctuation)


Try to see if it works:

1. Create an empty list called `df_strip_punct`;
2. Run a `for`-loop that iterates over all the "selftexts" in the `df` DataFrame, and that applies your function to each; 
3. Save the result in a new variable;
4. Print your new variable to see if it worked!

In [10]:
# Your code here

df_strip_punct = [strip_punctuation(t) for t in df.selftext]
df_strip_punct


['My girlfriend recently went to the beach with a few of her friends  She has this tiny bikini bottom that is basically a thong that I HATE when she wears in public  Well she wore it  Not only did she wear it she posed in the bathroom mirror of her hotel room to take a side profile picture so you could see her ass sticking out in it and posted it to her Snapchat story   Worth mentioning I am not friends with her on Snapchat for reasons similar to this sick of getting in fights when she says shes going out for girls night then posts videos of her sitting at a table with like 5 dudes that always got invited by one of the other girls which was completely unknown to her until she arrived  most of these guys she then adds on Snapchat afterwards  She didnt even save it and send it to me  I saw it when she was showing me pics from her beach trip and she had screenshot that particular snap and left it in her camera roll  Whether the ass part was intentional or not I will never know  She claims

### Tokenizing
Next, we need to create a tokenizer. Create another list called `df_tokens`, then use another for-loop that applies NLTK's `word_tokenize()` method on each entry of our `df_strip_punct` list. 

In [11]:
import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize

# Your code here

df_tokens = [word_tokenize(text) for text in df_strip_punct]
df_tokens[0][0]

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/tomvannuenen/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


'My'

`df_tokens` is a list of lists: each list contains the individual tokens of a post. What if we want to access a list within a list? It works like this:

In [12]:
list1 = [[10,13,17],[3,5,1],[13,11,12]]
list1[0][2]

17

Your turn! Print out the first 10 entries in the first entry of the `df_tokens` list.

In [13]:
# Your code here
df_tokens[0][:10]




['My',
 'girlfriend',
 'recently',
 'went',
 'to',
 'the',
 'beach',
 'with',
 'a',
 'few']

### Programming basics: Sets
Do these exercises if you need to learn about sets!

A set is an **unordered** and **unindexed** collection. This makes them different from lists, which are ordered, and from dictionaries, which are indexed. You can use sets to rapidly iterate through a list, when the order within that list doesn't matter. 

In Python sets are written with curly brackets, like so:

In [14]:
my_set = {"apple", "pear", "orange"}
print(my_set)

{'apple', 'orange', 'pear'}


Note that the order is not preserved!

You can also create a set from a different data type. What do you think will happen in the below expression? Think about it for a second, then run it to see.

In [15]:
set("I am going to work")

{' ', 'I', 'a', 'g', 'i', 'k', 'm', 'n', 'o', 'r', 't', 'w'}

Your turn! First, assign a new variable to a list of some random numbers. Then, force that list into a set, and print it out!

In [16]:
# Your code here





### Removing stopwords
Next, let's remove stopwords. We can do so using NLTK's stopwords list, which we imported above. Let's have a look at some of these stopwords.

In [17]:
nltk.download('stopwords')
from nltk.corpus import stopwords

stopwords.words('english')[:10]

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/tomvannuenen/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]

Iterating through this list for *every* word in our two corpora is going to take a long time, so let's turn it into a set. This saves us some time, as sets are less memory-intensive.

Remember, when creating a set it shouldn't matter which order items are in – and for our stopwords list, that is the case!

In [18]:
set(stopwords.words('english'))

{'a',
 'about',
 'above',
 'after',
 'again',
 'against',
 'ain',
 'all',
 'am',
 'an',
 'and',
 'any',
 'are',
 'aren',
 "aren't",
 'as',
 'at',
 'be',
 'because',
 'been',
 'before',
 'being',
 'below',
 'between',
 'both',
 'but',
 'by',
 'can',
 'couldn',
 "couldn't",
 'd',
 'did',
 'didn',
 "didn't",
 'do',
 'does',
 'doesn',
 "doesn't",
 'doing',
 'don',
 "don't",
 'down',
 'during',
 'each',
 'few',
 'for',
 'from',
 'further',
 'had',
 'hadn',
 "hadn't",
 'has',
 'hasn',
 "hasn't",
 'have',
 'haven',
 "haven't",
 'having',
 'he',
 'her',
 'here',
 'hers',
 'herself',
 'him',
 'himself',
 'his',
 'how',
 'i',
 'if',
 'in',
 'into',
 'is',
 'isn',
 "isn't",
 'it',
 "it's",
 'its',
 'itself',
 'just',
 'll',
 'm',
 'ma',
 'me',
 'mightn',
 "mightn't",
 'more',
 'most',
 'mustn',
 "mustn't",
 'my',
 'myself',
 'needn',
 "needn't",
 'no',
 'nor',
 'not',
 'now',
 'o',
 'of',
 'off',
 'on',
 'once',
 'only',
 'or',
 'other',
 'our',
 'ours',
 'ourselves',
 'out',
 'over',
 'own',
 'r

Your turn! 
1. Create a function called `strip_stopwords()` that takes `tokens` as a parameter;
2. In the function, create a list named `no_stop`; 
3. Turn `stopwords.words('english')` into a set (like above), then assign it to a variable named `stop`;
4. Run a for-loop that fills the `no_stop` list with only those tokens that are *not* in `stop` (you need an `if`-statement!);
5. Finally, `return` the list.

In [19]:
# Your code here

def strip_stopwords(tokens):
    stop = set(stopwords.words('english'))
    return [w for w in tokens if not w in stop] 

Run the following line of code to see if it worked. You should get a printout of the first 10 tokens in the first post of `trp` – without the stopwords of course!

In [20]:
# Run this
df_clean = [strip_stopwords(tokens) for tokens in df_tokens]
df_clean[0][:10]

['My',
 'girlfriend',
 'recently',
 'went',
 'beach',
 'friends',
 'She',
 'tiny',
 'bikini',
 'bottom']

### Stemming
Tokenizers are great, but they're often not perfect. Look at the example below:

In [21]:
word_tokenize("Why won't this work?")

['Why', 'wo', "n't", 'this', 'work', '?']

Looks like it did a pretty good job, except it considers "wo" and "n't" as different words.. Annoying. This is where **stemming** and **lemmatizing** come in handy. These are two text normalization techniques that are used to prepare text, words, and documents for further processing. 

See [this link](https://www.datacamp.com/community/tutorials/stemming-lemmatization-python?utm_source=adwords_ppc&utm_campaignid=1455363063&utm_adgroupid=65083631748&utm_device=c&utm_keyword=&utm_matchtype=b&utm_network=g&utm_adpostion=&utm_creative=332602034361&utm_targetid=dsa-429603003980&utm_loc_interest_ms=&utm_loc_physical_ms=1012831&gclid=Cj0KCQjwgJv4BRCrARIsAB17JI4kMKOUrJcdearlvPx4kl3VNVcqeZz-oeTSlbgikK3tJbXMrAmWTCwaAvUzEALw_wcB) for more information.

**Stemming** is the process of reducing inflection in words to their root forms such as mapping a group of words to the same stem even if the stem itself is not a valid word in the Language. First, let's load our stemmer:

In [22]:
stemmer = nltk.stem.LancasterStemmer()

In [23]:
for each in ["think", "thinker", "thinking"]:
    print(stemmer.stem(each))

think
think
think


...but stemming doesn't always produce the prettiest results:

In [24]:
for each in ["create", "creating", "creator"]:
    print(stemmer.stem(each))

cre
cre
cre


### Lemmatizing
A lemma is the canonical, dictionary or citation form of a word. For instance, the lemma for "thinks" is "think." Lemmatization, in other words, is the process of converting a word to its base form.

Lemmatizing your data typically is a bit less intrusive than stemming it. Let's see it in action:

In [25]:
lemmatizer = nltk.stem.WordNetLemmatizer()

In [27]:
nltk.download('wordnet')
nltk.download('omw-1.4')

for each in ["trade", "trades", "trading", "trader", "traders"]:
    print(lemmatizer.lemmatize(each))

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/tomvannuenen/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     /Users/tomvannuenen/nltk_data...
[nltk_data]   Unzipping corpora/omw-1.4.zip.


trade
trade
trading
trader
trader


Your turn! 
1. Create a function called `lemmatize()` that takes `tokens` as a parameter;
2. Create a new list called `lemmas`
3. In the function, assign `nltk.stem.WordNetLemmatizer()` to a variable called `lemmatizer`, like above; 
4. Run a `for`-loop that uses `lemmatizer.lemmatize(each)` to lemmatize each token in `tokens`; append the output to our `lemmas` list;  
5. Finally, `return` the list.

In [28]:
# Your code here

def lemmatize(tokens):
    lemmatizer = nltk.stem.WordNetLemmatizer()
    return [lemmatizer.lemmatize(each) for each in tokens] 

Run the following line of code to see if it worked. You should see that the token 'masters' has been changed to 'master'.

In [29]:
# Run this
df_lemmas = [lemmatize(tokens) for tokens in df_clean]
df_lemmas[0][30:35]

['see', 'as', 'sticking', 'posted', 'Snapchat']

### Forcing to string
Sometimes, when we have a list, we actually want a string. For instance, some libraries of NLP tools require strings as input. In those cases, we can force lists into strings by applying the list `.join` method. Let's use it to turn the first entry of our `trp_lemmas` list into a string.

In [30]:
df_str = ' '.join(df_lemmas[0])
df_str

'My girlfriend recently went beach friend She tiny bikini bottom basically thong I HATE wear public Well wore Not wear posed bathroom mirror hotel room take side profile picture could see as sticking posted Snapchat story Worth mentioning I friend Snapchat reason similar sick getting fight say shes going girl night post video sitting table like 5 dude always got invited one girl completely unknown arrived guy add Snapchat afterwards She didnt even save send I saw showing pic beach trip screenshot particular snap left camera roll Whether as part intentional I never know She claim liked way stomach looked pic beach black bikini bottom owns But there nothing look micro bikini bathroom mirror Theres ocean sand friend tit as bathroom mirror Shes 24 I 30 Is age Is typical 24 year old behavior nowadays Am I wrong thinking inappropriate behavior youre relationship I asshole making big deal Edit Guys let clear I try control point time I never told wear Yes I hate as completely bikini But Ive ne

## Putting it all together
After all that, you should be well-equipped to understand this preprocessing function. It takes a DataFrame in, removes the empty values, then removes punctuation, tokenizes and lemmatizes the selftext. It then spits the text back out as a string.

In [33]:
from nltk.stem import WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()

def preprocessing(df):
    """POS tags and filters DF by nouns"""
    dfLength = len(df)
    total = ""
    counter = 0
    for text in df['selftext']:
        # turn to lowercase
        text = text.lower()
        # remove punctuation
        text = ''.join(ch for ch in text if ch not in string.punctuation)
        # tokenize
        tokens = word_tokenize(text)
        # lemmatize
        lemmas = ' '.join([wordnet_lemmatizer.lemmatize(token) for token in tokens])
        # save
        total += lemmas
        counter += 1
        if counter % 100 == 0:
            print("Saved " + str(counter) + " out of " + str(dfLength) + " entries") 
    return total

Let's run our function on our DataFrame.

In [34]:
lemmas = preprocessing(df)

In [35]:
lemmas[:100]

'my girlfriend recently went to the beach with a few of her friend she ha this tiny bikini bottom tha'