<a href="https://colab.research.google.com/github/CALDISS-AAU/sdsphd19_coursematerials/blob/master/notebooks/SDS_PhD19_NLP_TextExplore.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Simple String Manipulation - Freshup :-)

We start by taking a piece of text and turning it into something that carries the meaning of the initial text but is less noisy and thus perhaps easier to "understand" by a computer

![alt text](https://i.guim.co.uk/img/media/ba5ca6316885c50ef827c9fd4f04d7f162a864dc/0_167_5000_3000/master/5000.jpg?width=1140&quality=45&auto=format&fit=max&dpr=2&s=e944b93b711d5a00dc503dd30bf8d60b)

Source (image and below text): https://www.theguardian.com/culture/2019/oct/09/muslim-drag-queen-amrou-al-kadhi-whenever-the-drag-came-off-id-have-a-nervous-breakdown

In [0]:
text = "The Eton-educated, non-binary British Iraqi had always struggled with their identity, until they discovered drag. Yet the 29 year old says the performances come at a high price"

In [0]:
# Split on fullstop
text.lower().split(".")

In [0]:
# split on empty space
text.split(" ")

In [0]:
# Find in text (position)
text.find('29')

In [0]:
# Simple replacement
text.replace('o', 'O')

In [0]:
# very short RegEx
import re

In [0]:
re.findall(r'\d+', text)

In [0]:
numbers = re.findall(r'\d+', text)

for i in numbers:
    print(text.replace(i, str(int(i) + 1)))

More on RegEx in the Datacamp courses (there is a whole course on that actually) and [here](https://www.dataquest.io/wp-content/uploads/2019/03/python-regular-expressions-cheat-sheet.pdf)

Overall, in NLP we are trying to represent meaning structure. That means that we want to focus on the most important and "meaning-bearing elements" in text, while reducing noise.
Words such as "and", "have", "the" may have central syntactic functions but are not particularly important from a semantic perspective.


In [0]:
# Defining stopwords

stopwords_en = ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 
                'ourselves', 'you', "you're", "you've", "you'll", 
                "you'd", 'your', 'yours', 'yourself', 'yourselves', 
                'he', 'him', 'his', 'himself', 'she', "she's", 'her', 
                'hers', 'herself', 'it', "it's", 'its', 'itself', 
                'they', 'them', 'their', 'theirs', 'themselves', 'what', 
                'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 
                'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 
                'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 
                'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 
                'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 
                'between', 'into', 'through', 'during', 'before', 'after', 'above', 
                'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 
                'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 
                'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 
                'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 
                'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 
                'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 
                'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', 
                "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', 
                "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 
                'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 
                'won', "won't", 'wouldn', "wouldn't"]

In [0]:
# Let's keep only words that are not stopwords
[word for word in text.lower().split() if word not in stopwords_en]

In [0]:
# Let's use RegEx one more time to remove leading or trailing punctuation from our words
'drag.,'.strip(r'[" ,.!?:;"]')

In [0]:
# Let's combine that and add another condition "No numbers"
[word.strip(r'[" ,.!?:;"]') for word in text.lower().split() if word not in stopwords_en and not word.isdigit()]

Now that you undestand (hopefully) what’s going on on the basic level, let's start using some more sophisticated tools to work with text.

We will import some tokenizers from NLTK

In [0]:
import nltk #this part is needed on colab.
nltk.download('punkt')
nltk.download('stopwords')
#----------------------------------------

# Tokenizing sentences
from nltk.tokenize import sent_tokenize

# Tokenizing words
from nltk.tokenize import word_tokenize

In [0]:
# Let's get our stences.
# Note that the full-stops at the end of each sentence are still there

sentences = sent_tokenize(text)
print(sentences)

In [0]:
# Use word_tokenize to tokenize the third sentence: tokenized_sent
tokenized_sent = word_tokenize(sentences[1])

# Make a set of unique tokens in the entire scene: unique_tokens
unique_tokens = set(word_tokenize(text))

print(tokenized_sent)
print(unique_tokens)

In [0]:
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english')) 

In [0]:
[word.lower() for word in word_tokenize(text) if word not in stop_words and word.isalnum()]

#### Your turn!

![alt text](https://media.giphy.com/media/9rwFfmB2qJ0mEsmkfj/giphy.gif)

Take the following text and transform it into a list of lists with with each element being a tokenized sentence. Remove stopwords, lower all tokens and keep only alpha-numeric tokens.


"I’ve been called many things in my life, but never an optimist. That was fine by me. I believed pessimists lived in a constant state of pleasant surprise: if you always expected the worst, things generally turned out better than you imagined. The only real problem with pessimism, I figured, was that too much of it could accidentally turn you into an optimist."

source: https://www.theguardian.com/global/2019/nov/21/glass-half-full-how-i-learned-to-be-an-optimist-in-a-week


### Processing many short texts and simple stats

An introduction to NLP would not be the same without Donald's tweets. Let's use these tweets for some more basic NLP and let's try to gather some insights...maybe

![donald_tweets](https://i.cdn.cnn.com/cnn/interactive/2017/politics/trump-tweets/media/trump-tweets-hdr-02.jpg)

Let's try to use some very simple statistics on twitter data:

thanks to [Trump Twitter Archive](http://www.trumptwitterarchive.com)

In [0]:
import pandas as pd
pd.set_option('display.max_colwidth', -1) #to see more text

import numpy as np
import seaborn as sns

import itertools
from collections import Counter

In [0]:
# Tokenizing Tweets made easy!
from nltk.tokenize import TweetTokenizer
tknzr = TweetTokenizer()

In [0]:
# download and open some Trump tweets from trump_tweet_data_archive

trump_tweets_df = pd.read_json('https://github.com/bpb27/trump_tweet_data_archive/raw/master/condensed_2018.json.zip')
trump_tweets_df.head()

In [0]:
# Reset index (not really needed but why not)
trump_tweets_df = trump_tweets_df.set_index(pd.to_datetime(trump_tweets_df.created_at))

In [0]:
# testing the tokenizer
tknzr.tokenize("I am a very #cool tweet by @Roman")

In [0]:
# Let's identify people Trump likes to mention
trump_tweets_df['mentions'] = trump_tweets_df['text'].map(lambda textline: [tag for tag in tknzr.tokenize(textline) if tag.startswith('@')])

In [0]:
# Only keep tweets where a mention i present
trump_tweets_df = trump_tweets_df[trump_tweets_df['mentions'].map(len) > 0]

In [0]:
# Collect
trump_tags = itertools.chain(*trump_tweets_df['mentions'])

In [0]:
# Count up and show
counted_tags = Counter(trump_tags)
counted_tags.most_common()[:10]

#### Your turn
![alt text](https://media.giphy.com/media/JIX9t2j0ZTN9S/giphy.gif)

The link below holds a datasewt with ~10k #OKBoomer tweets from the days 10-21 Nov.

https://github.com/CALDISS-AAU/sdsphd19_coursematerials/raw/master/data/tweets_boomer.zip

Use elements from the above code to make a list of the most common hashtags (you have to get the hashtags from the text, not using the column containing them already)
